[1]王殷伟,武晶菁,张宸宁,等.基于模式物种的快速同源搜索软件基准测试[J].南京师大学报(自然科学版),2022,45(02):44-51.[doi:10.3969/j.issn.1001-4616.2022.02.006]
 Wang Yinwei,Wu Jingjing,Zhang Chenning,et al.Benchmarking Fast Homology Search Softwares Based on Model Organisms[J].Journal of Nanjing Normal University(Natural Science Edition),2022,45(02):44-51.[doi:10.3969/j.issn.1001-4616.2022.02.006]
点击复制

基于模式物种的快速同源搜索软件基准测试()
分享到:

《南京师大学报(自然科学版)》[ISSN:1001-4616/CN:32-1239/N]

卷:
第45卷
期数:
2022年02期
页码:
44-51
栏目:
·生物学·
出版日期:
2022-05-15

文章信息/Info

Title:
Benchmarking Fast Homology Search Softwares Based on Model Organisms
文章编号:
1001-4616(2022)02-0044-08
作者:
王殷伟武晶菁张宸宁华宜家李 鹏严 洁
(南京师范大学生命科学学院,江苏 南京 210023)
Author(s):
Wang YinweiWu JingjingZhang ChenningHua YijiaLi PengYan Jie
(School of Life Sciences,Nanjing Normal University,Nanjing 210023,China)
关键词:
同源搜索直系同源推断RBH快速算法序列比较
Keywords:
homology searchorthology inferenceRBHfast algorithmssequence comparision
分类号:
Q33
DOI:
10.3969/j.issn.1001-4616.2022.02.006
文献标志码:
A
摘要:
传统的blast+软件包中的blastp搜索,在大数据时代下,序列搜索速度已经慢得难以接受. 同源搜索软件的开发在过去十几年取得了巨大进展,但缺乏综合的评估. 本研究对7个快速同源搜索软件与blastp进行了综合比较,结果发现,diamond的fast模式总体上来说相比其他软件更快,并且有着最低的错误发现率,是追求快速搜索的最佳选择; 在内存消耗上,MMseqs2的算法在内存消耗上非常低,而ghostx则最高; 在鉴定的hits数量方面,除了blasp以外,MMseqs2的s7.5模式在中等基因组相似度GSS下得到的结果最多,但s5模式应是更好的选择. 随着GSS的降低,ghostx得到的结果最多,而随着GSS的升高,ublast得到的结果最多; 在鉴定的Reciprocal Best Hits(RBH)数量上,ghostx在远缘搜索上具有优势,这一优势同样也具有共线性证据支持. 在同源搜索方面,除ghostx有43.4%的额外结果外,几乎所有软件的搜索结果之间都有着很大的重叠,并且ghostx还有着非常低的错误发现率,而MMseqs2的s3模式却有着最高的错误发现率. 总之,MMseqs2、diamond和ghostx是综合来说最好的三款替代blastp搜索的软件,diamond非常适合进行直系同源推断,并且可以用“fast”模式准确地快速搜索,而“very”是权衡下来最佳的搜索模式,但如果是进行远缘物种的搜索,ghostx则更有优势,而对于中等GSS下同源蛋白的鉴定,MMseqs2的s5可能是更好的选择.
Abstract:
Blastp in the traditional blast+package has been extremely slow in the era of big data. The development of homology search software has made great progress in the past decade or so,but comprehensive uations are scarce. In this study,a comprehensive comparison between 7 fast homology search softwares and blastp was conducted,and it was found that fast mode in diamond is generally faster than the others and has the lowest false discover rate. In memory consumption,MMseqs2 is the lowest while ghostx is the highest. In terms of the number of identified hits,s7.5 mode in MMseqs2 had the highest number at medium Genomic Similarity Scores(GSS)except blastp,but the s5 model should be a better choice. As GSS decreases,ghostx obtains the most results,while ublast obtains the most results as GSS increases. In terms of the number of identified Reciprocal Best Hits(RBH),ghostx has an advantage in remote search,and this advantage is also supported by synteny evidence. In terms of homology search,there is a large overlap among almost all software,with the exception of ghostx,which has 43.4% additional results and the highest false discovery rate while s3 mode in MMseqs2 has the lowest. Overall,compared to blastp,MMseqs2,diamond and ghostx are the three best alternatives to blastp. Diamond is well suited for orthology inference and can search accurately and quickly in “fast” mode,and “very” is the best search mode on balance,but for remote search,ghostx is more advantageous,while for identification of homologous proteins at medium GSS,s5 mode in MMseqs2 may be a good choice.

参考文献/References:

[1] CONSORTIUM G O. The Gene Ontology(GO)database and informatics resource[J]. Nucleic acids research,2004,32(suppl_1):D258-D261.
[2]KANEHISA M,GOTO S. KEGG:kyoto encyclopedia of genes and genomes[J]. Nucleic acids research,2000,28(1):27-30.
[3]LI L,STOECKERT C J J R,ROOS D S. OrthoMCL:identification of ortholog groups for eukaryotic genomes[J]. Genome research,2003,13(9):2178-2189.
[4]EMMS D M,KELLY S. OrthoFinder:solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy[J]. Genome biology,2015,16(1):157.
[5]FITCH W M. Homology a personal view on some of the problems[J]. Trends in genetics,2000,16(5):227-231.
[6]KRISTENSEN D M,WOLF Y I,MUSHEGIAN A R,et al. Computational methods for Gene Orthology inference[J]. Briefings in bioinformatics,2011,12(5):379-91.
[7]MORENO-HAGELSIEB G,LATIMER K. Choosing BLAST options for better detection of orthologs as reciprocal best hits[J]. Bioinformatics,2008,24(3):319-324.
[8]WOLF Y I,KOONIN E V. A tight link between orthologs and bidirectional best hits in bacterial and archaeal genomes[J]. Genome biology and evolution,2012,4(12):1286-1294.
[9]WARD N,MORENO-HAGELSIEB G. Quickly finding orthologs as reciprocal best hits with BLAT,LAST,and UBLAST:how much do we miss?[J]. PLoS one,2014,9(7):e101850.
[10]CAMACHO C,COULOURIS G,AVAGYAN V,et al. BLAST+:architecture and applications[J]. BMC bioinformatics,2009,10(1):421.
[11]KIEBASA S M,WAN R,SATO K,et al. Adaptive seeds tame genomic sequence comparison[J]. Genome research,2011,21(3):487-493.
[12]EDGAR R C. Search and clustering orders of magnitude faster than BLAST[J]. Bioinformatics,2010,26(19):2460-2461.
[13]KENT W J. BLAT—the BLAST-like alignment tool[J]. Genome research,2002,12(4):656-664.
[14]MORENO H G,WANG Z,WALSH S,et al. Phylogenomic clustering for selecting non-redundant genomes for comparative genomics[J]. Bioinformatics,2013,29(7):947-949.
[15]SARIPELLA G V,SONNHAMMER E L,FORSLUND K. Benchmarking the next generation of homology inference tools[J]. Bioinformatics,2016,32(17):2636-2641.
[16]BIEGERT A,S?ING J. Sequence context-specific profiles for homology searching[J]. Proceedings of the national academy of science of the United States of America,2009,106(10):3770-3775.
[17]S?ING J. Protein homology detection by HMM-HMM comparison[J]. Bioinformatics,2005,21(7):951-960.
[18]FINN R D,CLEMENTS J,EDDY S R. HMMER web server:interactive sequence similarity searching[J]. Nucleic acids research,2011,39(Web Server issue):W29-W37.
[19]PEARSON W R,LIPMAN D J. Improved tools for biological sequence comparison[J]. Proceedings of the national academy of science of the United States of America,1988,85(8):2444-2448.
[20]HERNáNDEZ S J E,MORENO H G. Progress in quickly finding orthologs as reciprocal best hits:comparing blast,last,diamond and MMseqs2[J]. BMC genomics,2020,21(1):741.
[21]BUCHFINK B,XIE C,HUSON D H. Fast and sensitive protein alignment using DIAMOND[J]. Nature methods,2015,12(1):59-60.
[22]STEINEGGER M,S?ING J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets[J]. Nature biotechnology,2017,35(11):1026-1028.
[23]HAUSWEDELL H,SINGER J,REINERT K. Lambda:the local aligner for massive biological data[J]. Bioinformatics,2014,30(17):i349-i355.
[24]SUZUKI S,KAKUTA M,ISHIDA T,et al. GHOSTX:an improved sequence homology search algorithm using a query suffix array and a database suffix array[J]. PLoS one,2014,9(8):e103833.
[25]DUTILH B E,VAN NOORT V,VAN DER HEIJDEN R T,et al. Assessment of phylogenomic and orthology approaches for phylogenetic inference[J]. Bioinformatics,2007,23(7):815-824.
[26]DESSIMOZ C,GABALDóN T,ROOS D S,et al. Toward community standards in the quest for orthologs[J]. Bioinformatics,2012,28(6):900-904.
[27]JONES P,BINNS D,CHANG H Y,et al. InterProScan 5:genome-scale protein function classification[J]. Bioinformatics,2014,30(9):1236-1240.
[28]GOUGH J. The SUPERFAMILY database in structural genomics[J]. Acta crystallogr section D biology crystallography,2002,58(Pt 11):1897-1900.

备注/Memo

备注/Memo:
基金项目:国家自然科学基金项目(3167229)、江苏省高等学校自然科学研究重大项目(19KJA330001).
通讯作者:严洁,博士,副教授,研究方向:动物分子进化与系统地理. E-mail:yanjie@njnu.edu.cn
更新日期/Last Update: 1900-01-01