Genomic diversity and viral metagenomics
基因组多样性和病毒元基因组学
Number of complete genomes
完整基因组数量
According to the NCBI, as of September 2019, there are 8,437 complete phage genomes divided into ten families (based on the ICTV classification at the time) and one unclassified group (Fig. 2a). More than half of them are members of the Siphoviridae family. This overrepresentation is due in large part to the isolation and genome sequencing of 1,537 siphophages infecting Mycobacterium smegmatis by the SEA-PHAGES program55. Myoviridae and Podoviridae represent 17 and 12% of the total phages, respectively, rendering Caudovirales (comprising also Herelleviridae and Ackermannviridae) the most abundant group of phages (> 85%) in public
根据 NCBI 的数据,截至 2019 年 9 月,共有 8437 个完整的噬菌体基因组,分为 10 个科(基于当时的 ICTV 分类)和一个未分类的组(图 2a)。其中一半以上属于 Siphoviridae 科。这一比例偏高在很大程度上是由于 SEA-PHAGES 计划对 1537 个感染了烟曲霉分枝杆菌的虹彩病毒进行了分离和基因组测序55。Myoviridae 和 Podoviridae 分别占噬菌体总数的 17% 和 12%,使 Caudovirales(还包括 Herelleviridae 和 Ackermannviridae)成为公共噬菌体中数量最多的一组(> 85%)。
genomic databases. The disproportionate representation of tailed dsDNA phages will likely decrease in the near future with the discovery of new phages. For example, the genomic diversity within the Microviridae family was largely underestimated until 258 new ssDNA phages were detected in the gut of Ciona robusta56. In addition, the unclassified bacterial virus group within NCBI consists of phages discovered through metagenomic projects that have yet to be isolated or have been very recently propagated on a bacterial host. Part of this latter group includes 283 non-tailed dsDNA phages, infecting the ubiquitous marine Vibrionaceae bacterial family7. Recently, Roux and colleagues used a machine-learning approach to mine microbial genomes and metagenomes for inoviruses9. They identified 10,295 inovirus-like sequences, from which 5,964 distinct species appeared to have been identified. This study alone represents a 100-fold expansion of the diversity previously described (57 genomes) within the Inoviridae family. The ever-increasing number of complete phage genomes in the NCBI database still represents only a small fraction of the actual phage genomic diversity, since half of them infect only seven host genera (Mycobacterium, Streptococcus, Escherichia, Pseudomonas, Gordonia, Lactococcus, and Salmonella). The total number of complete phage genomes available in public databases is also certainly far greater because of the numerous unidentified prophages in bacterial genomes57.
基因组数据库。在不久的将来,随着新噬菌体的发现,有尾 dsDNA 噬菌体比例失调的现象可能会有所缓解。例如,微病毒科的基因组多样性在很大程度上被低估了,直到在 Ciona robusta 的肠道中检测到 258 个新的 ssDNA 噬菌体56。此外,NCBI 中的未分类细菌病毒组由通过元基因组项目发现的噬菌体组成,这些噬菌体尚未被分离出来或最近才在细菌宿主上繁殖。后一类噬菌体包括 283 个无尾 dsDNA 噬菌体,感染无处不在的海洋 Vibrionaceae 细菌家族7。最近,Roux 及其同事利用机器学习方法挖掘微生物基因组和元基因组中的猪病毒9。他们发现了 10,295 个类似于伊诺维奇病毒的序列,从中似乎发现了 5,964 个不同的物种。仅这一项研究就比之前描述的 Inoviridae 科内的多样性(57 个基因组)扩大了 100 倍。NCBI 数据库中不断增加的完整噬菌体基因组数量仍然只代表实际噬菌体基因组多样性的一小部分,因为其中一半只感染 7 个宿主属(分枝杆菌属、链球菌属、埃希菌属、假单胞菌属、戈登菌属、乳球菌属和沙门氏菌属)。由于细菌基因组中还有大量未识别的噬菌体,因此公共数据库中完整的噬菌体基因组总数肯定要多得多57。
Range in genome size 基因组大小范围
Phages have a wide range of genome sizes, with an average size of 62.5 46.8 kb (Fig. 2b). Apparently, the smallest phage genome reported to date is that of Leuconostoc phage L5 with only 2,435 bp. At the other end of the spectrum, an increasing number of jumbo phages (> 200 kb) are being characterized and show unique genomic features58. Their large genome size allows jumbo phages to carry genes involved in replication and nucleotide metabolism that are absent in smaller phage genomes. The organization of these large viral genomes is also atypical because genes with associated functions do not show strong synteny and are instead, more dispersed58. A new group of phages with the largest genomes ever recorded to date, called Megaphages (> 540 kb), were just uncovered from human and animal gut metagenomes that are predicted to infect Prevotella5. These phages seem widespread in gut microbiomes, as they were identified in humans, baboons and pigs5. They were overlooked due to genome fragmentation and their use of an alternative genetic code, which consisted of a repurposed stop codon5.
噬菌体的基因组大小范围很广,平均大小为 62.5 46.8 kb(图 2b)。显然,迄今为止报道的最小噬菌体基因组是白色念珠菌噬菌体 L5,只有 2,435 bp。在另一端,越来越多的巨型噬菌体(> 200 kb)正在被鉴定,并显示出独特的基因组特征58。巨型噬菌体庞大的基因组可携带较小噬菌体基因组中没有的复制和核苷酸代谢基因。这些大型病毒基因组的组织结构也很不典型,因为具有相关功能的基因并不表现出很强的同源性,而是比较分散58。刚刚从人类和动物肠道元基因组中发现了一组新的噬菌体,它们的基因组是迄今为止记录到的最大的,被称为巨型噬菌体(> 540 kb),预计会感染普雷沃特氏菌5。这些噬菌体似乎广泛存在于肠道微生物组中,因为它们在人类、狒狒和猪体内都被发现了5。这些噬菌体之所以被忽视,是因为它们的基因组支离破碎,而且使用了另一种遗传密码,其中包括一个重复使用的终止密码子5。
Contribution of viral metagenomics in exploring phage genomic diversity
病毒元基因组学对探索噬菌体基因组多样性的贡献
Given the absence of a conserved genetic marker and the predicted large number in the biosphere59, phage genomic diversity is difficult to comprehend. Phages infecting different hosts typically have little to no sequence similarity and phages that infect a single host may also exhibit considerable sequence differences60–62. For instance, a pairwise comparisons of 2,333 phages showed no detectable homology in 97% of cases, when measuring nucleotide distance and gene content63. Thanks to modern techniques that explore viral dark matter, such as viral metagenomics, we are starting to grasp the extent of phage global diversity. Viral metagenomics is defined here as the sequencing of the total nucleic acids from the viral fraction of a given environment. It overcomes the challenges of culture-based approaches and single marker genes by assessing the total viral nucleic acids (mostly dsDNA) isolated from any given environment. Before the arrival of next generation sequencing, the first viral metagenomics study was published in 2002 from surface seawater samples64. In recent years, the optimization of the steps required to obtain good-quality viral nucleic acids65, the reducing costs of sequencing and an improved set of analytical tools66 have allowed the construction of large-scale virome (viral sequences obtained from viral metagenomics) datasets from viral communities, mostly from marine and human gut samples. There are now at least 90 studies describing viromes from aquatic environments67, 38 from the human gut and eight from soil67. Among them, three research consortia, Tara Oceans68, the Pacific Ocean Virome69 and the Malaspina oceanic research expedition70, have performed viral metagenomics on marine samples from various depths and locations. This has led to the detailed characterization of ocean dsDNA viruses and their abundance patterns on local and global scales71,72. The first human gut virome was published in 2003 from a single healthy individual73. More studies on twins and their mothers74, healthy adults75,76 and patients with ulcerative colitis77 have followed to describe longitudinal and inter-personal viral variations in health and diseases. In 2014, the mining of viral metagenomic libraries (viromes) also resulted in the discovery of the most abundant and widespread phage in the human gut, called crAssphage78. The results of these projects are summarized in the following sections.
由于缺乏保守的遗传标记,而且预计生物圈中的噬菌体数量庞大59 ,噬菌体基因组的多样性难以理解。感染不同宿主的噬菌体通常几乎没有序列相似性,而感染单一宿主的噬菌体也可能表现出相当大的序列差异60-62。例如,对 2,333 个噬菌体进行配对比较后发现,在测量核苷酸距离和基因含量时,97% 的噬菌体没有可检测到的同源性63。得益于探索病毒暗物质的现代技术,如病毒元基因组学,我们开始掌握噬菌体全球多样性的程度。病毒元组学在这里被定义为对特定环境中病毒部分的总核酸进行测序。它通过评估从任何给定环境中分离出来的全部病毒核酸(主要是dsDNA),克服了基于培养的方法和单一标记基因所带来的挑战。在新一代测序技术问世之前,2002 年发表了第一份病毒元基因组学研究报告,研究对象是表层海水样本64。近年来,获取优质病毒核酸所需步骤的优化65 、测序成本的降低以及分析工具66 的改进,使得从病毒群落(主要来自海洋和人类肠道样本)构建大规模病毒组(从病毒元基因组学获得的病毒序列)数据集成为可能。目前至少有 90 项研究描述了水生环境中的病毒组67 ,38 项研究描述了人类肠道中的病毒组,8 项研究描述了土壤中的病毒组67 。其中,塔拉海洋68 、太平洋病毒组69 和马拉斯皮纳海洋研究考察70 三个研究联盟对不同深度和地点的海洋样本进行了病毒元基因组学研究。通过这些研究,详细描述了海洋 dsDNA 病毒的特征及其在局部和全球范围内的丰度模式71,72。2003 年发表了第一个人类肠道病毒组,研究对象是单个健康人73。随后,对双胞胎及其母亲74、健康成年人75、76 和溃疡性结肠炎患者77 进行了更多研究,以描述健康和疾病中纵向和人际间的病毒变异。2014 年,对病毒元基因组文库(viromes)的挖掘还发现了人类肠道中数量最多、分布最广的噬菌体–crAssphage78。下文将对这些项目的成果进行总结。
Beyond viral metagenomics
超越病毒元基因组学
A major inconvenience in describing viral communities with metagenomics is the lack of a fine enough resolution to reconstruct genomes of closely related sequences. This causes phage
利用元基因组学描述病毒群落的一个主要不便之处是缺乏足够精细的分辨率来重建密切相关序列的基因组。这导致噬菌体
populations with high levels of microdiversity to be discarded from metagenomics assembly. The detection of this microdiversity is necessary to better understand phage-host interaction dynamics79. Single-virus genomics overcomes this obstacle by sorting individual phages prior to sequencing. Such approach led to the discovery of the most abundant marine phage80, which is called vSAG 37-F6 and infects Pelagibacter81. Viral tagging may also provide additional insights into phage-host interactions, as reported for cyanophages infecting Synechococcus82. Although metagenomics does not specifically target viral DNA, a wealth of information can be still discovered about phage sequences10. Using an exhaustive collection of viral protein families manually identified as bait, over 125,000 viral genomes were detected from 3,042 metagenomes of diverse geographical origins10. This study was a major contribution to our understanding of viral diversity, as they expanded the number of viral genes by 16-fold. It also suggested that on a global scale, phage genomic diversity still remained widely uncharacterized, but the discovery rate in marine and human samples (the most studied biomes) was approaching saturation10. Yet, the percentage of unknown phages still consistently represents the majority of the sequences in the viral fraction of any given environmental sample, accounting sometimes for more than 90% of the reads11,83. Figure 3 outlines how omics and culturing efforts can be integrated to fully characterize entire phage communities.
在元基因组学组装中,具有高度微多样性的种群将被剔除。要更好地了解噬菌体与宿主的相互作用动态,就必须检测这种微多样性79。单病毒基因组学通过在测序前对单个噬菌体进行分类来克服这一障碍。这种方法导致发现了最丰富的海洋噬菌体80 ,称为 vSAG 37-F6,感染天竺噬菌体81。病毒标记还可为噬菌体与宿主的相互作用提供更多信息,感染 Synechococcus 的蓝藻噬菌体就是如此82。尽管元基因组学并不专门针对病毒 DNA,但仍可发现大量有关噬菌体序列的信息10。利用人工鉴定的病毒蛋白质家族作为诱饵,从不同地理来源的 3042 个元基因组中检测到了超过 125,000 个病毒基因组10。这项研究对我们了解病毒多样性做出了重大贡献,因为他们将病毒基因的数量扩大了 16 倍。这项研究还表明,在全球范围内,噬菌体基因组的多样性仍未得到广泛表征,但在海洋和人类样本(研究最多的生物群落)中的发现率已接近饱和10。然而,未知噬菌体的比例仍始终占任何给定环境样本中病毒部分序列的大多数,有时占读数的 90% 以上11,83。图 3 概要介绍了如何将 mics 和培养工作结合起来,以全面描述整个噬菌体群落的特征。