EnteroBase: hierarchical clustering of 100 000s of bacterial genomes into species/subspecies and populations

PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES(2022)

引用 9|浏览16
暂无评分
摘要
The definition of bacterial species is traditionally a taxonomic issue while bacterial populations are identified by population genetics. These assignments are species specific, and depend on the practitioner. Legacy multilocus sequence typing is commonly used to identify sequence types (STs) and clusters (ST Complexes). However, these approaches are not adequate for the millions of genomic sequences from bacterial pathogens that have been generated since 2012. EnteroBase (http://enterobase.warwick.ac.uk) automatically clusters core genome MLST allelic profiles into hierarchical clusters (HierCC) after assembling annotated draft genomes from short-read sequences. HierCC clusters span core sequence diversity from the species level down to individual transmission chains. Here we evaluate HierCC's ability to correctly assign 100 000s of genomes to the species/subspecies and population levels for Salmonella, Escherichia, Clostridoides, Yersinia, Vibrio and Streptococcus. HierCC assignments were more consistent with maximum-likelihood super-trees of core SNPs or presence/absence of accessory genes than classical taxonomic assignments or 95% ANI. However, neither HierCC nor ANI were uniformly consistent with classical taxonomy of Streptococcus. HierCC was also consistent with legacy eBGs/ST Complexes in Salmonella or Escherichia and with O serogroups in Salmonella. Thus, EnteroBase HierCC supports the automated identification of and assignment to species/subspecies and populations for multiple genera.This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.
更多
查看译文
关键词
big data, hierarchical clustering, genomic databases, EnteroBase, cgMLST, accessory genome
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要