Chrome Extension
WeChat Mini Program
Use on ChatGLM

Predicting Protein-encoding Gene Content in Escherichia coli Genomes

biorxiv(2023)

Cited 0|Views19
No score
Abstract
In this study, we built machine learning classifiers for predicting the presence or absence of the variable genes occurring in 10-90% of all publicly available high-quality Escherichia coli genomes. The BV-BRC genus-specific protein families were used to define orthologs across the set of genomes, and a single binary classifier was built for predicting the presence or absence of each family in each genome. Each model was built using the nucleotide k-mers from a set of 100 conserved genes as features. The resulting set of 3,259 XGBoost classifiers had a per-genome average macro F1 score of 0.944 [0.943-0.945, 95% CI]. We show that the F1 scores are stable across MLSTs, and that the trend can be recapitulated through sampling with a smaller number of core genes or diverse input genomes. Surprisingly, the presence or absence of poorly annotated proteins, including “hypothetical proteins”, were easily predicted (F1 = 0.902 [0.898-0.906, 95% CI]). Models for proteins with horizontal gene transfer-related functions, including transposition- (F1 = 0.895 [0.882-0.907, 95% CI]), phage- (F1 = 0.872 [0.868-0.876, 95% CI]), and plasmid-related (F1 = 0.824 [0.814-0.834, 95% CI]) functions had slightly lower F1 scores, but were still accurate. Finally, we applied the models to a holdout set of 419 diverse E. coli genomes that were isolated from freshwater environmental sources and observed an average per-genome F1 score of 0.880 [0.876-0.883, 95% CI], demonstrating the extensibility of the models. Overall, this study provides a framework for predicting variable gene content using a limited amount of input sequence data. Importance Having the ability to predict the protein-encoding gene content of a genome is important for a variety of bioinformatic tasks, including assessing genome quality, binning genomes from shotgun metagenomic assemblies, and assessing risk due to the presence of antimicrobial resistance (AMR) and other virulence genes. In this study, we built a series of binary classifiers for predicting the presence or absence of variable genes occurring in 10-90% of all publicly available E. coli genomes. Overall, the results show that a large portion of the E. coli variable gene content can be predicted with high accuracy, including genes with functions relating to horizontal gene transfer. ### Competing Interest Statement The authors have declared no competing interest. * AMR : Antimicrobial Resistance BV-BRC : Bacterial and Viral Bioinformatics Resource Center MAG : Metagenome assembled genome MLST : Multi Locus Sequence Type PATRIC : PAThosystems Resource Integration Center RAST : Rapid Annotation Subsystem Technology XGBoost : Extreme Gradient Boosting
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined