SAP: Synteny-aware gene function prediction for bacteria using protein embeddings.

bioRxiv : the preprint server for biology(2023)

引用 0|浏览2
暂无评分
摘要
Motivation:Today, we know the function of only a small fraction of the protein sequences predicted from genomic data. This problem is even more salient for bacteria, which represent some of the most phylogenetically and metabolically diverse taxa on Earth. This low rate of bacterial gene annotation is compounded by the fact that most function prediction algorithms have focused on eukaryotes, and conventional annotation approaches rely on the presence of similar sequences in existing databases. However, often there are no such sequences for novel bacterial proteins. Thus, we need improved gene function prediction methods tailored for prokaryotes. Recently, transformer-based language models - adopted from the natural language processing field - have been used to obtain new representations of proteins, to replace amino acid sequences. These representations, referred to as protein embeddings, have shown promise for improving annotation of eukaryotes, but there have been only limited applications on bacterial genomes. Results:To predict gene functions in bacteria, we developed SAP, a novel synteny-aware gene function prediction tool based on protein embeddings from state-of-the-art protein language models. SAP also leverages the unique operon structure of bacteria through conserved synteny. SAP outperformed both conventional sequence-based annotation methods and state-of-the-art methods on multiple bacterial species, including for distant homolog detection, where the sequence similarity to the proteins in the training set was as low as 40%. Using SAP to identify gene functions across diverse enterococci, of which some species are major clinical threats, we identified 11 previously unrecognized putative novel toxins, with potential significance to human and animal health.
更多
查看译文
关键词
protein embeddings,bacteria,gene,prediction,synteny-aware
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要