谷歌浏览器插件
订阅小程序
在清言上使用

Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction

PLoS computational biology(2022)

引用 6|浏览24
暂无评分
摘要
Accurate identification of protein function is critical to elucidate life mechanisms and design new drugs. We proposed a novel deep-learning method, ATGO, to predict Gene Ontology (GO) attributes of proteins through a triplet neural-network architecture embedded with pre-trained language models from protein sequences. The method was systematically tested on 1068 non-redundant benchmarking proteins and 3328 targets from the third Critical Assessment of Protein Function Annotation (CAFA) challenge. Experimental results showed that ATGO achieved a significant increase of the GO prediction accuracy compared to the state-of-the-art approaches in all aspects of molecular function, biological process, and cellular component. Detailed data analyses showed that the major advantage of ATGO lies in the utilization of pre-trained transformer language models which can extract discriminative functional pattern from the feature embeddings. Meanwhile, the proposed triplet network helps enhance the association of functional similarity with feature similarity in the sequence embedding space. In addition, it was found that the combination of the network scores with the complementary homology-based inferences could further improve the accuracy of the predicted models. These results demonstrated a new avenue for high-accuracy deep-learning function prediction that is applicable to large-scale protein function annotations from sequence alone. Author summary In the post-genome sequencing era, a major challenge in computational molecular biology is to annotate the biological functions of all genes and gene products, which have been classified, in the context of the widely used Gene Ontology (GO), into three aspects of molecular function, biological process, and cellular component. In this work, we proposed a new open-source deep-learning architecture, ATGO, to deduce GO terms of proteins from the primary amino acid sequence, through the integration of the triplet neural-network with pre-trained language models of protein sequences. Large benchmark tests showed that, when powered with transformer embeddings of the language model, ATGO achieved a significantly improved performance than other state-of-the-art approaches in all the GO aspect predictions. Following the rapid progress of self-attention neural network techniques, which have demonstrated remarkable impacts on natural language processing and multi-sensory data process, and most recently on protein structure prediction, this study showed the significant potential of attention transformer language models on protein function annotations.
更多
查看译文
关键词
language model,ontology,gene,protein,triplet
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要