DeProt: A protein language model with quantizied structure and disentangled attention

crossref(2024)

引用 0|浏览4
暂无评分
摘要
Protein language models have exhibited remarkable representational capabilities in various downstream tasks, notably in the prediction of protein functions. Despite their success, these models traditionally grapple with a critical shortcoming: the absence of explicit protein structure information, which is pivotal for elucidating the relationship between protein sequences and their functionality. Addressing this gap, we introduce DeProt, a Transformer-based protein language model designed to incorporate protein sequences and structures. It was pre-trained on millions of protein structures from diverse natural protein clusters. DeProt first serializes protein structures into residue-level local-structure sequences and use a graph neural network based auto-encoder to vectorized the local structures. Then, these vectors are quantized and formed a discrete structure tokens by a pre-trained codebook. Meanwhile, DeProt utilize disentangled attention mechanisms to effectively integrate residue sequences with structure token sequences. Despite having fewer parameters and less training data, DeProt significantly outperforms other state-of-the-art (SOTA) protein language models, including those that are structure-aware and evolution-based, particularly in the task of zero-shot mutant effect prediction across 217 deep mutational scanning assays. Furthermore, DeProt exhibits robust representational capabilities across a spectrum of supervised-learning downstream tasks. Our comprehensive benchmarks underscore the innovative nature of DeProt's framework and its superior performance, suggesting its wide applicability in the realm of protein deep learning.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要