Diffusion Language Models Are Versatile Protein Learners
CoRR(2024)
摘要
This paper introduces diffusion protein language model (DPLM), a versatile
protein language model that demonstrates strong generative and predictive
capabilities for protein sequences. We first pre-train scalable DPLMs from
evolutionary-scale protein sequences within a generative self-supervised
discrete diffusion probabilistic framework, which generalizes language modeling
for proteins in a principled way. After pre-training, DPLM exhibits the ability
to generate structurally plausible, novel, and diverse protein sequences for
unconditional generation. We further demonstrate the proposed diffusion
generative pre-training makes DPLM possess a better understanding of proteins,
making it a superior representation learner, which can be fine-tuned for
various predictive tasks, comparing favorably to ESM2 (Lin et al., 2022).
Moreover, DPLM can be tailored for various needs, which showcases its prowess
of conditional generation in several ways: (1) conditioning on partial peptide
sequences, e.g., generating scaffolds for functional motifs with high success
rate; (2) incorporating other modalities as conditioner, e.g.,
structure-conditioned generation for inverse folding; and (3) steering sequence
generation towards desired properties, e.g., satisfying specified secondary
structures, through a plug-and-play classifier guidance.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要