Deep neural language modeling enables functional protein generation across families

bioRxiv(2021)

Cited 37|Views93
No score
Abstract
Bypassing nature’s evolutionary trajectory, de novo protein generation—defined as creating artificial protein sequences from scratch—could enable breakthrough solutions for biomedical and environmental challenges. Viewing amino acid sequences as a language, we demonstrate that a deep learning-based language model can generate functional artificial protein sequences across families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. Our protein language model is trained by simply learning to predict the next amino acid for over 280 million protein sequences from thousands of protein families, without biophysical or coevolutionary modeling. We experimentally evaluate model-generated artificial proteins on five distinct antibacterial lysozyme families. Artificial proteins show similar activities and catalytic efficiencies as representative natural lysozymes, including hen egg white lysozyme, while reaching as low as 44% identity to any known naturally-evolved protein. The X-ray crystal structure of an enzymatically active artificial protein recapitulates the conserved fold and positioning of active site residues found in natural proteins. We demonstrate our language model’s ability to be adapted to different protein families by accurately predicting the functionality of artificial chorismate mutase and malate dehydrogenase proteins. These results indicate that neural language models successfully perform de novo protein generation across protein families and may prove to be a tool to shortcut evolution. ### Competing Interest Statement The authors have declared no competing interest.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined