Unsupervised Grammar Induction for Revealing the Internal Structure of Protein Sequence Motifs.

AIME(2020)

引用 2|浏览5
暂无评分
摘要
Protein sequence motifs are conserved amino acid patterns of biological significance. They are vital for annotating structural and functional features of proteins. Yet, the computational methods commonly used for defining sequence motifs are typically simplified linear representations neglecting the higher-order structure of the motif. The purpose of the work is to create models of sequence motifs taking into account the internal structure of the modeled fragments. The ultimate goal is to provide the community with accurate and concise models of diverse collections of remotely related amino acid sequences that share structural features. The internal structure of amino acid sequences is modeled using a novel algorithm for unsupervised learning of weighted context-free grammar (WCFG). The proposed method learns WCFG both form positive and negative samples, whereas weights of rules are estimated using a novel Inside-Outside Contrastive Estimation algorithm. In comparison to existing approaches to learning CFG, the new method generates more concise descriptors and provides good control of the trade-off between grammar size and specificity. The method is applied to the nicotinamide adenine dinucleotide phosphate binding site motif.
更多
查看译文
关键词
protein sequence motifs,grammar,internal structure
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要