The String Decomposition Problem and its Applications to Centromere Assembly

biorxiv(2019)

引用 1|浏览7
暂无评分
摘要
Recent attempts to assemble long tandem repeats (such as multi-megabase long centromeres) faced the challenge of accurate translation of long error-prone reads from the nucleotide alphabet into the alphabet of repeat . Centromeres represent a particularly complex type of , where each unit is itself a repeat formed by chromosome-specific (a repeat within repeat). Given a set of monomers forming a specific centromere, translation of a read into monomers is modeled as the String Decomposition Problem, finding a concatenate of monomers with the highest-scoring sequence alignment to a given read. We developed a StringDecomposer algorithm for solving this problem, benchmarked it on the set of reads generated by the Telomere-to-Telomere consortium, and identified a novel (rare) monomer that extends the set of twelve X-chromosome specific monomers identified more than three decades ago. The accurate translation of each read into a monomer alphabet turns centromere assembly into a more tractable problem than the notoriously difficult problem of assembling centromeres in the nucleotide alphabet. Our identification of a novel monomer emphasizes the importance of careful identification of all (even rare) monomers for follow-up centromere assembly efforts.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要