Can Formal Languages Help Pangenomics to Represent and Analyze Multiple Genomes?

International Conference on Developments in Language Theory (DLT)(2022)

引用 1|浏览5
暂无评分
摘要
Graph pangenomics is a new emerging field in computational biology that is changing the traditional view of a reference genome from a linear sequence to a new paradigm: a sequence graph (pangenome graph or simply pangenome) that represents the main similarities and differences in multiple evolutionary related genomes. The speed in producing large amounts of genome data, driven by advances in sequencing technologies, is far from the slow progress in developing new methods for constructing and analyzing a pangenome. Most recent advances in the field are still based on notions rooted in established and quite old literature on combinatorics on words, formal languages and space efficient data structures. In this paper we discuss two novel notions that may help in managing and analyzing multiple genomes by addressing a relevant question: how can we summarize sequence similarities and dissimilarities in large sequence data? The first notion is related to variants of the Lyndon factorization and allows to represent sequence similarities for a sample of reads, while the second one is that of sample specific string as a tool to detect differences in a sample of reads. New perspectives opened by these two notions are discussed.
更多
查看译文
关键词
analyze multiple genomes,pangenomics,formal languages
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要