Aligning Distant Sequences to Graphs using Long Seed Sketches

biorxiv(2022)

引用 0|浏览4
暂无评分
摘要
Sequence-to-graph alignment is an important step in applications such as variant genotyping, read error correction and genome assembly. When a query sequence requires a substantial number of edits to align, approximate alignment tools that follow the seed-and-extend approach require shorter seeds to get any matches. However, in large graphs with high variation, relying on a shorter seed length leads to an exponential increase in spurious matches. We propose a novel seeding approach relying on long inexact matches instead of short exact matches. We demonstrate experimentally that our approach achieves a better time-accuracy trade-off in settings with up to a 25% mutation rate. We achieve this by sketching a subset of graph nodes and storing them in a K-nearest neighbor index. While sketches are more robust to indels, finding the nearest neighbor of a sketch in a high-dimensional space is more computationally challenging than finding exact seeds. We demonstrate that if we store sketch vectors in a K-nearest neighbor index, we can circumvent the curse of dimensionality. Our long sketch-based seed scheme contrasts existing approaches and highlights the important role that tensor sketching can play in bioinformatics applications. Our proposed seeding method and implementation have several advantages: i) We empirically show that our method is efficient and scales to graphs with 1 billion nodes, with time and memory requirements for preprocessing growing linearly with graph size and query time growing quasi-logarithmically with query length. ii) For queries with an edit distance of 25% relative to their length, on the 1 billion node graph, longer sketch-based seeds yield a 4x increase in recall compared to exact seeds. iii) Conceptually, our seeder can be incorporated into other aligners, proposing a novel direction for sequence-to-graph alignment. The implementation is available at: https://github.com/ratschlab/tensor-sketch-alignment. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
关键词
long seed sketches,distant sequences,graphs
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要