Efficient minimizer orders for large values of k using minimum decycling sets

biorxiv(2022)

引用 0|浏览12
暂无评分
摘要
Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k-mer in every L-long sub-sequence of the target sequence, where minimality is with respect to a predefined k-mer order. Commonly used minimizer orders select more k-mers overall than necessary and therefore provide limited improvement to runtime and memory usage of downstream analysis tasks. The recently introduced universal k-mer hitting sets produce minimizer orders resulting in fewer selected k-mers. Unfortunately, generating compact universal k-mer hitting sets is currently infeasible for k > 13, and thus cannot help in the many applications that need minimizers of larger k. Here, we close this gap by introducing decycling set-based minimizer orders. We define new orders based on minimum decycling sets, which are guaranteed to hit any infinitely long sequence. We show that in practice these new minimizer orders select a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets, and can also scale up to larger k. Furthermore, we developed a query method that avoids the need to keep the k-mers of a decycling set in memory, which enables the use of these minimizer orders for any value of k. We expect the new decycling set-based minimizer orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要