谷歌Chrome浏览器插件
订阅小程序
在清言上使用

KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis

Deyou Tang, Daqiang Tan, Weihao Xiao, Jiabin Lin, Juan Fu

ALGORITHMS(2022)

引用 0|浏览6
暂无评分
摘要
Background: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their best applicable scenarios and potential improvements using multiple hardware contexts and datasets. Results: KMC3 uses less memory and runs faster than CHTKC on a regular configuration server. CHTKC is efficient on high-performance computing platforms with high available memory, multi-thread, and low IO bandwidth. When tested with various datasets, KMC3 is less sensitive to the number of distinct k-mers and is more efficient for tasks with relatively low sequencing quality and long k-mer. CHTKC performs better than KMC3 in counting assignments with large-scale datasets, high sequencing quality, and short k-mer. Both algorithms are affected by IO bandwidth, and decreasing the influence of the IO bottleneck is critical as our tests show improvement by filtering and compressing consecutive first-occurring k-mers in KMC3. Conclusions: KMC3 is more competitive for running counter on ordinary hardware resources, and CHTKC is more competitive for counting k-mers in super-scale datasets on higher-performance computing platforms. Reducing the influence of the IO bottleneck is essential for optimizing the k-mer counting algorithm, and filtering and compressing low-frequency k-mers is critical in relieving IO impact.
更多
查看译文
关键词
k-mer,KMC,CHTKC,next generation sequencing,algorithm evaluation,optimization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要