Fast Hybrid Data Structure for a Large Alphabet K-Mers Indexing for Whole Genome Alignment

IEEE ACCESS(2021)

引用 0|浏览2
暂无评分
摘要
The most common index data structures used by whole genome aligners (WGA) are based on suffix trees (ST), suffix arrays, and FM-indexes. These data structures show good performance results as WGA works with sequences of letters over small alphabets; for example, four letters a, c, t, g for DNA alignment. A novel whole genome aligner, which we are developing, will work with distances between the label sites on DNA samples, which are represented as a sequence of positive integers. The size of alphabet sigma is theoretically unlimited. This has prompted us to investigate if there is a better structure that would improve search performance on large alphabets compared to the commonly used suffix-based structures. This paper presents the implementation of a highly optimized hybrid index data structure based on a ternary search tree (TST) and hash tables, which perform much better when working with strings on large alphabets compared to the ST. Single core parallelism was achieved using advanced vector extensions (AVX) single instruction multiple data (SIMD) instruction set. When searching for short k-mers over an alphabet of 25, 695 letters, our index search performance was up to 29 times better than the search performance of the reference ST. When the alphabet was compressed to approximately 1; 300 letters, our index search performance was still up to 2.6 times better than the ST. The source code is available free on http://olgen.cz/Resources/Upload/Home/public/software/hds.zip under the MIT license.
更多
查看译文
关键词
Engineering management, engineering in medicine and biology, bioinformatics, computational and artificial intelligence, computer architecture, data structures, tree data structures, parallel processing, parallel algorithms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要