LISA: A Case For Learned Index based Acceleration of Biological Sequence Analysis

Darryl Ho,Saurabh Kalikar,Sanchit Misra,Jialin Ding,Vasimuddin Md,Nesime Tatbul,Heng Li,Tim Kraska

bioRxiv (Cold Spring Harbor Laboratory)（2023）

引用 2|浏览14

暂无评分

摘要

Next Generation Sequencing (NGS) is transforming fields like genomics, transcriptomics, and epigenetics with rapidly increasing throughput at reduced cost. This also demands overcoming performance bottlenecks in the downstream analysis of the sequencing data. A key performance bottleneck is searching for exact matches of entire or substrings of short DNA/RNA sequence queries in a long reference sequence database. This task is typically performed by using an index of the reference - such as FM-index, suffix arrays, suffix trees, hash tables, or lookup tables. In this paper, we propose accelerating this sequence search by substituting or enhancing the indexes with machine learning based indexes - called learned indexes - and present LISA (Learned Indexes for Sequence Analysis). We evaluate LISA through a number of case studies – that cover widely used software tools; short and long reads; human, animal, and plant genome datasets; DNA and RNA sequences; various traditional indexing techniques (FM-indexes, hash tables and suffix arrays) – and demonstrate significant performance benefits in a majority of them. For example, our experiments on real datasets show that LISA achieves speedups of up to 2.2 fold and 4.7 fold over the state-of-the-art FM-index based implementations for exact sequence search modules in popular tools bowtie2 and BWA-MEM2, respectively. Code availability LISA-based FM-index: LISA-based hash-table: LISA applied to BWA-MEM2: . ### Competing Interest Statement The authors have declared no competing interest.

查看译文

关键词

Text Indexing,sequence alignment,Hashing,Functional Genomics,Support Vector Machines

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要