Study on reference-based FASTQ genome sequences compression.
International Conference on Bioinformatics and Intelligent Computing (BIC)(2022)
摘要
As the cost of genome sequencing decreases, the large amount of genomic data generated brings the storage problem of this massive data. We still have a lot of work to do in the field of specialized data compression of FASTQ files. This paper aims to explore a reference-based lossless compression algorithm for genome sequences in FASTQ format. We propose a compression scheme based on longest matching by using FMD-index to support exact match searching. At the same time, the reverse complementary sequence is used and the insertion, deletion and replacement operations are described effectively to further improve the compression ratio. In comparison with the experimental results of five compressors on seven sets of genome data, the proposed algorithm significantly improves the FASTQ file compression ratios, and is competitive in running time.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要