MPEG-G Reference-Based Compression of Unaligned Reads Through Ultra-Fast Alignments

2022 Data Compression Conference (DCC)(2022)

引用 0|浏览6
暂无评分
摘要
With the widespread application of next generation sequencing technologies, the volume of sequencing data became comparable to that of big data domains. The compression of sequencing reads (nucleotide sequences, quality values, read names), in both raw and aligned data, is a way to alleviate bandwidth, transfer, and storage requirements of genomics pipelines. ISO/IEC MPEG-G standardizes the compressed representation (i.e. storage and streaming) of structured, indexed sets of genomic sequencing data for both raw and aligned data. For the latter, reference-based compression is a strategy used to compress nucleotide sequences of sequencing reads by using alignment information to a reference sequence, which can be used to represent nucleotide sequences by storing the starting position of the alignment on the reference sequence, and the differences between the reference and the actual read. This general scheme is implemented in different ways by genomic data compressors, such as DeeZ, Quip, and CRAM, which apply to aligned reads.
更多
查看译文
关键词
generation sequencing technologies,big data domains,nucleotide sequences,read names,compressed representation,genomic sequencing data,reference-based compression,alignment information,reference sequence,genomic data compressors,DeeZ compressor,Quip compressor,CRAM compressor
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要