A Benchmark of Entropy Coders for the Compression of Genome Sequencing Data

2022 Data Compression Conference (DCC)(2022)

引用 0|浏览6
暂无评分
摘要
Genomic sequencing data contain three different data fields: read names, quality values, and nucleotide sequences. In this work, a variety of entropy encoders and com-pression algorithms were benchmarked in terms of compression-decompression rates and times separately for each data field as raw data from FASTQ files (implemented in the Fastq analysis script) and in MPEG-G uncompressed descriptor symbols de-coded from MPEG-G bitstreams (implemented in the symbols analysis script). The result of this benchmark is then compared to the performance of CABAC, which is the encoder used in first edition of the ISO /IEC MPEG-G standard for all types of descriptors, since CABAC was the best performing in terms of compression rates for the three types of data, thus providing overall better compression rates compared to other entropy coders in total. However, in some use cases encoding and decoding speed might be of higher interest than compression, and for specific datasets, types of data, or descriptor streams, other entropy coders might provide higher speed and/or better compression performance than CABAC.
更多
查看译文
关键词
MPEG G,entropy coders,sequencing data,compression
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要