A Lossless Compression Pipeline for Petabyte-Scale Whole Genome Sequencing Data

Ajeya Bhat, Sai Manasa Chadalavada,Nagakishore Jammula,Chirag Jain,Yogesh Simmhan

2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)(2023)

Cited 0|Views4
No score
Abstract
Whole genome sequencing (WGS) technologies have enabled high-throughput cost-effective genome sequencing at the population scale. A single WGS instrument can sequence millions of DNA molecules simultaneously, leading to the generation of massive datasets. GenomeIndia is an ongoing national project aimed at sequencing the genomes of 10,000 Indian individuals. The GenomeIndia sequencing centers are completing the generation of petabyte-scale genomic data. This has raised an urgent need for scalable lossless compression software to facilitate cost-effective storage and exchange of data. By default, each WGS file produced in the GenomeIndia project is stored in the standard unmapped BAM (uBAM) format. A uBAM file saves the DNA sequences as well as metadata associated with the sequencing experiment. We have developed an open-source software pipeline that enables parallel lossless compression and decompression of uBAM files. It produces compressed output that is approximately 5 x smaller than the input uBAM files. We carefully engineered the pipeline by integrating different bioinformatics tools such as SPRING, Picard, SAMtools, and PySAM. We evaluated the parallel efficiency of our approach using thorough performance profiling and strong-scaling experiments.
More
Translated text
Key words
Genome Sequencing,Sequencing Data,Whole-genome Sequencing,Genome Sequence Data,Lossless Compression,Compression Pipeline,High-throughput Sequencing,Bioinformatics Tools,Sequencing Center,Strong Scaling,Indian Individuals,Separate File,Parallelization,Fastq Files,FASTQ Format,Single File,Large Sequence,File Size,Original Files,Compression Ratio,Drop Time,Metadata File,Compression Efficiency,Compression Algorithm,Time Compression,Pipeline Stages
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined