ContaTester: Fast cross-contamination estimation and identification for large human sequencing cohorts

biorxiv(2021)

引用 0|浏览5
暂无评分
摘要
Background Interest in genomic medicine for human health studies and clinical applications is rapidly increasing. Clinical applications require contamination-free samples to avoid misleading results and provide a sound basis for diagnosis. Results Here we present ContaTester, a tool which requires only allele balance information gathered from a VCF file to detect cross-contamination in germline human DNA samples. Based on a regression model of allele balance distribution, ContaTester allows fast checking of contamination levels for single samples or large cohorts (less than two minutes per sample). We demonstrate the efficiency of ContaTester using experimental validations: ContaTester shows similar results to methods requiring alignment data but with a significantly reduced storage footprint and less computation time. Additionally, for contamination levels above 5%, ContaTester can identify contaminants across a cohort, providing important clues for troubleshooting and quality assessment. Conclusions ContaTester estimates contamination levels from VCF files generated from whole genome sequencing normal sample and provides reliable contaminant identification for cohorts or experimental batches. ### Competing Interest Statement The authors have declared no competing interest. * ### Acronyms r 2 : coefficient of determination. 2 AB : Allele Balance. 2, 3 AD : Allele Depth. 2 BAM : Binary Alignement Map. 1, 3, 4 GnomAD : Genome Aggregation Database. 3 InDels : Insertions-Deletions. 2 SAM : Sequence Alignement Map. 1 SNVs : Single Nucleotide Variants. 3 VCF : Variant Call Format. 1-4 WGS : Whole Genome Sequencing. 2, 4
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要