An Ensemble Approach To Accurately Detect Somatic Mutations Via Adaptive Boosting

CANCER RESEARCH(2015)

引用 0|浏览36
暂无评分
摘要
Identifying somatic mutations is a key analysis in cancer research. The challenge lies in the impure and heterogeneous nature of the tumor samples. Oftentimes, an algorithm works well for one tumor but poorly for another. Here, we present an ensemble approach that integrates multiple algorithms and demonstrate its performance and high accuracy with validation from both synthetic data and real data. Our approach incorporates state-of-the-art callers including MuTect, SomaticSniper, VarScan2, JointSNVMix2, and VarDict for somatic mutation detection. Each of these algorithms has its unique strength, capable of detecting variants that are missed by some others. The call sets are combined based on 70 independent sequencing and genomic features, which are then used by an adaptively boosted decision tree learner. The learner is trained with a sophisticated simulated data to discriminate true mutations from very noisy data of the tumor samples. In our latest submission to the ICGC-TCGA DREAM Mutation Calling Challenge (the Challenge), our approach obtained an unprecedented somatic SNV detection accuracy of 97.1% with a recall of 94.2% and a precision of 99.9%. The synthetic data was a tumor-normal pair of samples with 30x sequencing depth each. The tumor sample was synthesized by spiking in a whole spectrum of variants ranging from SNVs/Indels to SVs, resulting in an SNV allele frequency (VAF) of 25%. We further validated our approach with “in silico titration”. The titration mixed two different real genomes at different proportions with validated ground truths to generate different sample conditions, ranging from the simplest case where the normal and tumor were pure to the more challenging case where the tumor and normal tissues cross contaminated. From an VAF of 50%, 25% to 15%, our approach achieved an accuracy of 95.7%, 92.5%, and 85.3% respectively based on cross validation, consistent with the results from the Challenge. Finally, we validated our approach with three widely-used and published cancer datasets, obtained from TCGA and EGA, including a whole-genome sequenced malignant melanoma cell line, a whole-genome sequenced chronic lymphocytic leukemia cell line, and a whole-exome sequenced colon adenocarcinoma patient sample with experimentally validated somatic mutations. Our approach was trained on the data from the Challenge and applied to the aforementioned samples to measure its accuracy. Our results showed that we achieved a recall of 98.9%, 89.1% and 87.9% respectively. Although precision on real data cannot be measured without a comprehensive whole-genome experimental validation, our comparatively smaller call sets compared to all other methods considered implying that it has the highest precision among all. We extended our study of the above three validation approaches, namely synthetic genomes, in silico titration, and real samples, to compare with all the five individual callers for accuracy performance. We found that our approach had the highest accuracy when compared to any individual caller. To conclude, our approach is shown to have high accuracy in different types and conditions of tumor samples and by far the best in its class. Citation Format: Li Tai Fang, Pegah T. Afshar, John C. Mu, Narges Bani Asadi, Wing H. Wong, Hugo Y. K. Lam. An ensemble approach to accurately detect somatic mutations via adaptive boosting. [abstract]. In: Proceedings of the 106th Annual Meeting of the American Association for Cancer Research; 2015 Apr 18-22; Philadelphia, PA. Philadelphia (PA): AACR; Cancer Res 2015;75(15 Suppl):Abstract nr LB-306. doi:10.1158/1538-7445.AM2015-LB-306
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要