Whole exome sequencing and machine learning germline analysis of individuals presenting with phenotypes of extreme high and low risk of developing tobacco-induced lung adenocarcinoma

JOURNAL OF CLINICAL ONCOLOGY(2023)

引用 0|浏览25
暂无评分
摘要
10507 Background: Tobacco is the main risk factor for developing lung cancer. Yet, while some heavy smokers develop lung cancer at young age others never develop it, even at advanced age. This suggests a remarkable variability in the individual susceptibility to the carcinogenic effects of tobacco. We characterized the germline profile of subjects presenting these extreme phenotypes with Whole Exome Sequencing (WES) and Machine Learning (ML). Methods: We sequenced germline DNA from heavy smokers who either developed lung adenocarcinoma at early age ( extreme cases) or did not develop it at advanced age ( extreme controls). The discovery and validation cohorts included respectively 50 and 66 extreme cases and 50 and 83 extreme controls, selected from databases including > 6,000 subjects. We selected individual coding variants and variant-rich genes showing a significantly different distribution between extreme cases and controls. We trained ML models (Logistic Regression, Random Forest, Support Vector machine Classifier (SVC)) on the discovery cohort to classify subjects into their respective phenotypes and tested them on the validation cohort. Results: Mean age for extreme cases and controls in both cohorts was 50.2 and 78.4 years. Mean tobacco consumption was 38.1 and 59.1 pack-years. We validated 16 significant individual variants. The most significant variants were in ADAMTS7 (2 variants) in cases and TMEM191B (1) in controls. We validated 33 genes enriched with significant variants. The genes harboring more variants were HLA-A (4 variants) and ADAMTS7 (2) in cases; and PLIN4 (2) in controls (Table). We trained several ML models on the discovery cohort using as input the 16 significant individual variants and the number of variants in the 33 enriched genes. We tested them in the validation cohort obtaining accuracy of 72% and AUC-ROC of 87.4% with the best model (SVC), using 16 variants as input, confirming their association with the phenotypes. Functions of validated genes included oncogenes, tumor-suppressors, DNA repair, maintenance of genomic stability, HLA mediated antigen presentation and regulation of proliferation, migration, apoptosis and inflammatory pathways. Conclusions: Individuals presenting phenotypes of extreme high and low risk of developing tobacco-induced lung adenocarcinoma have different germline profiles. Our strategy may allow to identify high-risk subjects and to develop new therapeutic approaches. [Table: see text]
更多
查看译文
关键词
lung adenocarcinoma,machine learning germline analysis,phenotypes,whole exome,tobacco-induced
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要