Association of smoking history extracted from electronic health records (EHR) using machine-learning methods and tumor characteristics in patients with lung cancer

JOURNAL OF CLINICAL ONCOLOGY(2023)

引用 0|浏览13
暂无评分
摘要
1559 Background: Though smoking is a major risk factor for lung cancer, it has been a challenge to collect patients’ smoking history information accurately from the EH due to data inconsistency and incompleteness. To address these challenges, we utilized a weak supervision methodology to automatically annotate smoking status of patients with lung cancer and correlated it with tumor characteristics. Methods: We analyzed 6,355 patients with lung cancer who underwent tumor profiling with MSK-IMPACT. In total, 14,555 unstructured clinical notes were extracted from EHR at the Memorial Sloan Kettering Cancer Center. The weak supervision methodology used a generative model for intermediate labels that were subsequently tuned by machine-learning classifier to generate the final labels. Clinical notes from a randomly sampled set of 564 patients were manually curated and used for performance assessment. The rest of the patients were split into training and validation datasets used for model training and hyperparameter tuning. Pack years were also extracted from clinical notes using Natural Language Processing. We next conducted multivariate analyses for primary and metastatic tumor samples separately to correlate smoking metrics with tumor characteristics including tumor mutation burden (TMB) and chromosomal instability, as inferred by the fraction of genome altered (FGA) after controlling for age at sequencing, gender, histological subtypes, ancestry, coverage and tumor purity. Results: The weak supervision classifier had almost perfect performance for 2-label classification model (ever smokers and never smokers) with macro F1-score: 97.7%, balanced accuracy: 97.1%, 97.1%, precision:98.4%, 98.4% and recall: 99.5%,94.6% respectively. For 3-label classification model (never smoker, former smoker, and current smoker), the macro F1-score was 79.8%; balanced accuracy: 97.1%, 86.7%, 71.2%, precision: 93.9%, 90.1%, 61.7%, recall: 96.1%, 93.3%, 46.0% respectively. Analyzing genomic data, we observed that smoking status (smoker vs. never smoker) and pack-years were associated with TMB in both primary and metastatic tumor samples (p<2e-16). FGA was marginally associated with smokers compared to never smokers in primary tumor samples (p=0.06). Among smokers diagnosed with lung adenocarcinoma, significantly high FGA in primary tumor samples was observed in males compared to females after adjusting for pack-years and other variables (p= 3.3e-3). Conclusions: We demonstrated high performance of our approach for automated curation of smoking history from EHR. The genomic results confirmed distinct mutational patterns associated with smoking behavior in patients with lung cancer. We are currently exploring multimodal approaches by including chest CT images and “time of quitting” to improve performance of the 3-class model.
更多
查看译文
关键词
smoking history,electronic health records,lung cancer,machine-learning machine-learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要