Improving domain adaptation in de-identification of electronic health records through self-training.

Journal of the American Medical Informatics Association : JAMIA(2021)

引用 4|浏览23
暂无评分
摘要
OBJECTIVE:De-identification is a fundamental task in electronic health records to remove protected health information entities. Deep learning models have proven to be promising tools to automate de-identification processes. However, when the target domain (where the model is applied) is different from the source domain (where the model is trained), the model often suffers a significant performance drop, commonly referred to as domain adaptation issue. In de-identification, domain adaptation issues can make the model vulnerable for deployment. In this work, we aim to close the domain gap by leveraging unlabeled data from the target domain. MATERIALS AND METHODS:We introduce a self-training framework to address the domain adaptation issue by leveraging unlabeled data from the target domain. We validate the effectiveness on 4 standard de-identification datasets. In each experiment, we use a pair of datasets: labeled data from the source domain and unlabeled data from the target domain. We compare the proposed self-training framework with supervised learning that directly deploys the model trained on the source domain. RESULTS:In summary, our proposed framework improves the F1-score by 5.38 (on average) when compared with direct deployment. For example, using i2b2-2014 as the training dataset and i2b2-2006 as the test, the proposed framework increases the F1-score from 76.61 to 85.41 (+8.8). The method also increases the F1-score by 10.86 for mimic-radiology and mimic-discharge. CONCLUSION:Our work demonstrates an effective self-training framework to boost the domain adaptation performance for the de-identification task for electronic health records.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要