Automatic detection of protected health information from clinic narratives

Journal of Biomedical Informatics(2015)

引用 98|浏览76
暂无评分
摘要
Display Omitted A model to automatically detect protected health information in clinical texts.Machine learning techniques combined with keywords and rule-based approaches.7 main PHI categories with 25 associated sub-categories are identified.Achieving an overall micro-averaged F-measure of 93.6%.The winner of 2014 i2b2 de-identification challenge task. This paper presents a natural language processing (NLP) system that was designed to participate in the 2014 i2b2 de-identification challenge. The challenge task aims to identify and classify seven main Protected Health Information (PHI) categories and 25 associated sub-categories. A hybrid model was proposed which combines machine learning techniques with keyword-based and rule-based approaches to deal with the complexity inherent in PHI categories. Our proposed approaches exploit a rich set of linguistic features, both syntactic and word surface-oriented, which are further enriched by task-specific features and regular expression template patterns to characterize the semantics of various PHI categories. Our system achieved promising accuracy on the challenge test data with an overall micro-averaged F-measure of 93.6%, which was the winner of this de-identification challenge.
更多
查看译文
关键词
Clinical text mining,De-identification,Hybrid model,Natural language processing,Protected Health Information (PHI)
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要