Normalization of Predominant and Long-tail Bacterial Entities with a Hybrid CNN-LSTM and Knowledge-Driven Model

semanticscholar(2020)

引用 0|浏览7
暂无评分
摘要
As part of our ongoing effort to construct a biomedical knowledge base [3], we have recently focused on the normalization of bacterial entities. In contrast to other widely studied biomedical entities, such as diseases, we found that bacteria normalization poses unique challenges, primarily due to the skew of the ground truth data available. In this work, we describe the issues and explain the techniques that we used to address them. To perform bacteria normalization, we started by employing PubTator [2]—a large dataset of bacterial entities—to train a deep learning normalization model. However, PubTator is mostly comprised of a few predominant bacterial species, as shown in Figures 1 and 2. As a result, our normalization model, while performing well on the common bacteria names appearing in PubTator, failed to correctly map other less common bacteria names. To address this issue we employed two approaches: First, we created a new annotated dataset, called MDAD (Microbes and Diseases Annotation Dataset). While significantly smaller than PubTator (containing 1.9K vs the 38K bacteria mentions of the latter), MDAD is more representative of general bacteria names, as it is based on the more uniform Disbiome dataset [1]. Therefore, it serves as a better evaluation dataset. Second, we combined our deep learning model with a knowledge-driven approach into a hybrid model that targets both common and rare entities. This is based on our observation that predominant bacteria show notable variability in the naming with 10.97±13.04 surface forms per concept on average, while long-tail bacteria have an average of 1.29±0.65 with most mentions using the preferred name. Our deep learning model is a character-based CNN-LSTM to model the variability of predominant bacteria. Our Knowledge-Driven method leverages Levenshtein distance and abbreviation resolution to deal with long-tail bacteria. Each of these models was found to perform well for their target bacteria but performed poorly otherwise. Therefore, we created a hybrid model, which covers all bacteria by smartly combining the two models. It achieves 96% accuracy for test data containing both predominant and long-tail bacteria, substantially outperforming individual models in isolation. The performance results of the hybrid model and its components are shown in Figure 3.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要