Latent Topic Extraction as a Source of Labeling in Natural Language Processing.

2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)(2023)

引用 0|浏览5
暂无评分
摘要
Supervised machine learning algorithms depend on accurate labeling of target data to develop models that can derive relationships between input data and the target data. One major hindrance for developing supervised machine learning models capable of predicting the correct target label of unseen data rests on the quality of the data used to train the models, which often depends on having a subject matter expert (SME) create a labeled dataset to train the model on. Given the scarcity of such experts in many fields, the time needed to analyze data for labeling, and subjective differences among experts, ways to reduce the complexity associated with creating meaningful datasets are needed. In this work, we explore the use of two unsupervised topic modeling algorithms, Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) as potential methods for reducing the complexities in the labeling process. Specifically, we obtained COVID patient message data labeled by a SME and compared the overlap in topics designated as COVID versus not by the two algorithms to those of the SME. For each of the topic modeling algorithms, we found a strong degree of overlap in the COVID vs. non-COVID patient message labels with that of the SME, suggesting that the methodology could be used to provide synergies for developing labeled data sets used for clinically meaningful models.
更多
查看译文
关键词
Natural language processing,topic modeling,latent dirichlet allocation,non-negative matrix factorization,COVID-19
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要