A Combination of Resampling Method and Machine Learning for Text Classification on Imbalanced Data.

Haijun Feng, Tangren Dan, Weiming Wang, Rongzhi Gui, Junyao Liu,Yi Li

AIMS(2021)

Cited 1|Views0
No score
Abstract
Imbalanced data will affect the accuracy of text classification, in order to solve this issue, 11 different algorithms are used to resampling the dataset. Results show that, 5 different oversampling method and SmoteTomek method can rebalance the dataset effectively, which can improve the recognition rate of models on the minority class obviously, while undersampling methods decrease the overall accuracy of models on imbalanced dataset. Meanwhile, 7 different machine learning algorithms are used to train the model with datasets resampled by SmoteTomek algorithm, after combination, Naive Bayes and Logistic Regression algorithms performs best, they can improve the predictive ability of models on the minority class significantly without decreasing the overall accuracy of models. So in handling multi-class imbalanced text classification, Naive Bayes or Logistic Regression combined with SmoteTomek resampling method should be a preference.
More
Translated text
Key words
imbalanced data,text classification,resampling method,machine learning
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined