Combining supervised term-weighting metrics for SVM text classification with extended term representation

Knowledge and Information Systems(2016)

引用 88|浏览66
暂无评分
摘要
The accuracy of a text classification method based on a SVM learner depends on the weighting metric used in order to assign a weight to a term. Weighting metrics can be classified as supervised or unsupervised according to whether they use prior information on the number of documents belonging to each category. A supervised metric should be highly informative about the relation of a document term to a category, and discriminative in separating the positive documents from the negative documents for this category. In this paper, we propose 80 metrics never used for the term-weighting problem and compare them to 16 functions of the literature. A large number of these metrics were initially proposed for other data mining problems: feature selection, classification rules and term collocations. While many previous works have shown the merits of using a particular metric, our experience suggests that the results obtained by such metrics can be highly dependent on the label distribution on the corpus and on the performance measures used (microaveraged or macroaveraged F_1 -Score). The solution that we propose consists in combining the metrics in order to improve the classification. More precisely, we show that using a SVM classifier which combines the outputs of SVM classifiers that utilize different metrics performs well in all situations. The second main contribution of this paper is an extended term representation for the vector space model that improves significantly the prediction of the text classifier.
更多
查看译文
关键词
Text classification,Term weighting,Text representation,Support vector machines,Classifier combination
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要