Licord: Language Independent Content Word Finder

HYBRID ARTIFICIAL INTELLIGENT SYSTEMS(2016)

引用 1|浏览59
暂无评分
摘要
Content Words (CWs) are important segments of the text. In text mining, we utilize them for various purposes such as topic identification, document summarization, question answering etc. Usually, the identification of CWs requires various language dependent tools. However, such tools are not available for many languages and developing of them for all languages is costly. On the other hand, because of recent growth of text contents in various languages, language independent text mining carries great potentiality. To mine text automatically, the language tool independent CWs finding is a requirement. In this research, we devise a framework that identifies text segments into CWs in a language independent way. We identify some structural features that relate text segments into CWs. We devise the features over a large text corpus and apply machine learning-based classification that classifies the segments into CWs. The proposed framework only uses large text corpus and some training examples, apart from these, it does not require any language specific tool. We conduct experiments of our framework for three different languages: English, Vietnamese and Indonesian, and found that it works with more than 83% accuracy.
更多
查看译文
关键词
Content words, Language independent, Word segmentation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要