On Frequency-Based Approaches to Learning Stopwords and the Reliability of Existing Resources - A Study on Italian Language.

Communications in Computer and Information Science(2018)

引用 3|浏览215
暂无评分
摘要
Natural Language Processing techniques are of utmost importance for the proper management of Digital Libraries. These techniques are based on language-specific linguistic resources, that might be unavailable for many languages. Since manually building them is costly, time-consuming and error-prone, it would be desirable to learn these resources automatically from sample texts, without any prior knowledge about the language under consideration. In this paper we focus on stopwords, i.e., terms that can be ignored in order to understand the topic and content of a document. We propose an experimental study on the frequency behavior of stopwords, aimed at providing useful information for the development of automatic techniques for the compilation of stopword lists from a corpus of documents. The reliability and/or deficiencies of the stopwords obtained from the experiments is evaluated by comparison to existing linguistic resources. While the study is conducted on texts in Italian, we are confident that the same approach and experimental results may apply to other languages as well.
更多
查看译文
关键词
Natural Language Processing,Linguistic resources Stopwords,Keyword extraction
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要