Building Web-Based Subject-Specific Corpora on the Desktop: Evaluation of Search Metrics

Computational Intelligence(2023)

引用 0|浏览0
暂无评分
摘要
Building subject-specific or domain corpora from Web data is well-researched. However, most approaches start by using seed articles as inputs to Web crawlers and take document similarity algorithms for selection. We take a different lean resource approach by applying traditional search metrics with a relatively large (more than 100 search terms) ‘bag of domain words’ approach on the colossal clean crawled corpus. This approach enables one to build rich domain corpora of text documents quickly in a resource-poor environment (e.g., a few CPU cores). This paper tests several metrics using three different subject domains—language, Colossal Clean Crawled Corpus basic mathematics, and information science—and finds that there are significant performance differences between the various metrics. Surprisingly, a naïve, simple metric, outperforms TD-IDF and performs almost as well as our top ranked algorithm, Okapi BM25. This demonstrates that the performance of search metrics using a relatively larger number of search key words (> 100) is different than when a small set of search key words is used. We also demonstrate how to optimize the free parameters for Okapi BM25.
更多
查看译文
关键词
search,metrics,web-based,subject-specific
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要