Crawling Deep Web Using a New Set Covering Algorithm

ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS(2009)

引用 23|浏览3
暂无评分
摘要
Crawling the deep web often requires the selection of an appropriate set of queries so that they can cover most of the documents in the data source with low cost. This can be modeled as a set covering problem which has been extensively studied. The conventional set covering algorithms, however, do not work well when applied to deep web crawling due to various special features of this application domain. Typically, most set covering algorithms assume the uniform distribution of the elements being covered, while for deep web crawling, neither the sizes of documents nor the document frequencies of the queries is distributed uniformly. Instead, they follow the power law distribution. Hence, we have developed a new set covering algorithm that targets at web crawling. Compared to our previous deep web crawling method that uses a straightforward greedy set covering algorithm, it introduces weights into the greedy strategy. Our experiment carried out on a variety of corpora shows that this new method consistently outperforms its un-weighted version.
更多
查看译文
关键词
deep web,greedy strategy,straightforward greedy set,conventional set,web crawling,previous deep web,appropriate set,new method,deep web crawling,new set,new set covering algorithm,crawling deep web,power law distribution,set cover,set covering problem,greedy algorithm,uniform distribution
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要