Set Cover At Web Scale

KDD '15: The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Sydney NSW Australia August, 2015(2015)

引用 14|浏览62
暂无评分
摘要
The classic SET COVER problem requires selecting a minimum size subset A subset of F from a family of finite subsets F of U such that the elements covered by A are the ones covered by F. It naturally occurs in many settings in web search, web mining and web advertising. The greedy algorithm that iteratively selects a set in F that covers the most uncovered elements, yields an optimum (1 + ln vertical bar U vertical bar)-approximation but is inherently sequential. In this work we give the first MapReduce SET COVER algorithm that scales to problem sizes of similar to 1 trillion elements and runs in log(p) Delta iterations for a nearly optimum approximation ratio of p In Delta, where Delta is the cardinality of the largest set in F.A web crawler is a system for bulk downloading of web pages. Given a set of seed URLs, the crawler downloads and extracts the hyperlinks embedded in them and schedules the crawling of the pages addressed by those hyperlinks for a subsequent iteration. While the average page out-degree is similar to 50, the crawled corpus grows at a much smaller rate, implying a significant outlink overlap. Using our MapReduce SET COVER heuristic as a building block, we present the first large-scale seed generation algorithm that scales to similar to 20 billion nodes and discovers new pages at a rate similar to 4x faster than that obtained by prior art heuristics.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要