The Memento Tracer Toolset for Human-guided Focused Crawling of Dynamic Web

2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL)(2023)

引用 0|浏览6
暂无评分
摘要
Dynamic web content is a challenge to archive, because the underlying web technologies are constantly advancing. For example, new JavaScript frameworks and Scalable Vector Graphics web elements make the capabilities of archiving tools a moving target. Moreover, the order of execution of events on a web page can affect the page state and the crawling path. To date, a smart crawler capable of solving the daunting task of driving all action elements on a page with all possible traversal paths does not exist, due to the complexity and cost of operation. Consequently, conventional crawling usually does not activate interactive elements on the page, and thus preserves pages with missing information. In our approach, we use a pre-crawl curated web navigation session to record the navigation plan (or Trace) via our Chrome browser extension. We implemented a Trace-driven crawler that can take the navigation plan recorded in the Trace and apply it to a class of webpages with like HTML layouts and interactive options. The Trace-driven crawls are particularly useful for sites where software development work and scholarly research are now kept, such as GitHub, Slideshare, and Publons. Our crawler and other tools are open-sourced to help the web archiving community efficiently capture the valuable information now housed in the interactive web of productivity portals.
更多
查看译文
关键词
Web archiving,Memento,Focus Crawling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要