The Web Data Commons Schema.org Table Corpora

WWW 2024(2024)

引用 0|浏览0
暂无评分
摘要
The research on table representation learning, data retrieval, and data integration in the context of data lakes requires large table corpora for the training and evaluation of the developed methods. Over the years, several large table corpora such as WikiTables, GitTables, or the Dresden Web Table Corpus have been published and are used by the research community. This paper complements the set of public table corpora with the Web Data Commons Schema.org table corpora, two table corpora consisting of 4.2 (Release 2020) and 5 million (Release 2023) relational tables describing products, events, local businesses, job postings, recipes, movies, books, as well as 37 further types of entities. The feature that distinguishes the corpora from all other publicly available large table corpora is that all tables that describe entities of a specific type use the same attributes to describe these entities, i.e. all tables use a shared schema, the schema.org vocabulary. The shared schema eases the integration of data from different sources and allows training processes to focus on specific types of entities or specific attributes. Altogether the tables contain ~653 million rows of data which have been extracted from the Common Crawl web corpus and have been grouped into separate tables for each class/host combination, i.e. all records of a specific class that originate from a specific website are put into a single table. This paper describes the creation of the WDC Schema.org Table Corpora, gives an overview of the content of the corpora, and discusses their use cases.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要