SCI-3000: A Dataset for Figure, Table and Caption Extraction from Scientific PDFs.

ICDAR (1)(2023)

引用 0|浏览7
暂无评分
摘要
Extracting figures and similar visual elements from PDFs of scientific publications is important but non-trivial, and progress is impeded by a lack of datasets for evaluation and machine learning. In this work, we describe and publish the SCI-3000 dataset , containing 3 000 PDFs of scientific publications (34 791 pages) with annotations of figures, tables, and corresponding captions, from the fields of computer science , biomedicine , chemistry , physics , and technology . We demonstrate the use of the dataset to benchmark two figure, table, and caption extraction approaches from recent literature: one rule-based and one deep learning-based.
更多
查看译文
关键词
scientific pdfs,caption extraction,dataset,table,figure
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要