ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF.

DOCENG(2014)

引用 5|浏览2
暂无评分
摘要
ABSTRACTMost scientific articles are available in PDF format. The PDF standard allows the generation of metadata that is included within the document. However, many authors do not define this information, making this feature unreliable or incomplete. This fact has been motivating research which aims to extract metadata automatically. Automatic metadata extraction has been identified as one of the most challenging tasks in document engineering. This work proposes Artic, a method for metadata extraction from scientific papers which employs a two-layer probabilistic framework based on Conditional Random Fields. The first layer aims at identifying the main sections with metadata information, and the second layer finds, for each section, the corresponding metadata. Given a PDF file containing a scientific paper, Artic extracts the title, author names, emails, affiliations, and venue information. We report on experiments using 100 real papers from a variety of publishers. Our results outperformed the state-of-the-art system used as the baseline, achieving a precision of over 99%.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要