Automated Extraction of Bioclimatic Time Series from PDF Tables

Sabino Maggi, Silvana Fuina, Saverio Vicario

crossref(2023)

引用 0|浏览0
暂无评分
摘要
<p>Since the development of the original specifications in the '90s the PDF document format has become the <em>de-facto</em> standard for the distribution and archival of documents in electronic form because of its ability to preserve the original layout of the documents, independently of the hardware, operating system and application software used to visualize them.</p> <p>Unfortunately the PDF format does not contain explicit structural and semantic information, making it very difficult to extract structured information from them, in particular data presented in tabular form.&#160;<br />The automatic extraction of tabular data is a difficult and challenging task because tables can have extremely different formats and layouts, and involves several complex steps, from the proper recognition and conversion of printed text into machine-encoded characters, to the identification of logically coherent table constructs (headers, columns, rows, spanning elements), and to the breaking down of the data constructs into elemental objects.</p> <p>Several tools have been developed to support the extraction process. In this work we survey the most interesting tools for the automatic detection and extraction of tabular data, analyzing their respective advantages and limitations. A particular emphasis is given on programmable open source tools because of their flexibility and long-term availability, together with the possibility to easily tweak them to meet the peculiar needs of the problem at hand.</p> <p>As a practical application, we also present a workflow based on a set of R and AWK scripts that can automatically extract daily temperature and precipitation data from the official PDF documents made available each year by Regione Puglia, in Italy. The lessons learned from the development of this workflow and the possibility to generalize the approach to different kinds of PDF documents are also discussed.</p>
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要