Identifying and Extracting Hierarchical Information from Business PDF Documents.

Innovations on Software Engineering Conference (ISEC)(2022)

引用 0|浏览0
暂无评分
摘要
Portable Document Format (PDF) is a popular choice for a secure communication and persistence of business information and is a universally accepted format by businesses choosing to become digital. PDF provides multiple ways to make the information visually appealing and readable, and device independent rendering. To achieve this, PDF stores metadata with individual text characters, graphic components and other layout elements. Such atomic component wise meta-data makes machine processing of information in the PDF format very challenging; the challenge is further extended due to the difficulty of stitching together the original semantics from the componentized information. We propose a generic approach for extracting the hierarchy of the document structure while separating the content from header and footer, and extracting metadata associated with checkboxes to annotate the business information contained in PDF for tasks like mining specifications and rules from the document. Our prototype is able to process real-life, large PDF documents each running into roughly 400 pages, with nearly 95% of the extraction requiring no human intervention.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要