A visual analysis approach for data transformation via domain knowledge and intelligent models

Multimedia Systems(2024)

引用 0|浏览2
暂无评分
摘要
Industry benchmarking involves comparing and analyzing a company’s performance with other top-performing enterprises. PDF documents contain valuable corporate information, but their non-editable nature makes data extraction complex. This study focuses on converting unstructured data from PDF documents, including tables, images, and text, to a structured format that is suitable for analysis and decision-making. The methods that are currently used for PDF document conversion primarily involve manual extraction, PDF converters, and artificial intelligence algorithms. However, they are often restricted to processing a single modality, have limitations in dealing with complex structured tables, or cannot achieve the required accuracy in practice. This study focuses on converting the periodic reports documents of listed companies from PDF format to structured data. We propose a unified framework for extracting tables, images, and text by parsing PDF documents into constituent objects. We introduce three bespoke algorithms to process complex structured tables and to develop a prototype system of visual analysis that combines AI for automated data extraction with the domain knowledge of human experts for auditing. Quantitative and qualitative experiments are conducted to validate the methodology’s superiority, including its efficiency, quality, and user-friendliness.
更多
查看译文
关键词
PDF documents,Document parsing,Information extraction,Topic classification,Data transformation,Visual analysis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要