谷歌浏览器插件
订阅小程序
在清言上使用

Deep Learning for Information Extraction From Digital Documents

Machine Learning for Societal Improvement, Modernization, and Progress Advances in Human and Social Aspects of Technology(2022)

引用 0|浏览0
暂无评分
摘要
Print-oriented PDF documents are excellent at preserving the position of text and other objects but have difficulties in processing. Processable PDF documents will provide solutions to the unique needs of different sectors by paving the way for many innovations such as searching within documents, linking with different documents, or restructuring in a format that will increase the reading experience. In this chapter, a deep learning-based system design is presented that aims to export clean text content, separate all visual elements, and extract rich information from the content without losing the integrated structure of content types. While the F-RCNN model using the Detectron2 library was used to extract the layout, the cosine similarities between the wod2vec representations of the texts were used to identify the related clips, and the transformer language models were used to classify the clip type. The performance values on the 200-sample data set created by the researchers were determined as 1.87 WER and 2.11 CER in the headings and 0.22 WER and 0.21 CER in the paragraphs.
更多
查看译文
关键词
information extraction,digital documents,deep
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要