Chrome Extension
WeChat Mini Program
Use on ChatGLM

Information Extraction among Scanned Document Images in Database

Rashmi M. Choudhari, D. M. Dakhane

semanticscholar(2018)

Cited 0|Views0
No score
Abstract
1 Rashmi M. Choudhari, 2 Dr. D. M. Dakhane Student, 2 Professor 1 Department of Computer Science and Engineering, 1 Sipna College of Engineering, Amravati, India ________________________________________________________________________________________________________ Abstract: Information extraction is a key feature for mining any data so the detection of duplicate images is a useful means of indexing a large database of documents. An algorithm for duplicate document detection is proposed in this project that operates directly on images that have been symbolically compressed using techniques related to the ongoing JBIG2 standardization effort. This report describes an optical character recognition (OCR) method that recognizes the text in an image by deciphering data from the compressed representation. It recognizes the text in an image by deciphering the sequence of occurrence of blobs in the compressed representation. We propose a Hidden Markov Model (HMM) method for solving such deciphering problems and suggest applications in multilingual document duplicate detection. It is observed that it can recover better than 90% of the text in compressed document images and that this is sufficient to identify duplicates in a large database.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined