Converting printed Sinhala documents to formatted editable text

Information and Automation for Sustainability(2010)

引用 8|浏览0
暂无评分
摘要
Digitizing printed document is always a challenge faced by the computing society. Digitization of text not only allows users to easily modify and reprint printed documents, but also is a need of the day due to the use of word-search capability available at disposal in this era. Converting a printed document into a stream of characters using OCR (optical character recognition) techniques is a widely researched area of the past and there are a number of well established algorithms available in the literature to do so. However, the idea of preserving the formatting information of the original document is not much studied. The contribution of this paper is of two folds: (1) applying known OCR techniques to one of Sri Lanka's native languages, Sinhala, and addressing the challenges in doing so and (2) maintaining a number of selected formatting features of the printed document in the converted editable text. Therefore, this paper outlines the design and implementation of a software system that converts a scanned paper document written in Sinhala language into formatted editable text and describes how this system is integrated into an open-source word processing tool.
更多
查看译文
关键词
digital printing,document image processing,natural language processing,optical character recognition,text analysis,text editing,word processing,OCR technique,Sinhala language,Sri Lanka native language,computing society,digitizing printed document,formatted editable text,formatting feature,open-source word processing tool,optical character recognition,printed Sinhala document,text digitization,word-search capability,Sinhala document formatting,editable scanned documents,horizontal profiling,optical character recognition,vertical profiling,
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要