Large Synthetic Data from the ar$$\mathrm {\chi }$$iv for OCR Post Correction of Historic Scientific Articles

Lecture Notes in Computer Science(2023)

引用 0|浏览0
暂无评分
摘要
Historical scientific articles often require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We present a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (ADS). By mining the ar $$\mathrm {\chi }$$ iv we create, to the authors’ knowledge, the largest scientific synthetic ground truth/OCR post correction dataset of 203,354,393 character pairs. Baseline models trained with this dataset find the mean improvement in character and word error rates of 7.71% and 18.82% for historical OCR text, respectively. Interactive dashboards to explore the dataset are available online: https://readingtimemachine.github.io/projects/1-ocr-groundtruth-may2023 , and data and code, are hosted on GitHub: https://github.com/ReadingTimeMachine/ocr_post_correction .
更多
查看译文
关键词
ocr post correction,large synthetic data
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要