Classifying Turkish Trade Registry Gazette Announcements

2022 7th International Conference on Computer Science and Engineering (UBMK)(2022)

引用 1|浏览11
暂无评分
摘要
Turkish Trade Registry Gazette is an important source of information in many sectors such as banking and telecommunication. Although the newspaper is publicly available, the data is hard to acquire, and announcements are offered in image format. It is possible to search for a specific announcement a company has, but there exist many other unrelated announcements in the image returned. This poses multiple challenges in the way of information extraction. Due to the structure of the documents in these images, it is hard to perform OCR directly. Moreover, even in the case where the text is extracted, the announcement boundaries must be detected to split the announcements within the page. Once the announcements are extracted, the announcement of the searched company should be matched. Since no information regarding the surrounding announcements is given as a result of the query, these announcements should also be categorized to detect any events of interest other companies may have. In this work, we address all of these problems and present a pipeline that includes image processing, OCR, announcement splitting, and document classification steps.
更多
查看译文
关键词
document processing,OCR,document classification,natural language processing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要