Classifying Textual Components of Bilingual Documents with Decision-Tree Support Vector Machines

Document Analysis and Recognition(2011)

引用 12|浏览0
暂无评分
摘要
In this paper, we propose a method for classifying textual entities of bilingual documents written in Chinese and English. In contrast to earlier works that performed classification on the level of text lines or documents, we apply our method to the level of textual components, as we must first identify Chinese components before merging them into intact characters and sending the latter characters to a Chinese recognizer. To cope with a large training data set containing 365,672 samples, we employ a decision-tree support vector machine (DTSVM) method, which decomposes a given data space into small regions and trains local SVMs on those regions. By applying this method to train classifiers on various combinations of feature types, we were able to complete each training process within 3,500 seconds and achieve higher than 99.6% test accuracy in classifying a textual component into Chinese, alphanumeric, and punctuation. Moreover, the classification had no strong bias towards any of the three categories.
更多
查看译文
关键词
chinese component,training process,data space,chinese recognizer,classifying textual components,script and language identification,pattern classification,textual components classification,decision tree support vector machines,chinese components,textual component,vector machines,bilingual document,bilingual documents,large training data,natural language processing,earlier work,decision-tree support,component,decision trees,document image processing,support vector machines,decision-tree support vector machine,textual entity,training data,shape,feature extraction,accuracy,testing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要