Patc: Parallel Arabic Text Classifier

2018 21ST SAUDI COMPUTER SOCIETY NATIONAL COMPUTER CONFERENCE (NCC)(2018)

引用 1|浏览115
暂无评分
摘要
In the era of technology, the amount of textual data has dramatically grown and increased. It is also getting to be more complex in its nature every day. The ability to manage, analyze, summarize, and understand this data remains a challenging task that requires new techniques to deal with automatically organizing, searching, indexing, and browsing large collections of documents. Text classification is one of text mining areas, which is the process of classifying the text into predefined classes or topics. We developed a tool for Arabic text classification using parallel programming framework. The tool is called Parallel Arabic Text Classifier (PATC). It analyzes a labeled corpus of Arabic text that is input by the user and subsequently builds a text classifier. PATC consists of three major stages; (1) Preprocessing: PATC will normalize and stem the Arabic corpus before using it to train the classifier, (2) Training or Building the Classifier: The classifier will be trained with a user-uploaded, annotated Arabic corpus, and (3) Testing or Classifying: this stage will predict the class of a new document based on the trained classifier. This classifier is built using an approach that associates each label with frequent words using MapReduce distributed programming model. The classifier was evaluated using an Arabic corpus. The accuracy of the classification was around 80% using single-label measures, while it was in the high 90s% using multi-label measures.
更多
查看译文
关键词
component, Arabic Language, Multi-label Text Classification, MapReduce, Natural Language Processing, Text Mining
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要