TPTS: Text pre-processing Techniques for Sindhi Language

Ali Nawaz, Muhammad Nawaz,Noor Ahmed Shaikh,Samina Rajper,Junaid Baber, Muhammad Irfan Khalid

Pakistan Journal of Emerging Science and Technologies (PJEST)(2023)

引用 0|浏览0
暂无评分
摘要
The Internet is a significant source of textual data, with users generating vast amounts of information through social media and news agencies daily. The extraction of meaningful information from large datasets is a challenging and costly process. Text pre-processing is a crucial initial step in any Natural Language Processing (NLP) task, as it can impact the overall performance of the study. The main objective of text pre-processing is to transform unstructured text into a linguistically meaningful (standard form) format, making extracting information for any text-processing task easier. This paper introduces TPTS, a model for text pre-processing in the Sindhi language. TPTS performs essential NLP tasks such as text tokenization, normalization, stop-word removal, stemming, and POS tagging for the Sindhi language. The Sindhi Text Corpus (STC), consisting of 1.5k Sindhi text documents collected from various online news websites, is used for experimentation. The TF-IDF approach is employed to identify high-frequency stop-words in the Sindhi language. Furthermore, a rule-based system tags words with their part of speech in Sindhi input text. The ROUGE evaluation metric is used to assess the effectiveness of the proposed TPTS technique, achieving 89% accuracy on the STC corpus. The Sindhi language is spoken by over 30 million people globally, and the lack of adequate NLP tools and resources limits the development of technology and natural language applications that can benefit Sindhi speakers. The proposed TPTS model can aid in developing such applications, making it beneficial not only for text pre-processing tasks but also for other Sindhi language text-processing tasks such as text summarization, sentiment analysis, speech-processing applications, text mining, and information retrieval systems.
更多
查看译文
关键词
sindhi language,text,pre-processing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要