A Rule-Based Model for Normalization of SMS Text

ICTAI), 2012 IEEE 24th International Conference(2012)

引用 14|浏览0
暂无评分
摘要
SMS are short-length text documents written in a colloquial style. SMS text processing is challenging because of low signal-to-noise ratio and multi-varied text composition in terms of language, vocabulary, style and quality. These challenges can be overcome by robust text normalization, which is a necessary step before any technique can be applied and evaluated on such data. In this paper, we present a rule-based model for multi-lingual SMS text normalization focusing on messages written in Romanized Urdu and English. Urdu in contrast to English is a morphologically rich language (MRL), i.e. it produces a very large number of word forms for a given root form, while Romanized Urdu is a way of writing Urdu in Latin script which does not follow standard rules for systematic communication. Hence, normalization or standardization of multi-lingual SMS text offers challenges associated with SMS text, multi-lingualism, MRLs and Latin script. Our SMS standardizer is based upon a tuned set of rules that range over various domains of natural language processing, and which tackle the challenges mentioned above effectively. We then implement the standardizer in the application of Keyword Extraction from SMS messages, where it produces significant improvement in performance by upto 23% in F-measure.
更多
查看译文
关键词
natural language processing,text analysis,English,F-measure,Latin script,MRL,Romanized Urdu,SMS standardizer,SMS text processing,colloquial style documents,keyword extraction,morphologically rich language,multilingual SMS text normalization,multivaried text composition,natural language processing,rule-based model,short message service,short-length text documents,signal-to-noise ratio,Keyword Extraction,Romanized Urdu,Rule-based model,SMS,Text Normalization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要