A Novel Deep Auto-Encoder Based Linguistics Clustering Model for Social Text

ACM Transactions on Asian and Low-Resource Language Information Processing(2022)

引用 7|浏览8
暂无评分
摘要
The wide adoption of media and social media has increased the amount of digital content to an enormous level. Natural language processing (NLP) techniques provide an opportunity to extract and explore meaningful information from a large amount of text. Among natural languages, Urdu is one of the widely used languages worldwide for spoken and written communications. Due to its wide adopt-ability, digital content in the Urdu language is increasing briskly, especially with social media and online NEWS feeds. Government agencies and advertisers must filter and understand the content to analyze the trends and cohorts in their interest and national prerogative. Clustering is considered a baseline and one of the first steps in natural language understanding. There are many state-of-the-art clustering techniques specifically for English, French, and Arabic, but no significant research has been conducted in Urdu language processing. Doing it for short text segments is challenging because of limited features and the absence of meaningful language discourse and nuance. Many rule-based NLP techniques are adopted to overcome these issues, relying on human-designed features and rules. Therefore, these methods do not promise remarkable results. Alongside NLP, deep learning techniques are pretty efficient in capturing contextual information with minimal noise compared to other traditional methods. By taking on this challenging job, we develop a deep learning-based technique for Urdu short text clustering for the very first time without a human-designed feature. In this paper, we propose a method of short text clustering using a deep neural network that automatically learns feature representations and clustering assignments simultaneously. This method learns clustering objectives by converting the high dimensional feature space to a low dimensional feature space. Our experiments on the Urdu NEWS headlines dataset show remarkable results compared to state-of-the-art methods.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要