Big Data Processing with Probabilistic Latent Semantic Analysis on MapReduce

CyberC(2014)

引用 6|浏览16
暂无评分
摘要
Probabilistic Latent Semantic Analysis (PLSA) is a powerful statistical technique to analyze co-occurrence data, it has wide usage in information processing, ranging from information retrieval, information filtering, text processing automation, to natural language processing, and related areas. However, it has very high time and space complexity to train PLSA model on big data. Researchers have been trying to solve this problem using parallel means. But their approaches only partially reduce the time complexity, the main memory in the compute process still needs to load a large amount of data. In order to solve the scalability problem of data, a parallel method to train PLSA is proposed by adapting the traditional EM algorithm into MapReduce a computing framework for processing vast amounts of data in-parallel on clusters. In this way, the main memory in each computer just needs to load part of the dataset. This method can reduce time and space complexity simultaneously. Results show that this method can deal with large datasets efficiently.
更多
查看译文
关键词
parallelism,expectation-maximisation algorithm,big data,time complexity,parallel programming,mapreduce,scalablity,space complexity,plsa model training,probabilistic latent semantic analysis, scalablity, parallelism, mapreduce,main memory,information retrieval,information filtering,co-occurrence data analysis,computing framework,data scalability problem,computational complexity,text processing automation,probabilistic latent semantic analysis,statistical technique,information processing,natural language processing,big data processing,parallel method,em algorithm,probability
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要