A novel ensemble-based paradigm to process large-scale data

Thanh Trinh, HoangAnh Le,Nhung VuongThi, Hai HoangDuc, KieuAnh VuThi

Multimedia Tools and Applications(2024)

引用 0|浏览0
暂无评分
摘要
Big data analytics is an emerging topic in academic and industrial engineering fields, where the large-scale data issue is the most attractive challenge. It is crucial to design an effective large-scale data processing model to handle big data. In this paper, we aim to improve the accuracy of the classification task and reduce the execution time for large-scale data within a small cluster. In order to overcome these challenges, this paper presents a novel ensemble-based paradigm that consists of the procedure of splitting large-scale data files and developing ensemble models. Two different splitting methods are first developed to partition large-scale data into small data blocks without overlapping. Then we propose two ensemble-based methods with high accuracy and less execution time: bagging-based and boosting-based methods. Finally, the proposed paradigm can be implemented by four predictive models, which are combinations of two splitting methods and two ensemble-based methods. A series of persuasive experiments was conducted to evaluate the effectiveness of the proposed paradigm with four different combinations. Overall, the proposed paradigm with boosting-based is the best in terms of the accuracy metric compared with existing methods. In addition, boosting-based methods achieve 91.6% accuracy compared with 52% accuracy of base line model for a big data file with 10 million samples. However, the paradigm with bagging-based takes the least execution time to yield results. This paper also reveals the effectiveness of the computing Spark cluster for large-scale data and points out the weakness of RDD (Resilient Distributed dataset).
更多
查看译文
关键词
Big data,Large-scale data,Ensemble,Spark,Cluster,Classification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要