Equi-join for Multiple Datasets Based on Time Cost Evaluation Model.

ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2014, PT II(2014)

引用 1|浏览13
暂无评分
摘要
MapReduce is an important programming model for processing big data with a parallel, distributed algorithm on a cluster. In big data analytic application, equi-join is an important operation. However, it is inefficient to perform equi-join operations in MapReduce when multiple datasets are involved in the join. In this paper, a time cost evaluation model is extended for an equi-join by considering the time cost of calculation. In addition, the sub-joins in an equi-join are classified into star pattern sub-joins on single attribute and chain pattern sub-joins. Based on the extended model, optimization methods are presented and an equi-join plan with lower time cost is chosen for the equi-join. The optimization methods include: the star pattern sub-joins on one attribute are first processed; next, a chain pattern sub-join with minimal scale of intermediate results (i.e. the number of tuples in intermediate results) is processed; at last, a chain pattern sub-join is decomposed into several MapReduce jobs or single MapReduce job by dynamic programming to obtain an optimal scheme for the chain pattern sub-join. We conducted extensive experiments, and the results show that our method is more efficient than those methods such as MDMJ, Hive and Pig.
更多
查看译文
关键词
Join,MapReduce,Dynamic Programming
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要