Fast Distributed Complex Join Processing

Hao Zhang,Miao Qiao,Jeffrey Xu Yu,Hong Cheng

2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021)（2021）

引用 5|浏览46

暂无评分

摘要

Big data analytics often requires processing complex join queries in parallel in distributed systems such as Hadoop, Spark, Flink. The previous works consider that the main bottleneck of processing complex join queries is the communication cost incurred by shuffling of intermediate results, and propose a way to cut down such shuffling cost to zero by a one-round multi-way join algorithm. The one-round multi-way join algorithm is built on a one-round communication optimal algorithm for data shuffling over servers and a worst-case optimal computation algorithm for sequential join evaluation on each server. The previous works focus on optimizing the communication bottleneck, while neglecting the fact that the query could be computationally intensive. With the communication cost being well optimized, the computation cost may become a bottleneck. To reduce the computation bottleneck, a way is to trade computation with communication via pre-computing some partial results, but it can make communication or pre-computing becomes the bottleneck. With one of the three costs being considered at a time, the combined lowest cost may not be achieved. Thus the question left unanswered is how much should be traded such that the combined cost of computation, communication, and pre-computing is minimal.In this work, we study the problem of co-optimize communication, pre-computing, and computation cost in one-round multiway join evaluation. We propose a multi-way join approach ADJ (Adaptive Distributed Join) for complex join which finds one optimal query plan to process by exploring cost-effective partial results in terms of the trade-off between pre-computing, communication, and computation.We analyze the input relations for a given join query and find one optimal over a set of query plans in some specific form, with high-quality cost estimation by sampling. Our extensive experiments confirm that ADJ outperforms the existing multi-way join methods by up to orders of magnitude.

查看译文

关键词

optimal query plan,cost-effective partial results,join query,query plans,high-quality cost estimation,big data analytics,complex join queries,distributed systems,communication cost,shuffling cost,one-round multiway join algorithm,worst-case optimal computation algorithm,sequential join evaluation,communication bottleneck,computation cost,computation bottleneck,combined lowest cost,combined cost,cooptimize communication,fast distributed complex join processing,adaptive distributed join,Hadoop,Spark,Flink

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要