Parallel Query Processing: To Separate Communication from Computation

PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22)(2022)

引用 1|浏览32
暂无评分
摘要
In this paper, we study parallel query processing with a focus on reducing the communication cost, which is the dominating factor in parallel query processing. The communication cost becomes large if the intermediate results between operators are large in intra-operator parallelism. In the existing approaches, it optimizes an SQL query by arranging relational algebra operators to reduce the total cost, where, for each operator, it involves (i) distribution of data partitioned to computing nodes by communication, and (ii) computation on computing nodes locally. The communication and computation are dealt with inside an operator and are not separable. In other words, it is difficult to avoid large intermediate results and hence reduce the communication cost. To reduce communication cost, we separate communication from computation using several new operators proposed in this paper. One is a pair operator (circle times) to pair the partitions of a relation R with the partitions of a relation S, where a partition is specified by a hash function. With the pair operator defined, we can explicitly deal with communication to deliver pairs of partitions to computing nodes. Together with circle times, we can also explicitly treat the local computation on a computing node as (op) over tilde for any RA (relational algebra) operator op. We give a merge operator ((U) over tilde), to collect all partial results from computing nodes as they are. In short, with circle times, (op) over tilde, and (U) over tilde, we are able to explicitly specify communication and computation for RA operators. Furthermore, we propose new techniques, namely, partitioning push-down and computation push-up to separate communication from computation for RA expressions. We prove that we can push-down/up for a wide range of relational expressions. We have developed a distributed system named Secco (Separate Communication from Computation) by revamping SparkSQL on Spark, and confirmed the efficiency of our approach in our performance studies using real datasets.
更多
查看译文
关键词
Database, Parallel Query Processing, Query Optimization, OLAP
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要