Let's Rethink Join Optimization in Distributed Systems.

CIDR(2015)

引用 23|浏览45
暂无评分
摘要
Distributed shared-nothing systems that process large-scale data has seen unprecedented developments over the last decade. The advent of Google’s MapReduce [2] and Hadoop [3] has been followed by a series of systems with relational operators or SQL-like interfaces, such as Pig [8], Hive [10], Spark [12], SparkSQL [9], and Myria [4]. One of the core operations performed by these systems is evaluating relational joins. Along with these systems developments, there has also been very exciting progress on join algorithms both in the serial and distributed settings. However, the algorithmic progress in joins and the developments in large-scale data processing systems have not yet met. Current systems typically perform pairwise join plans, which perform well on data with primary and foreign key constraints, but are ill-suited and suboptimal for more complex sparse data that many modern applications process [7]. As new distributed data processing systems are rapidly being developed, we believe it is the right time to rethink how joins should be optimized in these systems. In this abstract, we argue that there is a promising opportunity to implement and experiment with a new set of join algorithms in distributed systems. Table 1 summarizes the algorithms we discuss and their properties. The specific problem we consider is the evaluation a conjunctive
更多
查看译文
关键词
rethink join optimization,distributed systems
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要