Plexus: Optimizing Join Approximation for Geo-Distributed Data Analytics

PROCEEDINGS OF THE 2023 ACM SYMPOSIUM ON CLOUD COMPUTING, SOCC 2023(2023)

引用 0|浏览2
暂无评分
摘要
Modern applications are increasingly generating and persisting data across geo-distributed data centers or edge clusters rather than a single cloud. This paradigm introduces challenges for traditional query execution due to increased latency when transferring data over wide-area network links. Join queries in particular are heavily affected, due to their large output size and amount of data that must be shuffled over the network. Join sampling-computing a uniform sample from the join results-is a useful technique for reducing resource requirements. However, applying it to a geo-distributed setting is challenging, since acquiring independent samples from each location and joining on the samples does not produce uniform and independent tuples from the join result. To address these challenges, we first generalize an existing join sampling algorithm to the geo-distributed setting. We then present our system, Plexus, which introduces three additional optimizations to further reduce the network overhead and handle network and data heterogeneity: (i) weight approximation, (ii) heterogeneity awareness and (iii) sample prefetching. We evaluate Plexus on a geo-distributed system deployed across multiple AWS regions, with an implementation based on Apache Spark. Using three real-world datasets, we show that Plexus can reduce query latency by up to 80% over the default Spark join implementation on a wide class of join queries without substantially impacting sample uniformity.
更多
查看译文
关键词
Join Algorithms,Distributed Systems,Query Optimization,Wide Area Network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要