Stark: Optimizing In-Memory Computing for Dynamic Dataset Collections

2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)(2017)

引用 8|浏览162
暂无评分
摘要
Emerging distributed in-memory computing frameworks, such as Apache Spark, can process a huge amount of cached data within seconds. This remarkably high efficiency requires the system to well balance data across tasks and ensure data locality. However, it is challenging to satisfy these requirements for applications that operate on a collection of dynamically loaded and evicted datasets. The dynamics may lead to time-varying data volume and distribution, which would frequently invoke expensive data re-partition and transfer operations, resulting in high overhead and large delay. To address this problem, we present Stark, a system specifically designed for optimizing in-memory computing on dynamic dataset collections. Stark enforces data locality for transformations spanning multiple datasets (e.g., join and cogroup) to avoid unnecessary data replications and shuffles. Moreover, to accommodate fluctuating data volume and skeweddata distribution, Stark delivers elasticity into partitions to balance task execution time andreduce job makespan. Finally, Stark achieves bounded failure recovery latency byoptimizing the data checkpointing strategy. Evaluations on a 50-server cluster show that Stark reduces the job makespan by 4X and improves system throughput by 6X compared to Spark.
更多
查看译文
关键词
dynamic dataset collections,distributed in-memory computing frameworks,cached data,time-varying data volume,data repartition,transfer operations,Stark,data replications,skewed data distribution,task execution time,data checkpointing strategy
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要