In-Memory Indexed Caching for Distributed Data Processing

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)(2022)

引用 0|浏览60
暂无评分
摘要
Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.
更多
查看译文
关键词
dataframes,underlying runtime system,de-facto distributed data processing framework,Apache Spark,modern cloud-based data-science workloads,Indexed DataFrame,in-memory cache,dataframe abstraction,indexing capabilities,nonindexed dataframe,memory Indexed caching,powerful abstractions
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要