QaaD (Query-as-a-Data): Scalable Execution of Massive Number of Small Queries in Spark.

Proc. ACM Manag. Data(2023)

Cited 0|Views23
No score
Abstract
Spark big data processing platform is heavily used in today's IT services for various critical applications such as machine learning tasks for service recommendations or massive volumes of raw sales data analysis. Spark is designed to deliver high performance by enabling a high degree of parallelism while processing various heavy-weight queries that require homogeneous operations on large data. However, it has been observed that workloads made of small and short-running queries coming from various sources are becoming dominant in practice. Unfortunately, the current Spark architecture is unfit to process workloads made of a large number of small queries optimally due to excessive I/Os with small computations. We present a technique, called QaaD, that addresses this problem fundamentally by applying i) transparent conversion of workloads made of small queries into one with large queries and ii) dynamic partition size adjustment for runtime overhead minimization. For this, we introduce a new abstraction, microRDD, to support our design of query merging, the embedding of queries as part of data, and an opportunistic sharing of common input data among queries. Comprehensive evaluation using real-world data shows that QaaD is able to deliver 10.6x to 36.6x speed-up against standard Spark executions for small query workloads.
More
Translated text
Key words
scalable execution,small queries,qaad,query-as-a-data
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined