Efficient Data Blocking and Skipping Framework Applying Heuristic Rules

2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)(2017)

引用 0|浏览85
暂无评分
摘要
Data blocking has been an effective technique of data skipping to reduce data access and shorten query response time in query engines. By generating fine-grained, balanced blocks and corresponding metadata, a query may skip a block if the metadata indicates that the block does not contain relevant data. Obviously, the deciding factor of a promising blocking strategy depends on how to produce effective data layout in reasonable time that is expected to skip most data. In this paper, we propose several algorithms that drastically reduce the time complexity of existent blocking strategies based on workload analysis, at the cost of relatively small loss of estimated tuples could be skipped. Via theoretical analysis, we prove that the time complexity of our algorithms is apparently lower than that of ward algorithm. Afterwards, we demonstrate the whole blocking and skipping workflow, install it into Spark SQL and obtain experimental evaluation results. Experimental results show that our technique gains significant improvement in aspect of blocking efficiency compared to ward algorithm, while keeping almost the same level of skipping ability.
更多
查看译文
关键词
data blocking,data skipping,workload,metadata,query response time,Spark SQL
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要