Performance Analysis and Deployment of Partitioning Strategies in Apache Spark

Procedia Computer Science(2023)

引用 1|浏览1
暂无评分
摘要
Data is flourishing day by day to a large extent, the data that need to be analyzed are not just large, but it may be high dimensional, heterogeneous, complex, unstructured, incomplete and noisy as well. Apache Spark framework is used for high-performance computing of Big Data. The proper division of the dataset can impact the degree of parallelism achieved. In Apache Spark, partitioning techniques help to manage the dataset in a distributed fashion. The optimal number of partitions and associated data with these partitions ensures the appropriate storage and quick access to data. Inbuilt libraries and flexibility utilizing the available techniques help Apache Spark to divide the dataset as a convenience. In this study, partitioning techniques available in Spark have been discussed and implemented. The pros and cons of the partitioning techniques have been highlighted. Furthermore, the performance of partitioning techniques has been compared in terms of execution time which help to select the suitable partitioning strategy for the different jobs.
更多
查看译文
关键词
Partitioning,MapReduce,Load Balancing,Big data,Hashing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要