Analyzing Spark Scheduling And Comparing Evaluations On Sort And Logistic Regression With Albatross Henrique Pizzol Grando

Henrique Pizzol Grando, Sami Ahmad Khan,Iman Sadooghi,Ioan Raicu

semanticscholar(2016)

引用 0|浏览0
暂无评分
摘要
Large amounts of data that needs to be processed nowadays, have led to the Big Data paradigm and the development of distributed systems. In order to facilitate the programming effort in these systems, frameworks like Spark [10] were created. Spark abstracts the notion of parallelism from the user and ensures that tasks are computed in parallel within the system, handling resources and providing fault tolerance. The scheduler in Spark is a centralized element that distributes the tasks across the worker nodes using a push mechanism and it dynamically scales the set of cluster resources according to workload and locality constraints. However, in bigger scales or with fine-grained workloads, a centralized scheduler can schedule tasks in a rate lower than the necessary, causing response time delays and increasing the latency. Various frameworks have been designed with a distributed scheduling approach, one of which is Albatross [7], a task level scheduling framework that uses a pull based mechanism instead of traditional push based of Spark, that uses a Distributed Message Queue (DMQ) for task distribution among its workers. In this paper, we discuss the problems of centralized scheduling in Spark, show different distributed scheduling approaches that could be a fundamental idea for a new distributed scheduler on Spark and we perform empirical evaluations on Sort and Logistic Regression in Spark to compare it with Albatross.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要