Proposal : Distribution-based cluster scheduling

Jun Woo Park, Junwoop

semanticscholar(2018)

引用 0|浏览5
暂无评分
摘要
Modern computing clusters support a mixture of diverse activities, ranging from customer-facing internet services, software development and test, scientific research, and exploratory data analytics [1, 16]. The role of the cluster schedulers is to map these tasks to the heterogeneous resources available in the cluster. They face a daunting task of efficiently matching the pending job according to their scheduling preferences (in terms of the resource and deadlines) while minimizing the completion latency and maximizing the cluster efficiency. Many recent schedulers exploit knowledge of pending jobs’ runtimes and resource usages as a powerful building block [5, 11, 24]. Using estimates of runtime and resource usage, a scheduler can pack jobs aggressively into its resource plan [5, 11, 24, 26], such as allowing a latency sensitive job to start before a high-priority batch job as long as the batch job will meet its deadline. The knowledge enables the scheduler to consider whether it is better to wait for a job’s preferred resources to be freed or to start the job right away on sub-optimal resources [3, 24]. Knowledge of job runtime and resource usage leads to more robust scheduling decisions than using simple scheduling algorithms that cannot leverage the information. In most cases, runtime estimates come from the observation of similar jobs (e.g., from the same user or past instances from the same job periodic job script) ran in the past. A point runtime estimate (e.g., mean or median) is derived from the relevant subset of the history and used by the scheduler. If the estimates are reasonably accurate, a scheduler that uses them usually outperforms other approaches. Previous research [24] suggests that these schedulers are robust to a reasonable degree of error (e.g., up to 50%). However, analyses of workloads from real clusters show that the actual estimate errors span much larger ranges than the previously explored. Applying a state-of-the-art ML-based predictor [23] to three real-world traces shows good estimates in general (77%-92% are within a factor of two of the actual runtime and most much closer), but a significant percentage (8%-23%) of estimates are not within that range, and some are off by more than an order of magnitude [15]. Even very effective predictors have inaccuracy and outlier because there is significant inherent variability in multi-purpose cluster workloads. The impact of inaccurate point estimates on scheduler performance is significant. Testing with real estimate profiles reveals that a scheduler relying on such estimates performs much worse with real estimate error profiles as compared to having perfect estimates. The point-estimate based
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要