Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters

2021 17TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING (MSN 2021)(2021)

引用 1|浏览2
暂无评分
摘要
Large-scale interactive web services and advanced AI applications make sophisticated decisions in real-time, based on executing a massive amount of computation tasks on thousands of servers. Task schedulers, which often operate in heterogeneous and volatile environments, require high throughput, i.e., scheduling millions of tasks per second, and low latency, i.e., incurring minimal scheduling delays for millisecond-level tasks. Scheduling is further complicated by other users' workloads in a shared system, other background activities, and the diverse hardware configurations inside datacenters. We present Rosella, a new self-driving, distributed approach for task scheduling in heterogeneous clusters. Rosella automatically learns the compute environment and adjusts its scheduling policy in real-time. The solution provides high throughput and low latency simultaneously because it runs in parallel on multiple machines with minimum coordination and only performs simple operations for each scheduling decision. Our learning module monitors total system load and uses the information to dynamically determine optimal estimation strategy for the backends' compute-power. Rosella generalizes power-of-two-choice algorithms to handle heterogeneous workers, reducing the max queue length of O(log n) obtained by prior algorithms to O(log log n). We evaluate Rosella with a variety of workloads on a 32-node AWS cluster. Experimental results show that Rosella significantly reduces task response time, and adapts to environment changes quickly.
更多
查看译文
关键词
Distributed Scheduler,Heterogeneous Clusters,high throughput,automatically learns the compute environment
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要