From reactive to proactive load balancing for task-based parallel applications in distributed memory machines

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE(2023)

引用 0|浏览3
暂无评分
摘要
Load balancing is often a challenge in task-parallel applications. The balancing problems are divided into static and dynamic. "Static" means that we have some prior knowledge about load information and perform balancing before execution, while "dynamic" must rely on partial information of the execution status to balance the load at runtime. Conventionally, work stealing is a practical approach used in almost all shared memory systems. In distributed memory systems, the communication overhead can make stealing tasks too late. To improve, people have proposed a reactive approach to relax communication in balancing load. The approach leaves one dedicated thread per process to monitor the queue status and offload tasks reactively from a slow to a fast process. However, reactive decisions might be mistaken in high imbalance cases. First, this article proposes a performance model to analyze reactive balancing behaviors and understand the bound leading to incorrect decisions. Second, we introduce a proactive approach to improve further balancing tasks at runtime. The approach exploits task-based programming models with a dedicated thread as well, namely Tcomm$$ Tcomm $$. Nevertheless, the main idea is to force Tcomm$$ Tcomm $$ not only to monitor load; it will characterize tasks and train load prediction models by online learning. "Proactive" indicates offloading tasks before each execution phase proactively with an appropriate number of tasks at once to a potential victim (denoted by an underloaded/fast process). The experimental results confirm speedup improvements from 1.5x$$ 1.5\times $$ to 3.4x$$ 3.4\times $$ in important use cases compared to the previous solutions. Furthermore, this approach can support co-scheduling tasks across multiple applications.
更多
查看译文
关键词
distributed memory, dynamic load balancing, machine learning, MPI plus OpenMP, online prediction, task-based parallel models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要