TH-Allreduce: Optimizing Small Data Allreduce Operation on Tianhe System.

International Conference on Parallel and Distributed Systems(2023)

引用 0|浏览6
暂无评分
摘要
Scaling up parallel applications can be challenging, especially when dealing with large volumes of data that need to be distributed across multiple nodes. In this paper, we explore the system architecture of Tianhe and propose a cutting-edge solution for optimizing global data communication. Our optimized small data blocking/non-blocking allreduce method (TH-Allreduce) is tailored for scientific applications like solving linear systems Ax=b, which often require massive data processing capabilities. To address intra-node communication challenges, we introduce the Ping-Pong Small Data Shared Memory (PPSDSM) framework, which utilizes ping-pong communication patterns to minimize Round-Trip Time (RTT) and reduce computational costs. We further present a latency-aware allreduce algorithm (PP-LA) based on PPSDSM that optimizes communication overheads and computational costs. For inter-node communication, we leverage the power of the Tianhe offloading engine to propose a topology-aware offloading allreduce method. Our experimental results show that our state-of-the-art library outperforms typical MPI implementations on different CPUs, achieving a speedup of 1.5-12x for intra-node allreduce and 1.32-3.34x for multi-node small data allreduce on the Tianhe Exascale Prototype Upgrade System at scale. These findings demonstrate that our proposed methods can significantly improve communication efficiency and scalability for distributed computing on Tianhe systems, opening up new avenues for a wide range of scientific applications.
更多
查看译文
关键词
Small Data Allreduce,Collective Communication,Offloading Communication,Shared Memory,Inter-node communication
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要