Accelerating Distributed GNN Training by Codes.

IEEE Transactions on Parallel and Distributed Systems(2023)

Cited 0|Views11
No score
Abstract
Emerging graph neural network (GNN) has recently attracted much attention and has been used extensively in many real-world applications thanks to its powerful expression ability of unstructured data. The real-world graph datasets are very large-scale, which can contain up to billions of nodes and tens of billions of edges. It usually requires distributed system to train GNN on such huge datasets. As a result, the data communication overheads between machines become the bottleneck of GNN computation. Our profiling results show that getting attributes from remote machines during sampling phase in GNN occupies >75% of the time of the training process. To address this issue, in this article, we propose Coded Neighbor Sampling (CNS) framework, which introduces codes technique to reduce the communication overheads of GNN. In the proposed CNS framework, the codes technique is coupled with GNN sampling method to exploit the data excess among different machines caused by unstructured nature of graph data. An analytical performance model is built for the proposed CNS framework, whose results are corroborated by the simulation and validate the benefit of the proposed CNS framework over both conventional GNN training method and conventional codes technique. Performance metrics, such as communication overheads, runtime, and throughput, of the proposed CNS framework are evaluated on a distributed GNN training simulation system implemented on MPI4py platform. The results show that, on average, the proposed CNS framework can save communication overhead by 40.6%, 35.5%, and 16.5%, reduce the runtime by 12.1%, 17.0%, and 10.0%, and improve the throughput by 16.2%, 24.4%, and 11.2%, respectively, when training GNN models with Cora, PubMed, and Large Taobao.
More
Translated text
Key words
GNN,neighbor sampling,codes,distributed machine learning,MPI,communication
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined