FastDimeNet plus plus : Training DimeNet++ in 22 minutes

Feiwen Zhu,Michal Futrega,Han Bao,Sukru Burc Eryilmaz, Fei Kong, Matthias Jouanneaux, Maximilian Stadler,Michal Marcinkiewicz, Kefeng Duan, Xinnian Zheng, Nimrod Angel, Fung Xie,June Yang,Michael Andersch

PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023（2023）

引用 0|浏览1

暂无评分

摘要

Recently, graph neural network (GNN) has shown significant strength in predicting the quantum mechanical properties of molecules. Based on GNN, the DimeNet++ leverages both distance information of atomic pairs and angle information of atomic triplets via message passing mechanism to predict quantum mechanical properties of molecules and has achieved state-of-the-art results. However, there are more than 10 thousand operators in DimeNet++, which results in low GPU utilization and large CPU launch overhead. The extensive time taken for the training of DimeNet++ is a significant drawback. The training period of DimeNet++ exceeds one month on a single NVIDIA A100 GPU. A common method for reducing training time involves employing data parallelism, which equally distributes the global batch across each GPU. However, data-parallel task partitioning, by default, does not consider load imbalance within the batch. This load imbalance leads to considerable synchronization overhead in a multi-GPU setting, reducing the overall efficiency of the parallelism. For the strong-scaling scenario, it results in 32% of the compute resource being wasted. In light of these observations, we propose a novel approach, FastDimeNet++, which delivers high GPU utilization, low CPU overhead, and extensive scalability, achieved through a series of optimization strategies. These include (i) a communication-free load-balancing sampler, (ii) computation graph reconstruction, and (iii) kernel fusion and redundancy bypass. Our experiments demonstrate that FastDimeNet++ achieves a GPU utilization rate of approximately 88% based on a mini-batch size of 4. Furthermore, we scale FastDimeNet++ to 512 GPUs, reaching 2.8 PetaFLOPS. In the MLPerf HPC V1.0, the winning DimeNet++ submission required a total training time of 111.86 minutes, whereas FastDimeNet++ introduced for the MLPerf HPC V2.0 required just 21.93 minutes, demonstrating a significant performance improvement of over 5x.

查看译文

关键词

distributed training,high-performance computing,GPU,graph neural network,AI for science

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要