Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions Atomically.

PACT(2022)

引用 0|浏览21
暂无评分
摘要
Due to the missing of a good orchestration of loop transformations, existing optimizing compilers for deploying neural networks on GPU either parallelize reductions ineffectively or miss the fusion opportunities with other operators. Neural network models thus exhibit sub-optimal performance on GPU. We present a practical approach called Panamera for the effective parallelization of reductions in neural networks on GPU. Panamera first leverages loop coalescing to flatten the loop dimensions of reductions, converting all reduction operators into canonical forms eligible for the polyhedral model. Next, Panamera uses polyhedral transformations to reduce the data movements caused by unfused reductions and perform multi-block hardware binding not considered by many compilers. Finally, Panamera embeds a highly optimized routine implemented using GPU atomic instructions, further improving the performance of neural network models while guaranteeing the correctness of parallel reductions. The experimental results demonstrate the effectiveness of our approach: for single operators our code obtains a mean speedup of 33.7×, 3.5×, 5.4× and 9.6× over cuDNN, CUB, TVM and Ansor, for sub-graphs our approach outperforms cuDNN, TVM and Ansor by 9.5×, 2.6× and 2.7×, and for end-to-end workloads, a tensor compiler integrated with our approach outperforms them by 122.5%, 19.3% and 15.2%.
更多
查看译文
关键词
deep learning, reduction, GPU, polyhedral compilation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要