EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree

Zhaodong Chen, Andrew Kerr, Richard Cai, Jack Kosaian, Haicheng Wu,Yufei Ding,Yuan Xie

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3(2024)

Cited 0|Views48
No score
Abstract
As deep learning models become increasingly complex, the deep learning compilers are critical for enhancing the system efficiency and unlocking hidden optimization opportunities. Although excellent speedups have been achieved in inference workloads, existing compilers face significant limitations in training. Firstly, the training computation graph involves intricate operations challenging to fuse, such as normalization, loss functions, and reductions, which limit optimization opportunities like kernel fusion. Secondly, the training graph's additional edges connecting forward and backward operators pose challenges in finding optimal and feasible partitions for kernel fusion. More importantly, existing compilers cannot either generate kernels with state-of-the-art performance on modern GPUs or accommodate diverse fusion patterns. In this paper, we introduce Epilogue Visitor Tree (EVT), a novel compiler that overcomes these limitations. EVT employs novel graph-level compilation passes to unlock hidden fusion and optimization opportunities. It also incorporates a novel integer linear programming-based partitioner that efficiently solves the optimal and feasible partitions in complex joint forward-backward graphs. Moreover, we present the Epilogue Visitor Abstraction and introduce the EVT operator compiler that automatically generates flexible epilogues that can be integrated with high-performance main loop implementations from CUTLASS and other SOTA libraries. EVT is evaluated on diverse training workloads across domains and achieves 1.26~3.1× speedup.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined