Graphene: An IR for Optimized Tensor Computations on GPUs

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3(2023)

引用 0|浏览68
暂无评分
摘要
Modern GPUs accelerate computations and data movements of multi-dimensional tensors in hardware. However, expressing optimized tensor computations in software is extremely challenging even for experts. Languages like CUDA C++ are centered around flat buffers in one-dimensional memory and lack reasonable abstractions for multi-dimensional data and threads. Existing tensor IRs are not expressive enough to represent the complex data-to-thread mappings required by the GPU tensor instructions. In this paper, we introduce Graphene, an intermediate representation (IR) for optimized tensor computations on GPUs. Graphene is a low-level target language for tensor compilers and performance experts while being closer to the domain of tensor computations than languages offering the same level of control such as CUDA C++ and PTX. In Graphene, multi-dimensional data and threads are represented as first-class tensors. Graphene’s tensors are hierarchically decomposable into tiles allowing to represent optimized tensor computations as mappings between data and thread tiles. We evaluate Graphene using some of the most important tensor computations in deep learning today, including GEMM, Multi-Layer Perceptron (MLP), Layernorm, LSTM, and Fused Multi-Head Attention (FMHA). We show that Graphene is capable of expressing all optimizations required to achieve the same practical peak performance as existing library implementations. Fused kernels beyond library routines expressed in Graphene significantly improve the end-to-end inference performance of Transformer networks and match or outperform the performance of cuBLAS(Lt), cuDNN, and custom handwritten kernels.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要