High-Performance Matrix Multiplication on the New Generation Shenwei Processor.

Le Xu, Hong An,Junshi Chen, Pengfei Zhang

HPCC/DSS/SmartCity/DependSys(2022)

引用 0|浏览0
暂无评分
摘要
As a critical operation in numerical computing, matrix multiplication is widely used in many high-performance applications. The new generation Shenwei processor is the newest many-core processor with powerful computing capacity, thus it's necessary to enable highly efficient matrix multiplication on it. In this paper, we present a detailed implementation of double-precision matrix multiplication on Shenwei processor and further perform deep optimizations in terms of both memory access and computing. We employ a three-level task partitioning scheme mapping blocked matrix multiplication to each mem-ory hierarchy to improve computational parallelism and data reusability. And the dual broadcasting algorithm enables efficient parallel computing on chip. For further optimization, we present a three-level latency hiding strategy to bridge the performance gap between computing and memory access, where the hybrid data prefetch effectively hides DMA memory access and RMA communication costs. And the adaptive blocking algorithm is proposed for sake of the generality of algorithms towards various matrix sizes and shapes. High-performance matrix multiplication is highly correlated with the compute kernel, and we propose the register-buffering instruction scheduling with latency hiding strategy to maximize pipeline efficiency. The experiments show that the performance of our algorithm can reach 2 TFLOPS at large scales, near the theoretical peak performance. And the hybrid prefetch can improve performance by 82 %, while the adaptive blocking not only improves performance but also smoothes the performance jitter of irregular matrix multi-plications. The compute kernel with instruction scheduling is highly optimized and can achieve more than 380% performance improvement.
更多
查看译文
关键词
parallel algorithm,Shenwei many-core proces-sor,dense computing,matrix multiplication,GEMM
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要