A Native Tensor–Vector Multiplication Algorithm for High Performance Computing

IEEE Transactions on Parallel and Distributed Systems(2022)

引用 0|浏览4
暂无评分
摘要
Tensor computations are important mathematical operations for applications that rely on multidimensional data. The tensor–vector multiplication (TVM) is the most memory-bound tensor contraction in this class of operations. This article proposes an open-source TVM algorithm which is much simpler and efficient than previous approaches, making it suitable for integration in the most popular BLAS libraries available today. Our algorithm has been written from scratch and features unit-stride memory accesses, cache awareness, mode obliviousness, full vectorization and multi-threading as well as NUMA awareness for non-hierarchically stored dense tensors. Numerical experiments are carried out on tensors up to order 10 and various compilers and hardware architectures equipped with traditional DDR and high bandwidth memory (HBM). For large tensors the average performance of the TVM ranges between 62% and 76% of the theoretical bandwidth for NUMA systems with DDR memory and remains independent of the contraction mode. On NUMA systems with HBM the TVM exhibits some mode dependency but manages to reach performance figures close to peak values. Finally, the higher-order power method is benchmarked with the proposed TVM kernel and delivers on average between 58% and 69% of the theoretical bandwidth for large tensors.
更多
查看译文
关键词
Parallel algorithms,shared memory,tensor computations,high bandwidth memory,NUMA
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要