Supporting Data Shuffle Between Threads in OpenMP.

IWOMP(2020)

引用 4|浏览11
暂无评分
摘要
Both NVIDIA and AMD GPUs provide shuffle or permutation instructions to enable direct data movement between private registers of different threads. Since it doesn’t involve the shared memory or global memory on the device which are slower than direct register access, data shuffling provides opportunities of optimizing data copy to improve computing performance. However, shuffle is low-level primitive(warp- or lane-level for NVIDIA and AMD GPUs) for GPU programming. It requires advanced knowledge and skills to effectively use it. In this paper, we present two approaches of using shuffle in OpenMP, 1) a high performance runtime implementation of reduction clause using shuffle instruction; and 2) proposed shuffle extension to OpenMP to let users specify when and how the data should be moved between threads. Using sum reduction and 2D stencil as examples in our experiment, the shuffle implementation always delivers the best performance with up to 2.39x speedup compared with other high performance implementation. Compared with standard OpenMP offloading code for 2D stencil, our shuffle implementation delivers superior performance for as many as 25x better. We also provide study of simulated shuffle using shared memory on NVIDIA GPUs to demonstrate how to support this extension on hardware that has no native shuffle support.
更多
查看译文
关键词
OpenMP, CUDA, Shuffle, Reduction, Stencil
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要