Out-of-GPU FFT: A case study in GPU prefetching

2021 International Conference on Computational Science and Computational Intelligence (CSCI)(2021)

引用 0|浏览2
暂无评分
摘要
In this paper, we propose a decomposition of the N-dimensional FIT and novel transposition strategies to optimize performance for input sizes that do not lit on the GPU. The state-of-the-art GPU FFT library, cuFFT, efficiently solves FFT problems that fit in the GPU memory. Additionally, using managed memory, cuFFT can solve problems that exceed the GPU memory, albeit inefficiently due to poor prefetching from the CPU. The major bottleneck in computing the FFT on a (WU is the PCI bandwidth. Therefore, careful prefetching is required to maximize PCI bandwidth. Batches of decomposed input data are sent to and from the GPU to overlap communication with computation. The batches are organized such that the dimension that is stored contiguously is always included to maximize DRAM bandwidth and cache line use. We compare three transposition strategies: CPU based transposition, GPU based transposition, and index-based transposition of the actual FFT and find that GPU based transposition performs the best. Finally, we propose a model that relates the hardware characteristics to the decomposition parameters. We compare our results to the model and to cuFFT on three platforms: a workstation with GeForce GTX 1060, NERSC Cori, and ORNL Summit it and show a 2-3X speedup over cuFFT using managed memory for input sizes that do not fit in the GPU memory.
更多
查看译文
关键词
Fast Fourier Transform,FFT,GPU,CUDA
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要