Leveraging NVLINK and asynchronous data transfer to scale beyond the memory capacity of GPUs

ScalA@SC（2017）

引用 5|浏览14

暂无评分

摘要

In this paper we demonstrate the utility of fast GPU to CPU interconnects to weak scale on hierarchical nodes without being limited to problem sizes that fit only in the GPU memory capacity. We show the speedup possible for a new regime of algorithms which traditionally have not benefited from being ported to GPUs because of an insufficient amount of computational work relative to bytes of data that must be transferred (offload intensity). This new capability is demonstrated with an example of our hierarchical GPU port of UMT, the 51K line CORAL benchmark application for Lawrence Livermore National Lab's radiation transport code. By overlapping data transfers and using the NVLINK connection between IBM POWER 8 CPUs and NVIDIA P100 GPUs, we demonstrate a speedup that continues even when scaling the problem size well beyond the memory capacity of the GPUs. Scaling to large local domains per MPI process is a necessary step to solving very large problems, and in the case of UMT, large local domains improve the convergence as the number of MPI ranks are weak scaled.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要