Multi-level spatial and temporal tiling for efficient HPC stencil computation on many-core processors with large shared caches.

Charles Yount,Alejandro Duran,Josh Tobin

Future Generation Computer Systems（2019）

引用 18|浏览48

暂无评分

摘要

Stencil computation is an important class of algorithms used in a large variety of scientific-simulation applications, especially those arising from finite-difference numerical solutions to differential equations representing the behavior of physical phenomenon such as seismic activity. The performance of stencil calculations is often bounded by memory bandwidth, and such code benefits from vectorization and tiling techniques to reuse data as much as possible once it is loaded from memory. These tiling algorithms are especially crucial for many-core CPU products that contain caches local to the individual cores, and this work provides a review of the use of techniques such as vector-folding and spatial tiling to maximize per-core cache resources. Recent many-core products also include special memory with much higher bandwidth than traditional DDR memory that is intended to provide additional performance for bandwidth-limited applications. On such platforms that also include DDR, the high-bandwidth RAM may be configurable either as separately addressable memory or as a large shared cache for the DDR. Examples of platforms with this feature include those containing products in the Intel® Xeon Phi™ x200 processor family (code-named Knights Landing), which use Multi-Channel DRAM (MCDRAM) technology to provide the higher bandwidth memory resources. In traditional sequential time-step stencil algorithms, the additional bandwidth can most easily be exploited when the stencil data fits into the faster memory, restricting the problem sizes that can be undertaken and under-utilizing the larger DDR memory on the platform. As stencil problem sizes become significantly larger than the fast-memory capacity, the sequential time-step algorithms create an overwhelming number of misses from the fast-memory shared cache, and the effective bandwidth approaches that of the DDR, significantly degrading performance. This paper illustrates this effect and explores the application of temporal wave-front tiling to alleviate it, simultaneously leveraging both the large cache’s bandwidth and the DDR capacity. Two example applications are used to illustrate the optimizations: a single-grid isotropic approximation to the wave equation and a staggered-grid formulation for earthquake simulation. Details of the various tiling algorithms are given for both applications, and results on a Xeon Phi processor are presented, comparing performance across problem sizes and among four experimental configurations. Analyses of the bandwidth utilization and MCDRAM-cache hit rates are provided for one of the example applications, illustrating the correlation between these metrics and performance. It is demonstrated that temporal wave-front tiling can provide up to a 2.4x speedup compared to using the fast-memory cache without temporal tiling and 3.3x speedup compared to only using DDR memory for large problem sizes on the isotropic application. Respective speedups of 1.9x and 2.8x are demonstrated for the staggered-grid application.

查看译文

关键词

Finite-difference method,Seismic modeling,Intel Xeon Phi,Temporal wave-front tiling,Vector-folding

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要