Accelerating Scientific Workflow Applications with GPUs

semanticscholar(2013)

引用 2|浏览0
暂无评分
摘要
This work analyzes the performance increases gained from enabling Swift applications to utilize the GPU through the GeMTC Framework. By identifying computationally intensive portions of Swift applications, we can easily turn these code blocks into GeMTC microkernels. Users can then call these microkernels throughout the lifetime of their Swift application. The GeMTC API handles task overlap and data movement, providing transparent GPU acceleration for the user. This work highlights preliminary performance results from the scientific application MDProxy. This application determines the energy of particles in a modeled universe as they move around in space. Keywords-Many-Task Computing, Swift, GPGPU, CUDA I. BACKGROUND INFORMATION GeMTC (GPU enabled Many-Task Computing) [1], is a CUDA-based framework which provides efficient support for Many-Task Computing [2] workloads on accelerators. [3] The GeMTC framework has been integrated into Swift/T [4], a parallel programming framework from Argonne National Laboratory and the University of Chicago, providing GPU functionality for the Swift language. [5] A microkernel is a traditional CUDA kernel that is modified to run in the GeMTC framework. A CUDA kernel is a userdefined function that runs on a NVIDIA GPU. II. MDPROXY ARCHITECTURE In Figure 1 the call stack architecture is shown for MDProxy through Swift and GeMTC. The user writes a Swift script that will build an array of potential particles and calls GeMTC MDProxy with this array as a parameter. Each call to MDProxy creates it’s own universe of particles and ships the universe to the GPU. Finally the MDProxy application consists of three functions 1) initialize the universe, 2) run the computation, and 3) update the result. III. TESTING ENVIRONMENT In this work we conduct our evaluation on a GTX 670 GPU with 7 Streaming Multiprocessors (SMXs). In addition this GPU contains 84 Warps(utilized as workers), 1344 CUDA Cores, and 2GB of DDR5 RAM. CPU results are tested on a 6 core 3Ghz AMD CPU. Fig. 1. Call stack architecture for the MDProxy implementation. IV. MDPROXY EVALUATION In Figure 2 the GeMTC MDProxy micro-kernel is evaluated with 2,688 particles and scaled up to 900 steps of computation from within the application. In addition, a comparison is drawn against a threaded CPU implementation of MDProxy with the GPU version showing a 10x speedup. Finally, Figure 3 shows how tasks per second are calculated based on varying the number of particles per universe. This work achieves almost 12k tasks per second for workloads with 500 particles in a given universe. Fig. 2. MDProxy evaluation of the GPU vs. CPU implementations.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要