The Unicorn Runtime: Efficient Distributed Shared Memory Programming for Hybrid CPU-GPU Clusters.

IEEE Trans. Parallel Distrib. Syst.(2017)

引用 20|浏览17
暂无评分
摘要
Programming hybrid CPU-GPU clusters is hard. This paper addresses this difficulty and presents the design and runtime implementation of Unicorn—a parallel programming model for hybrid CPU-GPU clusters. In particular, this paper proves that efficient distributed shared memory style programing is possible and its simplicity can be retained across CPUs and GPUs in a cluster, minus the frustration of dealing with race conditions. Further, this can be done with a unified abstraction, avoiding much of the complication of dealing with hybrid architectures. This is achieved with the help of transactional semantics (on shared global address spaces), deferred bulk data synchronization, workload pipelining and various communication and computation scheduling optimizations. We describe the said abstraction, our computation and communication scheduling system and report its performance on a few benchmarks like Matrix Multiplication, LU Decomposition and 2D FFT. We find that parallelization of coarse-grained applications like matrix multiplication or 2D FFT using our system requires only about 30 lines of C code to set up the runtime. The rest of the application code is regular single CPU/GPU implementation. This indicates the ease of extending parallel code to a distributed environment. The execution is efficient as well. When multiplying two square matrices of size $65,536\\times 65,536$ , Unicornachieves a peak performance of 7.88 TFlop/s when run over a cluster of 14 nodes with each node equipped with two Tesla M2070 GPUs and two 6-core Intel Xeon 2.67 GHz CPUs, connected over a 32 Gbps Infiniband network. In this paper, we also demonstrate that the Unicorn programming model can be efficiently used to implement high level abstractions like MapReduce. We use such an extension to implement PageRank and report its performance. For a sample web of 500 million web pages, our implementation completes a page rank iteration in about 18 seconds (on average) on a 14-node cluster.
更多
查看译文
关键词
Runtime,Programming,Optimization,Processor scheduling,Kernel,Synchronization,Graphics processing units
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要