Evaluating ARM HPC clusters for scientific workloads

Concurrency and Computation: Practice and Experience(2015)

引用 30|浏览80
暂无评分
摘要
The power consumption of modern high-performance computing HPC systems that are built using power hungry commodity servers is one of the major hurdles for achieving Exascale computation. Several efforts have been made by the HPC community to encourage the use of low-powered system-on-chip SoC embedded processors in large-scale HPC systems. These initiatives have successfully demonstrated the use of ARM SoCs in HPC systems, but there is still a need to analyze the viability of these systems for HPC platforms before a case can be made for Exascale computation. The major shortcomings of current ARM-HPC evaluations include a lack of detailed insights about performance levels on distributed multicore systems and performance levels for benchmarking in large-scale applications running on HPC. In this paper, we present a comprehensive evaluation of results that covers major aspects of server and HPC benchmarking for ARM-based SoCs. For the experiments, we built an unconventional cluster of ARM Cortex-A9s that is referred to as Weiser and ran single-node benchmarks STREAM, Sysbench, and PARSEC and multi-node scientific benchmarks High-performance Linpack HPL, NASA Advanced Supercomputing NAS Parallel Benchmark, and Gadget-2 in order to provide a baseline for performance limitations of the system. Based on the experimental results, we claim that the performance of ARM SoCs depends heavily on the memory bandwidth, network latency, application class, workload type, and support for compiler optimizations. During server-based benchmarking, we observed that when performing memory intensive benchmarks for database transactions, x86 performed 12% better for multithreaded query processing. However, ARM performed four times better for performance to power ratios for a single core and 2.6 times better on four cores. We noticed that emulated double precision floating point in Java resulted in three to four times slower performance as compared with the performance in C for CPU-bound benchmarks. Even though Intel x86 performed slightly better in computation-oriented applications, ARM showed better scalability in I/O bound applications for shared memory benchmarks. We incorporated the support for ARM in the MPJ-Express runtime and performed comparative analysis of two widely used message passing libraries. We obtained similar results for network bandwidth, large-scale application scaling, floating-point performance, and energy-efficiency for clusters in message passing evaluations NBP and Gadget 2 with MPJ-Express and MPICH. Our findings can be used to evaluate the energy efficiency of ARM-based clusters for server workloads and scientific workloads and to provide a guideline for building energy-efficient HPC clusters. Copyright © 2015 John Wiley & Sons, Ltd.
更多
查看译文
关键词
energy-efficiency,arm evaluation,multicore cluster,large-scale scientific application
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要