Parallel Compression Checkpointing for Socket-Level Heterogeneous Systems

High Performance Computing and Communications(2011)

Cited 5|Views3
No score
Abstract
Check pointing is an effective fault tolerant technique to improve the reliability of large scale parallel computing systems. However, check pointing causes a large number of computation nodes to store a huge amount of data into file system simultaneously. It does not only require a huge storage space to store system state, but also brings a tremendous pressure on the communication network and I/O subsystem because a massive demand of accesses are concentrated in a short period of time. Data compression can reduce the size of checkpoint data to be saved in the file system and to go through the communication network. However, compression induces a huge time overhead especially in large scale parallel systems, which is the main technical barrier of its practical usability. In this paper, we propose a parallel compression check pointing technique to reduce the time overhead in socket-level heterogeneous architectures. It integrates a number of parallel processing techniques, including transmitting checkpoint data between CPU, GPU and file system in double buffered pipelines, aggregating file write operations, SIMD parallel compression algorithm running on GPU, etc. The paper also reports an implementation of the technique on the Tianhe-1 supercomputer system and the evaluation experiments with the system. The experiment data show that the technique is efficient and practically usable.
More
Translated text
Key words
system state,file system,checkpoint data,SIMD parallel compression algorithm,large scale parallel system,data compression,large scale parallel computing,Parallel Compression Checkpointing,Socket-Level Heterogeneous Systems,communication network,experiment data,Tianhe-1 supercomputer system
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined