Mass Storage on the Terascale Computing System

MSS(2001)

Cited 5|Views21
No score
Abstract
On August 3rd, 2000, the National Science Foundation announced the award of $45million to the Pittsburgh Supercomputing Center to provide "terascale" computingcapability for U.S. researchers in all science and engineering disciplines.This Terascale Computing System (TCS) will be built using commodity components.The computational engines will be Compaq Alpha CPU's in a four processors per nodeconfiguration. The final system will have over 682 quad processor nodes, for a total ofmore than 2700 Alpha CPUs. Each node will have 4 gigabytes of memory, for a totalsystem memory of over 2.5 terabytes. All the nodes will be interconnected by a highspeed, very low latency Quadrics switch fabric, constructed in a full "fat-tree" topology.Given the very high component count in this system, it is important to architect thesolution to be tolerant of failures. Part of this architecture is the efficient saving of aprogram's memory state to disk. This checkpointing of the program memory should beeasy for the programmer to invoke and sufficiently fast to allow for frequent checkpoints,yet it should not severely impact the performance of the compute nodes or file servers.There will be flexibility in the recovery so that spare nodes can be automatically swappedin for failed nodes and the job restarted from the most recent checkpoint withoutsignificant user or system management intervention. It is estimated that the file serversrequired to collect and store these checkpoints and other temporary storage for executingjobs will collectively have ~27 terabytes of disk storage. The file servers will maintain adisk cache that will be migrated to a Hierarchical Storage Manager serving over 300terabytes.This paper will discuss the hardware and software architectural design of the TCSmachine. The software architecture of this system rests upon Compaq's "AlphaServerSC" software, which includes administration, control, accounting, and schedulingsoftware. We will describe our accounting, scheduling, and monitoring systems and theirrelation to the software included in AlphaServer SC.
More
Translated text
Key words
final system,memory state,program memory,software architectural design,software architecture,system management intervention,totalsystem memory,file server,Alpha CPUs,Compaq Alpha CPU,Mass Storage,Terascale Computing System
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined