Chrome Extension
WeChat Mini Program
Use on ChatGLM

Fault Modeling of Extreme Scale Applications Using Machine Learning

2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)(2016)

Cited 27|Views104
No score
Abstract
Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. This paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application error - and hence a recovery algorithm should be invoked - or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given a fault signature (a set of attributes comprising of system and application state), we use machine learning to create a model which predicts whether a multi-bit permanent/transient main memory fault will likely result in error. We present the design elements such as the fault injection methodology for covering important data structures, the application and system attributes which should be used for learning the model, the supervised learning algorithms (and potentially ensembles), and important metrics. We use three applications - NWChem, LULESH and SVM - as examples for demonstrating the effectiveness of the proposed fault modeling methodology.
More
Translated text
Key words
Fault,Modeling,Machine Learning,Applications,Exascale
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined