Chrome Extension
WeChat Mini Program
Use on ChatGLM

Error Vulnerabilities and Fault Recovery in Deep-Learning Frameworks for Hardware Accelerators

Iljoo Baek,Zhihao Zhu, Sourav Panda, Nandha Kishore Srinivasan,Soheil Samii,Ragunathan Raj Rajkumar

2020 IEEE 26th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA)(2020)

Cited 2|Views1
No score
Abstract
Hardware accelerators such as GP-GPUs, Tensor Cores, and Deep-Learning Accelerators (DLA) are increasingly being used in real-time settings such as autonomous vehicles (AVs). In such deployments, any software errors and process failures in hardware systems can lead to critical faults in AVs. Therefore, assessing and mitigating hardware accelerator faults are critical requirements for safety-critical systems. Past work on this subject focused on simulated and injected software and hardware faults to understand and analyze the behavior of the software stack and the entire system. However, programming errors and process failures caused when using software frameworks must also be considered. In this paper, we present experiments which show that widely used deep-learning frameworks are vulnerable to programming mistakes and errors. We first focus on memory-related programming errors caused by applications using deep-learning frameworks that facilitate high-performance inferencing. We next find that a reset to recover from any fault imposes significant time penalties in reloading a pre-trained deep neural network model. To reduce these fault recovery times, we propose fault recovery mechanisms that checkpoint and resume the network based on the inference stage when an error is detected. Finally, we substantiate the practical feasibility of our approach and evaluate the improvement in recovery times 1 1 A demo video clip demonstrating our recovery algorithm has been uploaded to Youtube: https://www.youtube.com/watch?v=xwUYdJdA5oM.. We use a case-study with real-world applications on an Nvidia GeForce GTX 1070 GPU and an Nvidia Xavier embedded platform, which is commonly used by multiple automotive OEMs.
More
Translated text
Key words
fault detection,fault recovery,deep learning framework,checkpoint,hardware accelerators
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined