Explaining SSD Failures using Anomaly Detection

semanticscholar(2021)

引用 0|浏览0
暂无评分
摘要
NAND flash based solid-state drives (SSDs) represent an important storage tier in data centers holding most of today’s warm and hot data. Even with the advanced fault tolerance techniques and low failure rates, large hyperscale data centers utilizing 100,000’s of SSDs suffer from multiple device failures daily. Data center operators are interested in predicting SSD device failures for two main reasons. First, even with RAID [2] and replication [5] techniques in place, device failures induce transient recovery and repair overheads, affecting the cost and tail latency of storage systems. Second, predicting near-term failure trends helps to inform the device acquisition process, thus avoiding capacity bottlenecks. Hence, it is important to predict both the short-term individual device failures as well as near-term failure trends. Prior studies on predicting storage device failures [1, 6, 7, 9] suffer from the following main challenges. First, as they utilize black-box machine learning (ML) techniques, they are unaware of the underlying failure reasons rendering it difficult to determine the failure types that these models can predict. Second, the models in prior work struggle with dynamic environments that suffer from previously unseen failures that have not been included in the training set. These two challenges are especially relevant for the SSD failure detection problem which suffers from high class-imbalance. In particular, the number of healthy drive observations is generally orders of magnitude larger than the number of failed drive observations, thus posing a problem for training most traditional supervised ML models. To address these challenges, we propose to utilize 1-class ML models that are trained only on the majority class. By ignoring the minority class for training, our 1-class models avoid overfitting to an incomplete set of failure types, thereby improving the overall prediction performance by up to 9.5% in terms of ROC AUC score. Furthermore, we introduce a new learning technique for SSD failure detection, 1-class autoencoder, which enables interpretability of the trained models while providing high prediction accuracy. In particular, 1-class autoencoders provide insights into what features and their combinations are most relevant to flagging a particular type of device failure. This enables categorization of failed drives based on their failure type, thus informing about specific procedures (e.g., repair, swap, etc.) that need to be applied to resolve the failure. For analysis and evaluation of our proposed techniques, we leverage a cloud-scale dataset from Google that has already been used in prior work [1, 8]. This dataset contains 40 million observations from over 30,000 drives over a period of six years. For each observation, the dataset contains 21 different SSD telemetry parameters including SMART (Self-Monitoring, Analysis, and Reporting Technology) parameters, the amount of read and written data, error codes, as well as the information about blocks that became nonoperational over time. Around 30% of the drives that failed during the data collection process were replaced while the rest were removed, and hence no longer appeared in the dataset. As a result, we obtained approximately 300 observations for each healthy drive (40 million observations in total) and 4 to 140 observations for each failed drive (15000 total observations). We treated each data point as an independent observation and normalized all the non-categorical data values to be between 0 and 1. One of our primary goals was to select the most distinguishing features that are highly correlated to the failures for training. We used three different feature selection methods, Filter, Embedded, and Wrapper [4] techniques, for selecting the most important features contributing to failures for our dataset. The resulting set of top features selected were correctable error count, cumulative bad block count, cumulative bad block count, cumulative p/e cycle, erase count, final read error, read count, factory bad block count, write count, and status read-only. The dataset containing only the top selected features is then used for training the different ML models.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要