Locating Anomaly Clues for Atypical Anomalous Services: An Industrial Exploration
IEEE Transactions on Dependable and Secure Computing(2023)
摘要
Continuity and steadiness are vital for services with massive users, which requires the anomalies of services should be detected and resolved in a timely manner. Our previous work proposed a tool, namely
ImpAPTr (Impact Analysis based on Pruning Tree)
, to identify the combination of multiple dimensional attributes as the clues leading to the root cause of service anomalies. However,
ImpAPTr
applies a threshold driven strategy, i.e., it needs to be triggered by a
$\geq 0.05\%$
drop of the success rate of the service calls (abbr.
SRSC
), which may face problems in an atypical yet pervasive situation in field application. For example, the combination of trivial anomalies (i.e., each causes a drop less than 0.05% to
SRSC
) can lead to a far more than 0.05% drop on
SRSC
. Besides, a suitable threshold is usually hard to be determined, etc. To address these problems, we propose a new method, namely
ImpAPTr+
in this paper to free the constraint of the 0.05% threshold. The basic idea is to involve time dimension and identify clues across multiple time intervals of data. We performed evaluation on three typical methods (i.e.,
ImpAPTr+
,
R-Adtributor
and
Squeeze
) with both production environment dataset and simulation dataset. The former dataset is directly retrieved from the service monitoring data in
Meituan
, one of the largest on-line service providers worldwide. The latter dataset is fabricated also using the monitoring data from the same company. The results indicate: (1)
ImpAPTr+
outperforms previous approaches to a large degree in terms of accuracy. (2) Both
ImpAPTr+
and
R-Adtributor
are able to find proper clues within seconds. (3)
ImpAPTr+
tends to find proper clues with shorter time intervals (i.e., less data), which implies that the method is more suitable for near real-time monitoring scenarios.
更多查看译文
关键词
Anomaly clues locating, multiple dimensional attributes, on-line service monitoring
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要