Debugging Transient Faults In Data Center Networks Using Synchronized Network-Wide Packet Histories

PROCEEDINGS OF THE 18TH USENIX SYMPOSIUM ON NETWORKED SYSTEM DESIGN AND IMPLEMENTATION(2021)

引用 0|浏览6
暂无评分
摘要
Data center network faults are hard to debug due to their scale and complexity. With the prevalence of hard-to-reproduce transient faults, root-cause analysis of network faults is extremely difficult due to unavailability of historical data, and inability to correlate the distributed data across the network. Often, it is not possible to find the root cause using only switch-local information. To find the root cause of such transient faults, we need: 1) Visibility: fine-grained, packet-level and network-wide observability, 2) Retrospection: ability to get historical information before the fault occurs, and 3) Correlation: ability to correlate the information across the network.In this work, we present the design and implementation of SyNDB, a tool with the aforementioned capabilities to enable root cause analysis of network faults. We implement and evaluate SyNDB with realistic topologies using large scale simulation and programmable switches. Our evaluations show that SyNDB can capture and correlate packet records over sufficiently large time windows (similar to 4 ms), thus facilitating the root cause analysis of various network faults.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要