Fathom: Understanding Datacenter Application Network Performance

PROCEEDINGS OF THE 2023 ACM SIGCOMM 2023 CONFERENCE, SIGCOMM 2023(2023)

引用 0|浏览27
暂无评分
摘要
We describe our experience with Fathom, a system for identifying the network performance bottlenecks of any service running in the Google fleet. Fathom passively samples RPCs, the principal unit of work for services. It segments the overall latency into host and network components with kernel and RPC stack instrumentation. It records these detailed latency metrics, along with detailed transport connection state, for every sampled RPC. This lets us determine if the completion is constrained by the client, network or server. To scale while enabling analysis, we also aggregate samples into distributions that retain multi-dimensional breakdowns. This provides us with a macroscopic view of individual services. Fathom runs globally in our datacenters for all production traffic, where it monitors billions of TCP connections 24x7. For five years Fathom has been our primary tool for troubleshooting service network issues and assessing network infrastructure changes. We present case studies to show how it has helped us improve our production services.
更多
查看译文
关键词
Networking,Performance,Measurements,Datacenter
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要