Scalable RDMA Transport with Efficient Connection Sharing.

Jian Tang,Xiaoliang Wang, Huichen Dai

INFOCOM(2023)

引用 0|浏览10
暂无评分
摘要
RDMA provides extremely low-latency and high- throughput data transmission as its protocol stack is entirely offloaded into the RDMA NIC. However, the increasing scale of RDMA networks requires hosts to establish a large number of connections, e.g., process-level full mesh, which easily overwhelms the limited resource on RNICs and hence significantly degrades performance. This paper presents SRM, a scalable transport mode for RDMA that remarkably alleviates resource exhaustion on RNICs. SRM proposes a kernel-based solution to multiplex workloads from different applications over the same connection. Meanwhile, to preserve RDMA’s performance benefits, SRM 1) avoids syscall overhead by sharing the working memory between user-space and kernel; 2) maintains high resource utilization through lock-free approach to avoid contention; 3) adopts multiple optimizations to mitigate the head-of-line blocking issue; 4) implements a rapid recovery mechanism to provide high system robustness. We evaluate SRM using extensive experiments and simulations. Testbed experiments reveal that SRM outperforms existing transports, including DCT, RC, and XRC, by 4x to 20x in latency for all-to-all communication pattern. Simulations of large-scale networks show that, compared with DCT, RC, and XRC, SRM achieves up to 4.42x/4.0x/3.7x speedups respectively in flow completion time while consuming the least memory.
更多
查看译文
关键词
RDMA,Datacenter,Scalability
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要