Near-Optimal Fault Tolerance for Efficient Batch Matrix Multiplication via an Additive Combinatorics Lens

CoRR(2023)

引用 0|浏览0
暂无评分
摘要
Fault tolerance is a major concern in distributed computational settings. In the classic master-worker setting, a server (the master) needs to perform some heavy computation which it may distribute to m other machines (workers) in order to speed up the time complexity. In this setting, it is crucial that the computation is made robust to failed workers, in order for the master to be able to retrieve the result of the joint computation despite failures. A prime complexity measure is thus the recovery threshold, which is the number of workers that the master needs to wait for in order to derive the output. This is the counterpart to the number of failed workers that it can tolerate. In this paper, we address the fundamental and well-studied task of matrix multiplication. Specifically, our focus is on when the master needs to multiply a batch of n pairs of matrices. Several coding techniques have been proven successful in reducing the recovery threshold for this task, and one approach that is also very efficient in terms of computation time is called Rook Codes. The previously best known recovery threshold for batch matrix multiplication using Rook Codes is O(n^log_23)=O(n^1.585). Our main contribution is a lower bound proof that says that any Rook Code for batch matrix multiplication must have a recovery threshold that is at least ω(n). Notably, we employ techniques from Additive Combinatorics in order to prove this, which may be of further interest. Moreover, we show a Rook Code that achieves a recovery threshold of n^1+o(1), establishing a near-optimal answer to the fault tolerance of this coding scheme.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要