Near-Optimal Fault Tolerance for Efficient Batch Matrix Multiplication via an Additive Combinatorics Lens
CoRR(2023)
摘要
Fault tolerance is a major concern in distributed computational settings. In
the classic master-worker setting, a server (the master) needs to perform some
heavy computation which it may distribute to m other machines (workers) in
order to speed up the time complexity. In this setting, it is crucial that the
computation is made robust to failed workers, in order for the master to be
able to retrieve the result of the joint computation despite failures. A prime
complexity measure is thus the recovery threshold, which is the number
of workers that the master needs to wait for in order to derive the output.
This is the counterpart to the number of failed workers that it can tolerate.
In this paper, we address the fundamental and well-studied task of matrix
multiplication. Specifically, our focus is on when the master needs to multiply
a batch of n pairs of matrices. Several coding techniques have been proven
successful in reducing the recovery threshold for this task, and one approach
that is also very efficient in terms of computation time is called Rook
Codes. The previously best known recovery threshold for batch matrix
multiplication using Rook Codes is O(n^log_23)=O(n^1.585).
Our main contribution is a lower bound proof that says that any Rook Code for
batch matrix multiplication must have a recovery threshold that is at least
ω(n). Notably, we employ techniques from Additive Combinatorics in order
to prove this, which may be of further interest. Moreover, we show a Rook Code
that achieves a recovery threshold of n^1+o(1), establishing a near-optimal
answer to the fault tolerance of this coding scheme.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要