Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
CoRR(2024)
摘要
Autoregressive decoding of large language models (LLMs) is memory bandwidth
bounded, resulting in high latency and significant wastes of the parallel
processing power of modern accelerators. Existing methods for accelerating LLM
decoding often require a draft model (e.g., speculative decoding), which is
nontrivial to obtain and unable to generalize. In this paper, we introduce
Lookahead decoding, an exact, parallel decoding algorithm that accelerates LLM
decoding without needing auxiliary models or data stores. It allows trading
per-step log(FLOPs) to reduce the number of total decoding steps, is more
parallelizable on single or multiple modern accelerators, and is compatible
with concurrent memory-efficient attention (e.g., FlashAttention). Our
implementation of Lookahead decoding can speed up autoregressive decoding by up
to 1.8x on MT-bench and 4x with strong scaling on multiple GPUs in code
completion tasks. Our code is avialable at
https://github.com/hao-ai-lab/LookaheadDecoding
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要