Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass
CoRR(2024)
Abstract
Many applications today provide users with multiple auto-complete drafts as
they type, including GitHub's code completion, Gmail's smart compose, and
Apple's messaging auto-suggestions. Under the hood, language models support
this by running an autoregressive inference pass to provide a draft.
Consequently, providing k drafts to the user requires running an expensive
language model k times. To alleviate the computation cost of running k
inference passes, we propose Superposed Decoding, a new decoding algorithm that
generates k drafts at the computation cost of one autoregressive inference
pass. We achieve this by feeding a superposition of the most recent token
embeddings from the k drafts as input to the next decoding step of the
language model. At every inference step we combine the k drafts with the
top-k tokens to get k^2 new drafts and cache the k most likely options,
using an n-gram interpolation with minimal compute overhead to filter out
incoherent generations. Our experiments show that k drafts from Superposed
Decoding are at least as coherent and factual as Nucleus Sampling and Greedy
Decoding respectively, while being at least 2.44× faster for k≥3. In
a compute-normalized setting, user evaluations demonstrably favor text
generated by Superposed Decoding over Nucleus Sampling. Code and more examples
open-sourced at https://github.com/RAIVNLab/SuperposedDecoding.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined