Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

Ethan Shen, Alan Fan,Sarah M. Pratt,Jae Sung Park,Matthew Wallingford,Sham M. Kakade,Ari Holtzman,Ranjay Krishna,Ali Farhadi,Aditya Kusupati

CoRR（2024）

Cited 0|Views6

No score

Abstract

Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing k drafts to the user requires running an expensive language model k times. To alleviate the computation cost of running k inference passes, we propose Superposed Decoding, a new decoding algorithm that generates k drafts at the computation cost of one autoregressive inference pass. We achieve this by feeding a superposition of the most recent token embeddings from the k drafts as input to the next decoding step of the language model. At every inference step we combine the k drafts with the top-k tokens to get k^2 new drafts and cache the k most likely options, using an n-gram interpolation with minimal compute overhead to filter out incoherent generations. Our experiments show that k drafts from Superposed Decoding are at least as coherent and factual as Nucleus Sampling and Greedy Decoding respectively, while being at least 2.44× faster for k≥3. In a compute-normalized setting, user evaluations demonstrably favor text generated by Superposed Decoding over Nucleus Sampling. Code and more examples open-sourced at https://github.com/RAIVNLab/SuperposedDecoding.

Translated text

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined