Non-autoregressive Sequence-to-Sequence Vision-Language Models
CVPR 2024(2024)
Abstract
Sequence-to-sequence vision-language models are showing promise, but their
applicability is limited by their inference latency due to their autoregressive
way of generating predictions. We propose a parallel decoding
sequence-to-sequence vision-language model, trained with a Query-CTC loss, that
marginalizes over multiple inference paths in the decoder. This allows us to
model the joint distribution of tokens, rather than restricting to conditional
distribution as in an autoregressive model. The resulting model, NARVL,
achieves performance on-par with its state-of-the-art autoregressive
counterpart, but is faster at inference time, reducing from the linear
complexity associated with the sequential generation of tokens to a paradigm of
constant time joint inference.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined