Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models
ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)
摘要
The accuracy of end-to-end (E2E) automatic speech recognition (ASR) models
continues to improve as they are scaled to larger sizes, with some now reaching
billions of parameters. Widespread deployment and adoption of these models,
however, requires computationally efficient strategies for decoding. In the
present work, we study one such strategy: applying multiple frame reduction
layers in the encoder to compress encoder outputs into a small number of output
frames. While similar techniques have been investigated in previous work, we
achieve dramatically more reduction than has previously been demonstrated
through the use of multiple funnel reduction layers. Through ablations, we
study the impact of various architectural choices in the encoder to identify
the most effective strategies. We demonstrate that we can generate one encoder
output frame for every 2.56 sec of input speech, without significantly
affecting word error rate on a large-scale voice search task, while improving
encoder and decoder latencies by 48
but computationally expensive baseline.
更多查看译文
关键词
end-to-end ASR,computational latency,runtime efficiency,large models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要