Multilingual Visual Speech Recognition with a Single Model by Learning with Discrete Visual Speech Units
CoRR(2024)
摘要
This paper explores sentence-level Multilingual Visual Speech Recognition
with a single model for the first time. As the massive multilingual modeling of
visual data requires huge computational costs, we propose a novel strategy,
processing with visual speech units. Motivated by the recent success of the
audio speech unit, the proposed visual speech unit is obtained by discretizing
the visual speech features extracted from the self-supervised visual speech
model. To correctly capture multilingual visual speech, we first train the
self-supervised visual speech model on 5,512 hours of multilingual audio-visual
data. Through analysis, we verify that the visual speech units mainly contain
viseme information while suppressing non-linguistic information. By using the
visual speech units as the inputs of our system, we pre-train the model to
predict corresponding text outputs on massive multilingual data constructed by
merging several VSR databases. As both the inputs and outputs are discrete, we
can greatly improve the training efficiency compared to the standard VSR
training. Specifically, the input data size is reduced to 0.016
original video inputs. In order to complement the insufficient visual
information in speech recognition, we apply curriculum learning where the
inputs of the system begin with audio-visual speech units and gradually change
to visual speech units. After pre-training, the model is finetuned on
continuous features. We set new state-of-the-art multilingual VSR performances
by achieving comparable performances to the previous language-specific VSR
models, with a single trained model.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要