The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023
CoRR(2024)
摘要
This paper delineates the visual speech recognition (VSR) system introduced
by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech
Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of
Single-Speaker VSR Task, and the open track of Multi-Speaker VSR Task. In terms
of data processing, we leverage the lip motion extractor from the baseline1 to
produce multi-scale video data. Besides, various augmentation techniques are
applied during training, encompassing speed perturbation, random rotation,
horizontal flipping, and color transformation. The VSR model adopts an
end-to-end architecture with joint CTC/attention loss, comprising a ResNet3D
visual frontend, an E-Branchformer encoder, and a Transformer decoder.
Experiments show that our system achieves 34.76
Task and 41.06
ranking first place in all three tracks we participate.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要