Improving Visual Speech Recognition for Small-Scale Datasets via Speaker Embedding

Jinghan Wu, Yakun Zhang, Xingyu Zhang,Changyan Zheng,Ye Yan,Erwei Yin

2023 6th International Conference on Software Engineering and Computer Science (CSECS)(2023)

引用 0|浏览0
暂无评分
摘要
Visual speech recognition, also known as lipreading, breaks the application limitation of automatic speech recognition when there is extreme background noise or a need for private communication. However, acquiring high-quality lipreading models remains a challenging task due to the lack of sufficient lipreading data for particular languages. In this work, we propose a new lipreading framework using speaker identity information to improve the speech recognition performance with limited and low resolution visual data. Based on a shared visual encoder, speaker information is first pre-trained with speaker identification label. And then, it is further integrated with semantic features for the content recognition training process. The shared visual encoder adopts widely used visual front-end of deep convolutional neural network along with recurrent neural network, which can extract the spatial and temporal features effectively. Meanwhile, extra information is provided for the visual speech recognition since the effectiveness of supervised pre-training is utilized, alleviating the performance drop of the lipreading model caused by insufficient training data. Experimental results on the AVLetters dataset confirm the state-of-the-art recognition result of the proposed framework with the best classification accuracy of 70.77 %. For the simplicity and effectiveness, it is a promising lipreading framework for practical use in various application scenarios and languages.
更多
查看译文
关键词
lipreading,speaker embedding,supervised pretraining,visual speech recognition,deep learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要