谷歌Chrome浏览器插件
订阅小程序
在清言上使用

CroMM-VSR: Cross-Modal Memory Augmented Visual Speech Recognition

IEEE Transactions on Multimedia(2022)

引用 19|浏览33
暂无评分
摘要
Visual Speech Recognition (VSR) is a task that recognizes speech from external appearances of the face ( i.e., lips) into text. Since the information from the visual lip movements is not sufficient to fully represent the speech, VSR is considered as one of the challenging problems. One possible way to resolve this problem is additionally utilizing audio which contains rich information for speech recognition. However, the audio information could not be always available such as in crowded situations. Thus, it is necessary to find a way that successfully provides enough information for speech recognition with visual inputs only. In this paper, we alleviate the information insufficiency of visual lip movement by proposing a cross-modal memory augmented VSR with Visual-Audio Memory (VAM). The proposed framework tries to utilize the complementary information of audio even when the audio inputs are not provided at the inference time. Concretely, the proposed VAM learns to imprint audio features of short clip-level into a memory network using the corresponding visual features. To this end, the VAM contains two memories, lip-video key and audio value. We guide the audio value memory to imprint the audio feature and the lip-video key memory to memorize the location of the imprinted audio. By doing this, the VAM can exploit rich audio information by accessing the memory using visual inputs only. Experimental results show that the proposed method achieves state-of-the-art performance on both word- and sentence-level VSR. In addition, we verify the learned representations inside the VAM contain meaningful information for VSR.
更多
查看译文
关键词
Visualization,Speech recognition,Lips,Three-dimensional displays,Training,Face recognition,Feature extraction,Lip-reading,visual speech recognition,cross-modal,visual-audio memory
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要