Chrome Extension
WeChat Mini Program
Use on ChatGLM

Watch, Listen Once, And Sync: Audio-Visual Synchronization With Multi-Modal Regression Cnn

2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)(2018)

Cited 26|Views6
No score
Abstract
Recovering audio-visual synchronization is an important task in the field of visual speech processing. In this paper, we present a multi-modal regression model that uses a convolutional neural network (CNN) for recovering audio-visual synchronization of single-person speech videos. The proposed model takes audio and visual features of multiple frames as the input and predicts a drifted frame number of the audiovisual pair which we input. We treat this synchronization task as a regression problem. Thus, the model does not need to search with a sliding window which would increase the computational cost. Experimental results show that the proposed method outperforms other baseline methods for recovered accuracy and computational cost.
More
Translated text
Key words
Audio-visual synchronization, visual speech processing, neural networks
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined