Audio-Visual Contrastive and Consistency Learning for Semi-Supervised Action Recognition

Maregu Assefa, Wei Jiang,Jinyu Zhan, Kumie Gedamu, Getinet Yilma,Melese Ayalew, Deepak Adhikari

IEEE TRANSACTIONS ON MULTIMEDIA(2024)

引用 0|浏览0
暂无评分
摘要
Semi-supervised video learning is an increasingly popular approach for improving video understanding tasks by utilizing large-scale unlabeled videos along with a few labels. Recent studies have shown that multimodal contrastive learning and consistency regularization are effective techniques for generating high-quality pseudo-labels for semi-supervised action recognition. However, existing pseudo-labeling approaches are solely based on the model's class predictions and can suffer from confirmation biases due to the accumulation of false predictions. To address this issue, we propose exploiting audio-visual feature correlations to achieve high-quality pseudo-labels instead of relying on model confidence. To achieve this goal, we introduce Audio-visual Contrastive and Consistency Learning (AvCLR) for semi-supervised action recognition. AvCLR generates reliable pseudo-labels from audio-visual feature correlations using deep embedded clustering to mitigate confirmation biases. Additionally, AvCLR introduces two contrastive modules: intra-modal contrastive learning (ImCL) and cross-modal contrastive learning (XmCL) to discover complementary information from audio-visual alignments. The ImCL module learns informative representations within audio and video independently, while the XmCL module aims to leverage global high-level features of audio-visual information. Furthermore, the XmCL is constrained by introducing intra-instance negatives from one modality to the other. We jointly optimize the model with ImCL, XmCL, and consistency regularization in an end-to-end semi-supervised manner. Experimental results have demonstrated that the proposed AvCLR framework is effective in reducing confirmation biases and outperforms existing confidence-based semi-supervised action recognition methods.
更多
查看译文
关键词
Action recognition,audio-visual learning,contrastive learning,semi-supervised learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要