LightCvT - Audio Forgery Detection via Fusion of Light CNN and Transformer.

ICCPR(2021)

引用 1|浏览4
暂无评分
摘要
The development of machine learning has greatly improved the quality and authenticity of artificial intelligence-generated audio, image, and video. However, the existence of such fake content poses a risk to people's daily life. It is important to explore and develop forensic technology to identify real and synthesized audio. We observe that these deep neural networks have a certain uniformity in cross-modal tasks, that is, the deep neural networks used for image forgery detection can also be applied to audio forgery detection. We propose a light transformer-based network LightCvT for audio forgery detection. LightCvT attempts to capture both local feature and non-local long-range dependencies to improve the detection accuracy of audio forgery. We combine the enhanced Light CNN structure and Vision Transformer (ViT) to take full advantage of them. Specifically, Light CNN's ability to extract local feature and ViT's ability to capture long-distance feature dependencies are adaptively complementary, leading to significant improvement of the global perception ability and the discriminability of the system. In addition, we introduce the self-attention mechanism into the convolutional token embedding to further model the relations between local regions. Comprehensive experiments are conducted and the results show that the proposed method achieves better performance than the state-of-the-art methods.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要