Analysis of Spectro-Temporal Modulation Representation for Deep-Fake Speech Detection

Haowei Cheng,Candy Olivia Mawalim,Kai Li, Lijun Wang,Masashi Unoki

2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC(2023)

引用 0|浏览0
暂无评分
摘要
Deep-fake speech detection aims to develop effective techniques for identifying fake speech generated using advanced deep-learning methods. It can reduce the negative impact of malicious production or dissemination of fake speech in real-life scenarios. Although humans can relatively easy to distinguish between genuine and fake speech due to human auditory mechanisms, it is difficult for machines to distinguish them correctly. One major reason for this challenge is that machines struggle to effectively separate speech content from human vocal system information. Common features used in speech processing face difficulties in handling this issue, hindering the neural network from learning the discriminative differences between genuine and fake speech. To address this issue, we investigated spectro-temporal modulation representations in genuine and fake speech, which simulate the human auditory perception process. Next, the spectro-temporal modulation was fit to a light convolutional neural network bidirectional long short-term memory for classification. We conducted experiments on the benchmark datasets of the Automatic Speaker Verification and Spoofing Countermeasures Challenge 2019 (ASVspoof2019) and the Audio Deep synthesis Detection Challenge 2023 (ADD2023), achieving an equal-error rate of 8.33% and 42.10%, respectively. The results showed that spectro-temporal modulation representations could distinguish the genuine and deep-fake speech and have adequate performance in both datasets.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要