Listening In The Dips: Comparing Relevant Features For Speech Recognition In Humans And Machines

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION(2017)

引用 11|浏览5
暂无评分
摘要
In recent years, automatic speech recognition (ASR) systems gradually decreased (and for some tasks closed) the gap between human and automatic speech recognition. However, it is unclear if similar performance implies humans and ASR systems to rely on similar signal cues. In the current study, ASR and HSR are compared using speech material from a matrix sentence test mixed with either a stationary speech-shaped noise (SSN) or amplitude-modulated SSN. Recognition performance of HSR and ASR is measured in term of the speech recognition threshold (SRT), i.e.. the signal-to-noise ratio with 50% recognition rate and by comparing psychometric functions. ASR results are obtained with matched-trained DNN-based systems that use FBank features as input and compared to results obtained from eight normal-hearing listeners and two established models of speech intelligibility. For both maskers, HSR and ASR achieve similar SRTs with an average deviation of only 0.4 dB. A relevance propagation algorithm is applied to identify features relevant for ASR. The analysis shows that relevant features coincide either with spectral peaks of the speech signal or with dips of the noise masker, indicating that similar cues arc important in HSR and ASR.
更多
查看译文
关键词
man-machine comparison, deep neural networks, automatic speech recognition, relevance propagation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要