ARET - Aggregated Residual Extended Time-Delay Neural Networks for Speaker Verification.

Ruiteng Zhang,Jianguo Wei,Wenhuan Lu,Longbiao Wang,Meng Liu,Lin Zhang, Jiayu Jin,Junhai Xu

INTERSPEECH（2020）

引用 18|浏览17

暂无评分

摘要

The time-delay neural network (TDNN) is widely used in speaker verification to extract long-term temporal features of speakers. Although common TDNN approaches well capture time-sequential information, they lack the delicate transformations needed for deep representation. To solve this problem, we propose two TDNN architectures. RET integrates short-cut connections into conventional time-delay blocks, and ARET adopts a split-transform-merge strategy to extract more discriminative representation. Experiments on VoxCeleb datasets without augmentation indicate that ARET realizes satisfactory performance on the VoxCeleb1 test set, VoxCeleb1-E, and VoxCeleb1-H, with 1.389%, 1.520%, and 2.614% equal error rate (EER), respectively. Compared to state-of-the-art results on these test sets, RET achieves a 23% similar to 43% relative reduction in EER, and ARET reaches 32% similar to 45%.

查看译文

关键词

residual transformations, aggregated transformations, time-delay neural networks, speaker verification

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要