Convolution-Based Attention Model With Positional Encoding For Streaming Speech Recognition On Embedded Devices

2021 IEEE Spoken Language Technology Workshop (SLT)(2021)

引用 2|浏览27
暂无评分
摘要
On-device automatic speech recognition (ASR) is much more preferred over server-based implementations owing to its low latency and privacy protection. Many server-based ASRs employ recurrent neural networks (RNNs) to exploit their ability to recognize long sequences with a limited number of states; however, they are inefficient for single-stream implementations in embedded devices. In this study, a highly efficient convolutional model-based ASR with monotonic chunkwise attention is developed. Although temporal convolution-based models allow more efficient implementations, they demand a long filter-length to avoid looping or skipping problems. To remedy this problem, we add positional encoding, while shortening the filter length, to a convolution-based ASR encoder. It is demonstrated that the accuracy of the short filter-length convolutional model is significantly improved. In addition, the effect of positional encoding is analyzed by visualizing the attention energy and encoder outputs. The proposed model achieves the word error rate of 11.20% on TED-LIUMv2 for an end-to-end speech recognition task.
更多
查看译文
关键词
long filter-length,positional encoding,convolution-based ASR,encoder,short filter-length convolutional model,attention energy,end-to-end speech recognition task,convolution-based attention model,streaming speech recognition,embedded devices,server-based implementations,privacy protection,recurrent neural networks,long sequences,single-stream implementations,highly efficient convolutional model-based ASR,monotonic chunkwise attention,temporal convolution-based models,server-based ASR,on-device automatic speech recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要