HiTNet: Byte-to-BPE Hierarchical Transcription Network for End-to-End Speech Recognition

2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)(2021)

引用 0|浏览6
暂无评分
摘要
In this paper, we propose a new byte to byte-pair-encoding (BPE) Hierarchical Transcription Network (HiTNet) architecture for end-to-end (e2e) automatic speech recognition (ASR). The proposed HiTNet architecture simultaneously encodes as well as decodes information hierarchically at different levels of linguistic granularity such as bytes and BPE. In general this idea can be extended to any levels of granularity including phonemes or graphemes or bytes (character to sub-character in some languages), to sub-words or byte-pair encodings (BPE), to words, and so on. Existing hierarchical e2e ASR models primarily encode the acoustic information in an hierarchical manner governed by weaker linguistic constraints at each level. The language information at each level is neither embedded or used explicitly, nor is the information decoded at each level passed on to the next stage. The proposed architecture primarily decodes information in an hierarchical manner utilizing the linguistic information at each level explicitly, while at the same time utilizing the hierarchically encoded acoustic information at each level. Experiments with a two-level byte-to-BPE (b2B) hierarchical transcription show that the proposed architecture significantly reduces the word error rates of both the byte and BPE decoders compared to baseline byte and BPE based attention encoder-decoder models.
更多
查看译文
关键词
HiTNet,byte-to-BPE,acoustic-linguistic,end-to-end ASR,hierarchical transcription
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要