Advances In Arabic Speech Transcription At Ibm Under The Darpa Gale Program

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING(2009)

引用 44|浏览1
暂无评分
摘要
This paper describes the Arabic broadcast transcription system fielded by IBM in the GALE Phase 2.5 machine translation evaluation. Key advances include the use of additional training data from the Linguistic Data Consortium (LDC), use of a very large vocabulary comprising 737 K words and 2.5 M pronunciation variants, automatic vowelization using flat-start training, cross-adaptation between unvowelized and vowelized acoustic models, and rescoring with a neural-network language model. The resulting system achieves word error rates below 10% on Arabic broadcasts. Very large scale experiments with unsupervised training demonstrate that the utility of unsupervised data depends on the amount of supervised data available. While unsupervised training improves system performance when a limited amount (135 h) of supervised data is available, these gains disappear when a greater amount (848 h) of supervised data is used, even with a very large (7069 h) corpus of unsupervised data. We also describe a method for modeling Arabic dialects that avoids the problem of data sparseness entailed by dialect-specific acoustic models via the use of non-phonetic, dialect questions in the decision trees. We show how this method can be used with a statically compiled decoding graph by partitioning the decision trees into a static component and a dynamic component, with the dynamic component being replaced by a mapping that is evaluated at run-time.
更多
查看译文
关键词
Dialect modeling,discriminative training,speech recognition,vowelization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要