Web Data Harvesting For Speech Understanding Grammar Induction

14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5(2013)

引用 32|浏览60
暂无评分
摘要
The development of a speech understanding grammar for spoken dialogue systems can be greatly accelerated by using an in-domain corpus. The development of such a corpus, however, is a slow and expensive process. This paper proposes unsupervised, language-agnostic methods for finding relevant corpora in the web and mining the most informative parts. We show that by utilizing perplexity we are able to increase the in-domainess (precision) of the mined corpora, while by utilizing pragmatic constraints and search engine rank we can increase the generalizability (recall). We show that automatic grammar induction algorithms achieve superior performance on the automatically mined corpora compared to in-domain manually collected corpora for a travel application.
更多
查看译文
关键词
spoken dialog systems,grammar induction,speech understanding,web harvesting,language modeling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要