Multimodal Word Discovery and Retrieval With Spoken Descriptions and Visual Concepts

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING(2020)

引用 1|浏览21
暂无评分
摘要
In the absence of dictionaries, translators, or grammars, it is still possible to learn some of the words of a new language by listening to spoken descriptions of images. If several images, each containing a particular visually salient object, each co-occur with a particular sequence of speech sounds, we can infer that those speech sounds are a word whose definition is the visible object. A multimodal word discovery system accepts, as input, a database of spoken descriptions of images (or a set of corresponding phone transcriptions) and learns a mapping from waveform segments (or phone strings) to their associated image concepts. In this article, four multimodal word discovery systems are demonstrated: three models based on statistical machine translation (SMT) and one based on neural machine translation (NMT). The systems are trained with phonetic transcriptions, MFCC and multilingual bottleneck features (MBN). On the phone-level, the SMT outperforms the NMT model, achieving a 61.6% F1 score in the phone-level word discovery task on Flickr30k. On the audio-level, we compared our models with the existing ES-KMeans algorithm for word discovery and present some of the challenges in multimodal spoken word discovery.
更多
查看译文
关键词
Hidden Markov models,Acoustics,Speech processing,Task analysis,Image retrieval,Random variables,Adaptation models,Unsupervised word discovery,language acquisition,machine translation,multimodal learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要