Generating images from audio under semantic consistency

Neurocomputing(2022)

引用 4|浏览16
暂无评分
摘要
Generating the data of an absent modality based on existing modal information is valuable for realizing audio-visual intermodal information complementarity. However, existing audio-visual generation methods require strict timing synchronization between the data of two modalities, which is very time-consuming and expensive. In this paper, considering the extensive audio-visual semantic associations, we propose a semantic consistency audio-to-image generative adversarial network (SCAIGAN) to generate visual images with the corresponding semantics directly from audio spectrograms. Particularly, in our model, three mechanisms are exploited. First, a self-attention mechanism is added to the encoder to better capture the global features and geometric structure of the high-dimensional characteristics of the data. Second, the projection mechanism is used in a discriminator to constrain the generator in such a way that a type of cross-modal-based self-supervision under semantic consistency can be embedded. Finally, self-modulation batch normalization is applied to the generator to accelerate the convergence and improve the quality of the generated images. Experiments demonstrate that our model can generate clear visual images with diversity on both instrument and face datasets and can achieve better classification accuracy than the other state-of-the-art methods. Our code will be made publicly available at https://github.com/PengchengZhao1001/AV-Correlation.
更多
查看译文
关键词
Cross modal,Semantic association,Generative adversarial nets
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要