AUDIO CAPTIONING BASED ON TRANSFORMER AND PRE-TRAINING FOR 2020 DCASE AUDIO CAPTIONING CHALLENGE Technical Report

Yusong Wu, Kun Chen,Ziyue Wang,Xuan Zhang,Fudong Nian,Shengchen Li,Xi Shao

semanticscholar（2020）

Cited 0|Views16

No score

Abstract

This report proposes an automated audio captioning model for the 2020 DCASE audio captioning challenge. In this challenge, a model is required to be trained from scratch to generate natural language descriptions of a given audio signal. However, as limited data available and restrictions on using pre-trained models trained by external data, training directly from scratch can result in poor performance where acoustic events and language are poorly modeled. For better acoustic event and language modeling, a sequenceto-sequence model is proposed which consists of a CNN encoder and a Transformer decoder. In the proposed model, the encoder and word embedding are firstly pre-trained. Regulations and data augmentations are applied during training, while fine-tuning is applied after training. Experiments show that the proposed model can achieve a SPIDEr score of 0.227 on audio captioning performance.

Translated text

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined