Multiple Transformer Mining for VizWiz Image Caption

Xuchao Gong, Hongji Zhu, Yongliang Wang, Biaolong Chen,Aixi Zhang,Fangxun Shu,Si Liu

semanticscholar(2021)

Cited 0|Views3
No score
Abstract
This paper proposes a multiple transformer mining algorithm (MTMA) for the VizWiz image captioning task. MTMA consists of grid image feature extraction, OCR and object detectors to effectively describe the image information. Self-Critical Sequence Training (SCST) approach is adopted for image captioning models in the training phase, and semantic similarity aggregation is adopted in the postprocessing phase. Meanwhile, ensemble power is leveraged in multi-modal feature fusion and post-caption generation to further enhance the performance. As a result, the proposed algorithm outperforms others with 94.06 CIDEr.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined