Multiple Transformer Mining for VizWiz Image Caption
semanticscholar(2021)
Abstract
This paper proposes a multiple transformer mining algorithm (MTMA) for the VizWiz image captioning task. MTMA consists of grid image feature extraction, OCR and object detectors to effectively describe the image information. Self-Critical Sequence Training (SCST) approach is adopted for image captioning models in the training phase, and semantic similarity aggregation is adopted in the postprocessing phase. Meanwhile, ensemble power is leveraged in multi-modal feature fusion and post-caption generation to further enhance the performance. As a result, the proposed algorithm outperforms others with 94.06 CIDEr.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined