EvCap: Element-Aware Video Captioning

IEEE Transactions on Circuits and Systems for Video Technology(2024)

Cited 0|Views6
No score
Abstract
Video captioning is a multi-modal task across computer vision and natural language processing. Previous methods generally follow two paradigms, i.e. template-based and sequence-based. Template-based methods can generate relatively accurate elements (e.g. humans, objects, or actions) to complete a template caption, but with a rather limited vocabulary and syntactic structure; in contrast, sequence-based methods generate more natural descriptions like humans but easily suffer element errors due to their heavy dependence on visual features that often contain much distracting information. In this work, we draw lessons from the element extraction manner in template-based methods and propose a novel Element-aware video Captioning (EvCap) framework that applies linguistic features beyond general visual features to consolidate model awareness of specific elements under the sequence-based paradigm. In particular, we introduce two new linguistic features, i.e. action and object-relevant features, from the upstream encoder of the sequence-based paradigm to encode action and object information (in the forms of phrases and words respectively) that benefits the generation of corresponding elements in the final description. Moreover, to fuse the heterogeneous representations and relieve noise of inaccurate features, we design a post-operation fusion strategy, with semantic interaction and energy weighting to ensure the effective usage of each feature. Experimental results show that our EvCap achieves amazingly promising performance compared with baselines under diverse upstream encoder architectures including CNNs, ViT and CLIP, demonstrating good scalability with respect to encoder choices.
More
Translated text
Key words
Video captioning,element-awareness,multi-modal application
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined