AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding
CoRR(2024)
摘要
Temporal Video Grounding (TVG) aims to localize a moment from an untrimmed
video given the language description. Since the annotation of TVG is
labor-intensive, TVG under limited supervision has accepted attention in recent
years. The great success of vision-language pre-training guides TVG to follow
the traditional "pre-training + fine-tuning" paradigm, however, the
pre-training process would suffer from a lack of temporal modeling and
fine-grained alignment due to the difference of data nature between pre-train
and test. Besides, the large gap between pretext and downstream tasks makes
zero-shot testing impossible for the pre-trained model. To avoid the drawbacks
of the traditional paradigm, we propose AutoTVG, a new vision-language
pre-training paradigm for TVG that enables the model to learn semantic
alignment and boundary regression from automatically annotated untrimmed
videos. To be specific, AutoTVG consists of a novel Captioned Moment Generation
(CMG) module to generate captioned moments from untrimmed videos, and TVGNet
with a regression head to predict localization results. Experimental results on
Charades-STA and ActivityNet Captions show that, regarding zero-shot temporal
video grounding, AutoTVG achieves highly competitive performance with
in-distribution methods under out-of-distribution testing, and is superior to
existing pre-training frameworks with much less training data.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要