HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models
CoRR(2024)
摘要
While recent progress in video-text retrieval has been driven by the
exploration of powerful model architectures and training strategies, the
representation learning ability of video-text retrieval models is still limited
due to low-quality and scarce training data annotations. To address this issue,
we present a novel video-text learning paradigm, HaVTR, which augments video
and text data to learn more generalized features. Specifically, we first adopt
a simple augmentation method, which generates self-similar data by randomly
duplicating or dropping subwords and frames. In addition, inspired by the
recent advancement in visual and language generative models, we propose a more
powerful augmentation method through textual paraphrasing and video stylization
using large language models (LLMs) and visual generative models (VGMs).
Further, to bring richer information into video and text, we propose a
hallucination-based augmentation method, where we use LLMs and VGMs to generate
and add new relevant information to the original data. Benefiting from the
enriched data, extensive experiments on several video-text retrieval benchmarks
demonstrate the superiority of HaVTR over existing methods.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要