A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval
CoRR(2023)
摘要
Existing long video retrieval systems are trained and tested in the
paragraph-to-video retrieval regime, where every long video is described by a
single long paragraph. This neglects the richness and variety of possible valid
descriptions of a video, which could be described in moment-by-moment detail,
or in a single phrase summary, or anything in between. To provide a more
thorough evaluation of the capabilities of long video retrieval systems, we
propose a pipeline that leverages state-of-the-art large language models to
carefully generate a diverse set of synthetic captions for long videos. We
validate this pipeline's fidelity via rigorous human inspection. We then
benchmark a representative set of video language models on these synthetic
captions using a few long video datasets, showing that they struggle with the
transformed data, especially the shortest captions. We also propose a
lightweight fine-tuning method, where we use a contrastive loss to learn a
hierarchical embedding loss based on the differing levels of information among
the various captions. Our method improves performance both on the downstream
paragraph-to-video retrieval task (+1.1% R@1 on ActivityNet), as well as for
the various long video retrieval metrics we compute using our synthetic data
(+3.6% R@1 for short descriptions on ActivityNet). For data access and other
details, please refer to our project website at
https://mgwillia.github.io/10k-words.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要