Self-Adaptive Sampling for Efficient Video Question-Answering on Image–Text Models
arxiv(2023)
摘要
Video question-answering is a fundamental task in the field of video
understanding. Although current vision–language models (VLMs) equipped with
Video Transformers have enabled temporal modeling and yielded superior results,
they are at the cost of huge computational power and thus too expensive to
deploy in real-time application scenarios. An economical workaround only
samples a small portion of frames to represent the main content of that video
and tune an image–text model on these sampled frames. Recent video
understanding models usually randomly sample a set of frames or clips,
regardless of internal correlations between their visual contents, nor their
relevance to the problem. We argue that such kinds of aimless sampling may omit
the key frames from which the correct answer can be deduced, and the situation
gets worse when the sampling sparsity increases, which always happens as the
video lengths increase. To mitigate this issue, we propose two frame sampling
strategies, namely the most domain frames (MDF) and most implied frames (MIF),
to maximally preserve those frames that are most likely vital to the given
questions. MDF passively minimizes the risk of key frame omission in a
bootstrap manner, while MIS actively searches key frames customized for each
video–question pair with the assistance of auxiliary models. The experimental
results on three public datasets from three advanced VLMs (CLIP, GIT and
All-in-one) demonstrate that our proposed strategies can boost the performance
for image-text pretrained models. The source codes pertaining to the method
proposed in this paper are publicly available at
https://github.com/declare-lab/sas-vqa.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要