InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding
arxiv(2024)
摘要
Understanding long videos, ranging from tens of minutes to several hours,
presents unique challenges in video comprehension. Despite the increasing
importance of long-form video content, existing benchmarks primarily focus on
shorter clips. To address this gap, we introduce InfiniBench a comprehensive
benchmark for very long video understanding which presents 1)The longest video
duration, averaging 76.34 minutes; 2) The largest number of question-answer
pairs, 108.2K; 3) Diversity in questions that examine nine different skills and
include both multiple-choice questions and open-ended questions; 4)
Humancentric, as the video sources come from movies and daily TV shows, with
specific human-level question designs such as Movie Spoiler Questions that
require critical thinking and comprehensive understanding. Using InfiniBench,
we comprehensively evaluate existing Large MultiModality Models (LMMs) on each
skill, including the commercial model Gemini 1.5 Flash and the open-source
models. The evaluation shows significant challenges in our benchmark.Our
results show that the best AI models such Gemini struggles to perform well with
42.72
will stimulate the LMMs community towards long video and human-level
understanding. Our benchmark can be accessed at
https://vision-cair.github.io/InfiniBench/
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要