Chrome Extension
WeChat Mini Program
Use on ChatGLM

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Emil B. Song,Wenhao Chai,Guanhong Wang, Ya Zhang,Haoyang Zhou,Feiyang Wu, Xianghai Guo, Yang Tian,Y. Lu,Jenq–Neng Hwang,Gaoang Wang

arXiv (Cornell University)(2023)

Cited 0|Views3
No score
Abstract
Recently, integrating video foundation models and large language models to build a video understanding system overcoming the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection are the remaining challenges. Inspired by Atkinson-Shiffrin memory model, we develop an memory mechanism including a rapidly updated short-term memory and a compact thus sustained long-term memory. We employ tokens in Transformers as the carriers of memory. MovieChat achieves state-of-the-art performace in long video understanding.
More
Translated text
Key words
sparse memory,long video
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined