Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation
CVPR 2024(2023)
摘要
We study object interaction anticipation in egocentric videos. This task
requires an understanding of the spatio-temporal context formed by past actions
on objects, coined action context. We propose TransFusion, a multimodal
transformer-based architecture. It exploits the representational power of
language by summarizing the action context. TransFusion leverages pre-trained
image captioning and vision-language models to extract the action context from
past video frames. This action context together with the next video frame is
processed by the multimodal fusion module to forecast the next object
interaction. Our model enables more efficient end-to-end learning. The large
pre-trained language models add common sense and a generalisation capability.
Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our
multimodal fusion model. They also highlight the benefits of using
language-based context summaries in a task where vision seems to suffice. Our
method outperforms state-of-the-art approaches by 40.4
overall mAP on the Ego4D test set. We validate the effectiveness of TransFusion
via experiments on EPIC-KITCHENS-100. Video and code are available at
https://eth-ait.github.io/transfusion-proj/.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要