HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model
arxiv(2024)
Abstract
Video-Language Models (VLMs), pre-trained on large-scale video-caption
datasets, are now standard for robust visual-language representation and
downstream tasks. However, their reliance on global contrastive alignment
limits their ability to capture fine-grained interactions between visual and
textual elements. To address these challenges, we introduce HENASY
(Hierarchical ENtities ASsemblY), a novel framework designed for egocentric
video analysis that enhances the granularity of video content representations.
HENASY employs a compositional approach using an enhanced slot-attention and
grouping mechanisms for videos, assembling dynamic entities from video patches.
It integrates a local entity encoder for dynamic modeling, a global encoder for
broader contextual understanding, and an entity-aware decoder for late-stage
fusion, enabling effective video scene dynamics modeling and granular-level
alignment between visual entities and text. By incorporating innovative
contrastive losses, HENASY significantly improves entity and activity
recognition, delivering superior performance on benchmarks such as Ego4D and
EpicKitchen, and setting new standards in both zero-shot and extensive video
understanding tasks. Our results confirm groundbreaking capabilities of HENASY
and establish it as a significant advancement in video-language multimodal
research.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined