SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding
CoRR(2024)
摘要
Temporal video grounding (TVG) is a critical task in video content
understanding, requiring precise alignment between video content and natural
language instructions. Despite significant advancements, existing methods face
challenges in managing confidence bias towards salient objects and capturing
long-term dependencies in video sequences. To address these issues, we
introduce SpikeMba: a multi-modal spiking saliency mamba for temporal video
grounding. Our approach integrates Spiking Neural Networks (SNNs) with state
space models (SSMs) to leverage their unique advantages in handling different
aspects of the task. Specifically, we use SNNs to develop a spiking saliency
detector that generates the proposal set. The detector emits spike signals when
the input signal exceeds a predefined threshold, resulting in a dynamic and
binary saliency proposal set. To enhance the model's capability to retain and
infer contextual information, we introduce relevant slots which learnable
tensors that encode prior knowledge. These slots work with the contextual
moment reasoner to maintain a balance between preserving contextual information
and exploring semantic relevance dynamically. The SSMs facilitate selective
information propagation, addressing the challenge of long-term dependency in
video content. By combining SNNs for proposal generation and SSMs for effective
contextual reasoning, SpikeMba addresses confidence bias and long-term
dependencies, thereby significantly enhancing fine-grained multimodal
relationship capture. Our experiments demonstrate the effectiveness of
SpikeMba, which consistently outperforms state-of-the-art methods across
mainstream benchmarks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要