EISNet: A Multi-Modal Fusion Network for Semantic Segmentation with Events and Images

IEEE Transactions on Multimedia(2024)

引用 0|浏览13
暂无评分
摘要
Bio-inspired event cameras record a scene as sparse and asynchronous “events” by detecting per-pixel brightness changes. Such cameras show great potential in challenging scene understanding tasks, benefiting from the imaging advantages of high dynamic range and high temporal resolution. Considering the complementarity between event and standard cameras, we propose a multi-modal fusion network (EISNet) to improve the semantic segmentation performance. The key challenges of this topic lie in ( i ) how to encode event data to represent accurate scene information and ( ii ) how to fuse multi-modal complementary features by considering the characteristics of two modalities. To solve the first challenge, we propose an Activity-Aware Event Integration Module (AEIM) to convert event data into frame-based representations with high-confidence details via scene activity modeling. To tackle the second challenge, we introduce the Modality Recalibration and Fusion Module (MRFM) to recalibrate modal-specific representations and then aggregate multi-modal features at multiple stages. MRFM learns to generate modal-oriented masks to guide the merging of complementary features, achieving adaptive fusion. Based on these two core designs, our proposed EISNet adopts an encoder-decoder transformer architecture for accurate semantic segmentation using events and images. Experimental results show that our model outperforms state-of-the-art methods by a large margin on event-based semantic segmentation datasets. The code is publicly available at https://github.com/bochenxie/EISNet .
更多
查看译文
关键词
Event camera,multi-modal fusion,attention mechanism,semantic segmentation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要