ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers
arxiv(2024)
摘要
3D occupancy, an advanced perception technology for driving scenarios,
represents the entire scene without distinguishing between foreground and
background by quantifying the physical space into a grid map. The widely
adopted projection-first deformable attention, efficient in transforming image
features into 3D representations, encounters challenges in aggregating
multi-view features due to sensor deployment constraints. To address this
issue, we propose our learning-first view attention mechanism for effective
multi-view feature aggregation. Moreover, we showcase the scalability of our
view attention across diverse multi-view 3D tasks, such as map construction and
3D object detection. Leveraging the proposed view attention as well as an
additional multi-frame streaming temporal attention, we introduce ViewFormer, a
vision-centric transformer-based framework for spatiotemporal feature
aggregation. To further explore occupancy-level flow representation, we present
FlowOcc3D, a benchmark built on top of existing high-quality datasets.
Qualitative and quantitative analyses on this benchmark reveal the potential to
represent fine-grained dynamic scenes. Extensive experiments show that our
approach significantly outperforms prior state-of-the-art methods. The codes
and benchmark will be released soon.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要