OV-NeRF: Open-vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding
CoRR(2024)
摘要
The development of Neural Radiance Fields (NeRFs) has provided a potent
representation for encapsulating the geometric and appearance characteristics
of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D
semantic perception tasks has been a recent focus. However, current methods
that extract semantics directly from Contrastive Language-Image Pretraining
(CLIP) for semantic field learning encounter difficulties due to noisy and
view-inconsistent semantics provided by CLIP. To tackle these limitations, we
propose OV-NeRF, which exploits the potential of pre-trained vision and
language foundation models to enhance semantic field learning through proposed
single-view and cross-view strategies. First, from the single-view perspective,
we introduce Region Semantic Ranking (RSR) regularization by leveraging 2D mask
proposals derived from SAM to rectify the noisy semantics of each training
view, facilitating accurate semantic field learning. Second, from the
cross-view perspective, we propose a Cross-view Self-enhancement (CSE) strategy
to address the challenge raised by view-inconsistent semantics. Rather than
invariably utilizing the 2D inconsistent semantics from CLIP, CSE leverages the
3D consistent semantics generated from the well-trained semantic field itself
for semantic field training, aiming to reduce ambiguity and enhance overall
semantic consistency across different views. Extensive experiments validate our
OV-NeRF outperforms current state-of-the-art methods, achieving a significant
improvement of 20.31
respectively. Furthermore, our approach exhibits consistent superior results
across various CLIP configurations, further verifying its robustness.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要