Sparse Multimodal Vision Transformer for Weakly Supervised Semantic Segmentation.

CVPR Workshops(2023)

引用 0|浏览8
暂无评分
摘要
Vision Transformers have proven their versatility and utility for complex computer vision tasks, such as land cover segmentation in remote sensing applications. While performing on par or even outperforming other methods like Convolutional Neural Networks (CNNs), Transformers tend to require even larger datasets with fine-grained annotations (e.g., pixel-level labels for land cover segmentation). To overcome this limitation, we propose a weakly-supervised vision Transformer that leverages image-level labels to learn a semantic segmentation task to reduce the human annotation load. We achieve this by slightly modifying the architecture of the vision Transformer through the use of gating units in each attention head to enforce sparsity during training and thereby retaining only the most meaningful heads. This allows us to directly infer pixel-level labels from image-level labels by post-processing the un-pruned attention heads of the model and refining our predictions by iteratively training a segmentation model with high fidelity. Training and evaluation on the DFC2020 dataset show that our method 1 not only generates high-quality segmentation masks using image-level labels, but also performs on par with fully-supervised training relying on pixel-level labels. Finally, our results show that our method is able to perform weakly-supervised semantic segmentation even on small-scale datasets.
更多
查看译文
关键词
attention head,complex computer vision tasks,Convolutional Neural Networks,fine-grained annotations,fully-supervised training,high-quality segmentation masks,human annotation load,land cover segmentation,leverages image-level labels,par,pixel-level labels,remote sensing applications,segmentation model,semantic segmentation task,sparse multimodal vision Transformer,un-pruned attention heads,Vision Transformers,weakly supervised semantic segmentation,weakly-supervised semantic segmentation,weakly-supervised vision Transformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要