Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning
CVPR 2024(2024)
摘要
Zero-shot learning (ZSL) recognizes the unseen classes by conducting
visual-semantic interactions to transfer semantic knowledge from seen classes
to unseen ones, supported by semantic information (e.g., attributes). However,
existing ZSL methods simply extract visual features using a pre-trained network
backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic
correspondences for representing semantic-related visual features as lacking of
the guidance of semantic information, resulting in undesirable visual-semantic
interactions. To tackle this issue, we propose a progressive semantic-guided
vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly
considers two properties in the whole network: i) discover the semantic-related
visual representations explicitly, and ii) discard the semantic-unrelated
visual information. Specifically, we first introduce semantic-embedded token
learning to improve the visual-semantic correspondences via semantic
enhancement and discover the semantic-related visual tokens explicitly with
semantic-guided token attention. Then, we fuse low semantic-visual
correspondence visual tokens to discard the semantic-unrelated visual
information for visual enhancement. These two operations are integrated into
various encoders to progressively learn semantic-related visual representations
for accurate visual-semantic interactions in ZSL. The extensive experiments
show that our ZSLViT achieves significant performance gains on three popular
benchmark datasets, i.e., CUB, SUN, and AWA2.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要