UniVS: Unified and Universal Video Segmentation with Prompts as Queries
CVPR 2024(2024)
摘要
Despite the recent advances in unified image segmentation (IS), developing a
unified video segmentation (VS) model remains a challenge. This is mainly
because generic category-specified VS tasks need to detect all objects and
track them across consecutive frames, while prompt-guided VS tasks require
re-identifying the target with visual/text prompts throughout the entire video,
making it hard to handle the different tasks with the same architecture. We
make an attempt to address these issues and present a novel unified VS
architecture, namely UniVS, by using prompts as queries. UniVS averages the
prompt features of the target from previous frames as its initial query to
explicitly decode masks, and introduces a target-wise prompt cross-attention
layer in the mask decoder to integrate prompt features in the memory pool. By
taking the predicted masks of entities from previous frames as their visual
prompts, UniVS converts different VS tasks into prompt-guided target
segmentation, eliminating the heuristic inter-frame matching process. Our
framework not only unifies the different VS tasks but also naturally achieves
universal training and testing, ensuring robust performance across different
scenarios. UniVS shows a commendable balance between performance and
universality on 10 challenging VS benchmarks, covering video instance,
semantic, panoptic, object, and referring segmentation tasks. Code can be found
at .
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要