ROMOT: Referring-expression-comprehension open-set multi-object tracking

The Visual Computer(2024)

Cited 0|Views5
No score
Abstract
Traditional multi-object tracking is limited to tracking a predefined set of categories, whereas open-vocabulary tracking expands its capabilities to track novel categories. In this paper, we propose ROMOT (referring-expression-comprehension open-set multi-object tracking), which not only tracks objects from novel categories not included in the training data, but also enables tracking based on referring expression comprehension (REC). REC describes targets solely by their attributes, such as “the person running at the front” or “the bird flying in the air rather than on the ground,” making it particularly relevant for real-world multi-object tracking scenarios. Our ROMOT achieves this by harnessing the exceptional capabilities of a visual language model and employing multi-stage cross-modal attention to handle tracking for novel categories and REC tasks. Integrating RSM (reconstruction similarity metric) and OCM (observation-centric momentum) in our ROMOT eliminates the need for task-specific training, addressing the challenge of insufficient datasets. Our ROMOT enhances efficiency and adaptability in handling tracking requirements without relying on extensive tracking training data.
More
Translated text
Key words
Open-set,Referring expression comprehension,Detection,Tracking
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined