Chrome Extension
WeChat Mini Program
Use on ChatGLM

End-to-end Multi-task Learning Framework for Spatio-Temporal Grounding in Video Corpus

Conference on Information and Knowledge Management(2022)

Cited 0|Views14
No score
Abstract
ABSTRACTIn this paper, we consider a novel task, Video Corpus Spatio-Temporal Grounding (VCSTG) for material selection and spatio-temporal adaption in intelligent video editing. Given a text query depicting an object and a corpus of untrimmed and unsegmented videos, VCSTG aims to localize a sequence of spatio-temporal object tubes from the video corpus. Existing methods tackle the VCSTG task in a multi-stage approach, which encodes the query and video representation independently for each task, leading to local optimum. In this paper, we propose a novel one-stage multi-task learning based framework named MTSTG for the VCSTG task. MTSTG learns unified query and video representation for video retrieval, temporal grounding and spatial grounding tasks. Video-level, frame-level and object-level contrastive learning are introduced to measure the mutual information between query and video at different granularity. Comprehensive experiments demonstrate our newly proposed framework outperforms the state-of-the-art multi-stage methods on VidSTG dataset.
More
Translated text
Key words
Video Retrieval, Temporal Grounding, Spatial Grounding, Video Corpus Moment Retrieval
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined