谷歌浏览器插件
订阅小程序
在清言上使用

Dual-path temporal map optimization for make-up temporal video grounding

Multimedia Systems(2024)

引用 0|浏览16
暂无评分
摘要
Make-up temporal video grounding (MTVG) aims to localize the target video segment, which is semantically related to a sentence describing a make-up activity in a make-up video. Compared with the general video grounding, MTVG focuses on meticulous actions and changes on the face. The make-up instruction step, usually involving detailed differences in products and facial areas, is more fine-grained than general activities (e.g., cooking activity and furniture assembly). Thus, existing general approaches may not effectively locate the target activity effectually due to the lack of fine-grained semantic cues for the make-up semantic comprehension. To tackle this issue, we propose an effective proposal-based framework named Dual-Path Temporal Map Optimization Network to capture fine-grained multimodal semantic details of make-up activities. We extract both query-agnostic and query-guided features to construct two proposal sets and use specific evaluation methods for the two sets. Different from the commonly used single structure in previous methods, our dual-path structure can mine more semantic information in make-up videos and distinguish fine-grained actions well. These two candidate sets represent the cross-modal makeup video-text similarity and multi-modal fusion relationship, complementing each other. Therefore, the joint prediction of these sets will enhance the accuracy of video timestamp prediction. Comprehensive experiments on the YouMakeup dataset demonstrate our proposed dual structure excels in fine-grained semantic comprehension. The source code will be available at: https://github.com/lijiaxiuHFUT/DPTMO .
更多
查看译文
关键词
Video understanding,Make-up temporal video grounding,Proposal generation,2D temporal map
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要