Hierarchical Multi-Modal Attention Network for Time-Sync Comment Video Recommendation

Weihao Zhao, Han Wu, Weidong He,Haoyang Bi, Hao Wang,Chen Zhu,Tong Xu,Enhong Chen

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY(2024)

引用 0|浏览4
暂无评分
摘要
Due to inherent interactivity, time-sync comment of videos have attracted increasing attention and were widely adopted in online video platforms. In addition to enhancing user engagement, time-sync comments provide abundant semantic information that can greatly enhance video understanding, which however is largely overlooked in mainstream video recommender systems. To address this issue, we propose a Hierarchical Multi-modal Attention Network (HMAN) to effectively utilize time-sync comment for recommendation. Specifically, we design a Multi-level Text Condense (MTC) Module to capture the accurate semantics of time-sync comments via text-level and vision-level condense operations. Then we propose a Range Convolution Block (RCB) to capture both visual and textual information from variable-length event segments leveraging the variable respective field. After that, we design a Hierarchical Multi-modal Branch Fusion (HMBF) Module to obtain a comprehensive multi-modal representation of the time-sync comments video. Finally, with the obtained video representation, recommendation scores are obtained through its inner product with user embedding. Extensive experiments demonstrate the effectiveness of the proposed HMAN, and ablation studies on different variants of HMAN further validate the utility of each component and the necessity of the hierarchical multi-modal branch fusion method.
更多
查看译文
关键词
Time-sync comment videos,multi-modal representation,video recommendation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要