Leveraging Text Representation and Face-head Tracking for Long-form Multimodal Semantic Relation Understanding

International Multimedia Conference(2022)

引用 5|浏览15
暂无评分
摘要
ABSTRACTIn the intricate problem of understanding long-form multi-modal inputs, few key-aspects in scene-understanding and dialogue-and-discourse are often overlooked. In this paper, we investigate two such key-aspects for better semantic and relational understanding - (i). head-object-tracking in addition to usual face-tracking, and (ii). fusing scene-to-text representation with external common-sense knowledge-base for effective mapping to sub-tasks of interest. The usage of head-tracking especially helps with enriching sparse entity mapping to inter-entity conversation interactions. These methods are guided by natural language supervision on visual models, and perform well for interaction and sentiment understanding tasks.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要