Spatial Hierarchical Attention Network Based Video-Guided Machine Translation

SSRN Electronic Journal(2022)

引用 0|浏览7
暂无评分
摘要
Video-guided machine translation, as one type of multimodal machine translation, aims to engage video contents as auxiliary information to address the word sense ambiguity problem in machine translation. Previous studies only use features from pre-trained action detection models as motion representations of the video to solve the verb sense ambiguity and neglect the noun sense ambiguity problem. To address this, we propose a video-guided machine translation system using both spatial and motion representations. For the spatial part, we propose a hierarchical attention network to model the spatial information from object-level to video-level. We investigate and discuss spatial features extracted from objects with pre-trained convolutional neural network models and spatial concept features extracted from object labels and attributes with pre-trained language models. We further investigate spatial feature filtering by referring to corresponding source sentences. Experiments on the VATEX dataset show that our system achieves a 35.86 BLEU-4 score, which is 0.51 score higher than the single model of the SOTA method. Experiments on the How2 dataset further verify the generalization ability of our proposed system.
更多
查看译文
关键词
machine translation,attention,spatial,video-guided
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要