Chrome Extension
WeChat Mini Program
Use on ChatGLM

MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition

Multimedia Tools and Applications(2024)

Cited 0|Views42
No score
Abstract
Multimodal named entity recognition (MNER) aims at identifying entity spans and recognizing their categories in social media posts with the aid of images. Previous work on MNER often relies on an attention mechanism to model the interactions between the images and text representations. However, the inconsistency of feature representations of different modalities will bring difficulties to the modeling of image-text interaction. To address this issue, we propose multi-granularity visual contexts to align image features into the textual space for text-text interactions so that the attention mechanism in pre-trained textual embeddings can be better utilized. The visual information of multi-granularity can help establish more accurate and thorough connections between image pixels and linguistic semantics. Specifically, we first extract the global image caption and dense image captions as the coarse-grained visual context and fine-grained visual contexts separately. Then, we consider images as signals with sparse semantic density for image-text interactions and image captions as dense semantic signals for text-text interactions. To alleviate the bias caused by visual noise and inaccurate alignment, we further design a dynamic filter network to filter visual noise and dynamically allocate visual information for modality fusion. Meanwhile, we propose a novel multi-granularity visual prompt-guided fusion network to model more robust modality fusion. Extensive experiments on three MNER datasets demonstrate the effectiveness of our method and achieve state-of-the-art performance.
More
Translated text
Key words
Multi-granularity,Multimodal named entity recognition,Prompt-guided,Visual context
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined