Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning
arxiv(2024)
摘要
In the medical multi-modal frameworks, the alignment of cross-modality
features presents a significant challenge. However, existing works have learned
features that are implicitly aligned from the data, without considering the
explicit relationships in the medical context. This data-reliance may lead to
low generalization of the learned alignment relationships. In this work, we
propose the Eye-gaze Guided Multi-modal Alignment (EGMA) framework to harness
eye-gaze data for better alignment of medical visual and textual features. We
explore the natural auxiliary role of radiologists' eye-gaze data in aligning
medical images and text, and introduce a novel approach by using eye-gaze data,
collected synchronously by radiologists during diagnostic evaluations. We
conduct downstream tasks of image classification and image-text retrieval on
four medical datasets, where EGMA achieved state-of-the-art performance and
stronger generalization across different datasets. Additionally, we explore the
impact of varying amounts of eye-gaze data on model performance, highlighting
the feasibility and utility of integrating this auxiliary data into multi-modal
alignment framework.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要