Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery
CoRR(2024)
Abstract
Recent advancements in Surgical Visual Question Answering (Surgical-VQA) and
related region grounding have shown great promise for robotic and medical
applications, addressing the critical need for automated methods in
personalized surgical mentorship. However, existing models primarily provide
simple structured answers and struggle with complex scenarios due to their
limited capability in recognizing long-range dependencies and aligning
multimodal information. In this paper, we introduce Surgical-LVLM, a novel
personalized large vision-language model tailored for complex surgical
scenarios. Leveraging the pre-trained large vision-language model and
specialized Visual Perception LoRA (VP-LoRA) blocks, our model excels in
understanding complex visual-language tasks within surgical contexts. In
addressing the visual grounding task, we propose the Token-Interaction (TIT)
module, which strengthens the interaction between the grounding module and the
language responses of the Large Visual Language Model (LVLM) after projecting
them into the latent space. We demonstrate the effectiveness of Surgical-LVLM
on several benchmarks, including EndoVis-17-VQLA, EndoVis-18-VQLA, and a newly
introduced EndoVis Conversations dataset, which sets new performance standards.
Our work contributes to advancing the field of automated surgical mentorship by
providing a context-aware solution.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined