Multi-Frame, Lightweight Efficient Vision-Language Models for Question Answering in Autonomous Driving
CoRR(2024)
Abstract
Vision-Language Models (VLMs) and Multi-Modal Language models (MMLMs) have
become prominent in autonomous driving research, as these models can provide
interpretable textual reasoning and responses for end-to-end autonomous driving
safety tasks using traffic scene images and other data modalities. However,
current approaches to these systems use expensive large language model (LLM)
backbones and image encoders, making such systems unsuitable for real-time
autonomous driving systems where tight memory constraints exist and fast
inference time is necessary. To address these previous issues, we develop
EM-VLM4AD, an efficient, lightweight, multi-frame vision language model which
performs Visual Question Answering for autonomous driving. In comparison to
previous approaches, EM-VLM4AD requires at least 10 times less memory and
floating point operations, while also achieving higher BLEU-4, METEOR, CIDEr,
and ROGUE scores than the existing baseline on the DriveLM dataset. EM-VLM4AD
also exhibits the ability to extract relevant information from traffic views
related to prompts and can answer questions for various autonomous driving
subtasks. We release our code to train and evaluate our model at
https://github.com/akshaygopalkr/EM-VLM4AD.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined