Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
arxiv(2024)
Abstract
Training models with longer in-context lengths is a significant challenge for
multimodal model due to substantial GPU memory and computational costs. This
exploratory study does not present state-of-the-art models; rather, it
introduces an innovative method designed to increase in-context text length in
multi-modality large language models (MLLMs) efficiently. We present Visualized
In-Context Text Processing (VisInContext), which processes long in-context text
using visual tokens. This technique significantly reduces GPU memory usage and
floating point operations (FLOPs) for both training and inferenceing stage. For
instance, our method expands the pre-training in-context text length from 256
to 2048 tokens with nearly same FLOPs for a 56 billion parameter MOE model.
Experimental results demonstrate that model trained with VisInContext delivers
superior performance on common downstream benchmarks for in-context few-shot
evaluation. Additionally, VisInContext is complementary to existing methods for
increasing in-context text length and enhances document understanding
capabilities, showing great potential in document QA tasks and sequential
document retrieval.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined