World Model on Million-Length Video And Language With Blockwise RingAttention
arxiv(2024)
摘要
Current language models fall short in understanding aspects of the world not
easily described in words, and struggle with complex, long-form tasks. Video
sequences offer valuable temporal information absent in language and static
images, making them attractive for joint modeling with language. Such models
could develop a understanding of both human textual knowledge and the physical
world, enabling broader AI capabilities for assisting humans. However, learning
from millions of tokens of video and language sequences poses challenges due to
memory constraints, computational complexity, and limited datasets. To address
these challenges, we curate a large dataset of diverse videos and books,
utilize the Blockwise RingAttention technique to scalably train on long
sequences, and gradually increase context size from 4K to 1M tokens. This paper
makes the following contributions: (a) Largest context size neural network: We
train one of the largest context size transformers on long video and language
sequences, setting new benchmarks in difficult retrieval tasks and long video
understanding. (b) Solutions for overcoming vision-language training
challenges, including using masked sequence packing for mixing different
sequence lengths, loss weighting to balance language and vision, and
model-generated QA dataset for long sequence chat. (c) A highly-optimized
implementation with RingAttention, Blockwise Transformers, masked sequence
packing, and other key features for training on millions-length multimodal
sequences. (d) Fully open-sourced a family of 7B parameter models capable of
processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM,
LWM-Chat) of over 1M tokens. This work paves the way for training on massive
datasets of long video and language to develop understanding of both human
knowledge and the multimodal world, and broader capabilities.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要