GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
CoRR(2023)
摘要
While the recent advances in Multimodal Large Language Models (MLLMs)
constitute a significant leap forward in the field, these models are
predominantly confined to the realm of input-side multimodal comprehension,
lacking the capacity for multimodal content generation. To fill this gap, we
present GPT4Video, a unified multi-model framework that empowers Large Language
Models (LLMs) with the capability of both video understanding and generation.
Specifically, we develop an instruction-following-based approach integrated
with the stable diffusion generative model, which has demonstrated to
effectively and securely handle video generation scenarios. GPT4Video offers
the following benefits: 1) It exhibits impressive capabilities in both video
understanding and generation scenarios. For example, GPT4Video outperforms
Valley by 11.8\% on the Video Question Answering task, and surpasses NExt-GPT
by 2.3\% on the Text to Video generation task. 2) it endows the LLM/MLLM with
video generation capabilities without requiring additional training parameters
and can flexibly interface with a wide range of models to perform video
generation. 3) it maintains a safe and healthy conversation not only in
output-side but also the input side in an end-to-end manner. Qualitative and
qualitative experiments demonstrate that GPT4Video holds the potential to
function as a effective, safe and Humanoid-like video assistant that can handle
both video understanding and generation scenarios.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要