Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities
arxiv(2024)
摘要
Multimodal models are expected to be a critical component to future advances
in artificial intelligence. This field is starting to grow rapidly with a surge
of new design elements motivated by the success of foundation models in natural
language processing (NLP) and vision. It is widely hoped that further extending
the foundation models to multiple modalities (e.g., text, image, video, sensor,
time series, graph, etc.) will ultimately lead to generalist multimodal models,
i.e. one model across different data modalities and tasks. However, there is
little research that systematically analyzes recent multimodal models
(particularly the ones that work beyond text and vision) with respect to the
underling architecture proposed. Therefore, this work provides a fresh
perspective on generalist multimodal models (GMMs) via a novel architecture and
training configuration specific taxonomy. This includes factors such as
Unifiability, Modularity, and Adaptability that are pertinent and essential to
the wide adoption and application of GMMs. The review further highlights key
challenges and prospects for the field and guide the researchers into the new
advancements.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要