Lumos : Empowering Multimodal LLMs with Scene Text Recognition
CoRR(2024)
摘要
We introduce Lumos, the first end-to-end multimodal question-answering system
with text understanding capabilities. At the core of Lumos is a Scene Text
Recognition (STR) component that extracts text from first person point-of-view
images, the output of which is used to augment input to a Multimodal Large
Language Model (MM-LLM). While building Lumos, we encountered numerous
challenges related to STR quality, overall latency, and model inference. In
this paper, we delve into those challenges, and discuss the system
architecture, design choices, and modeling techniques employed to overcome
these obstacles. We also provide a comprehensive evaluation for each component,
showcasing high quality and efficiency.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要