GRAM: Global Reasoning for Multi-Page VQA
CoRR(2024)
摘要
The increasing use of transformer-based large language models brings forward
the challenge of processing long sequences. In document visual question
answering (DocVQA), leading methods focus on the single-page setting, while
documents can span hundreds of pages. We present GRAM, a method that seamlessly
extends pre-trained single-page models to the multi-page setting, without
requiring computationally-heavy pretraining. To do so, we leverage a
single-page encoder for local page-level understanding, and enhance it with
document-level designated layers and learnable tokens, facilitating the flow of
information across pages for global reasoning. To enforce our model to utilize
the newly introduced document-level tokens, we propose a tailored bias
adaptation method. For additional computational savings during decoding, we
introduce an optional compression stage using our C-Former model, which reduces
the encoded sequence length, thereby allowing a tradeoff between quality and
latency. Extensive experiments showcase GRAM's state-of-the-art performance on
the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our
approach.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要