Tur[k]ingBench: A Challenge Benchmark for Web Agents
arxiv(2024)
Abstract
Recent chatbots have demonstrated impressive ability to understand and
communicate in raw-text form. However, there is more to the world than raw
text. For example, humans spend long hours of their time on web pages, where
text is intertwined with other modalities and tasks are accomplished in the
form of various complex interactions. Can state-of-the-art multi-modal models
generalize to such complex domains?
To address this question, we introduce TurkingBench, a benchmark of tasks
formulated as web pages containing textual instructions with multi-modal
context. Unlike existing work which employs artificially synthesized web pages,
here we use natural HTML pages that were originally designed for crowdsourcing
workers for various annotation purposes. The HTML instructions of each task are
also instantiated with various values (obtained from the crowdsourcing tasks)
to form new instances of the task. This benchmark contains 32.2K instances
distributed across 158 tasks.
Additionally, to facilitate the evaluation on TurkingBench, we develop an
evaluation framework that connects the responses of chatbots to modifications
on web pages (modifying a text box, checking a radio, etc.). We evaluate the
performance of state-of-the-art models, including language-only, vision-only,
and layout-only models, and their combinations, on this benchmark. Our findings
reveal that these models perform significantly better than random chance, yet
considerable room exists for improvement. We hope this benchmark will help
facilitate the evaluation and development of web-based agents.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined