Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study
arxiv(2024)
摘要
Recent studies have demonstrated the success of foundation agents in specific
tasks or scenarios. However, existing agents cannot generalize across different
scenarios, mainly due to their diverse observation and action spaces and
semantic gaps, or reliance on task-specific resources. In this work, we propose
the General Computer Control (GCC) setting: building foundation agents that can
master any computer task by taking only screen images (and possibly audio) of
the computer as input, and producing keyboard and mouse operations as output,
similar to human-computer interaction. To target GCC, we propose Cradle, an
agent framework with strong reasoning abilities, including self-reflection,
task inference, and skill curation, to ensure generalizability and
self-improvement across various tasks. To demonstrate the capabilities of
Cradle, we deploy it in the complex AAA game Red Dead Redemption II, serving as
a preliminary attempt towards GCC with a challenging target. Our agent can
follow the main storyline and finish real missions in this complex AAA game,
with minimal reliance on prior knowledge and application-specific resources.
The project website is at https://baai-agents.github.io/Cradle/.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要