Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study

Weihao Tan,Ziluo Ding, Wentao Zhang, Boyu Li,Bohan Zhou, Junpeng Yue, Haochong Xia,Jiechuan Jiang,Longtao Zheng, Xinrun Xu, Yifei Bi,Pengjie Gu,Xinrun Wang, Börje F. Karlsson,Bo An,Zongqing Lu

arxiv(2024)

引用 0|浏览8
暂无评分
摘要
Recent studies have demonstrated the success of foundation agents in specific tasks or scenarios. However, existing agents cannot generalize across different scenarios, mainly due to their diverse observation and action spaces and semantic gaps, or reliance on task-specific resources. In this work, we propose the General Computer Control (GCC) setting: building foundation agents that can master any computer task by taking only screen images (and possibly audio) of the computer as input, and producing keyboard and mouse operations as output, similar to human-computer interaction. To target GCC, we propose Cradle, an agent framework with strong reasoning abilities, including self-reflection, task inference, and skill curation, to ensure generalizability and self-improvement across various tasks. To demonstrate the capabilities of Cradle, we deploy it in the complex AAA game Red Dead Redemption II, serving as a preliminary attempt towards GCC with a challenging target. Our agent can follow the main storyline and finish real missions in this complex AAA game, with minimal reliance on prior knowledge and application-specific resources. The project website is at https://baai-agents.github.io/Cradle/.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要