CocoaBench

An Evaluation Framework for General Agents with Compositional Cognitive Abilities

Overview

We focus on evaluating general agents on highly complex tasks, while only requiring them to operate a set of general tools (browser, code execution, file systems, etc.). See examples below.
We design the benchmark around cognitive abilities rather than specific tools. The interfaces through which LLMs interact with the world will change, but the underlying abilities required, perception, reasoning, and memory, remain unchanged. See our blog post here.
The benchmark is easy to use: tasks are human-understandable, automatically evaluated by scripts. We additionally developed a framework makes it straightforward to evaluate any models within a lightweight sandbox. See the github repo here.

Examples

Here are some example tasks from CocoaBench, showcasing the diverse reasoning challenges our benchmark presents.

Evaluation

We currently evaluate several leading commercial agent systems on CocoaBench-v0.1 (25 tasks, not including the examples). A more detailed breakdown is shown in the leaderboard.

Performance Summary

Case Studies

We present the model solutions for the 4 example tasks shwon above. Explore how different models approached each example task. Click on a result block to view the analysis and the raw response.

Contributors

Shibo Hao*, Zhining Zhang*, Zhiqi Liang*, Tianyang Liu*, Zilong Wang*, Kun Zhou, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zhoujun Cheng, Yu Wang, Feng Yao, Licheng Liu, Ziqiao Ma, Hector Liu, Rupesh Srivastava, Julian McAuley, Jingbo Shang, Lianhui Qin, Zhiting Hu

(* core contributor)

Get Involved

We are continuously building and improving CocoaBench. CocoaBench is a community-driven benchmark, and we welcome contributions from researchers and practitioners with diverse backgrounds. If you've encountered a challenging real-world problem that pushed your limits, it might make a great benchmark task!

We've set up a streamlined task contribution protocol to guide you through creating and submitting new tasks. Contributors with 3 accepted tasks are eligible for co-authorship on the CocoaBench paper, which we plan to submit to a top-tier ML conference.

Have questions or ideas? Feel free to reach out to us or join our Discord community to propose new tasks or discuss ideas. (If the link doesn't work, try refreshing the page or manually add the server in Discord app using invite code: ZDaDhVCd)

Citation

@misc{cocoabench2025,
  title={CocoaBench: An Evaluation Framework for General Agents with Compositional Cognitive Abilities},
  author={Shibo Hao and Zhining Zhang and Zhiqi Liang and Tianyang Liu and Zilong Wang and others},
  howpublished={Blog post},
  month={December},
  year={2025},
  url={https://cocoabench.github.io/}
}