An Evaluation Framework for General Agents with Compositional Cognitive Abilities
Here are some example tasks from CocoaBench, showcasing the diverse reasoning challenges our benchmark presents.
We currently evaluate several leading commercial agent systems on CocoaBench-v0.1 (25 tasks, not including the examples). A more detailed breakdown is shown in the leaderboard.
We present the model solutions for the 4 example tasks shwon above. Explore how different models approached each example task. Click on a result block to view the analysis and the raw response.
Shibo Hao*, Zhining Zhang*, Zhiqi Liang*, Tianyang Liu*, Zilong Wang*, Kun Zhou, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zhoujun Cheng, Yu Wang, Feng Yao, Licheng Liu, Ziqiao Ma, Hector Liu, Rupesh Srivastava, Julian McAuley, Jingbo Shang, Lianhui Qin, Zhiting Hu
(* core contributor)
We are continuously building and improving CocoaBench. CocoaBench is a community-driven benchmark, and we welcome contributions from researchers and practitioners with diverse backgrounds. If you've encountered a challenging real-world problem that pushed your limits, it might make a great benchmark task!
We've set up a streamlined task contribution protocol to guide you through creating and submitting new tasks. Contributors with 3 accepted tasks are eligible for co-authorship on the CocoaBench paper, which we plan to submit to a top-tier ML conference.
Have questions or ideas? Feel free to reach out to us or join our Discord community to propose new tasks or discuss ideas. (If the link doesn't work, try refreshing the page or manually add the server in Discord app using invite code: ZDaDhVCd)
@misc{cocoabench2025,
title={CocoaBench: An Evaluation Framework for General Agents with Compositional Cognitive Abilities},
author={Shibo Hao and Zhining Zhang and Zhiqi Liang and Tianyang Liu and Zilong Wang and others},
howpublished={Blog post},
month={December},
year={2025},
url={https://cocoabench.github.io/}
}