LLM Squid Game

Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models

Zijian Chen^1,2, Wenjun Zhang¹, Guangtao Zhai^1,2*,

¹Shanghai Jiao Tong University, ²Shanghai AI Laboratory

^*Corresponding author

Abstract: We introduce SQUID GAME, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings elaborated to evaluate LLMs through interactive gameplay against other LLM opponents. Notably, SQUID GAME consists of six eliminationstyle levels, focusing on multi-faceted abilities, such as instruction-following, code, reasoning, planning, and safety alignment. We evaluate over 50 LLMs on SQUID GAME, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios. We observe a clear generational phase transition on performance in the same model lineage and find evidence that some models resort to speculative shortcuts to win the game, indicating the possibility of higherlevel evaluation paradigm contamination in static benchmarks. Furthermore, we compare prominent LLM benchmarks and SQUID GAME with correlation analyses, highlighting that dynamic evaluation can serve as a complementary part for static evaluations.

Design Principles

Elimination rather than Score

Instead of seeking the theoretically best model, we adopt a Battle Royale-style relative ranking system, where a certain percentage of the models may be eliminated in each round. Based on this, the difficulty of following challenges increases according to both the game settings and the survivors themselves. Therefore, a high ranking not only represents superior capabilities but also reflects a tactical advantage (e.g., fewer errors, better strategies) over specific opponents.

Resource Constraint

We add resource constraints (e.g., token or API call quota) in SQUID GAME to measure the utilization rate of tokens and inference efficiency. More importantly, this mechanism also exerts pressure on the model by providing live feedback on its diminishing resource pool.

Information Asymmetry

We introduce an information asymmetry design that shifts the evaluative focus from “what a model knows” to “what it can do under uncertainty”. For example, in glass stepping stones, the participating LLMs start with no prior information, unaware of whether any given glass pane is safe or a trap. The dynamic asymmetry between models lies in that those positioned further down the order hold a substantial informational advantage, as they can observe the actions and results of all prior competitors. This uncertainty of the evaluation process improves the depth of evaluation and can differentiate models with better reasoning, planning, and strategy generation capabilities towards AGI.

Dynamic Adversarial Evaluation

We elaborate a series of levels (e.g., red-green light, tug of war, and marbles) that provide a holistic evaluation of all-round capabilities, ranging from benign instruction following to collaborative problem-solving and adversarial gaming in both offensive and defensive scenarios. This creates a self-evolving and never-saturating evaluation environment, where the difficulty automatically scales with the opponent’s intelligence, thus maintaining a constant challenge for cutting-edge models.

Task Card

We build a full-scale replica of the Squid Game for general LLMs in six scenarios, namely red-green light, sugar honeycombs, tug of war, marbles, glass stepping stones, and the final squid game.

Benchmark Candidates

Our SQUID GAME includes 52 LLMs total, with a mix of top proprietary models and open-source models across various sizes. In particular, for proprietary models, we include OpenAI models such as GPT-5, GPT-4o, and o3, Google models such as Gemini 2.5 Pro and Gemini 1.5 Flash, Anthropic models such as Claude 4.1 Opus and Claude 3.7 Sonnet, xAI models such as Grok-3 and Grok-4, and ByteDance models such as Seed 1.6. For open-source models, we include models from Qwen (Qwen3-235B, Qwen2.5-{72B, 7B}), DeepSeek (DS-R1, DS-V3, Llama3.x), Kimi-K2, and GLM (GLM-{4, 4.5, 4.5-air}) families.

Main Results

The survival situation of each level in the SQUID GAME. SRs and SRo represent the stage survival rate and overall survival rate, respectively. We show the mean and standard deviation obtained from 20 independent SQUID GAMES. (click to expand)

Survival rate of each model across six levels in SQUID GAME. We merely report the models that passed the first game. (click to expand)

Box plots of the elimination points of 52 LLMs in the red-green light game. For each box, the pentagon and red line inside the box denote the mean and median, respectively. The edges of the box represent the 25th and 75th percentiles, with blue circles marking elimination points. A clear performance gap exists between top commercial LLMs (e.g., GPT-5) and their non-reasoning predecessors as well as open-source competitors. (click to expand)

Upper. The number of tests passed by models during the sugar honeycombs phase of each SQUID GAME. The depth of color represents frequency. Bottom. The average CODEBERTSCORE of degraded code and code corrected by different LLMs. (click to expand)

Left. Model combat style analysis: offensive vs. defensive success rates. The red dotted circle encompasses other models with extreme defensive behavior, covering nearly all top proprietary LLMs. Right. Word cloud of all spontaneously generated questions during the marbles game. (click to expand)

The performance of models on different benchmarks, compared to a best-fit line. We compare the differences in relative performance of LLMs on SQUID GAME_red-green light vs. LIVEBENCH_instruction-following, SQUID GAME_sugar honeycombs vs. LIVECODEBENCH, and SQUID GAME vs. CHATBOT ARENA. (click to expand)

Code

To be released.

Citation

@misc{chen2025evaluating,
      title={Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models}, 
      author={Zijian Chen and Wenjun Zhang and Guangtao Zhai},
      year={2025},
      eprint={2511.10691},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.10691}, 
}

Contact

Please contact the first author of this paper for queries.

Zijian Chen, [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Squid Game

Design Principles

Elimination rather than Score

Resource Constraint

Information Asymmetry

Dynamic Adversarial Evaluation

Task Card

Benchmark Candidates

Main Results

Code

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LLM Squid Game

Design Principles

Elimination rather than Score

Resource Constraint

Information Asymmetry

Dynamic Adversarial Evaluation

Task Card

Benchmark Candidates

Main Results

Code

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages