Skip to content

zijianchen98/LLM_Squid_Game

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LLM Squid Game

Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models

1Shanghai Jiao Tong University, 2Shanghai AI Laboratory
*Corresponding author

Abstract: We introduce SQUID GAME, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings elaborated to evaluate LLMs through interactive gameplay against other LLM opponents. Notably, SQUID GAME consists of six eliminationstyle levels, focusing on multi-faceted abilities, such as instruction-following, code, reasoning, planning, and safety alignment. We evaluate over 50 LLMs on SQUID GAME, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios. We observe a clear generational phase transition on performance in the same model lineage and find evidence that some models resort to speculative shortcuts to win the game, indicating the possibility of higherlevel evaluation paradigm contamination in static benchmarks. Furthermore, we compare prominent LLM benchmarks and SQUID GAME with correlation analyses, highlighting that dynamic evaluation can serve as a complementary part for static evaluations.

Design Principles

Elimination rather than Score

  • Instead of seeking the theoretically best model, we adopt a Battle Royale-style relative ranking system, where a certain percentage of the models may be eliminated in each round. Based on this, the difficulty of following challenges increases according to both the game settings and the survivors themselves. Therefore, a high ranking not only represents superior capabilities but also reflects a tactical advantage (e.g., fewer errors, better strategies) over specific opponents.

Resource Constraint

  • We add resource constraints (e.g., token or API call quota) in SQUID GAME to measure the utilization rate of tokens and inference efficiency. More importantly, this mechanism also exerts pressure on the model by providing live feedback on its diminishing resource pool.

Information Asymmetry

  • We introduce an information asymmetry design that shifts the evaluative focus from “what a model knows” to “what it can do under uncertainty”. For example, in glass stepping stones, the participating LLMs start with no prior information, unaware of whether any given glass pane is safe or a trap. The dynamic asymmetry between models lies in that those positioned further down the order hold a substantial informational advantage, as they can observe the actions and results of all prior competitors. This uncertainty of the evaluation process improves the depth of evaluation and can differentiate models with better reasoning, planning, and strategy generation capabilities towards AGI.

Dynamic Adversarial Evaluation

  • We elaborate a series of levels (e.g., red-green light, tug of war, and marbles) that provide a holistic evaluation of all-round capabilities, ranging from benign instruction following to collaborative problem-solving and adversarial gaming in both offensive and defensive scenarios. This creates a self-evolving and never-saturating evaluation environment, where the difficulty automatically scales with the opponent’s intelligence, thus maintaining a constant challenge for cutting-edge models.

Task Card

We build a full-scale replica of the Squid Game for general LLMs in six scenarios, namely red-green light, sugar honeycombs, tug of war, marbles, glass stepping stones, and the final squid game.

Benchmark Candidates

Our SQUID GAME includes 52 LLMs total, with a mix of top proprietary models and open-source models across various sizes. In particular, for proprietary models, we include OpenAI models such as GPT-5, GPT-4o, and o3, Google models such as Gemini 2.5 Pro and Gemini 1.5 Flash, Anthropic models such as Claude 4.1 Opus and Claude 3.7 Sonnet, xAI models such as Grok-3 and Grok-4, and ByteDance models such as Seed 1.6. For open-source models, we include models from Qwen (Qwen3-235B, Qwen2.5-{72B, 7B}), DeepSeek (DS-R1, DS-V3, Llama3.x), Kimi-K2, and GLM (GLM-{4, 4.5, 4.5-air}) families.

Main Results

The survival situation of each level in the SQUID GAME. SRs and SRo represent the stage survival rate and overall survival rate, respectively. We show the mean and standard deviation obtained from 20 independent SQUID GAMES. (click to expand)

Survival rate of each model across six levels in SQUID GAME. We merely report the models that passed the first game. (click to expand)

Box plots of the elimination points of 52 LLMs in the red-green light game. For each box, the pentagon and red line inside the box denote the mean and median, respectively. The edges of the box represent the 25th and 75th percentiles, with blue circles marking elimination points. A clear performance gap exists between top commercial LLMs (e.g., GPT-5) and their non-reasoning predecessors as well as open-source competitors. (click to expand)

Upper. The number of tests passed by models during the sugar honeycombs phase of each SQUID GAME. The depth of color represents frequency. Bottom. The average CODEBERTSCORE of degraded code and code corrected by different LLMs. (click to expand)

Left. Model combat style analysis: offensive vs. defensive success rates. The red dotted circle encompasses other models with extreme defensive behavior, covering nearly all top proprietary LLMs. Right. Word cloud of all spontaneously generated questions during the marbles game. (click to expand)

The performance of models on different benchmarks, compared to a best-fit line. We compare the differences in relative performance of LLMs on SQUID GAME_red-green light vs. LIVEBENCH_instruction-following, SQUID GAME_sugar honeycombs vs. LIVECODEBENCH, and SQUID GAME vs. CHATBOT ARENA. (click to expand)

Code

To be released.

Citation

@misc{chen2025evaluating,
      title={Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models}, 
      author={Zijian Chen and Wenjun Zhang and Guangtao Zhai},
      year={2025},
      eprint={2511.10691},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.10691}, 
}

Contact

Please contact the first author of this paper for queries.

About

Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors