📄 Paper: https://arxiv.org/pdf/2506.06971
📎 Dataset Download Link
BREAK-THE-CHAIN introduces a new framework to test the reasoning robustness of Large Language Models (LLMs) by applying adversarial, yet meaning-preserving, modifications to code generation tasks. Despite strong performance on clean prompts, we find that LLMs are highly sensitive to superficial prompt changes, revealing brittle reasoning under surface-level linguistic shifts.
We generate 700 perturbed prompts from 100 LeetCode-style problems using 7 distinct transformation types, and evaluate 9 popular LLMs (Claude, Gemini, DeepSeek, Qwen, LLaMA) on their reasoning resilience.
-
🔧 Introduced 7 adversarial perturbation types that retain problem semantics while changing prompt structure:
- Storytelling, Gamification, Distracting Constraints, Domain Shift, Example Perturbation, Negation Objective, Soft Negation
-
📊 Evaluated 9 leading LLMs on 700 total instances (100 clean problems × 7 transformations) using Pass@1 metric and difficulty stratification.
-
📉 Found severe reasoning failures:
- Claude-3.7 Sonnet dropped -54.3% under domain shift.
- Claude-3.7 Sonnet dropped -42.1% under distracting constraints.
-
📈 Discovered accuracy gains in certain cases:
- Qwen2.5-Coder improved by +24.5% with Example Perturbation.
- LLaMA-3.1-Instruct improved by +35.3% with Storytelling.
- Gemini-2.0-Flash gained +12.0% under Example Perturbation.
-
🧪 Released a perturbed benchmark dataset and evaluation scripts for robust testing of LLM reasoning behavior.
| Clean Prompt | Gamified Prompt |
|---|---|
| "Given a 2D integer array, return the final matrix sum score" | "In the realm of Azura, brave adventurers collect treasure maps..." |
📁 See all transformed prompts from the downloaded dataset.
| Model | Clean | Storytelling | Gamification | Distracting Constraints |
|---|---|---|---|---|
| Gemini 2.5 Flash | 95.0% | 97.4% (+2.4%) | 96.9% (+1.9%) | 95.5% (+0.5%) |
| Claude 3.7 Sonnet | 90.0% | 63.4% (-26.6%) | 50.0% (-40%) | 47.9% (-42.1%) |
| LLaMA 3.1 Instruct | 19.0% | 44.7% (+25.7%) | 37.8% (+18.8%) | 37.6% (+18.6%) |
🧠 Natural-sounding rewrites like storytelling can improve performance, while distracting constraints severely hurt reasoning ability. Except for Claude-3.7-Sonnet, which is the model that achieved with the highest clean accuracy but a drop across all perturbations.
Access all clean and modified prompts here:
📎 Dataset Download Link
modified_data→data_modified/clean_data→data/
- For clean input testing:
python run_script_main.py
- For perturbed input testing:
python run_script_main_perturbation.py
ENTER_KEY in the main.py and main_perturbation.py.
We utilized the lcb_runner/ folder directly from LiveCodeBench repository for a smooth and accurate evaluation.
If you use this dataset or findings, please cite:
@misc{roh2025breakthechain,
title = {{BREAK-THE-CHAIN: Adversarial Prompting in Code Generation}},
author = {Jaechul Roh and Varun Gandhi and Shivani Anilkumar and Arin Garg},
year = {2025},
howpublished = {\url{https://github.com/jrohsc/685_Project/tree/main/LiveCodeBench}},
note = {UMass Amherst CS685 Advanced NLP Project},
}
