Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation

📄 Paper: https://arxiv.org/pdf/2506.06971

📎 Dataset Download Link

🧠 Overview

BREAK-THE-CHAIN introduces a new framework to test the reasoning robustness of Large Language Models (LLMs) by applying adversarial, yet meaning-preserving, modifications to code generation tasks. Despite strong performance on clean prompts, we find that LLMs are highly sensitive to superficial prompt changes, revealing brittle reasoning under surface-level linguistic shifts.

We generate 700 perturbed prompts from 100 LeetCode-style problems using 7 distinct transformation types, and evaluate 9 popular LLMs (Claude, Gemini, DeepSeek, Qwen, LLaMA) on their reasoning resilience.

💡 Key Contributions

🔧 Introduced 7 adversarial perturbation types that retain problem semantics while changing prompt structure:
- Storytelling, Gamification, Distracting Constraints, Domain Shift, Example Perturbation, Negation Objective, Soft Negation
📊 Evaluated 9 leading LLMs on 700 total instances (100 clean problems × 7 transformations) using Pass@1 metric and difficulty stratification.
📉 Found severe reasoning failures:
- Claude-3.7 Sonnet dropped -54.3% under domain shift.
- Claude-3.7 Sonnet dropped -42.1% under distracting constraints.
📈 Discovered accuracy gains in certain cases:
- Qwen2.5-Coder improved by +24.5% with Example Perturbation.
- LLaMA-3.1-Instruct improved by +35.3% with Storytelling.
- Gemini-2.0-Flash gained +12.0% under Example Perturbation.
🧪 Released a perturbed benchmark dataset and evaluation scripts for robust testing of LLM reasoning behavior.

🔬 Example Perturbation (Gamification)

Clean Prompt	Gamified Prompt
"Given a 2D integer array, return the final matrix sum score"	"In the realm of Azura, brave adventurers collect treasure maps..."

📁 See all transformed prompts from the downloaded dataset.

📊 Results Snapshot

Model	Clean	Storytelling	Gamification	Distracting Constraints
Gemini 2.5 Flash	95.0%	97.4% (+2.4%)	96.9% (+1.9%)	95.5% (+0.5%)
Claude 3.7 Sonnet	90.0%	63.4% (-26.6%)	50.0% (-40%)	47.9% (-42.1%)
LLaMA 3.1 Instruct	19.0%	44.7% (+25.7%)	37.8% (+18.8%)	37.6% (+18.6%)

🧠 Natural-sounding rewrites like storytelling can improve performance, while distracting constraints severely hurt reasoning ability. Except for Claude-3.7-Sonnet, which is the model that achieved with the highest clean accuracy but a drop across all perturbations.

📥 Run

Access all clean and modified prompts here:
📎 Dataset Download Link

1. Download the folders and place them into the corresponding directories:

modified_data → data_modified/
clean_data → data/

2. For Batch Runs:

For clean input testing:

python run_script_main.py

For perturbed input testing:

python run_script_main_perturbation.py

‼️ Make sure to enter Anthropic and Gemini Key where it says ENTER_KEY in the main.py and main_perturbation.py.

📝 Note

We utilized the lcb_runner/ folder directly from LiveCodeBench repository for a smooth and accurate evaluation.

📜 Citation

If you use this dataset or findings, please cite:

@misc{roh2025breakthechain,
  title        = {{BREAK-THE-CHAIN: Adversarial Prompting in Code Generation}},
  author       = {Jaechul Roh and Varun Gandhi and Shivani Anilkumar and Arin Garg},
  year         = {2025},
  howpublished = {\url{https://github.com/jrohsc/685_Project/tree/main/LiveCodeBench}},
  note         = {UMass Amherst CS685 Advanced NLP Project},
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
__pycache__		__pycache__
data		data
data_modified		data_modified
evaluation		evaluation
images		images
lcb_runner		lcb_runner
results		results
.gitignore		.gitignore
ERRATA.md		ERRATA.md
LICENSE		LICENSE
PROMPT_TEMPLATE.py		PROMPT_TEMPLATE.py
README.md		README.md
lcb_sky.yml		lcb_sky.yml
livecodebench-generic.ipynb		livecodebench-generic.ipynb
livecodebench.ipynb		livecodebench.ipynb
main.py		main.py
main_perturbation.py		main_perturbation.py
maps.py		maps.py
modify_problem.py		modify_problem.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
run_script_main.py		run_script_main.py
run_script_main_perturbation.py		run_script_main_perturbation.py
run_script_modify_problem.py		run_script_modify_problem.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation

🧠 Overview

💡 Key Contributions

🔬 Example Perturbation (Gamification)

📊 Results Snapshot

📥 Run

1. Download the folders and place them into the corresponding directories:

2. For Batch Runs:

📝 Note

📜 Citation

About

Uh oh!

Releases

Packages

Languages

License

jrohsc/Chain-of-Code-Collapse

Folders and files

Latest commit

History

Repository files navigation

Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation

🧠 Overview

💡 Key Contributions

🔬 Example Perturbation (Gamification)

📊 Results Snapshot

📥 Run

1. Download the folders and place them into the corresponding directories:

2. For Batch Runs:

📝 Note

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages