Official implementation of "Defending Jailbreak Prompts via In-Context Adversarial Game (ICAG)". ICAG performs an in-context adversarial game between attack and defense agents to iteratively produce and compress defense rules without fine-tuning. It significantly reduces jailbreak success rates and transfers well across LLMs.
- Paper:
arXiv:2402.13148
src/train.py: ICAG training entry (iterative attack/defense; produce system prompts)agents.py: attack/defense/evaluator agentsprompts.py,prompts_baseline.py: prompt templatesutil.py: OpenAI + HF utilities for generation/eval- Evaluation scripts:
test_baseline.py: AdvBench-style evaluation of system promptstest_transfer.py: transfer evaluation across modelsMMLU.py: ability retention on MMLUxstest.py: over-defense/refusal analysis on XSTestICAG_attack_data.py: generate/refine attack prompts for analysis
data/- Attacks:
PAIR_*.pkl,AutoDAN_*.pkl,Self-Reminder-Data/*,MultiJail.csv,Xstest/* - ICAG iterations/ablations:
ICAG_prompts_iter.json,ICAG_ablation_prompts.json - MMLU:
MMLU_prompts.pkl,MMLU_answers.pkl,MMLU_system_prompts.pkl
- Attacks:
Install:
pip install -U openai transformers torch accelerate sentencepiece langchain fastchatEnvironment:
export OPENAI_API_KEY=YOUR_API_KEYNotes:
- Remote models use OpenAI Chat Completions (e.g.,
gpt-3.5-turbo-0125). - Local models use HF checkpoints (pre-mapped in scripts):
vicuna:lmsys/vicuna-7b-v1.5mistral:mistralai/Mistral-7B-Instruct-v0.3llama3-instruct:meta-llama/Meta-Llama-3-8B-Instruct
Bundled under data/:
- AdvBench:
harmful_behaviors.csv - Attacks:
PAIR_*.pkl,AutoDAN_*.pkl,Self-Reminder-Data/data/jailbreak_prompts*.csv - Iterations/Ablations:
ICAG_prompts_iter.json,ICAG_ablation_prompts.json - XSTest:
Xstest/* - MMLU:
MMLU_prompts.pkl,MMLU_answers.pkl,MMLU_system_prompts.pkl
No extra downloads needed for main experiments. For local models, ensure HF access.
Entry: src/train.py
Arguments:
--victim_llm:vicuna,llama3,llama3_instruct,mistral,gpt-3.5-turbo-0125--attack_llm:gpt-3.5-turbo-0125,llama3_instruct--defense_llm: same as above--mode:improved_ref,ref,self_reminder,wout--path: checkpoint path (optional)--n: iterations (default 10)--question_idx: question index (optional)
Example (run at repo root):
cd In-Context-Adversarial-Game
python src/train.py \
--victim_llm vicuna \
--attack_llm gpt-3.5-turbo-0125 \
--defense_llm gpt-3.5-turbo-0125 \
--mode improved_ref \
--n 10Resume:
python src/train.py --path Logs/vicuna_improved_ref/defense_5.pkl --mode improved_ref --n 10Different scripts assume different working directories (due to relative paths).
- AdvBench-style baseline — run in
src/:
cd In-Context-Adversarial-Game/src
python test_baseline.pyLogs at ../Logs/{victim_llm}/result.txt.
- Transfer — run at repo root:
cd In-Context-Adversarial-Game
python src/test_transfer.py- XSTest (over-defense/refusal) — run at repo root:
cd In-Context-Adversarial-Game
python src/xstest.py- MMLU — run in
src/:
cd In-Context-Adversarial-Game/src
python MMLU.py- Attack data generation — run in
src/:
cd In-Context-Adversarial-Game/src
python ICAG_attack_data.py- OpenAI rate limits/errors: ensure
OPENAI_API_KEYand quota; lower concurrency or add retries. - CUDA OOM: batch-size fallback exists; reduce
batch_size/max_new_tokensif needed. - Path errors: run
test_baseline.py,MMLU.py,ICAG_attack_data.pyinsrc/; runtrain.py,test_transfer.py,xstest.pyat repo root.
@article{zhou2024defending,
title={Defending jailbreak prompts via in-context adversarial game},
author={Zhou, Yujun and Han, Yufei and Zhuang, Haomin and Guo, Kehan and Liang, Zhenwen and Bao, Hongyan and Zhang, Xiangliang},
journal={arXiv preprint arXiv:2402.13148},
year={2024}
}
See the paper: arXiv:2402.13148