Skip to content

ws-jiang/MetaDefense

Repository files navigation

MetaDefense (NeurIPS 2025)

MetaDefense is a two-stage safety framework that defends large language models (LLMs) against finetuning-based jailbreak attacks (FJAttack) — attacks that exploit finetuning APIs to subvert alignment and induce harmful behavior.

Unlike prior approaches that rely on separate moderation models or rigid rule filters, MetaDefense reuses the same LLM to perform both harmfulness detection and response generation, through two complementary defense stages:

🛡️ Pre-Generation Defense

Before generation starts, MetaDefense appends a lightweight diagnostic prompt:

“Is this query harmful or harmless? It is 100% ...”

The model predicts a single token — “harmful” or “harmless” — based on its internal representation of the input. If the probability of “harmful” exceeds that of “harmless,” MetaDefense immediately refuses the request with a safety reminder, preventing harmful decoding before it begins.

This stage effectively filters obviously malicious or disguised jailbreak prompts (e.g., “pretend you are my evil twin...”) while reusing the LLM’s own embedding space as a discriminator.

⚔️ Mid-Generation Defense

Even with pre-generation filtering, some harmful queries may slip through. To handle this, MetaDefense continuously monitors the generation process itself.

After producing every adaptive window of tokens (controlled by parameter $\gamma$), the model receives another diagnostic prompt:

“Is this response harmful or harmless? It is 100% ...”

The LLM evaluates the partial response so far. If it predicts “harmful” with higher probability, generation halts instantly and a refusal message is returned. Otherwise, generation proceeds for another window of tokens. This adaptive checking frequency makes MetaDefense both responsive to emerging harm and efficient for benign responses.


Through lightweight instruction tuning on harmless and harmful query–response pairs, MetaDefense learns to follow these diagnostic prompts directly within the same LLM.
This eliminates the need for extra classifiers, achieving 2× memory efficiency and strong generalization to unseen attack templates, while maintaining benign-task performance on LLaMA-2-7B, Qwen-2.5-3B-Instruct, and LLaMA-3.2-3B-Instruct.

📄 Paper: https://openreview.net/forum?id=ycMpNwzUAA or http://arxiv.org/abs/2510.07835


🧩 Setup and Usage

1. Environment Setup

  1. Add your Hugging Face token to huggingface_token.txt (single line).
  2. Create the Conda environment:
conda env create -f environments.yml
conda activate MetaDefense

2. Dataset Preparation

✅ Benign Datasets

Build JSON files for the following tasks:

cd agnews && python build_dataset.py
cd ../gsm8k && python build_dataset.py
cd ../sst2 && python build_dataset.py

☣️ BeaverTail Dataset

Download the harmful query dataset with harmless/harmful responses:

cd data
wget https://huggingface.co/datasets/anonymous4486/repnoise_beavertail/resolve/11ea63756195527164a9aa2850f7544bd251eab1/beavertails_with_refusals_train.json

3. Alignment Training

Alignment training scripts are located in script/align/.

cd script/align

# Example: LLaMA-2-7B
nohup bash -x run_align_MetaDefense.sh 0 Llama7B > log/log_align_MetaDefense_Llama7B.log 2>&1 &

# Example: LLaMA-3.2-3B-Instruct
# nohup bash -x run_align_MetaDefense.sh 1 Llama32Instruct3B > log/log_align_MetaDefense_Llama32Instruct3B.log 2>&1 &

# Example: Qwen-2.5-3B-Instruct
# nohup bash -x run_align_MetaDefense.sh 2 Qwen3BInstruct > log/log_align_MetaDefense_Qwen3BInstruct.log 2>&1 &

All LoRA checkpoints will be saved under:local/ckpt/<model>_MetaDefense_*

4. Finetuning Jailbreak Attack and Evaluation

This step simulates FJAttack by mixing benign finetuning data with a small fraction of harmful samples wrapped in attack templates, then evaluates on both harmful and benign tasks. Scripts are provided in script/ft/:

cd script/ft

# Usage: bash run_ft_MetaDefense.sh <GPU_ID> <dataset> <attack_template> <model_tag>
# dataset: sst2 | agnews | gsm8k
# attack_template: Direct | PrefixInjection | RefusalSuppression | RolePlay
# model_tag: Llama7B | Llama32Instruct3B | Qwen3BInstruct

# Examples (LLaMA-2-7B)
nohup bash -x run_ft_MetaDefense.sh 0 sst2 Direct Llama7B >> log/log_MetaDefense_sst2_Direct_Llama7B.log 2>&1 &
nohup bash -x run_ft_MetaDefense.sh 1 agnews PrefixInjection Llama7B >> log/log_MetaDefense_agnews_PrefixInjection_Llama7B.log 2>&1 &
nohup bash -x run_ft_MetaDefense.sh 2 gsm8k RolePlay Llama7B >> log/log_MetaDefense_gsm8k_RolePlay_Llama7B.log 2>&1 &
  • Extract the evaluation results from log file by grep ">>>" log/log_MetaDefense_sst2_Direct_Llama7B.log

Optional runs for other models:

# LLaMA-3.2-3B-Instruct
# nohup bash -x run_ft_MetaDefense.sh 0 sst2 Direct Llama32Instruct3B >> log/log_MetaDefense_sst2_Direct_Llama32Instruct3B.log 2>&1 &

# Qwen-2.5-3B-Instruct
# nohup bash -x run_ft_MetaDefense.sh 0 sst2 Direct Qwen3BInstruct >> log/log_MetaDefense_sst2_Direct_Qwen3BInstruct.log 2>&1 &

The script performs:

  1. Finetuning on harmful+benign mixtures using the specified template and dataset.
  2. Harmful query evaluation via poison/evaluation/pred.py.
  3. Benign task evaluation via <dataset>/pred_eval.py.

Artifacts are stored in:

local/ckpt/
local/data/

Summary

MetaDefense provides a unified, memory-efficient, and instruction-tuned safety mechanism that defends before and during generation—achieving both robustness and efficiency.

If you find this repository useful, please cite our paper:

@inproceedings{jiang2025MetaDefense,
  title={{MetaDefense}: Defending Finetuning-based Jailbreak Attack Before and During Generation},
  author={Weisen Jiang and Sinno Jialin Pan},
  booktitle={Neural Information Processing Systems (NeurIPS)},
  year={2025},
}

📚 Reference

About

Official Implementation of MetaDefense (NeurIPS 2025)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published