DELMAN

This is the official repository for "DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing" (Accepted by ACL 2025 Findings) 🎉🎉.

Overview

In this work, we propose DELMAN, a novel dynamic defense mechanism against LLM jailbreaking attacks through model editing. Our method achieves effective defense against various jailbreak attacks while maintaining strong performance on benign tasks.

Quickstart

Installation

To install all dependencies, please get into this directory and run the following command:

git clone https://github.com/wanglne/DELMAN.git
cd DELMAN
conda create -n delman python=3.9.20
conda activate delman
pip install -r requirements.txt

We directly provide the "cov" matrix of models that we have already calculated. https://drive.google.com/drive/folders/1uee2b_rti0UlNgQ52hlY2oVB5AduO7Ch?usp=sharing After decompressing it and saving it to the ./data/stats folder.

Llama 3.1 Configuration

If you are using Llama 3.1 with DELMAN, you need to adjust the offset value in ./rome/repr_tools.py line 106:

# NOTE: For Llama 3.1, set offset to 2
# For other models, use 1
offset = 1

Change offset = 1 to offset = 2 when using Llama 3.1.

Run DELMAN

export model_name=Qwen/Qwen2.5-7B-Instruct
export hparams_fname=Qwen2.5-7B-Instruct.json
export data_name=HarmBench.json
export out_name="DELMAN_qwen2.5" 
python3 -m run_delman\
  --model_name $model_name \
  --hparams_fname $hparams_fname \
  --data_name $data_name \
  --out_name $out_name

export model_name=meta-llama/Llama-3.1-8B-Instruct
export hparams_fname=Llama-3.1-8B-Instruct.json
export data_name=HarmBench.json
export out_name="DELMAN_llama3.1" 
python3 -m run_delman\
  --model_name $model_name \
  --hparams_fname $hparams_fname \
  --data_name $data_name \
  --out_name $out_name

Acknowledgement

Our code is based on MEMIT and BadEdit.

Citation

@article{wang2025delman,
  title={DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing},
  author={Wang, Yi and Weng, Fenghua and Yang, Sibei and Qin, Zhan and Huang, Minlie and Wang, Wenjie},
  journal={arXiv preprint arXiv:2502.11647},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
delman		delman
dsets		dsets
figs		figs
hparams		hparams
rome		rome
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
globals.yml		globals.yml
requirements.txt		requirements.txt
run_delman.py		run_delman.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DELMAN

Overview

Quickstart

Installation

Llama 3.1 Configuration

Run DELMAN

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

wanglne/DELMAN

Folders and files

Latest commit

History

Repository files navigation

DELMAN

Overview

Quickstart

Installation

Llama 3.1 Configuration

Run DELMAN

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages