Skip to content

wanglne/DELMAN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DELMAN

This is the official repository for "DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing" (Accepted by ACL 2025 Findings) 🎉🎉.

arXiv License: MIT

Overview

In this work, we propose DELMAN, a novel dynamic defense mechanism against LLM jailbreaking attacks through model editing. Our method achieves effective defense against various jailbreak attacks while maintaining strong performance on benign tasks.

Quickstart

Installation

To install all dependencies, please get into this directory and run the following command:

git clone https://github.com/wanglne/DELMAN.git
cd DELMAN
conda create -n delman python=3.9.20
conda activate delman
pip install -r requirements.txt

We directly provide the "cov" matrix of models that we have already calculated. https://drive.google.com/drive/folders/1uee2b_rti0UlNgQ52hlY2oVB5AduO7Ch?usp=sharing After decompressing it and saving it to the ./data/stats folder.

Llama 3.1 Configuration

If you are using Llama 3.1 with DELMAN, you need to adjust the offset value in ./rome/repr_tools.py line 106:

# NOTE: For Llama 3.1, set offset to 2
# For other models, use 1
offset = 1

Change offset = 1 to offset = 2 when using Llama 3.1.

Run DELMAN

export model_name=Qwen/Qwen2.5-7B-Instruct
export hparams_fname=Qwen2.5-7B-Instruct.json
export data_name=HarmBench.json
export out_name="DELMAN_qwen2.5" 
python3 -m run_delman\
  --model_name $model_name \
  --hparams_fname $hparams_fname \
  --data_name $data_name \
  --out_name $out_name
export model_name=meta-llama/Llama-3.1-8B-Instruct
export hparams_fname=Llama-3.1-8B-Instruct.json
export data_name=HarmBench.json
export out_name="DELMAN_llama3.1" 
python3 -m run_delman\
  --model_name $model_name \
  --hparams_fname $hparams_fname \
  --data_name $data_name \
  --out_name $out_name

Acknowledgement

Our code is based on MEMIT and BadEdit.

Citation

@article{wang2025delman,
  title={DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing},
  author={Wang, Yi and Weng, Fenghua and Yang, Sibei and Qin, Zhan and Huang, Minlie and Wang, Wenjie},
  journal={arXiv preprint arXiv:2502.11647},
  year={2025}
}

About

[ACL 2025 Findings] DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages