AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

The official implementation for the EMNLP 2025 paper
"AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender"

📄 Paper • 🚀 Code

🔧 Requirements

Python 3.9.0
PyTorch 2.4.1
Transformers 4.46.3

Installation

git clone https://github.com/MuyuenLP/AdaSteer
cd AdaSteer
pip install -e .

🚀 Quick Start

1. Vector Extraction

We employ a difference-in-means method for vector extraction. Here's an example of extracting the rejection vector for LLaMA-3.1-8B-Instruct:

bash scripts/llama31/RD.sh

2. Activation Steering

Fixed Steering Coefficient

Apply rejection vectors with a fixed steering coefficient λ to reject harmful inputs:

bash scripts/llama31/refusal.sh

Adaptive Steering (AdaSteer)

For dynamic steering, we modify the transformers library to implement adaptive activation steering coefficient calculation. See ./adasteer/models/For_Steering_LlamaModel_adasteer.py for implementation details.

bash scripts/llama31/adasteer.sh

📚 Citation

If you find our work useful for your research, please kindly cite our paper:

@article{zhao2025adasteer,
  title={AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender},
  author={Zhao, Weixiang and Guo, Jiahe and Hu, Yulin and Deng, Yang and Zhang, An and Sui, Xingyu and Han, Xinyang and Zhao, Yanyan and Qin, Bing and Chua, Tat-Seng and others},
  journal={arXiv preprint arXiv:2504.09466},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
adasteer		adasteer
configs		configs
data		data
scripts		scripts
vectors		vectors
.gitignore		.gitignore
LICENSE		LICENSE
readme.md		readme.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

🔧 Requirements

Installation

🚀 Quick Start

1. Vector Extraction

2. Activation Steering

Fixed Steering Coefficient

Adaptive Steering (AdaSteer)

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

🔧 Requirements

Installation

🚀 Quick Start

1. Vector Extraction

2. Activation Steering

Fixed Steering Coefficient

Adaptive Steering (AdaSteer)

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages