Skip to content

MuyuenLP/AdaSteer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

EMNLP 2025 Status Python PyTorch

The official implementation for the EMNLP 2025 paper
"AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender"

📄 Paper🚀 Code


🔧 Requirements

  • Python 3.9.0
  • PyTorch 2.4.1
  • Transformers 4.46.3

Installation

git clone https://github.com/MuyuenLP/AdaSteer
cd AdaSteer
pip install -e .

🚀 Quick Start

1. Vector Extraction

We employ a difference-in-means method for vector extraction. Here's an example of extracting the rejection vector for LLaMA-3.1-8B-Instruct:

bash scripts/llama31/RD.sh

2. Activation Steering

Fixed Steering Coefficient

Apply rejection vectors with a fixed steering coefficient λ to reject harmful inputs:

bash scripts/llama31/refusal.sh

Adaptive Steering (AdaSteer)

For dynamic steering, we modify the transformers library to implement adaptive activation steering coefficient calculation. See ./adasteer/models/For_Steering_LlamaModel_adasteer.py for implementation details.

bash scripts/llama31/adasteer.sh

📚 Citation

If you find our work useful for your research, please kindly cite our paper:

@article{zhao2025adasteer,
  title={AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender},
  author={Zhao, Weixiang and Guo, Jiahe and Hu, Yulin and Deng, Yang and Zhang, An and Sui, Xingyu and Han, Xinyang and Zhao, Yanyan and Qin, Bing and Chua, Tat-Seng and others},
  journal={arXiv preprint arXiv:2504.09466},
  year={2025}
}

About

EMNLP 25 Oral - AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors