The official implementation for the EMNLP 2025 paper
"AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender"
- Python 3.9.0
- PyTorch 2.4.1
- Transformers 4.46.3
git clone https://github.com/MuyuenLP/AdaSteer
cd AdaSteer
pip install -e .We employ a difference-in-means method for vector extraction. Here's an example of extracting the rejection vector for LLaMA-3.1-8B-Instruct:
bash scripts/llama31/RD.shApply rejection vectors with a fixed steering coefficient λ to reject harmful inputs:
bash scripts/llama31/refusal.shFor dynamic steering, we modify the transformers library to implement adaptive activation steering coefficient calculation. See ./adasteer/models/For_Steering_LlamaModel_adasteer.py for implementation details.
bash scripts/llama31/adasteer.shIf you find our work useful for your research, please kindly cite our paper:
@article{zhao2025adasteer,
title={AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender},
author={Zhao, Weixiang and Guo, Jiahe and Hu, Yulin and Deng, Yang and Zhang, An and Sui, Xingyu and Han, Xinyang and Zhao, Yanyan and Qin, Bing and Chua, Tat-Seng and others},
journal={arXiv preprint arXiv:2504.09466},
year={2025}
}