This is the codebase for our ICLR 2025 paper Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements.
In addition to the codebase, we have publicly released the CoSApien dataset. Soon, we will release the CoSAlign model and synthetic datasets. Please see, our HuggingFace collection: Controllable Safety Alignment 🤗.
- CoSApien👥: A human-authored benchmark: link
- Llama3.1-8B-CoSAlign🤖: A safety-configurable Llama3.1-8B: Coming-Soon
If you have questions feel free to email the authors.
Please use run_eval_multistep.sh to evaluate controllability. This script pipelines the steps needed for evaluation:
- (1) The candidate model generate response on the test set
- (2) Generate evaluation response from the evaluator model
- (3) Parse evaluation responses and aggregate final evaluation results
You will need to provide the name (or path) of the candidate model, a pretty name of the candidate model, and the name of the system prompt template as command line arguments.
We detail the process for CoSAlign data creation in the data_processing/ directory. Please see data_processing/README.md for details.
We use the code adapted from the DPO repo for conducting SFT and DPO. Please see more details in the dpo/ directory. Please see our provided example training script in dpo/train.sh.
If you find our work useful, please consider citing our paper:
@inproceedings{zhang2025controllablesafetyalignment,
title={Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements},
author={Jingyu Zhang and Ahmed Elgohary and Ahmed Magooda and Daniel Khashabi and Benjamin Van Durme},
year={2025},
url={https://arxiv.org/abs/2410.08968},
booktitle = {International Conference on Learning Representations (ICLR)}
}