📄 Paper | 🤗 TARS Model | 🤗 Lightweight SFT Model | 📝 Blog Post |
Training repository for "Reasoning as an Adaptive Defense for Safety"
This repository contains the training code and datasets for TARS (Training Adaptive Reasoners for Safety), an online RL training approach that uses reasoning as an adaptive defense for LLM safety. The training code uses a modified version of verl, which is adapted from a previous version of rLLM.
This repository includes:
- Datasets: train_lambda_0.1/0.3/0.5/0.7/0.9.parquet
- Training Script: Online RL safety training for reasoning using GRPO
First, install the Python packages.
conda env create --file environment.ymlSecond, install the modified version of verl and additional packages.
pip install -e ./verl
pip install git+https://github.com/dsbowen/strong_reject.git@main
pip install flash-attnTrain through online RL starting from the base lightweight SFT model used for TARS.
bash scripts/train/run_train.sh If you find this work useful, please cite our paper:
@article{kim2025reasoning,
title={Reasoning as an Adaptive Defense for Safety},
author={Kim, Taeyoun and Tajwar, Fahim and Raghunathan, Aditi and Kumar, Aviral},
journal={arXiv preprint arXiv:2507.00971},
year={2025}
}