Self-RedTeam

This code supplements our recently released paper: Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

This codebase is built on OpenRLHF: a high-performance RLHF framework built on Ray, DeepSpeed and HF Transformers. To visit the original repo: GitHub Repo | Slides | Technical Report | Documentations

The reward model part is built on WildGuard: GitHub Repo | Paper | Model

Self-RedTeam model checkpoints can be found here.

Quick Start

Installation

Although OpenRLHF recommends using Docker, we ran our experiments in a conda manage environment.

cd selfplay-openrlhf
conda create --name openrlhf python=3.10
pip install -e .
pip install openrlhf[vllm]

Note

We recommend using vLLM 0.8.2 or higher. export VLLM_USE_V1=1 requires vLLM 0.8.2 or the Nightly version and enable export VLLM_ENABLE_V1_MULTIPROCESSING=0.

Training

Step 0: Start `ray` cluster

# We use 4 A100-80GB-PCIe to run our experiments
ray start --head

Step 1: Host Reward model

# We use 4 L40-48GB to host reward model inferences
# This GPU setup is more flexible depending how much inference workload you need to handle simutaneously
# We recommand num-gpus here = number of actors in the training process
bash scripts/serve_remote_wildguard.sh --num-gpus 4 --tensor-parallel-size 1

Step 2: Run REINFORCE++ to reproduce our Self-play + SFT checkpoints

Before you start, please ensure that you have done the following:

Unzip red_team/data/data.zip to the same directory if you need the dataset used in our experiment.
Get the hostname of your remote reward model process, it will usually print in the console when it first initializes, change REMOTE_RM_URL="http://0.0.0.0:5000/classify" to whatever that hostname is.

# Change you experiment setting inside the shell scripts
bash ./scripts/red_team_game_reinforce_8b.sh

Eval

Please review eval/README.md

Cite This

If you find our work helpful, please consider citing this work!

@misc{liu2025chasingmovingtargetsonline,
      title={Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models}, 
      author={Mickel Liu and Liwei Jiang and Yancheng Liang and Simon Shaolei Du and Yejin Choi and Tim Althoff and Natasha Jaques},
      year={2025},
      eprint={2506.07468},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.07468}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
eval		eval
openrlhf		openrlhf
red_team		red_team
scripts		scripts
wildguard		wildguard
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Self-RedTeam

Quick Start

Installation

Training

Step 0: Start `ray` cluster

Step 1: Host Reward model

Step 2: Run REINFORCE++ to reproduce our Self-play + SFT checkpoints

Eval

Cite This

About

Uh oh!

Releases

Packages

Languages

License

mickelliu/selfplay-redteaming

Folders and files

Latest commit

History

Repository files navigation

Self-RedTeam

Quick Start

Installation

Training

Step 0: Start ray cluster

Step 1: Host Reward model

Step 2: Run REINFORCE++ to reproduce our Self-play + SFT checkpoints

Eval

Cite This

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Step 0: Start `ray` cluster

Packages