RDS

This is the official repository for "Root Defense Strategies: Ensuring Safety of LLM at the Decoding Level" (Accepted by ACL 2025).

How to train the classifier

bash scripts/forward.sh

Replace the estimate.py with our released one, as we only use the hidden state of the original queries without any safety prompt.
To Train the classifier, run:

python estimate.py \
    --system_prompt_type ${system_prompt_type} \
    --config sampling --pretrained_model_path ${model}

During refering, RDS extends EAGLE_Head (https://github.com/SafeAILab/EAGLE) in resampling process to generate the hidden state of the candidate tokens.

replace EAGLE-my/eagle/model/cnet.py with our cnet.py, in which we add the classifier in the decoder layer for resampling.
To generate the safe output, run:

python code.py

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
cnets.py		cnets.py
code.py		code.py
estimate.py		estimate.py