Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment.
Installation | To Run | Acknowledgment
This repository contains code and data for: Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment.
Requires Python 3.9 to run.
Install conda environment from environment.yml file.
conda env create -n TA2 --file environment.yml
conda activate TA2GPT-Judge is required to evaluate TruthfulQA. Please refer to ./FineTune_Judge/tune.sh for details. You need to provide your OpenAI API key.
echo "export OPENAI_API_KEY='yourkey'" >> ~/.zshrcThen, please provide the related information in evaluate_tqa.py:
openai.api_key = "YOUR KEY" # TODO
truth_model = "YOUR MODEL HERE" # TODO
info_model = "YOUR MODEL HERE" # TODOAll intermediate results will be saved to ../Intermediate folder.
To generate clean output:
./Scripts/clean_run_tqa.sh
./Scripts/clean_run_toxigen.sh
./Scripts/clean_run_bold.sh
./Scripts/clean_run_harmful.shTo generate adversarial output:
./Scripts/adv_gen_tqa.sh
./Scripts/adv_gen_toxigen.sh
./Scripts/adv_gen_bold.shTo attack:
./Scripts/attack_tqa.sh
./Scripts/attack_toxigen.sh
./Scripts/attack_bold.sh
./Scripts/attack_harmful.shTo evaluate:
python evaluate_tqa.py --model [llama, vicuna] --prompt_type [freeform, choice]
python evaluate_toxigen.py --model [llama, vicuna] --prompt_type [freeform, choice]
python evaluate_bold.py --model [llama, vicuna] --prompt_type [freeform, choice]
python evaluate_harmful.py --model [llama, vicuna] --prompt_type [freeform, choice]The attack.py is built upon the following work:
Red-teaming language models via activation engineering https://github.com/nrimsky/LM-exp/blob/main/refusal/refusal_steering.ipynb
Many thanks to the authors and developers!