Code for the attacks and defenses in our paper, Benchmarking Misuse Mitigation Against Covert Adversaries.
This repository contains implementations of various defense mechanisms against covert adversarial attacks on language models. The defenses are designed to detect and mitigate harmful requests that are potentially distributed across multiple users or user sessions. Please see our paper for details.
The defenses in this repository are designed to work with the Benchmarks for Stateful Defenses (BSD) dataset, which contains challenging questions that test "misuse uplift" and "detectability" of harmful request patterns.
The BSD dataset is available through HuggingFace at: https://huggingface.co/datasets/BrachioLab/BSD
Access Policy: The dataset has restricted access to enable legitimate safety research while preventing potential harmful applications. To access the dataset:
- Visit our HuggingFace dataset page
- Submit a request through the provided form. We will follow-up with an email asking about your usecase.
All attack implementations are available in the JB_attacks/ folder. See the JB_attacks README for usage instructions.
-
Decomposition Attacks - The primary focus of our work is on decomposition attacks, where harmful requests are broken into seemingly benign sub-questions distributed across multiple interactions or users. This covert approach makes detection significantly more challenging for traditional safety mechanisms.
-
Comparison Methods - To evaluate defense effectiveness, we compare against several established jailbreaking methods:
- PAIR
- Adaptive Attack
- Adversarial Reasoning
- Crescendo
The defense/ folder is organized into two main categories: prompt_wise/ contains defenses that operate on individual queries without maintaining state, while stateful/ contains defenses that track information across multiple queries to detect covert attacks.
modeling/- Inference, adversarial training, and evaluating the defenses on the decomposition dataset.training/- Adversarial finetuning against PAIR/decomposition attacksfinetune_8b_binary.py- Adversarial training for decomposition attacksfinetune_pair.py- Adversarial training against PAIR jailbreaks
inference/- Prediction and inference utilitiespredict.py- Basic prediction utilitiespredict_pair.py- Evaluating detection defenses on PAIR attackspointwise_defense.py- Evaluating pointwise detection defenses for decomposition
buffer_methods/- Buffer-based stateful defense mechanismsbuffer_decomp.pyandbuffer_defense.py- Buffer defensesbuffer_defense_together.py- Buffer defense strategy using Llama 70B with the Together APIrandom_sample.py- Stateful evaluation through sampling across multiple runs
