Skip to content

davisrbr/bsd-misuse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmarking Misuse Mitigation Against Covert Adversaries

Decomposition Attack Schematic

Code for the attacks and defenses in our paper, Benchmarking Misuse Mitigation Against Covert Adversaries.

Overview

This repository contains implementations of various defense mechanisms against covert adversarial attacks on language models. The defenses are designed to detect and mitigate harmful requests that are potentially distributed across multiple users or user sessions. Please see our paper for details.

Dataset

The defenses in this repository are designed to work with the Benchmarks for Stateful Defenses (BSD) dataset, which contains challenging questions that test "misuse uplift" and "detectability" of harmful request patterns.

Accessing the BSD Dataset

The BSD dataset is available through HuggingFace at: https://huggingface.co/datasets/BrachioLab/BSD

Access Policy: The dataset has restricted access to enable legitimate safety research while preventing potential harmful applications. To access the dataset:

  1. Visit our HuggingFace dataset page
  2. Submit a request through the provided form. We will follow-up with an email asking about your usecase.

Attacks

All attack implementations are available in the JB_attacks/ folder. See the JB_attacks README for usage instructions.

  • Decomposition Attacks - The primary focus of our work is on decomposition attacks, where harmful requests are broken into seemingly benign sub-questions distributed across multiple interactions or users. This covert approach makes detection significantly more challenging for traditional safety mechanisms.

  • Comparison Methods - To evaluate defense effectiveness, we compare against several established jailbreaking methods:

    • PAIR
    • Adaptive Attack
    • Adversarial Reasoning
    • Crescendo

Defenses

The defense/ folder is organized into two main categories: prompt_wise/ contains defenses that operate on individual queries without maintaining state, while stateful/ contains defenses that track information across multiple queries to detect covert attacks.

Prompt-wise Defenses (prompt_wise/)

  • modeling/ - Inference, adversarial training, and evaluating the defenses on the decomposition dataset.
    • training/ - Adversarial finetuning against PAIR/decomposition attacks
    • inference/ - Prediction and inference utilities

Stateful Defenses (stateful/)

About

Code for the attacks and defenses in our paper, "Benchmarking Misuse Mitigation Against Covert Adversaries"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors