Alignment-Enhanced Decoding (AED)

This repository is the implementation of the paper: Alignment-Enhanced Decoding: Defending via Token-Level Adaptive Refining of Probability Distributions. In this paper, we present a novel defense that employs adaptive decoding to address the root causes of jailbreak issues.😊

Abstract

Large language models are susceptible to jailbreak attacks, which can result in the generation of harmful content. While prior defenses mitigate these risks by perturbing or inspecting inputs, they ignore competing objectives, the underlying cause of alignment failures. In this paper, we propose Alignment-Enhanced Decoding (AED), a novel defense that employs adaptive decoding to address the root causes of jailbreak issues. We first define the Competitive Index to quantify alignment failures and utilize feedback from self-evaluation to compute post-alignment logits. Then, AED adaptively combines Competitive Index and post-alignment logits with the original logits to obtain harmless and helpful distributions. Consequently, our method enhances safety alignment while maintaining helpfulness. We conduct experiments across five models and four common jailbreaks, with the results validating the effectiveness of our approach.

Pipeline

AED has 3 steps: Step 1 involves obtaining the probability distribution of the next token; Step 2 computes the Competitive Index, which reflects the degree of competitions; and Step 3 realigns the distribution to ensure a safe and ethical response. More detail could be found in our paper.😄

Defense Results

The table compares the defense capabilities of AED (ours) against other defense methods across five LLMs and four types of jailbreak attacks. Rejection Rate (RR) is used as the metric for evaluation. The best results are highlighted in bold, while the second best results are underlined. The PPL method demonstrates high effectiveness against GCG attacks but achieves 0% effectiveness in other jailbreak scenarios

Setting Up the Environment

To set up the environment, follow these steps:

Clone the Repository:

git clone https://github.com/yourusername/yourrepository.git
cd yourrepository

Create a Virtual Environment:

# Using conda
conda create --name myenv --file requirements.txt
conda activate myenv

Install Dependencies:
```
pip install -r requirements.txt
```
Run the Application: Open and run the main.ipynb notebook using Jupyter Notebook or JupyterLab.

Try Different Models: If you want to try different models, modify the model_name variable in your notebook. For example:

model_name = "vicuna"  # Change to vicuna, llama3, gemma, or guanaco
model_path = "../llama2-7b-chat"  # Don't forget to update the model path accordingly

Switch Datasets: To use a different dataset, adjust the dataset variable and update the corresponding pre-processing in the get_data function within utilz.py. For example:
```
dataset = "gcg"  # Change to the dataset you want to use
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Alignment-Enhanced Decoding (AED)

Abstract

Pipeline

Defense Results

Setting Up the Environment

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
figs		figs
results		results
README.md		README.md
main.ipynb		main.ipynb
requirements.txt		requirements.txt
utilz.py		utilz.py

Folders and files

Latest commit

History

Repository files navigation

Alignment-Enhanced Decoding (AED)

Abstract

Pipeline

Defense Results

Setting Up the Environment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages