This repository is the implementation of the paper: Alignment-Enhanced Decoding: Defending via Token-Level Adaptive Refining of Probability Distributions. In this paper, we present a novel defense that employs adaptive decoding to address the root causes of jailbreak issues.😊
Large language models are susceptible to jailbreak attacks, which can result in the generation of harmful content. While prior defenses mitigate these risks by perturbing or inspecting inputs, they ignore competing objectives, the underlying cause of alignment failures. In this paper, we propose Alignment-Enhanced Decoding (AED), a novel defense that employs adaptive decoding to address the root causes of jailbreak issues. We first define the Competitive Index to quantify alignment failures and utilize feedback from self-evaluation to compute post-alignment logits. Then, AED adaptively combines Competitive Index and post-alignment logits with the original logits to obtain harmless and helpful distributions. Consequently, our method enhances safety alignment while maintaining helpfulness. We conduct experiments across five models and four common jailbreaks, with the results validating the effectiveness of our approach.
AED has 3 steps: Step 1 involves obtaining the probability distribution of the next token; Step 2 computes the Competitive Index, which reflects the degree of competitions; and Step 3 realigns the distribution to ensure a safe and ethical response. More detail could be found in our paper.😄 
The table compares the defense capabilities of AED (ours) against other defense methods across five LLMs and four types of jailbreak attacks. Rejection Rate (RR) is used as the metric for evaluation. The best results are
highlighted in bold, while the second best results are underlined. The PPL method demonstrates high effectiveness against GCG attacks but achieves 0% effectiveness in other jailbreak scenarios

To set up the environment, follow these steps:
-
Clone the Repository:
git clone https://github.com/yourusername/yourrepository.git cd yourrepository -
Create a Virtual Environment:
# Using conda conda create --name myenv --file requirements.txt conda activate myenv -
Install Dependencies:
pip install -r requirements.txt
-
Run the Application: Open and run the
main.ipynbnotebook using Jupyter Notebook or JupyterLab. -
Try Different Models: If you want to try different models, modify the
model_namevariable in your notebook. For example:model_name = "vicuna" # Change to vicuna, llama3, gemma, or guanaco model_path = "../llama2-7b-chat" # Don't forget to update the model path accordingly
-
Switch Datasets: To use a different dataset, adjust the
datasetvariable and update the corresponding pre-processing in theget_datafunction withinutilz.py. For example:dataset = "gcg" # Change to the dataset you want to use