Recently, Large Language Models (LLMs) have garnered significant attention for their exceptional natural language processing capabilities. However, concerns about their trustworthiness remain unresolved, particularly in addressing ''jailbreaking'' attacks on aligned LLMs. Previous research predominantly relies on scenarios with white-box LLMs or specific and fixed prompt templates, which are often impractical and lack broad applicability. In this paper, we introduce a straightforward and novel method, named ObscurePrompt, for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data. Specifically, we first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary. ObscurePrompt starts with constructing a base prompt that integrates well-known jailbreaking techniques. Powerful LLMs are then utilized to obscure the original prompt through iterative transformations, aiming to bolster the attack's robustness. Comprehensive experiments show that our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms. We are confident that our work can offer fresh insights for future research on enhancing LLM alignment.
To replicate the experiment results, follow these steps:
This section defines the settings for connecting to different APIs. The available options are azure and openai.
api:
type: azure # or openai
endpoint: END_POINT
version: 2023-12-01-preview
key: API_KEYtype: Specifies the API provider. Can be eitherazureoropenai.endpoint: The endpoint URL for the API.version: The version of the API to use.key: The API key for authentication.
This section specifies the device to be used for running the models.
device: cudadevice: Specifies the device type. The example usescuda, which is for GPU acceleration.
This section contains the API keys for various services.
api_keys:
azure_openai: API_KEY
replicate_api_token: API_KEY
deepinfra_openai: API_KEYazure_openai: API key for Azure OpenAI service.replicate_api_token: API key for the Replicate service.deepinfra_openai: API key for DeepInfra OpenAI service.
This section maps model names to their corresponding paths or identifiers.
model_path_mapping:
Llama2-7b: meta-llama/Llama-2-7b-chat-hf
vicuna-7b: lmsys/vicuna-7b-v1.3
ChatGPT: gpt-3.5-turbo
GPT-4: gpt-4-turbopython run.pyTo run a single model, select one of the following code into the main function in run.py:
run_pipeline('GPT-4', 'obscure')
run_pipeline('ChatGPT', 'obscure')
run_pipeline('Llama2-7b', 'obscure')
run_pipeline('Llama2-70b', 'obscure')
run_pipeline('Vicuna-13b', 'obscure')We provide two types of ASR evaluation methods: Single ASR Evaluation (ASR for one result) and Combined ASR Evaluation (integrated prompts with multiple results).
To run the evaluation, define the model_list in config.yaml under evaluation_setting. Here is an example configuration:
evaluation_setting:
model_list:
- ChatGPT
- GPT-4
- Vicuna-7b
- Llama2-7bIf you want to run combined evaluation, set the combined_num in config.yaml.
Define your result file path in config.yaml under evaluation_setting. For example, if your result_file_path is res, organize your results as follows:
res/ChatGPT/...
res/Llama2-7b/...
res/Other_model_in_model_list/...
...
To run the single evaluation, use the following command:
python script.py singleTo run the combined evaluation, use the following command:
python script.py combined@misc{huang2024obscureprompt,
title={ObscurePrompt: Jailbreaking Large Language Models via Obscure Input},
author={Yue Huang and Jingyu Tang and Dongping Chen and Bingda Tang and Yao Wan and Lichao Sun and Xiangliang Zhang},
year={2024},
eprint={2406.13662},
archivePrefix={arXiv},
primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}
