Semantic Representation Attack
against Aligned Large Language Models

Introduction

Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content. Current methods typically target exact affirmative responses, suffering from limited convergence, unnatural prompts, and high computational costs. We introduce semantic representation attacks, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs. Rather than targeting exact textual patterns, our approach exploits the semantic representation space that can elicit diverse responses that share equivalent harmful meanings. This innovation resolves the inherent trade-off between attack effectiveness and prompt naturalness that plagues existing methods. Our Semantic Representation Heuristic Search (SRHS) algorithm efficiently generates semantically coherent adversarial prompts by maintaining interpretability during incremental search. We establish rigorous theoretical guarantees for semantic convergence and demonstrate that SRHS achieves unprecedented attack success rates (89.4% averaged across 18 LLMs, including 100% on 11 models) while significantly reducing computational requirements. Extensive experiments show that our method consistently outperforms existing approaches.

Overview

Illustration of vanilla attacks under Semantic Incoherence and our Semantic Representation Attack under Semantic Coherence. Vanilla methods optimize for specific textual outputs, producing semantically incoherent prompts limited to a single response pattern. Our approach maintains coherence during optimization, enabling convergence to equivalent semantic representations across lexical variations, which provides multiple viable optimization paths and enhances attack performance.

Quick Start

Installation

We recommend creating a conda environment as follows:

conda env create -f environment.yml
conda activate harmbench

All required packages for SRA are listed in environment.yml. Then

git clone https://github.com/JiaweiLian/SRA.git
cd SRA
pip install -r requirements.txt
python -m spacy download en_core_web_sm

For HarmBench installation and usage, see the HarmBench documentation.

Full Workflow to Run SRA

Generate test cases (step1):

python scripts/run_pipeline.py --methods SRA --models <target_model> --step 1 --mode slurm

Merge test cases (step1.5):

python scripts/run_pipeline.py --methods SRA --models <target_model> --step 1.5 --mode local

Supplement results (optional, if needed):

python tools/supplement_results_sra.py --file_path <merged_json_file>

Calculate ASR:

python tools/asr_sra.py --file_path <supplemented_json_file>

Using your own models in HarmBench

You can easily add new Hugging Face transformers models in configs/model_configs/models.yaml by simply adding an entry for your model. This model can then be directly evaluated on most red teaming methods without modifying the method configs (using our dynamic experiment config parsing code, described in ./docs/configs.md). Some methods (AutoDAN, PAIR, TAP) require manually adding experiment configs for new models.

Using your own red teaming methods in HarmBench

All of our red teaming methods are implemented in baselines, imported through baselines/init.py, and managed by configs/method_configs. You can easily improve on top of existing red teaming methods or add new methods by simply making a new subfolder in the baselines directory. New attacks are required to implement the methods in the RedTeamingMethod class in baselines/baseline.py.

Acknowledgements

This project is built on top of the HarmBench framework. We are grateful to the HarmBench team for providing a standardized evaluation framework for automated red teaming.

Citation

If you find our SRA work useful in your research, please consider citing our paper:

@article{lian2025semantic,
  title={Semantic Representation Attack against Aligned Large Language Models},
  author={Lian, Jiawei and Pan, Jianhong and Wang, Lefan and Wang, Yi and Mei, Shaohui and Chau, Lap-Pui},
  journal={Advances in neural information processing systems},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
baselines		baselines
configs		configs
data/behavior_datasets		data/behavior_datasets
scripts		scripts
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api_models.py		api_models.py
environment.yml		environment.yml
eval_utils.py		eval_utils.py
evaluate_completions.py		evaluate_completions.py
generate_completions.py		generate_completions.py
generate_test_cases.py		generate_test_cases.py
merge_test_cases.py		merge_test_cases.py
openai_proxy_example.py		openai_proxy_example.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantic Representation Attack
against Aligned Large Language Models

Table of Contents

Introduction

Overview

Quick Start

Installation

Full Workflow to Run SRA

Using your own models in HarmBench

Using your own red teaming methods in HarmBench

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

License

JiaweiLian/SRA

Folders and files

Latest commit

History

Repository files navigation

Semantic Representation Attack against Aligned Large Language Models

Table of Contents

Introduction

Overview

Quick Start

Installation

Full Workflow to Run SRA

Using your own models in HarmBench

Using your own red teaming methods in HarmBench

Acknowledgements

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Semantic Representation Attack
against Aligned Large Language Models

Packages