SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

We introduce SafeAligner, a novel method to enhance the security of large language models (LLMs) against jailbreak attacks. SafeAligner uses two specially trained models—the Sentinel Model for safety and the Intruder Model for risk simulation—to improve response security dynamically during decoding. Our tests show that SafeAligner effectively balances enhanced security with maintained performance, providing a robust and efficient defense solution.

📖 Paper: SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
🎮 Code Repo: https://github.com/csHuangfdu/SafeAligner
🤗 Data: You can download from HuggingFace.

OverView

Data

One of the core contributions of this paper is the construction of the SafeAligner dataset. It includes 628 English data in 8 harmful scenarios, where each data consists of a harmful query and its safe and harmful replies, as shown in the following table.

Scenario	Num	# Ins	# Saf	# Haf
Adult Content	34	12.2	19.6	272.3
Economic Harm	38	14.8	17.8	218.8
Fraud Deception	72	15.1	20.4	241.1
Illegal Activity	144	14.6	21.4	206.5
Hate/Harass/Violence	130	15.7	17.3	183.8
Malware	130	17.0	20.1	249.3
Physical Harm	39	14.1	19.8	212.4
Privacy Violation Activity	41	17.2	14.5	183.5

Num represents the number of statistical data entries. Ins refers to harmful queries. Saf denotes safe responses. Haf indicates harmful responses. # represents the average token length.

This dataset supports future research on LLM jailbreak attacks, LLM security alignment research, etc.

Getting Start

We will release the code after the paper is accepted ...

Cite

@article{huang2024safealigner,
  title={SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance},
  author={Huang, Caishuang and Zhao, Wanxu and Zheng, Rui and Lv, Huijie and Dou, Shihan and Li, Sixian and Wang, Xiao and Zhou, Enyu and Ye, Junjie and Yang, Yuming and others},
  journal={arXiv preprint arXiv:2406.18118},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
figs		figs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

OverView

Data

Getting Start

Cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

OverView

Data

Getting Start

Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages