We introduce SteeringSafety: a comprehensive benchmark for evaluating representation steering methods across seven safety perspectives.
Our focus is on 1) how effective current steering methods are on standardized safety perspectives, and 2) understanding how steering one perspective affects others, which is critical for safe deployment.
We hope this benchmark will foster development of more precise steering methods and serve as a platform for introducing new approaches to increase safety and datasets to test them.
- π 17 datasets collected and standardized covering 7 perspectives for measuring safety behaviors
- π§ Modular framework decomposing training-free steering methods into standardized, interchangeable components
Before running, ensure you have Python 3.10+ with an up-to-date version of pip (pip install --upgrade pip if necessary). A CUDA-capable GPU is recommended for larger models.
# Clone repository
git clone https://github.com/wang-research-lab/SteeringSafety.git
cd SteeringSafety
# Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
# Install dependencies
pip install -e .
# Configure .env file
cp .env.example .env
# Edit .env to set API keys (see below for requirements)
# Quick test with no API keys needed (primary behavior only, debug mode, small model)
python scripts/run/run_full_pipeline.py -m qwen25-05b -c explicit_bias -M dim --primary-only --debug
Different experiments require different API keys:
For Steered Perspectives Only (no API keys needed):
- Explicit/Implicit Bias (ToxiGen, BBQ) - uses exact match evaluation
- Intrinsic Hallucination (FaithEval) - uses exact match evaluation
Requires Groq API Key (GROQ_API_KEY):
- Extrinsic Hallucination (PreciseWikiQA) - uses
llama-3.3-70b-versatilefor evaluation - Refusal (SALADBench) - if using
REFUSAL_EVAL_METHOD=GROQ(default)
Requires OpenAI API Key (OPENAI_API_KEY):
- 5 DarkBench datasets: Brand Bias, Sycophancy, Anthropomorphism, User Retention, Sneaking
We support 5 main steering methods (DIM, ACE, CAA, PCA, LAT), each with 3 variants (standard, no KL divergence check, conditional/CAST) on 17 datasets based on 3 perspectives specified by 5 concepts (explicit_bias, implicit_bias, hallucination_extrinsic, hallucination_intrinsic, refusal_base).
# Full evaluation with entanglement measurement (requires OpenAI key and other keys as needed)
python scripts/run/run_full_pipeline.py -m qwen25-7b -c explicit_bias -M dim
# Different steering variants
python scripts/run/run_full_pipeline.py -m qwen25-7b -c explicit_bias -M dim_nokl # No KL constraint
python scripts/run/run_full_pipeline.py -m qwen25-7b -c explicit_bias -M dim_conditional # With CASTFor comprehensive evaluation across multiple models, concepts, and methods (run on any subset of these):
python scripts/run/run_parallel_experiments.py \
--skip-metrics \
--concepts explicit_bias implicit_bias hallucination_extrinsic refusal_base hallucination_intrinsic \
--model llama3-1-8b qwen25-7b gemma2-2b \
--methods dim ace caa pca lat dim_nokl ace_nokl caa_nokl pca_nokl lat_nokl dim_conditional ace_conditional caa_conditional pca_conditional lat_conditional \
--gpu 0 1 2 3 4 5 6
--skip-ood \
--debug \This runs experiments across specified concepts and models in parallel using multiple GPUs. Adjust the concept, model, and method lists as needed, as well as the GPU IDs based on your setup with CUDA_VISIBLE_DEVICES.
Results Location: All experiments are saved to the experiments/ directory by default, organized as experiments/{method}_{concept}_{model}/{timestamp}/.
For analysis, run the following scripts to aggregate results and create visualizations as in the paper:
-
Run full pipeline analysis:
python scripts/run/run_full_pipeline.py --metrics-only --experiments-dir experiments
For specific directories:
python scripts/run/run_concept_metrics.py \ "experiments/lat_hallucination_intrinsic_llama3-1-8b/" \ "experiments/baseline_llama3-1-8b/"
Note: When looking at DarkBench results from experimental directories, the higher the metric the more unsafe the model is. To make it consistent with other datasets, you must do
1 - metricto get the higher-is-better score.
- Generate analysis and plots:
python scripts/analysis/analyze_concept_metrics.py \ --experiments-dir experiments \ --output-prefix steering_analysis python scripts/analysis/generate_all_plots.py experiments/
Currently supported models:
- Qwen2.5-7B-Instruct
- Llama-3.1-8B-Instruct
- Gemma-2-2B-IT
All models are instruct versions supporting chat templates and are compatible with all steering methods in our framework.
We evaluate on 17 datasets across 7 perspectives. See our Hugging Face dataset or paper for detailed descriptions.
- Harmfulness: SALADBench (refusal behavior)
- Bias: BBQ (implicit), ToxiGen (explicit)
- Hallucination: FaithEval (intrinsic), PreciseWikiQA (extrinsic)
- Social: Sycophancy, Anthropomorphism, Brand Bias, User Retention
- Reasoning: Expert-level (GPQA), Commonsense (ARC-C)
- Epistemic: Factual Misconceptions (TruthfulQA), Sneaking
- Normative: Commonsense Morality, Political Views
All of our steering experiments involve harmfulness, bias, and hallucination, but are evaluated on all other perspectives.
For multiple-choice datasets, we support both substring matching and likelihood-based evaluation:
- Default: Both methods run simultaneously (
mc_evaluation_method: "both") - Substring: Pattern matching in model output (to view entanglement's effects on instruction-following)
- Likelihood: Compare log probabilities of answer tokens (to view how behavior shifts internally)
When both methods run, individual results are saved as "Accuracy (substring)" and "Accuracy (likelihood)" in metrics.yaml. The avg_metric field uses substring by default (configurable via preferred_avg_metric: "substring" or "likelihood" in the inference section of concept config files):
# configs/concepts/my_concept.yaml or configs/secondary_concepts/my_concept.yaml
inference:
preferred_avg_metric: "likelihood" # or "substring" (default)
max_new_tokens: 100
temperature: 0.0We decompose training-free steering methods into three phases:
Extract steering vectors from training data:
- Methods:
DiffInMeans,PCA,LAT - Formats:
SteeringFormat.DEFAULT,SteeringFormat.REPE,SteeringFormat.CAA
Choose optimal layer and hyperparameters:
- Grid Search: Exhaustive search across the desired layers based on val score
- COSMIC: Efficient cosine similarity-based selection without full generation
Apply steering during inference:
- Activation Addition: Add scaled direction to activations
- Directional Ablation: Remove projection along direction (with optional affine transformation)
- Locations: Where in the model to apply steering (same layer as generation, all layers, cumulative across layers, etc.)
- Positions:
ALL_TOKENS,POST_INSTRUCTION,OUTPUT_ONLY - Conditional (CAST): Apply only when activation similarity exceeds threshold
We implement 5 methods from the literature, each with 3 variants for different effectiveness/entanglement tradeoffs:
All configurations can be found in the configs/ directory with variants: {method}.yaml, {method}_nokl.yaml, {method}_conditional.yaml
| Method | Components | Paper | Implementation Notes |
|---|---|---|---|
| DIM | DiffInMeans + Directional Ablation | Arditi et al. + COSMIC | Original refusal steering method |
| ACE | DiffInMeans + Directional Ablation (affine) | Marshall et al. + COSMIC | Adds reference projection |
| CAA | DiffInMeans + Activation Addition (MC format) | Panickssery et al. | Uses multiple-choice format |
| PCA | PCA + Activation Addition | Zou et al. (RepE) + CAST + AxBench | Principal component analysis |
| LAT | LAT + Activation Addition (cumulative) | Zou et al. (RepE) + AxBench | Linear artificial tomography |
Importantly, the above 5 methods are not exhaustive. Our modular framework allows easy creation of new methods by combining different components!
For example, to create a new method using LAT with CAA format, COSMIC selection, Directional Ablation application, and Conditional steering (CAST), with different layer and component choices than is used in the paper, simply create a new YAML config:
# configs/custom.yaml - LAT + CAA format + COSMIC + Directional Ablation + Conditional
# Override dataset formatting to use CAA templates with LAT:
train_data:
pos:
params:
format: SteeringFormat.CAA # LAT with CAA format
neg:
params:
format: SteeringFormat.CAA
neutral: null
# Phase 1: Direction Generation
direction_generation:
generator:
class: direction_generation.linear.LAT
params: {}
param_grid:
# Change for every middle layer and attn output component
layer_pct_start: [0.3]
layer_pct_end: [0.7]
layer_step: [1]
component: ['attn']
attr: ['output']
pos: [-1]
...
# Phase 2: Direction Selection
direction_selection:
class: direction_selection.cosmic.COSMIC
params:
application_locations: []
include_generation_loc: true
generation_pos: POST_INSTRUCTION # Targeted application
use_kl_divergence_check: false
...
# Phase 3: Direction Application
direction_application:
class: direction_application.unconditional.DirectionalAblation
params:
use_affine: false # Pure directional ablation
...
# Enable conditional steering
conditional:
enabled: true
condition_selection:
class: direction_selection.grid_search.ConditionalGridSearchSelector
params:
condition_thresholds: "auto"
condition_comparators: ["greater"]
...We also welcome contributions of new datasets, models, and components to further expand what can be evaluated.
SteeringSafety/
βββ configs/ # Experiment configurations
β βββ {method}.yaml # Base configurations
β βββ {method}_nokl.yaml # No KL divergence check
β βββ {method}_conditional.yaml # With CAST
βββ data/ # Dataset loaders
β βββ steering_data.py # Main data interface
β βββ refusal.py # Harmfulness datasets
β βββ bias.py # Bias datasets
β βββ hallucination.py # Hallucination datasets
β βββ secondary_datasets.py # Entanglement evaluation
βββ direction_generation/ # Phase 1 components
β βββ base.py
β βββ linear.py # DiffInMeans, PCA, LAT
βββ direction_selection/ # Phase 2 components
β βββ base.py
β βββ grid_search.py
β βββ cosmic.py
βββ direction_application/ # Phase 3 components
β βββ base.py
β βββ unconditional.py # Standard steering
β βββ conditional.py # CAST implementation
βββ utils/ # Utilities
β βββ intervention_llm.py # Model steering code
β βββ steering_utils.py # Helper functions
β βββ enums.py # Configuration enums
βββ scripts/
βββ run/ # Experiment scripts
βββ analysis/ # Evaluation tools
Our evaluation reveals several critical insights about current steering methods:
-
Method effectiveness varies significantly: DIM and ACE work best for reducing harmfulness and bias, while PCA and LAT show promise for hallucination reduction, but success depends heavily on the specific method-model-perspective combination being used.
-
Entanglement affects different capabilities unevenly: Social behaviors (like sycophancy and user retention) and normative judgments are most vulnerable to unintended changes during steering, while reasoning capabilities remain relatively stable.
-
Counterintuitive cross-perspective effects emerge: Jailbreaking doesn't necessarily increase toxicity, hallucination steering causes opposing political shifts in different models, and improving one type of bias can actually degrade another type, showing complex interdependencies between safety perspectives.
-
Conditional steering improves tradeoffs: Applying steering selectively (conditional steering) achieves effectiveness comparable to the best settings while significantly reducing entanglement for harmfulness and hallucination, though it performs poorly for bias steering.
-
Findings generalize across model scales: The relative performance rankings of different steering methods and entanglement patterns remain can consistent across models of different sizes, suggesting insights from smaller models can inform steering larger models.
This represents a major open challenge in AI safety: developing steering methods that can precisely target specific perspectives without changing performance on other perspectives. We hope this benchmark will accelerate progress toward more controllable and safer steering methods, and in the future, more generally towards safer AI systems.
The SteeringSafety framework code is released under the MIT License.
This benchmark incorporates multiple existing datasets, each with their own licensing terms. For some datasets (e.g., HalluLens), we also utilize their evaluation code and metrics. Users must respect the individual licenses of constituent datasets:
| Dataset | License | Source |
|---|---|---|
| ARC-C | CC-BY-SA-4.0 | AllenAI |
| Alpaca | CC-BY-NC-4.0 | Stanford |
| BBQ | CC-BY-4.0 | NYU-MLL |
| CMTest | CC-BY-SA-4.0 | AI-Secure |
| DarkBench | MIT | Apart Research |
| FaithEval | See source* | Salesforce |
| GPQA | CC-BY-4.0 | Rein et al. |
| HalluLens | CC-BY-NC** | Meta |
| SALADBench | Apache-2.0 | OpenSafetyLab |
| ToxiGen | See source* | Microsoft |
| TruthfulQA | See source* | Lin et al. |
| TwinViews | CC-BY-4.0 | Fulay et al. |
Datasets marked with asterisk seem to have no explicit dataset license but their associated codebases are licensed (Apache-2.0, MIT, etc.). Please refer to original sources for usage terms. HalluLens is mostly CC-BY-NC but contains some components with other licenses.
We gratefully acknowledge the following for helpful resources and foundational work:
@misc{siu2025SteeringSafety,
title={SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs},
author={Vincent Siu and Nicholas Crispino and David Park and Nathan W. Henry and Zhun Wang and Yang Liu and Dawn Song and Chenguang Wang},
year={2025},
eprint={2509.13450},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2509.13450},
}