RepIt: Steering Language Models with Concept-Specific Refusal Vectors

This repository contains the implementation of RepIt (Representing Isolated Targets), a framework for isolating concept-specific representations in language model activations to enable precise behavioral interventions. This work demonstrates how to selectively suppress refusal on targeted concepts while preserving refusal behavior elsewhere.

Overview

RepIt addresses a key limitation in current activation steering methods: interventions often have broader effects than desired. By isolating pure concept vectors, RepIt enables targeted interventions that can suppress refusal on specific topics (like weapons of mass destruction) while maintaining safety guardrails on other harmful content.

Key Features

Concept-Specific Isolation: Disentangles target concept representations from overlapping signals
Data Efficiency: Robust concept vectors can be extracted from as few as a dozen examples
Computational Efficiency: Corrective signal localizes to just 100-200 neurons
Benchmark Evasion: Enables targeted jailbreaks that evade standard safety evaluations

This repository is built on COSMIC: Generalized Refusal Direction Identification in LLM Activations (Siu, 2025), which itself builds on the repository from Refusal in Language Models Is Mediated by a Single Direction (Arditi, 2024).

Installation

git clone https://github.com/wang-research-lab/RepIt.git
cd RepIt
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install --no-build-isolation mamba-ssm==2.2.5

export HF_TOKEN="..." #to access LlamaGuard3

⚠️ It is possible that your system may fail to install the ''mamba-ssm'' package. The package is only required to run Nemotron models and can be skipped if necessary.

Quick Start

Run the RepIt pipeline on a target concept group:

#executed from top of the repository in RepIt
python3 -m pipeline.run_pipeline_repit --model_path "microsoft/Phi-4-mini-instruct" --target_group "wmd_bio"

Pipeline Options

--tailweight_ablation_study: Run experiments with tailweight analysis
--use_existing: Uses existing artifacts from saved files if possible. If false, the pipeline will rerun and overwrite your existing results.
--target_size_study: Test data efficiency with limited training examples (Warning, this will take a long time)

To toggle each option, simply include the corresponding tag (eg. --use_existing) into the python command above.

Available Target Groups

⚠️ At present, due to concerns about misuse, we have not yet released the WMD prompts originally used in the original paper. Therefore, the below groups will be non-functional as the repository will NOT contain the necessary files to run the pipeline.

wmd_bio: Biological weapons concepts
wmd_chem: Chemical weapons concepts
wmd_cyber: Cyber attack concepts

Method Overview

RepIt works through three main steps:

Reweighting: Balance contributions from non-target vectors using inverse norm weighting
Whitening: Address collinearity issues through ridge-regularized covariance matrices
Orthogonalization: Project target vectors and subtract controlled fractions of non-target components

The method isolates concept-specific directions while preserving the essential signal for targeted interventions.

Repository Structure

dataset/
├── processed
├── raw #both of these are inherited from prior versions of the script
├── splits #functional json splits for loading into the main pipeline
└── Miscellaneous splitting and loading logic
pipeline/
├── run_pipeline_RepIt.py      # Main pipeline script
├── submodules/
│   ├── disentangle_vectors.py # Core RepIt implementation
│   ├── generate_directions_RepIt.py # Direction generation
│   ├── select_direction_cosmic.py # Direction selection via COSMIC
│   ├── evaluate_jailbreak.py  # Safety evaluation methods
│   └── completions_helper.py  # Generation utilities
├── utils/
│   └── hook_utils.py          # Model intervention hooks
├── model_utils/
│   └── # Model loading scripts (inherited from Arditi et al)
└── config.py                  # Configuration settings
plotting/ #scripts to create the figures in the original paper assuming you have results in pipeline/runs

Results

RepIt demonstrates:

Precise Control: Target ASR of 0.4-0.7 while maintaining non-target ASR around 0.1
Benchmark Evasion: Models appear safe on standard benchmarks while retaining harmful capabilities
Efficiency: Works with minimal data and compute resources
Generalization: Consistent results across multiple model architectures

Safety Considerations

This research reveals significant safety implications:

Covert Capabilities: Models can retain harmful knowledge while appearing safe
Evaluation Gaps: Current safety benchmarks may miss targeted vulnerabilities
Low Barrier: Attacks require minimal resources, making them accessible

Citation

If you use this work in your research, please cite:

@misc{siu2025repitrepresentingisolatedtargets,
      title={RepIt: Steering Language Models with Concept-Specific Refusal Vectors}, 
      author={Vincent Siu and Nathan W. Henry and Nicholas Crispino and Yang Liu and Dawn Song and Chenguang Wang},
      year={2025},
      eprint={2509.13281},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.13281}, 
}

License

This code is provided for research purposes. Please use responsibly and consider the ethical implications of targeted model modifications.

Contact

For questions about this research, please open an issue in this repository.

Warning: This research demonstrates techniques that could be misused for harmful purposes. Please use responsibly and consider the broader implications of targeted AI safety circumvention.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataset		dataset
pipeline		pipeline
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Overview

Key Features

Installation

Quick Start

Pipeline Options

Available Target Groups

Method Overview

Repository Structure

Results

Safety Considerations

Citation

License

Contact

About

Uh oh!

Releases

Packages

Languages

License

wang-research-lab/RepIt

Folders and files

Latest commit

History

Repository files navigation

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Overview

Key Features

Installation

Quick Start

Pipeline Options

Available Target Groups

Method Overview

Repository Structure

Results

Safety Considerations

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages