Position-aware Automatic Circuit Discovery

This repository implements Position-Aware Edge Attribution Patching (PEAP),

Arxiv: https://arxiv.org/abs/2502.04577

Overview

PEAP extends Edge Attribution Patching (EAP) by incorporating positional edges, enabling researchers to understand how components at different token positions interact with each other. The method computes attribution scores for both:

Non-crossing edges: Connections within the same token position (such as attention head -> mlp, mlp -> attention head, embedding -> mlp and more)
Crossing edges: Connections between attention heads at different token positions.

Key Features

Position-aware analysis: Segment input sequences into meaningful spans and analyze interactions between them
Position-aware edge attribution: Compute attribution scores for many types of connections in transformer models
Circuit discovery: Automatically discover circuits of varying sizes
Faithfulness evaluation: Evaluate how faithfully the circuits preserve model behavior using ablation studies.
Multiple tasks supported: Includes implementations for Indirect Object Identification (IOI), WinoBias, and Greater-Than comparison tasks

Repository Structure

src/
├── pos_aware_edge_attribution_patching.py  # Core PEAP implementation
├── eval_utils.py                           # Circuit discovery and evaluation utilities
├── eval.py                                 # Full evaluation pipeline
├── exp.py                                  # Experiment classes for different tasks
├── data_generation.py                      # Dataset generation for supported tasks
├── input_attribution.py                    # Input attribution analysis methods (for finding the Schema)
├── schema_generation.py                    # Automatic span schema generation using LLMs
└── environment.yml                         # Conda environment specification

Core Components

PEAP Algorithm (`pos_aware_edge_attribution_patching.py`)

Computes position-aware attribution scores using gradient-based methods
Handles both counterfactual and mean ablation strategies -Supports multiple aggregation methods to handle spans of varying lengths. (sum, average, max absolute value)

Circuit Discovery (`eval_utils.py`)

Implements algorithms to find circuits of specified sizes
Supports threshold-based and top-k circuit discovery
Provides both forward (logits→embeddings) and reverse (embeddings→logits) search

Faithfulness Evaluation (`eval.py`)

Evaluates discovered circuits through mean ablation
Computes faithfulness metrics and accuracy preservation
Generates comprehensive evaluation reports

Dataset Generation (`data_generation.py`)

Creates datasets for IOI (ABBA/BABA patterns), WinoBias, and Greater-Than tasks
Includes automatic model evaluation on generated datasets

Schema Generation (`schema_generation.py`)

Automatically generates span schemas using large language models
Supports multiple LLM backends (GPT-4, Claude, Llama)
Includes input attribution methods to identify important tokens

Supported Tasks

Indirect Object Identification (IOI)
WinoBias
Greater-Than

How to Add New Tasks

Create a customized Experiment object, as defined in Experiment.py.
Prepare a DataFrame that follows these guidelines:
- It must include a "prompt" column.
- It should have one column per span, where each column contains the index of the first token in that span.
  - The end of span t is one index before the starting index of span t+1. This means every token is included in exactly one span.
- Empty spans are allowed — just set the start index equal to the start index of the next span.
- The DataFrame must also include a "length" column indicating the total number of tokens in the prompt. This helps handle prompts of varying lengths and will also serve as the boundary for the final span, ensuring it includes all remaining tokens.
- A Beginning-of-Sequence (BOS) token is automatically added when running the pipeline.
  - Do not include it in the "prompt" text.
  - However, make sure to account for it when setting span indices.
    For example, in the prompt "I love you", the token "I" should have index 1 in the DataFrame, since the BOS token will be added at position 0.

Example

For the prompt "I love you" (which is tokenized as ["I", "love", "you"] and becomes ["<BOS>", "I", "love", "you"]), the DataFrame might look like this:

prompt	span_0	span_1	span_2	length
I love you	1	2	3	4

span_0 starts at index 1 (token "I")
span_1 starts at index 2 (token "love")
→ So span_0 includes only "I"
span_2 starts at index 3 (token "you")
→ So span_1 includes only "love"
Since length = 4 (because the BOS token will be added), span_2 includes all tokens from index 3 up to (but not including) index 4 → just "you"

Installation

conda env create -f src/environment.yml
conda activate peap

Usage

1. Generate Datasets

python src/data_generation.py --model_name gpt2 --save_dir ./data --task ioi_baba --seed 42

2. Compute PEAP Scores

Make sure to add "length" as the last span.

python src/pos_aware_edge_attribution_patching.py \
    -e ioi -m gpt2 -cl data/gpt2/ioi_ABBA/human_baseline/IOI_data_clean.csv -co data/gpt2/ioi_ABBA/human_baseline/IOI_data_counter_abc.csv \
    -sp prefix IO and S1 S1+1 action1 S2 action2 to length -ds 10 -p ioi_results.pkl

3. Discover and Evaluate Circuits

Make sure to add "length" as the last span.

python src/eval.py \
    -e ioi -m gpt2 -cl data/gpt2/ioi_ABBA/human_baseline/IOI_data_clean.csv -co data/gpt2/ioi_ABBA/human_baseline/IOI_data_counter_abc.csv \
    -sp prefix IO and S1 S1+1 action1 S2 action2 to length -n 10 -tk 100 200 300 \
    -p ioi_results.pkl -sp results.pkl

How PEAP Works

1. Span Definition

PEAP segments input sequences into meaningful spans based on:

Syntactic structure (subjects, objects, verbs)
Semantic roles (professions, names, actions)
Task-specific elements (important tokens identified through attribution)

2. Attribution Computation

PEAP computes attribution scors for both:

Non-crossing edges: Direct connections within spans
Crossing edges: Attention-mediated connections between different spans through query-key-value interactions

3. Circuit Discovery

Using computed attribution scores, we discovers circuits through:

Top-k selection: Select k highest-scoring edges

4. Faithfulness Evaluation

Discovered circuits are evaluated by:

Mean ablation: Replace non-circuit components with mean activations
Performance preservation: Measure how well the circuit maintains original model behavior
Size-performance tradeoffs: Analyze circuit efficiency across different sizes

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
data_utils		data_utils
src		src
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Position-aware Automatic Circuit Discovery

Overview

Key Features

Repository Structure

Core Components

PEAP Algorithm (`pos_aware_edge_attribution_patching.py`)

Circuit Discovery (`eval_utils.py`)

Faithfulness Evaluation (`eval.py`)

Dataset Generation (`data_generation.py`)

Schema Generation (`schema_generation.py`)

Supported Tasks

How to Add New Tasks

Example

Installation

Usage

1. Generate Datasets

2. Compute PEAP Scores

3. Discover and Evaluate Circuits

How PEAP Works

1. Span Definition

2. Attribution Computation

3. Circuit Discovery

4. Faithfulness Evaluation

About

Uh oh!

Releases

Packages

Languages

License

technion-cs-nlp/PEAP

Folders and files

Latest commit

History

Repository files navigation

Position-aware Automatic Circuit Discovery

Overview

Key Features

Repository Structure

Core Components

PEAP Algorithm (pos_aware_edge_attribution_patching.py)

Circuit Discovery (eval_utils.py)

Faithfulness Evaluation (eval.py)

Dataset Generation (data_generation.py)

Schema Generation (schema_generation.py)

Supported Tasks

How to Add New Tasks

Example

Installation

Usage

1. Generate Datasets

2. Compute PEAP Scores

3. Discover and Evaluate Circuits

How PEAP Works

1. Span Definition

2. Attribution Computation

3. Circuit Discovery

4. Faithfulness Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

PEAP Algorithm (`pos_aware_edge_attribution_patching.py`)

Circuit Discovery (`eval_utils.py`)

Faithfulness Evaluation (`eval.py`)

Dataset Generation (`data_generation.py`)

Schema Generation (`schema_generation.py`)

Packages