This repository implements Position-Aware Edge Attribution Patching (PEAP),
Arxiv: https://arxiv.org/abs/2502.04577
PEAP extends Edge Attribution Patching (EAP) by incorporating positional edges, enabling researchers to understand how components at different token positions interact with each other. The method computes attribution scores for both:
- Non-crossing edges: Connections within the same token position (such as attention head -> mlp, mlp -> attention head, embedding -> mlp and more)
- Crossing edges: Connections between attention heads at different token positions.
- Position-aware analysis: Segment input sequences into meaningful spans and analyze interactions between them
- Position-aware edge attribution: Compute attribution scores for many types of connections in transformer models
- Circuit discovery: Automatically discover circuits of varying sizes
- Faithfulness evaluation: Evaluate how faithfully the circuits preserve model behavior using ablation studies.
- Multiple tasks supported: Includes implementations for Indirect Object Identification (IOI), WinoBias, and Greater-Than comparison tasks
src/
├── pos_aware_edge_attribution_patching.py # Core PEAP implementation
├── eval_utils.py # Circuit discovery and evaluation utilities
├── eval.py # Full evaluation pipeline
├── exp.py # Experiment classes for different tasks
├── data_generation.py # Dataset generation for supported tasks
├── input_attribution.py # Input attribution analysis methods (for finding the Schema)
├── schema_generation.py # Automatic span schema generation using LLMs
└── environment.yml # Conda environment specification
- Computes position-aware attribution scores using gradient-based methods
- Handles both counterfactual and mean ablation strategies -Supports multiple aggregation methods to handle spans of varying lengths. (sum, average, max absolute value)
- Implements algorithms to find circuits of specified sizes
- Supports threshold-based and top-k circuit discovery
- Provides both forward (logits→embeddings) and reverse (embeddings→logits) search
- Evaluates discovered circuits through mean ablation
- Computes faithfulness metrics and accuracy preservation
- Generates comprehensive evaluation reports
- Creates datasets for IOI (ABBA/BABA patterns), WinoBias, and Greater-Than tasks
- Includes automatic model evaluation on generated datasets
- Automatically generates span schemas using large language models
- Supports multiple LLM backends (GPT-4, Claude, Llama)
- Includes input attribution methods to identify important tokens
-
Create a customized
Experimentobject, as defined inExperiment.py. -
Prepare a DataFrame that follows these guidelines:
- It must include a
"prompt"column. - It should have one column per span, where each column contains the index of the first token in that span.
- The end of span t is one index before the starting index of span t+1. This means every token is included in exactly one span.
- Empty spans are allowed — just set the start index equal to the start index of the next span.
- The DataFrame must also include a "length" column indicating the total number of tokens in the prompt. This helps handle prompts of varying lengths and will also serve as the boundary for the final span, ensuring it includes all remaining tokens.
- A Beginning-of-Sequence (BOS) token is automatically added when running the pipeline.
- Do not include it in the
"prompt"text. - However, make sure to account for it when setting span indices.
For example, in the prompt"I love you", the token"I"should have index1in the DataFrame, since the BOS token will be added at position0.
- Do not include it in the
- It must include a
For the prompt "I love you" (which is tokenized as ["I", "love", "you"] and becomes ["<BOS>", "I", "love", "you"]), the DataFrame might look like this:
| prompt | span_0 | span_1 | span_2 | length |
|---|---|---|---|---|
| I love you | 1 | 2 | 3 | 4 |
span_0starts at index 1 (token"I")span_1starts at index 2 (token"love")
→ Sospan_0includes only"I"span_2starts at index 3 (token"you")
→ Sospan_1includes only"love"- Since
length = 4(because the BOS token will be added),span_2includes all tokens from index 3 up to (but not including) index 4 → just"you"
conda env create -f src/environment.yml
conda activate peappython src/data_generation.py --model_name gpt2 --save_dir ./data --task ioi_baba --seed 42Make sure to add "length" as the last span.
python src/pos_aware_edge_attribution_patching.py \
-e ioi -m gpt2 -cl data/gpt2/ioi_ABBA/human_baseline/IOI_data_clean.csv -co data/gpt2/ioi_ABBA/human_baseline/IOI_data_counter_abc.csv \
-sp prefix IO and S1 S1+1 action1 S2 action2 to length -ds 10 -p ioi_results.pklMake sure to add "length" as the last span.
python src/eval.py \
-e ioi -m gpt2 -cl data/gpt2/ioi_ABBA/human_baseline/IOI_data_clean.csv -co data/gpt2/ioi_ABBA/human_baseline/IOI_data_counter_abc.csv \
-sp prefix IO and S1 S1+1 action1 S2 action2 to length -n 10 -tk 100 200 300 \
-p ioi_results.pkl -sp results.pklPEAP segments input sequences into meaningful spans based on:
- Syntactic structure (subjects, objects, verbs)
- Semantic roles (professions, names, actions)
- Task-specific elements (important tokens identified through attribution)
PEAP computes attribution scors for both:
- Non-crossing edges: Direct connections within spans
- Crossing edges: Attention-mediated connections between different spans through query-key-value interactions
Using computed attribution scores, we discovers circuits through:
- Top-k selection: Select k highest-scoring edges
Discovered circuits are evaluated by:
- Mean ablation: Replace non-circuit components with mean activations
- Performance preservation: Measure how well the circuit maintains original model behavior
- Size-performance tradeoffs: Analyze circuit efficiency across different sizes