Sparsity as a Variance Regulator for Improved Depth Utilization in Language Models
This repository contains the official implementation for our research on understanding how sparsity mechanisms mitigate the curse of depth in large language models. We reveal sparsity as a variance regulator that improves depth utilization beyond its conventional role in computational efficiency.
Our investigation covers:
- Implicit Sparsity: Weight sparsity induced by weight decay, attention sparsity from long contexts
- Explicit Sparsity: Key/value-sharing in Grouped-Query Attention (GQA), expert activation in Mixture-of-Experts (MoE)
Through controlled depth-scaling experiments, we demonstrate that sparsity consistently reduces output variance and promotes functional differentiation across layers.
- [2026-03] Our paper is released at arXiv.
- [2026-02] Our paper is released here.
- [2026-02] Initial codebase release with training and evaluation scripts.
- Overview
- News
- Installation
- Getting Started
- Training
- Evaluation
- Analysis Tools
- Acknowledgments
- License
- Python 3.10 or higher
- CUDA-capable GPU(s) for training (recommended)
- At least 50GB disk space for datasets and checkpoints
# Create virtual environment
python -m venv sparsity_cod
source sparsity_cod/bin/activate
# Install PyTorch
pip install torch==2.7.1
# Install the main package
git clone <repository-url>
cd SparsityAndCoD
pip install -e .
# Install optional dependencies
pip install flash-attn --no-build-isolation
pip install datasets matplotlib scikit-learn torchmetrics wandb
# Install evaluation harness (optional)
pip install -e lm-evaluation-harnessWe provide a helper script for tokenizing HuggingFace datasets into the OLMo-compatible format:
Script: scripts/tokenize_from_dataset.py
python scripts/tokenize_from_dataset.py \
--dataset_name HuggingFaceFW/fineweb-edu \
--dataset_config sample_100BT \
--split "train" \
--max_tokens 10_000_000_000 \
--output_dir ./tokenized_data \
--tokenizer_identifier ./tokenizer/tokenizer.jsonNote: You can download the GPT-NeoX tokenizer from the OLMo HuggingFace repository.
We provide pre-configured YAML files for various model architectures used in our experiments:
| Configuration | Description | Parameters |
|---|---|---|
adamw-1B.yaml |
Primary 1B parameter dense model | 1B |
adamw-400M-momentum.yaml |
Model for higher-order momentum analysis | 400M |
olmoe-1B-7B.yaml |
1B activated MoE with 7B total | 1B/7B |
olmoe-400M-2B.yaml |
400M activated MoE with 2B total | 400M/2B |
olmoe-1B-7B-ablation.yaml |
Ablation study configuration | 1B/7B |
We provide an easy-to-use training script at scripts/train_script.sh:
bash scripts/train_script.sh <config> <batch_size> <global_batch> <lr> <gpus> <port> <suffix>Arguments:
| Argument | Description | Example |
|---|---|---|
config |
Configuration file name (without .yaml) | adamw-1B |
batch_size |
Per-device batch size | 16 |
global_batch |
Global batch size | 256 |
lr |
Learning rate | 1e-3 |
gpus |
Visible GPU IDs | 0,1,2,3,4,5,6,7 |
port |
Master port (optional) | 29500 |
suffix |
Run suffix for logging (optional) | 1B-Training |
Example:
bash scripts/train_script.sh adamw-1B 16 256 1e-3 0,1,2,3,4,5,6,7 29500 1B-TrainingKey training parameters you may want to customize:
--model.max_sequence_length=1024 # Training sequence length
--optimizer.weight_decay=0.1 # Weight decay coefficient
--data.dir=./tokenized_data # Path to tokenized dataset
--tokenizer.identifier=./tokenizer/tokenizer.json
--save_folder="./runs/${config}-${suffix}"
--max_duration=5e9T # Total training tokens
--wandb.project="YourProject" # Weights & Biases project
--layerwise_statis_collect_interval=1 # Variance collection frequencyConvert OLMo checkpoints to HuggingFace format for evaluation:
Dense Models:
python scripts/convert_tools/convert_olmo_hf.py \
--input_dir /path/to/olmo/checkpoint \
--tokenizer_json_path /path/to/tokenizer.json \
--output_dir /path/to/outputMoE Models:
python scripts/convert_tools/convert_olmo_moe_hf.py \
--input_dir /path/to/olmo/moe/checkpoint \
--tokenizer_json_path /path/to/tokenizer.json \
--output_dir /path/to/outputUsing the lm-evaluation-harness (Make sure you install the local version):
python -m lm_eval \
--model hf \
--model_args pretrained=/path/to/hf/model \
--tasks mmlu,hellaswag,arc_challenge \
--batch_size auto \
--output_path ./eval_resultsWe provide several analysis scripts for investigating sparsity patterns and layer importance:
| Script | Purpose | Usage |
|---|---|---|
analyze_weight_sparsity.py |
Analyze weight magnitude distributions | For studying implicit weight sparsity |
analyze_attention_sparsity.py |
Analyze attention pattern sparsity | For studying implicit sequence-wise sparsity |
analyze_weight_decay_sparsity.py |
Compare sparsity across weight decay values | For ablation studies |
These scripts compute various metrics to analyze layer importance and functional differentiation, as described in our paper:
| Script | Purpose | Key Features |
|---|---|---|
compute_jacobian.py |
Compute Jacobian matrices for residual identity mapping analysis | Measures how much each layer transforms its input via Jacobian analysis. Computes deviation from identity mapping, off-diagonal norms, diagonal statistics, Frobenius norm, and spectral norm. Supports both row-wise and element-wise computation methods. |
compute_usefulness.py |
Compute layer usefulness via linear approximation | Replaces each layer with a linear mapping (W*x + b) fitted via least squares, then measures the loss increase. Computes a global usefulness score as the ratio of layers causing >10% loss increase. |
compute_layer_score.py |
Comprehensive layer scoring with multiple metrics | Causal Score (--compute_causal): Measures the causal effect of skipping a layer on future layers. Permutation Score (--compute_permutation): Measures layer independence by swapping layer weights and computing normalized loss change. |
Example usage:
# Compute Jacobian matrices for residual identity analysis
python scripts/score_computation/compute_jacobian.py \
--model_path /path/to/model \
--output_dir ./jacobian_results \
--num_samples 50 \
--seq_length 512 \
--plot
# Compute layer usefulness via linear approximation
python scripts/score_computation/compute_usefulness.py \
--model_path /path/to/model \
--output_dir ./usefulness_results \
--num_samples 100 \
--seq_length 512
# Compute causal effects between layers
python scripts/score_computation/compute_layer_score.py \
--model_path /path/to/model \
--compute_causal \
--output_dir ./causal_results
# Compute permutation scores
python scripts/score_computation/compute_layer_score.py \
--model_path /path/to/model \
--compute_permutation \
--output_dir ./permutation_resultsThis project builds upon the excellent work of:
- OLMo - Open Language Model from the Allen Institute for AI
- OLMoE - Mixture-of-Experts extension to OLMo
- lm-evaluation-harness - Framework for evaluating language models
We are grateful to the maintainers and contributors of these open-source projects.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
If you find this repository helpful, please give us a β on GitHub!
