When Does Sparsity Mitigate the Curse of Depth in LLMs?

Sparsity as a Variance Regulator for Improved Depth Utilization in Language Models

Overview

This repository contains the official implementation for our research on understanding how sparsity mechanisms mitigate the curse of depth in large language models. We reveal sparsity as a variance regulator that improves depth utilization beyond its conventional role in computational efficiency.

Our investigation covers:

Implicit Sparsity: Weight sparsity induced by weight decay, attention sparsity from long contexts
Explicit Sparsity: Key/value-sharing in Grouped-Query Attention (GQA), expert activation in Mixture-of-Experts (MoE)

Through controlled depth-scaling experiments, we demonstrate that sparsity consistently reduces output variance and promotes functional differentiation across layers.

📰 News

[2026-03] Our paper is released at arXiv.
[2026-02] Our paper is released here.
[2026-02] Initial codebase release with training and evaluation scripts.

📋 Table of Contents

Overview
News
Installation
Getting Started
- Data Preparation
- Configuration
Training
Evaluation
- Checkpoint Conversion
- Downstream Evaluation
Analysis Tools
Acknowledgments
License

🛠️ Installation

Prerequisites

Python 3.10 or higher
CUDA-capable GPU(s) for training (recommended)
At least 50GB disk space for datasets and checkpoints

Setup Environment

# Create virtual environment
python -m venv sparsity_cod
source sparsity_cod/bin/activate

# Install PyTorch
pip install torch==2.7.1

# Install the main package
git clone <repository-url>
cd SparsityAndCoD
pip install -e .

# Install optional dependencies
pip install flash-attn --no-build-isolation
pip install datasets matplotlib scikit-learn torchmetrics wandb

# Install evaluation harness (optional)
pip install -e lm-evaluation-harness

🚀 Getting Started

Data Preparation

We provide a helper script for tokenizing HuggingFace datasets into the OLMo-compatible format:

Script: scripts/tokenize_from_dataset.py

python scripts/tokenize_from_dataset.py \
    --dataset_name HuggingFaceFW/fineweb-edu \
    --dataset_config sample_100BT \
    --split "train" \
    --max_tokens 10_000_000_000 \
    --output_dir ./tokenized_data \
    --tokenizer_identifier ./tokenizer/tokenizer.json

Note: You can download the GPT-NeoX tokenizer from the OLMo HuggingFace repository.

Configuration

We provide pre-configured YAML files for various model architectures used in our experiments:

Configuration	Description	Parameters
`adamw-1B.yaml`	Primary 1B parameter dense model	1B
`adamw-400M-momentum.yaml`	Model for higher-order momentum analysis	400M
`olmoe-1B-7B.yaml`	1B activated MoE with 7B total	1B/7B
`olmoe-400M-2B.yaml`	400M activated MoE with 2B total	400M/2B
`olmoe-1B-7B-ablation.yaml`	Ablation study configuration	1B/7B

🏋️ Training

Training Script

We provide an easy-to-use training script at scripts/train_script.sh:

bash scripts/train_script.sh <config> <batch_size> <global_batch> <lr> <gpus> <port> <suffix>

Arguments:

Argument	Description	Example
`config`	Configuration file name (without .yaml)	`adamw-1B`
`batch_size`	Per-device batch size	`16`
`global_batch`	Global batch size	`256`
`lr`	Learning rate	`1e-3`
`gpus`	Visible GPU IDs	`0,1,2,3,4,5,6,7`
`port`	Master port (optional)	`29500`
`suffix`	Run suffix for logging (optional)	`1B-Training`

Example:

bash scripts/train_script.sh adamw-1B 16 256 1e-3 0,1,2,3,4,5,6,7 29500 1B-Training

Training Configuration

Key training parameters you may want to customize:

--model.max_sequence_length=1024    # Training sequence length
--optimizer.weight_decay=0.1        # Weight decay coefficient
--data.dir=./tokenized_data         # Path to tokenized dataset
--tokenizer.identifier=./tokenizer/tokenizer.json
--save_folder="./runs/${config}-${suffix}"
--max_duration=5e9T                 # Total training tokens
--wandb.project="YourProject"       # Weights & Biases project
--layerwise_statis_collect_interval=1  # Variance collection frequency

📊 Evaluation

Checkpoint Conversion

Convert OLMo checkpoints to HuggingFace format for evaluation:

Dense Models:

python scripts/convert_tools/convert_olmo_hf.py \
    --input_dir /path/to/olmo/checkpoint \
    --tokenizer_json_path /path/to/tokenizer.json \
    --output_dir /path/to/output

MoE Models:

python scripts/convert_tools/convert_olmo_moe_hf.py \
    --input_dir /path/to/olmo/moe/checkpoint \
    --tokenizer_json_path /path/to/tokenizer.json \
    --output_dir /path/to/output

Downstream Evaluation

Using the lm-evaluation-harness (Make sure you install the local version):

python -m lm_eval \
    --model hf \
    --model_args pretrained=/path/to/hf/model \
    --tasks mmlu,hellaswag,arc_challenge \
    --batch_size auto \
    --output_path ./eval_results

🔬 Analysis Tools

We provide several analysis scripts for investigating sparsity patterns and layer importance:

Sparsity Analysis

Script	Purpose	Usage
`analyze_weight_sparsity.py`	Analyze weight magnitude distributions	For studying implicit weight sparsity
`analyze_attention_sparsity.py`	Analyze attention pattern sparsity	For studying implicit sequence-wise sparsity
`analyze_weight_decay_sparsity.py`	Compare sparsity across weight decay values	For ablation studies

Layer Importance & Score Computation

These scripts compute various metrics to analyze layer importance and functional differentiation, as described in our paper:

Script	Purpose	Key Features
`compute_jacobian.py`	Compute Jacobian matrices for residual identity mapping analysis	Measures how much each layer transforms its input via Jacobian analysis. Computes deviation from identity mapping, off-diagonal norms, diagonal statistics, Frobenius norm, and spectral norm. Supports both row-wise and element-wise computation methods.
`compute_usefulness.py`	Compute layer usefulness via linear approximation	Replaces each layer with a linear mapping (W*x + b) fitted via least squares, then measures the loss increase. Computes a global usefulness score as the ratio of layers causing >10% loss increase.
`compute_layer_score.py`	Comprehensive layer scoring with multiple metrics	Causal Score (`--compute_causal`): Measures the causal effect of skipping a layer on future layers. Permutation Score (`--compute_permutation`): Measures layer independence by swapping layer weights and computing normalized loss change.

Example usage:

# Compute Jacobian matrices for residual identity analysis
python scripts/score_computation/compute_jacobian.py \
    --model_path /path/to/model \
    --output_dir ./jacobian_results \
    --num_samples 50 \
    --seq_length 512 \
    --plot

# Compute layer usefulness via linear approximation
python scripts/score_computation/compute_usefulness.py \
    --model_path /path/to/model \
    --output_dir ./usefulness_results \
    --num_samples 100 \
    --seq_length 512

# Compute causal effects between layers
python scripts/score_computation/compute_layer_score.py \
    --model_path /path/to/model \
    --compute_causal \
    --output_dir ./causal_results

# Compute permutation scores
python scripts/score_computation/compute_layer_score.py \
    --model_path /path/to/model \
    --compute_permutation \
    --output_dir ./permutation_results

🙏 Acknowledgments

This project builds upon the excellent work of:

OLMo - Open Language Model from the Allen Institute for AI
OLMoE - Mixture-of-Experts extension to OLMo
lm-evaluation-harness - Framework for evaluating language models

We are grateful to the maintainers and contributors of these open-source projects.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

If you find this repository helpful, please give us a ⭐ on GitHub!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
configs		configs
docs		docs
evaluation		evaluation
lm-evaluation-harness		lm-evaluation-harness
olmo		olmo
olmo_data		olmo_data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conftest.py		conftest.py
olm_env.def		olm_env.def
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

When Does Sparsity Mitigate the Curse of Depth in LLMs?

Overview

📰 News

📋 Table of Contents

🛠️ Installation

Prerequisites

Setup Environment

🚀 Getting Started

Data Preparation

Configuration

🏋️ Training

Training Script

Training Configuration

📊 Evaluation

Checkpoint Conversion

Downstream Evaluation

🔬 Analysis Tools

Sparsity Analysis

Layer Importance & Score Computation

🙏 Acknowledgments

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

When Does Sparsity Mitigate the Curse of Depth in LLMs?

Overview

📰 News

📋 Table of Contents

🛠️ Installation

Prerequisites

Setup Environment

🚀 Getting Started

Data Preparation

Configuration

🏋️ Training

Training Script

Training Configuration

📊 Evaluation

Checkpoint Conversion

Downstream Evaluation

🔬 Analysis Tools

Sparsity Analysis

Layer Importance & Score Computation

🙏 Acknowledgments

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages