Skip to content

pUmpKin-Co/SparsityAndCoD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

When Does Sparsity Mitigate the Curse of Depth in LLMs?

License

Sparsity Ablation Study

Sparsity as a Variance Regulator for Improved Depth Utilization in Language Models


Overview

This repository contains the official implementation for our research on understanding how sparsity mechanisms mitigate the curse of depth in large language models. We reveal sparsity as a variance regulator that improves depth utilization beyond its conventional role in computational efficiency.

Our investigation covers:

  • Implicit Sparsity: Weight sparsity induced by weight decay, attention sparsity from long contexts
  • Explicit Sparsity: Key/value-sharing in Grouped-Query Attention (GQA), expert activation in Mixture-of-Experts (MoE)

Through controlled depth-scaling experiments, we demonstrate that sparsity consistently reduces output variance and promotes functional differentiation across layers.


πŸ“° News

  • [2026-03] Our paper is released at arXiv.
  • [2026-02] Our paper is released here.
  • [2026-02] Initial codebase release with training and evaluation scripts.

πŸ“‹ Table of Contents


πŸ› οΈ Installation

Prerequisites

  • Python 3.10 or higher
  • CUDA-capable GPU(s) for training (recommended)
  • At least 50GB disk space for datasets and checkpoints

Setup Environment

# Create virtual environment
python -m venv sparsity_cod
source sparsity_cod/bin/activate

# Install PyTorch
pip install torch==2.7.1

# Install the main package
git clone <repository-url>
cd SparsityAndCoD
pip install -e .

# Install optional dependencies
pip install flash-attn --no-build-isolation
pip install datasets matplotlib scikit-learn torchmetrics wandb

# Install evaluation harness (optional)
pip install -e lm-evaluation-harness

πŸš€ Getting Started

Data Preparation

We provide a helper script for tokenizing HuggingFace datasets into the OLMo-compatible format:

Script: scripts/tokenize_from_dataset.py

python scripts/tokenize_from_dataset.py \
    --dataset_name HuggingFaceFW/fineweb-edu \
    --dataset_config sample_100BT \
    --split "train" \
    --max_tokens 10_000_000_000 \
    --output_dir ./tokenized_data \
    --tokenizer_identifier ./tokenizer/tokenizer.json

Note: You can download the GPT-NeoX tokenizer from the OLMo HuggingFace repository.

Configuration

We provide pre-configured YAML files for various model architectures used in our experiments:

Configuration Description Parameters
adamw-1B.yaml Primary 1B parameter dense model 1B
adamw-400M-momentum.yaml Model for higher-order momentum analysis 400M
olmoe-1B-7B.yaml 1B activated MoE with 7B total 1B/7B
olmoe-400M-2B.yaml 400M activated MoE with 2B total 400M/2B
olmoe-1B-7B-ablation.yaml Ablation study configuration 1B/7B

πŸ‹οΈ Training

Training Script

We provide an easy-to-use training script at scripts/train_script.sh:

bash scripts/train_script.sh <config> <batch_size> <global_batch> <lr> <gpus> <port> <suffix>

Arguments:

Argument Description Example
config Configuration file name (without .yaml) adamw-1B
batch_size Per-device batch size 16
global_batch Global batch size 256
lr Learning rate 1e-3
gpus Visible GPU IDs 0,1,2,3,4,5,6,7
port Master port (optional) 29500
suffix Run suffix for logging (optional) 1B-Training

Example:

bash scripts/train_script.sh adamw-1B 16 256 1e-3 0,1,2,3,4,5,6,7 29500 1B-Training

Training Configuration

Key training parameters you may want to customize:

--model.max_sequence_length=1024    # Training sequence length
--optimizer.weight_decay=0.1        # Weight decay coefficient
--data.dir=./tokenized_data         # Path to tokenized dataset
--tokenizer.identifier=./tokenizer/tokenizer.json
--save_folder="./runs/${config}-${suffix}"
--max_duration=5e9T                 # Total training tokens
--wandb.project="YourProject"       # Weights & Biases project
--layerwise_statis_collect_interval=1  # Variance collection frequency

πŸ“Š Evaluation

Checkpoint Conversion

Convert OLMo checkpoints to HuggingFace format for evaluation:

Dense Models:

python scripts/convert_tools/convert_olmo_hf.py \
    --input_dir /path/to/olmo/checkpoint \
    --tokenizer_json_path /path/to/tokenizer.json \
    --output_dir /path/to/output

MoE Models:

python scripts/convert_tools/convert_olmo_moe_hf.py \
    --input_dir /path/to/olmo/moe/checkpoint \
    --tokenizer_json_path /path/to/tokenizer.json \
    --output_dir /path/to/output

Downstream Evaluation

Using the lm-evaluation-harness (Make sure you install the local version):

python -m lm_eval \
    --model hf \
    --model_args pretrained=/path/to/hf/model \
    --tasks mmlu,hellaswag,arc_challenge \
    --batch_size auto \
    --output_path ./eval_results

πŸ”¬ Analysis Tools

We provide several analysis scripts for investigating sparsity patterns and layer importance:

Sparsity Analysis

Script Purpose Usage
analyze_weight_sparsity.py Analyze weight magnitude distributions For studying implicit weight sparsity
analyze_attention_sparsity.py Analyze attention pattern sparsity For studying implicit sequence-wise sparsity
analyze_weight_decay_sparsity.py Compare sparsity across weight decay values For ablation studies

Layer Importance & Score Computation

These scripts compute various metrics to analyze layer importance and functional differentiation, as described in our paper:

Script Purpose Key Features
compute_jacobian.py Compute Jacobian matrices for residual identity mapping analysis Measures how much each layer transforms its input via Jacobian analysis. Computes deviation from identity mapping, off-diagonal norms, diagonal statistics, Frobenius norm, and spectral norm. Supports both row-wise and element-wise computation methods.
compute_usefulness.py Compute layer usefulness via linear approximation Replaces each layer with a linear mapping (W*x + b) fitted via least squares, then measures the loss increase. Computes a global usefulness score as the ratio of layers causing >10% loss increase.
compute_layer_score.py Comprehensive layer scoring with multiple metrics Causal Score (--compute_causal): Measures the causal effect of skipping a layer on future layers. Permutation Score (--compute_permutation): Measures layer independence by swapping layer weights and computing normalized loss change.

Example usage:

# Compute Jacobian matrices for residual identity analysis
python scripts/score_computation/compute_jacobian.py \
    --model_path /path/to/model \
    --output_dir ./jacobian_results \
    --num_samples 50 \
    --seq_length 512 \
    --plot

# Compute layer usefulness via linear approximation
python scripts/score_computation/compute_usefulness.py \
    --model_path /path/to/model \
    --output_dir ./usefulness_results \
    --num_samples 100 \
    --seq_length 512

# Compute causal effects between layers
python scripts/score_computation/compute_layer_score.py \
    --model_path /path/to/model \
    --compute_causal \
    --output_dir ./causal_results

# Compute permutation scores
python scripts/score_computation/compute_layer_score.py \
    --model_path /path/to/model \
    --compute_permutation \
    --output_dir ./permutation_results

πŸ™ Acknowledgments

This project builds upon the excellent work of:

  • OLMo - Open Language Model from the Allen Institute for AI
  • OLMoE - Mixture-of-Experts extension to OLMo
  • lm-evaluation-harness - Framework for evaluating language models

We are grateful to the maintainers and contributors of these open-source projects.


πŸ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


If you find this repository helpful, please give us a ⭐ on GitHub!

About

Sparsity as a Variance Regulator for Improved Depth Utilization in Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors