Introduction to VERDICT

SOSP'25 Artifact Evaluation for Paper#501 VERDICT

Introduction to VERDICT

Overview: VERDICT is a verification tool for ensuring parallelization equivalence in distributed model training. It guarantees that the parallelized model is arithmetically equivalent to its original single-device version, thereby eliminating errors such as incorrect tensor transformations or faulty collective communication introduced during parallelization.

Workflow: VERDICT takes the execution plans (execplans) of both the original (single-device) and parallelized (multi-device) models, each enriched with lineages. It then transforms these into symbolic SSA DAGs (single-static assignment directed acyclic graphs). The dual graphs are partitioned into independent stages using lineage information. Each stage undergoes parallel execution, where shape reduction is applied, followed by symbolic verification using Z3. If all stages pass, VERDICT guarantees end-to-end equivalence.

Implementation: VERDICT is implemented with modular interfaces. It defines a general graph interface and a solver interface, supporting a registry of allowable operators. The nnScaler backend constructs SSA DAGs that conform to VERDICT’s requirements. A high-level view of the implementation is shown below:

Artifact Evaluation Guide

Welcome to artifact evaluation guide for VERDICT (SOSP'25). The following document outlines the procedures needed to reproduce our results and guides you through the key experiments presented in the paper.

✅ Checklist with Estimated Time Cost

Access hardware resources. (Azure, clone repo)
Installation. (conda environment, demo runs) ⏱️ 10 mins
Run Real-world Parallelization Evaluation (§8.1) ⏱️ 24 hours
Run Scalability Evaluation (§8.2) ⏱️ 6 hours
Run Bug Reproduction Evaluation (§8.3) ⏱️ 20 mins

⚠️ Note: Due to refined design, optimizations and code refactoring, the current evaluation results are improved thus different from statistics in the submitted paper. Please refer to each section for expected outputs.

💻 Hardware Requirements

To fully reproduce results, we recommend to run VERDICT artifact evaluation on a machine with at least 32 CPU (virtual) cores and 1TB memory. For SOSP'25 AE reviewers, please contact authors for Azure virtual machine instances.

🎭 Recommended Evaluation Strategy

Evaluation of §8.1 and §8.2 are about VERDICT's performance, thus having to be run on capable machines, such as Azure VM instances. These experiments are estimated to take total 30 hours and do not require much reviewer engagement except for inspection of final dumped statistics, due to acknowledged correctness of input models.
Evaluation of §8.3 bug reproduction can be more interesting. VERDICT will verify 14 buggy parallelized models, detect their violations, and print out information helpful for bug diagnosis. These experiments are lightweight, and can be run on personal devices.

🚀 Installation

Create conda environment. ⚠️ This step can be skipped when using the provided machine, since it's already set up.
```
cd Verdict
conda env create -f conda-environment.yml
conda activate verdict
```
Run demo verification for 2-layer llama3 model parallelization.
```
mkdir data/logs;
bash scripts/demo_llama3.sh 
```
The scripts/demo_llama3.sh essentially runs the following command:
```
python main.py \
    --sm gen_model/mgeners/llama3_default_dp1_pp1_tp1_nm1_gbs32_ly32_h32_hi128_sq8192.pkl \
    --pm gen_model/mgeners/llama3_default_dp2_pp2_tp2_nm2_gbs32_ly32_h32_hi128_sq8192.pkl \
    --seed 0 \
    --time  \
    --max_ser_proc 30 \
    --max_vrf_proc 30 \
    --loglevel INFO \
    --no_cache_nodes \
    --no_cache_stages \
    |& tee -a data/logs/llama3_default_dp2_pp2_tp2_nm2_gbs32_ly32_h32_hi128_sq8192.txt
```
Command interpretation: main.py is the entry of Verdict. --sm and --pm sepcify the paths of single-device model's and parallelized model's execution plan respectively. --seed sets z3 random seed. --time activates timer. --max_ser_proc and --max_vrf_proc set the multiprocessing pool size for building SSA DAGs, and parallel stage execution respectively. --loglevel sets logger level. --no_cache_nodes and --no_cache_stages ignore any cached data and run verification from scratch. |& tee -a ... writes logs to a file for inspection convenience.

👀 Expected Output: The program should print the following message or similar at the bottom of the output. Indicating successful execution of all stages, as well as the verified end-to-end equivalence.
```
parallel verifying stages: 100%|██████████| 3909/3909 [00:05<00:00, 676.09it/s] 
PID: ... - ✅ SUCCESS 
Stats(success=True, ... )
```
A failed run would print PID: ... - ❌ FAIL.
Once the demo runs successfully, we can move on to reproduce evaluations.

⚠️ Note: In all following evaluations, we directly provide execution plans (*.pkl files) of models in stead of building graphs from model code due to time and GPU constraints, as nnScaler parallelizes models through expensive profiling and graph tracing.

🚀 Evaluate Real-World Parallelization

🎯 Goal

(Paper §8.1) To demonstrate VERDICT practicality and time cost, we experiment on verifying execution plans for LLaMA3 (8B/70B/405B) and DeepSeek-V3 (16B/236B/671B) models under various real-world parallelization setup. There are total 6 runs. Parallelizaion scheme listed below.

⏳ Estimated Completion Time

As these plans corresponds to real-world large scale training, involving up to 8192 GPUs, their verification time can be costly. Estimated time is listed below. (L1-L3 and D1-D2 will use 30 workers for stage-parallel execution. D3 will use 10 workers due to memory constraints.)

📋 Table of Experiments

Exp. ID	Model	DP	TP	PP	NM	Est. Time
L1	llama3-8B	512	1	1	1	5 mins
L2	llama3-70B	16	8	4	32	2.5 hours
L3	llama3-405B	64	8	16	16	8 hours
D1	DSV3-16B	16	4	2	16	30 mins
D2	DSV3-236B	16	8	4	16	2.5 hours
D3	DSV3-671B	32	8	8	16	9 hours

🛠 How to Run

⚠️ Note: As these experiments take hours to run, we recommend using screen or tmux in case of ssh disconnection.

cd to working directory (relative path w.r.t repo root): nnscaler/Verdict
Clean up stale stats.
```
rm data/stats/stats.csv
```

Run each experiment.

bash scripts/run_large_llama3_8B.sh;
bash scripts/run_large_llama3_70B.sh;
bash scripts/run_large_llama3_405B.sh;
bash scripts/run_large_moe_16B.sh;
bash scripts/run_large_moe_236B.sh;
bash scripts/run_large_moe_671B.sh;

Each script contains a command similar to the demo's in installation guide. Feel free to take a look.

👀 Expected Output

Each experiment is expected to print the following message or similar at the end of their respective log, indicating success of verification. The Stats will contain the time of execution, which will also be dumped to data/stats/stats.csv.

PID: ... - ✅ SUCCESS 
Stats(success=True, ... )

To check the evaluation results conveniently, take a look at the column t_total in data/stats/stats.csv and compare with either estimated time in the Table of Experiments above, or our updated evaluation results (to be appear in camera-ready) in data/4_stats_large_final.csv. The measured time should be close. Compared with Table 3 in paper §8.1, the improved verification time takes only the half.

🚀 Evaluate Scalability

🎯 Goal

(Paper §8.2) This evaluation measures scalability trends of VERDICT. According to the design of VERDICT, the time complexity should be invariant with respect to actual tensor shapes due to shape reduction, and sub-/linear to parallelization.

⏳ Estimated Completion Time

There will be 40 runs, taking 6 hours in total.

🛠 How to Run

⚠️ Note: As these experiments take hours to run, we recommend using screen or tmux in case of ssh disconnection.

cd to working directory (relative path w.r.t repo root): nnscaler/Verdict
Clean up stale stats.
```
rm data/stats/stats.csv
```
⚠️ Important. As we will later make plot graph based on this file, requiring the csv only contains results from this evaluation.
Run experiments.
```
bash scripts/run_trends_full_8B.sh
```
Plot time complexity trends.
```
cd ../ae;
python draw.py
```

👀 Expected Output

Inspect output plots in nnscaler/ae/figs. The trends should ressemble the following plots. The complexity trends are also expected to be consistent with Figure 6 in paper §8.2, though the breakdown time components and absolute time may differ.

🚀 Evaluate Bug Reproduction

🎯 Goal

(Paper §8.3) This evaluation demonstrates VERDICT can detect real-world parallelization bugs. We will use VERDICT to verify 14 buggy execution plans crafted from real-world buggy models. We will see how VERDICT detect them, and output useful information for diagnosis. All execution plans are crafted on the base model of llama3-8B with 2-way DP, 2-way TP, 2-way PP and 2 microbatches.

⏳ Estimated Completion Time

Total execution time of 14 cases should finish within 3 minumtes. Manual examination of results takes around 15 minutes.

📋 Table of Bugs

Bug ID	Type	Description
1	communication	Missing `allreduce` operation without breaking dataflow
2	communication	Missing `allreduce` operation with broken dataflow
3	communication	`allreduce` primitive replaced by `allgather`
4	communication	`allgather` primitive replaced by `allreduce`
5	communication	Missing reduce operation in backward fused operator
6	rank assignment	Wrong rank assignment of tensor movement between pipeline stages
7	computation	Single-rank mutant on fw ADD op, adding inconsistent constants
8	computation	Single-rank mutant on fw POW op, powering inconsistent constants
9	computation	Multi-rank mutant on fw ADD op, adding inconsistent constants
10	computation	Single-rank mutant on bw MULTIREF op, consuming wrong input tensors
11	computation	Single-rank mutant on bw APPLY_ROTARY_EMB, reversing output tensors
12	scaling	Inconsistent scaling of tensors
13	scaling	Inconsistent scaling of tensors
14	scheduling	Lauching reducer before the last operator of the second microbatch

🛠 How to Run

cd to working directory (relative path w.r.t repo root): nnscaler/Verdict
Run experiments. Also log outputs to nnscaler/ae/br*.log for examination convenience.
```
bash scripts/run_br.sh
```

👀 Expected Output

Inspect nnscaler/ae/br[bug_id].log. The following document guides you how to interpret the error logs case by case. We use Gs and Gp to refer to single-device graph and multi-device graph; Ts and Tp for their tensors.