TDAR is a unified framework designed to unlock the potential of Block Diffusion Language Models (BDLMs) for test-time scaling on complex reasoning tasks. We introduce adaptive decoding and generation strategies that effectively balance efficiency and reasoning quality, achieving state-of-the-art performance with up to 3.37Γ speedup.
[2026-02-11]
- π TDAR-8B-Thinking model is now available on Hugging Face!
- π BACD (Bounded Adaptive Confidence Decoding) enables dynamic, difficulty-aware test-time decoding
- π‘ TCCF (Think Coarse, Critic Fine) paradigm achieves superior speed-quality trade-off
- π Comprehensive evaluation on 6 reasoning benchmarks (mathematics, code, STEM)
- π Paper released on arXiv: Advancing Block Diffusion Language Models for Test-Time Scaling
Recent advances in Block Diffusion Language Models (BDLMs) have shown promise in integrating diffusion with autoregressive decoding for efficient inference. However, existing BDLMs struggle to balance efficiency and effectiveness under test-time scaling on complex reasoning tasks.
Key challenges in current BDLMs:
- Static decoding strategies: Uniform denoising schedules fail to adapt to varying difficulty across long reasoning trajectories
- Suboptimal block size allocation: Single block size throughout reasoning limits flexibility in computation allocation
- Speed-quality trade-off: Improvements in efficiency often lead to substantial performance degradation
To address these limitations, we introduce TDAR, which features:
- Bounded Adaptive Confidence Decoding (BACD): Dynamically adapts denoising process based on local difficulty signals
- Think Coarse, Critic Fine (TCCF): Allocates computation based on functional roles in reasoning trajectories
- Progressive Block Size Extension: Mitigates performance degradation when scaling to larger block sizes
- π₯ Releases
- π Overview
- π Main Results
- π Quick Start
- π‘ Methods
- π Detailed Results
- βοΈ Speed-Quality Trade-off
- π― Experimental Setup
- π€ Model Zoo
- Citation
TDAR-8B-Thinking achieves state-of-the-art performance among 8B-scale block diffusion language models while providing substantial speedup:
We comprehensively evaluate TDAR on 6 diverse reasoning benchmarks covering mathematical reasoning, code generation, and STEM tasks:
We comprehensively evaluate TDAR on 6 diverse reasoning benchmarks covering mathematical reasoning, code generation, and STEM tasks:
| Method | Math500 | AIME24 | AIME25 | AMC23 | LCB | GPQA | AVG | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TPF | ACC | TPF | AVG@8 | TPF | AVG@8 | TPF | AVG@8 | TPF | ACC | TPF | ACC | TPF | ACC | |
| Autoregressive LM | ||||||||||||||
| Qwen3-8B-Thinkingβ | 1.00 | 88.2 | 1.00 | 63.3 | 1.00 | 55.8 | 1.00 | 88.8 | 1.00 | 59.5 | 1.00 | 49.0 | 1.00 | 67.4 |
| Masked Diffusion LM | ||||||||||||||
| LLaDA | 3.91 | 41.2 | 3.44 | 6.7 | 3.66 | 0.0 | 4.07 | 12.5 | 2.83 | 4.7 | 3.14 | 17.2 | 3.51 | 13.7 |
| LLaDA-1.5 | 3.97 | 42.2 | 3.34 | 0.0 | 3.68 | 0.0 | 4.01 | 10.0 | 2.86 | 4.3 | 3.01 | 24.2 | 3.48 | 13.5 |
| LLaDA-MoE | 2.70 | 56.6 | 2.89 | 3.3 | 2.71 | 0.0 | 3.16 | 32.5 | 2.05 | 12.9 | 2.18 | 27.8 | 2.62 | 22.2 |
| Block Diffusion LM | ||||||||||||||
| Fast-dLLM-v2 | 2.81 | 59.4 | 2.58 | 0.0 | 2.58 | 0.0 | 2.77 | 25.0 | 1.73 | 6.8 | 2.09 | 28.3 | 2.43 | 19.9 |
| SDAR-8B-Chat | 2.21 | 52.6 | 2.96 | 5.0 | 2.35 | 7.1 | 2.83 | 22.5 | 1.60 | 7.5 | 1.32 | 10.6 | 2.21 | 17.6 |
| DiRL-8B-Instruct | 2.30 | 78.2 | 1.96 | 18.8 | 1.92 | 15.8 | 2.05 | 65.6 | 2.64 | 10.4 | 2.27 | 44.4 | 2.19 | 38.9 |
| TraDo-8B-Instruct | 2.36 | 75.0 | 2.13 | 13.3 | 2.00 | 12.5 | 2.23 | 55.3 | 1.42 | 7.2 | 1.43 | 27.3 | 1.93 | 31.8 |
| TraDo-8B-Thinking | 1.28 | 84.0 | 1.35 | 31.3 | 1.35 | 26.3 | 1.37 | 72.8 | 1.10 | 22.6 | 1.16 | 46.0 | 1.27 | 47.1 |
| TraDo + BACD | 1.33 | 85.0 | 1.44 | 32.9 | 1.44 | 27.5 | 1.45 | 73.8 | 1.15 | 23.3 | 1.18 | 49.5 | 1.33 | 48.7 |
| TraDo + BACD + TCCF | 1.28 | 85.6 | 1.36 | 35.8 | 1.33 | 27.1 | 1.36 | 74.1 | 1.11 | 21.9 | 1.14 | 49.5 | 1.27 | 49.0 |
| TDAR-8B-thinking (Ours) | 1.62 | 81.6 | 4.47 | 34.6 | 4.17 | 30.8 | 5.03 | 69.1 | 1.25 | 40.5 | 1.28 | 46.5 | 2.97 | 50.5 |
| + BACD | 1.88 | 83.4 | 5.07 | 36.3 | 4.73 | 30.4 | 5.59 | 71.3 | 1.46 | 40.1 | 1.49 | 46.0 | 3.37 | 51.2 |
| + BACD + TCCF | 1.75 | 84.0 | 3.04 | 42.9 | 2.79 | 35.8 | 2.68 | 80.0 | 1.32 | 42.6 | 1.39 | 50.0 | 2.16 | 55.9 |
Note: TPF = Tokens Per Forward Pass (higher is faster); β indicates models derived from Qwen3-8B-Base with identical CPT and SFT.
π State-of-the-Art Performance
- TDAR-8B-Thinking outperforms the previous best TraDo-8B-Thinking by +3.4 points on average
- With BACD + TCCF: Achieves 55.9% average accuracy (best among all 8B BDLMs)
- On challenging benchmarks: 42.9 on AIME24, 35.8 on AIME25, 80.0 on AMC23
β‘ Superior Efficiency
- 3.37Γ speedup with BACD (compared to 1.00Γ autoregressive baseline)
- 2.34Γ faster than TraDo-8B-Thinking (2.97 TPF vs 1.27 TPF) with better accuracy
- Maintains high speed even with TCCF: 2.16 TPF while achieving best accuracy
π― Optimal Speed-Quality Trade-off
- BACD alone: Maximum speedup (3.37 TPF) with +0.7 accuracy gain
- BACD + TCCF: Best accuracy (55.9%) with 2.16Γ speedup
- Flexible configuration for different application scenarios
π§ Generalizability
- BACD improves TraDo-8B-Thinking from 47.1 to 48.7 (+1.6 points)
- TCCF further boosts performance to 49.0
- Both strategies consistently improve performance across diverse tasks
π Impact of Block Size
- Larger block size (B=16) enables better speed-performance trade-off
- TraDo-8B-Thinking (B=4) achieves only 1.33 TPF even with BACD
- Progressive block size extension is crucial for unlocking acceleration potential
# Clone the repository
git clone https://github.com/yourusername/TDAR.git
cd TDAR
# Create conda environment
conda create -n tdar python=3.10 -y
conda activate tdar
# Install PyTorch
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
# Install lmdeploy
cd third_party/lmdeploy-0.10.2
pip3 install -e .
# Install additional dependencies
pip install -r requirements.txtTDAR supports evaluation on multiple reasoning benchmarks. You can download pre-processed datasets from our HuggingFace collection.
Dataset Format:
Each dataset should follow this structure:
datasets/
βββ {dataset_name}/
βββ test/
βββ {dataset_name}.json
The JSON file should contain a list of problems with the following format:
[
{
"question": "Problem statement here...",
"ground_truth_answer": "correct answer"
},
...
]Edit the configuration file in ./configs/eval_config. We provide templates for both single-stage inference and two-stage inference.
Key Configuration Parameters:
block_size: Block length for generationdenoising_steps_per_block: Number of denoising iterations per blockremasking_strategy:bounded_adaptive_confidence_decoding: BACD strategylow_confidence_dynamic: Dynamic threshold-based remasking
start_with_think: Whether to use<think>tags (required for TCCF)
Single-Stage Evaluation:
# Edit the script to set your paths. Key configurations to modify:
# - CONDA_ENV_PATH: Path to your conda environment
# - PROJECT_ROOT: Path to TDAR project root
# - MODEL_PATH: Path to your model checkpoint
# - CONFIG_FILE: Path to your config file
# - NUM_GPUS_TO_USE: Number of GPUs to use
# Run evaluation
bash scripts/inference/run_eval.shTwo-Stage Evaluation (TCCF):
# Run two-stage evaluation
bash scripts/inference/run_eval_tccf.shScript Structure:
The evaluation script supports:
- β Multi-GPU parallel evaluation
- β Automatic batch processing
- β Progress tracking and logging
- β Result organization by timestamp
Example Dataset Configuration in Script:
declare -a DATASET_CONFIGS=(
"MATH500|math|1|0211" # MATH500, 1 sample per question
"AIME2024|math|8|0211" # AIME2024, 8 samples per question
"GPQA|option|1|0211" # GPQA, 1 sample per question
"LiveCodeBench|code|1|0211" # LiveCodeBench, 1 sample
)Format: "dataset_name|data_type|samples_per_question|run_id"
dataset_name: Must match folder name indataset_basedata_type:math,code, oroptionsamples_per_question: Number of responses to generate per problemrun_id: Identifier for this run
Check Logs:
# Logs are saved to logs/ or logs_tccf/
tail -f logs/{model_name}/{config_name}/{timestamp}/{dataset_name}.logCheck Results:
# Results are saved to results/ or results_tccf/
ls results/{model_name}/{config_name}/{timestamp}/Result Format:
Each evaluation produces a JSON file with:
[
{
"question": "original question",
"ground_truth_answer": "correct answer",
"prompt": "formatted prompt used",
"full_output": ["response 1", "response 2", ...],
"extracted_output": ["extracted answer 1", "extracted answer 2", ...],
"response_length": [token_count_1, token_count_2, ...],
"finish_reason": ["stop", "stop", ...],
"correctness": [true, false, ...]
},
...
]For two-stage evaluation, additional fields are included:
{
"stage1_output": ["think stage output 1", ...],
"stage2_output": ["final stage output 1", ...],
"stage1_tokens": [count1, ...],
"stage2_tokens": [count2, ...],
"used_final_stage": [true, ...]
}Edit the training configuration in configs/train_config/tdar_8b_sft.yaml, then run the training script to perform Supervised Fine-Tuning (SFT):
bash scripts/train/run_sft_bs16.shOur key innovation is the Progressive Block Size Extension approach, which enables stable training with large block sizes while incurring minimal performance degradation. Specifically, we progressively increase the block size using the same training recipe, gradually extending the model from a block size of 4 to 64. This progressive strategy mitigates optimization difficulties associated with large block sizes and is crucial for unlocking the acceleration potential of BDLMs in test-time scaling scenarios.
For testing and reproduction purposes, we recommend using high-quality long-form reasoning datasets:
- OpenR1-Math: A curated collection of mathematical reasoning problems with detailed step-by-step solutions.
Data Format:
The training data should be in JSONL format (.jsonl), where each line represents one training sample:
{
"messages": [
{
"role": "user",
"content": "Problem statement here..."
},
{
"role": "assistant",
"content": "<think>\nDetailed reasoning process...\n</think>\n\nFinal answer: ..."
}
]
}
