S₀ Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Jack Young
S₀ Tuning optimizes the initial hidden state (S₀) of recurrent layers in hybrid architectures (GatedDeltaNet, Mamba-2). The learned states are injected before each forward pass, adding zero latency at inference since recurrent models already maintain state. In our main HumanEval setting, training uses roughly 48 execution-verified solutions; trained states are 48 MB, and training takes about 3 minutes on one GPU.
Main results use Qwen3.5-4B with 20 optimization steps per task. Unless noted otherwise, p-values are from two-sided Welch's t-test across independent seed runs.
| Benchmark | Base | + S₀ Tuning | Delta | Seeds | p-value |
|---|---|---|---|---|---|
| HumanEval | 48.8% | 72.2% | +23.6pp | 10 | < 10⁻¹¹ |
| MATH-500 | 51.4% | 56.2% | +4.8pp | 8 | 0.00002 |
| GSM8K | 85.3% | 88.1% | +2.8pp | 10 | 0.0003 |
| Spider (boundary test) | ~72% | ~72% | +0.0pp | 5 alphas | n.s. |
The Spider result (no improvement on out-of-distribution SQL) supports the trajectory-steering mechanism: S₀ Tuning biases the model toward solution trajectories already in its distribution, rather than injecting new knowledge.
Scaling. Gains increase with model size: +2.6pp at 0.8B, +23.6pp at 4B, +44.0pp at 9B (HumanEval).
Cross-architecture. On FalconH1-7B (Mamba-2), S₀ reaches 71.8% vs 71.4% for LoRA in a 3-seed comparison, statistically indistinguishable at this sample size.
LoRA framing. The main Qwen comparison is against the best rank-24 LoRA baseline (+12.7pp). In a separate matched-budget comparison, rank-64 LoRA degrades by -15.5pp in this small-data regime.
from s0 import S0Trainer
trainer = S0Trainer.from_pretrained("Qwen/Qwen3.5-4B")
prompt = "Q: What is 2+2?\nA:"
answer = " 4"
full_text = prompt + answer
# prompt_length = number of prompt tokens (for completion-only loss masking)
tokens = trainer.tokenizer(prompt)
prompt_length = len(tokens["input_ids"])
data = [(full_text, prompt_length)]
trainer.train(data)
trainer.activate()
output = trainer.generate("Q: What is 2+2?\nA:")
print(output)
trainer.save("./my_s0_states")Loading saved states on a new instance:
trainer = S0Trainer.from_pretrained("Qwen/Qwen3.5-4B")
trainer.load("./my_s0_states")
trainer.activate()
output = trainer.generate("Q: What is 2+2?\nA:")pip install s0-tuningRequires PyTorch 2.0+, transformers >= 4.51.0, and a GPU with >= 10 GB VRAM for training.
| Model Family | Architecture | Recurrent Layers | Default Alpha | Status |
|---|---|---|---|---|
| Qwen3.5 | GatedDeltaNet | linear_attention layers |
0.07 | Tested |
| FalconH1 | Mamba-2 | mamba layers |
0.65 | Experimental |
To add a new hybrid architecture, the model needs (1) identifiable recurrent
layers and (2) an initial_state argument in the recurrent kernel. See
_detect_architecture and _patch_gdn / _patch_mamba2 in trainer.py.
Recurrent models process sequences starting from S₀ = 0. S₀ Tuning creates
learnable state tensors, patches them into the model as initial_state, and
optimizes via next-token prediction loss on correct completions. At inference,
learned states are scaled by alpha (e.g. 0.07 for GatedDeltaNet) to prevent
distribution shift. The result: the model starts each sequence from a
task-informed state rather than zeros, biasing generation toward correct
solution trajectories without modifying any model weights.
S0Config controls training:
| Parameter | Default | Description |
|---|---|---|
n_steps |
20 | Optimization steps |
lr |
1e-3 | Learning rate |
l2_lambda |
5e-4 | L2 regularization weight |
alpha |
None | Scaling factor (auto-detected per architecture if None) |
normalize |
False | Normalize states before scaling |
grad_clip |
1.0 | Gradient clipping norm |
max_length |
2048 | Maximum sequence length |
If you use this codebase, or otherwise found our work valuable, please cite:
@article{young2026s0tuning,
title={S$_0$ Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models},
author={Young, Jack},
year={2026}
}MIT