Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

News

[2025/05/29] We have released our model and data through HuggingFace. The code will be released soon, within one month.

Introduction

LLMs perform well on coding benchmarks like LiveCodeBench but struggle with real-world software engineering (SWE) tasks. Even large models like Claude reach only around 60% accuracy on SWE-bench, despite using carefully engineered prompting pipelines. Smaller models (under 100B parameters) perform significantly worse, typically scoring below 10% in zero-shot settings and plateauing around 30% after supervised fine-tuning (SFT) on GitHub issue datasets. Improving the performance of these small models remains a key challenge for practical deployment, where repeatedly querying large models is often too costly or inefficient.

Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring.

This work aims to explore a new research direction: sample-efficient test-time scaling methods that can identify correct solutions with fewer samples. We propose Evolutionary Test-Time Scaling (EvoScale) that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions. Our approach results in Satori-SWE-32B, a 32B model trained on open-source model and open-source data. Key features of Satori-SWE include:

Key Features

A new perspective of formulating test-time scaling as an evolutionary process, improving sample efficiency for software engineering tasks.
A novel RL training approach that enables self-evolution, eliminating the need for external reward models or verifiers at inference time.
Satori-SWE-32B with EvoScale achieves performance comparable to models exceeding 100B parameters, while requiring only a small number of samples.

Training Framework

1. Small-scale Mutation SFT

Classical SFT: fine-tune a base model on inputs consisting of the issue description and code context, with targets that include a chain-of-thought (CoT) trace and the ground-truth patch.
Mutation SFT: fine-tune a second model, using inputs and a set of conditioning examples consisting of patches sampled from the classical SFT model, write an improved patch.

2. Large‑Scale RL for Self‑Evolution

Potential-based Reward Shaping: further fine-tune the mutation SFT model to maximize the expected potential rewards in score between a newly generated patch and a previous patch.

Evaluation

SWE‑Bench‑Verified

We use the Qwen2.5-Coder-32B-Instruct model as our base model for training Satori-SWE-32B. Through the two-stage SFT training and RL training, Satori-SWE-32B outperforms all small-scale models under greedy decoding, while achieving comparable performance with current SOTA SWE-RL with smaller model scale (32B v.s. 70B), much fewer training data (30K v.s. million-scale) and test-time scaling samples (50 v.s. 500).

Model	Params	Best@N	Accuracy (%)
GPT‑4o (Agentless)	–	1	38.8
Claude 3.5 (Agentless)	–	1	50.8
DeepSeek‑V3 (Agentless)	–	–	42.0
SWE‑Fixer	72 B	1	30.2
SWE‑Gym‑32B	32 B	1	20.6
SWE‑Gym‑32B	32 B	16	32.0
Llama‑3 SWE‑RL	70 B	80	37.0
Llama‑3 SWE‑RL	70 B	500	41.0
Satori‑SWE‑32B	32 B	1	35.8
Satori‑SWE‑32B	32 B	10	38.9
Satori‑SWE‑32B	32 B	25	40.2
Satori‑SWE‑32B	32 B	50	41.6

Satori Team Members

$†$: Project lead

Core Contributors

Contributors

Delin Chen, UMass Amherst
Zhenting Qi, Harvard
Wei Lu, SUTD
Gregory W. Wornell, MIT
Subhro Das, MIT-IBM Watson AI Lab
David Cox, MIT-IBM Watson AI Lab
Chuang Gan$^†$, UMass, MIT-IBM Watson AI Lab

Contact Information

For questions, please:

Raise an issue in our GitHub repository
Contact us at: [email protected]

Citation

@misc{zeng2025satorisweevolutionarytesttimescaling,
      title={Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering}, 
      author={Guangtao Zeng and Maohao Shen and Delin Chen and Zhenting Qi and Subhro Das and Dan Gutfreund and David Cox and Gregory Wornell and Wei Lu and Zhang-Wei Hong and Chuang Gan},
      year={2025},
      eprint={2505.23604},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.23604}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

News

Introduction

Key Features

Training Framework

1. Small-scale Mutation SFT

2. Large‑Scale RL for Self‑Evolution

Evaluation

SWE‑Bench‑Verified

Satori Team Members

Core Contributors

Contributors

Contact Information

Citation

About

Uh oh!

Releases

Packages

Contributors 3

satori-reasoning/Satori-SWE

Folders and files

Latest commit

History

Repository files navigation

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

News

Introduction

Key Features

Training Framework

1. Small-scale Mutation SFT

2. Large‑Scale RL for Self‑Evolution

Evaluation

SWE‑Bench‑Verified

Satori Team Members

Core Contributors

Contributors

Contact Information

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages