Self-Evaluation Unlocks
Any-Step Text-to-Image Generation

Authors
1The University of Hong Kong    2Adobe Research
*Corresponding author.    Project lead.
December 2025
Teaser
Figure 1. One model, any compute: Self-E generates coherent images at 2, 4, 8, and 50 steps.
Hover or tap an image to preview its prompt

Hover (or tap) an image to view its text prompt.

Introduction

Modern text to image models are dominated by diffusion and flow matching due to their stability, scalability, and strong visual fidelity. However, they are inherently multi-step models: they learn local structure of the underlying data distribution such as score functions or velocity fields, and therefore require dozens of steps to reliably traverse the curved reverse trajectory from noise to data.

We introduce the Self-Evaluating Model (Self-E), a training-from-scratch framework for any-step text to image generation. Self-E is able to unlock such capability without requiring distillation from a pre-trained teacher model. Instead, Self-E learns structure of local distribution from data in a manner similar to conditional flow matching (learning from data), while simultaneously employing a mechanism that evaluates its own few-step generated samples using its own score estimates (self-evaluation).


Method

Two Complementary Signals

Self-E trains a single model with two complementary objectives: a learning from data component that provides local trajectory supervision, and a self-evaluation component that targets distribution level matching.

Self-E method overview
Self-E simultaneously learns from data while performing self-evaluation, using the same network in two complementary modes.
Learning from data

What it learns: local structure, i.e., the local score or velocity information that explains how density varies in nearby states. Concretely, we sample a real image $x_0$ with prompt $c$, add noise to obtain $x_t$, and train the model to predict the clean image from this noisy input using a conditional flow matching objective. This provides local trajectory supervision, which is most effective when generation follows a local path with many small denoising steps.

Learning by self-evaluation

What it learns: distribution-level correctness of the generated sample at each specified denosing step, i.e., whether a resulting output is realistic and prompt-consistent. Instead of constraining the trajectory of intermediate generations, self-evaluation directly targets global distribution matching by treating the model output as a sample from its implicit distribution and pushing it toward the real data distribution. After the model proposes a long-range jump from a starting timestep to a landing timestep, it uses its own local score/flow estimator at the landing point to produce a directional signal indicating how the current sample should move toward a higher-quality, more prompt-consistent region.

Results

GenEval Overall Across Step Counts

Self-E consistently achieves state-of-the-art results across step budgets and improves monotonically with more steps. 2, 4, 8, and 50 steps correspond to 0.753, 0.781, 0.785, and 0.815. The largest margin appears in the few-step regime, while performance remains top-tier at 8 and 50 steps.

Quantitative Comparison

Metric: GenEval Overall with denosing steps.
Method 2 4 8 50
SDXL 0.0021 0.1576 0.3759 0.4601
FLUX.1-Dev 0.0998 0.3198 0.5893 0.7966
LCM 0.2624 0.3277 0.3398 0.3303
SANA-1.5 0.1662 0.5725 0.7788 0.8062
TiM 0.6338 0.6867 0.7143 0.7797
SDXL-Turbo 0.4622 0.4766 0.4652 0.3983
SD3.5-Turbo 0.3635 0.7194 0.7071 0.6114
Self-E 0.7531 0.7806 0.7849 0.8151

Overall vs Steps

x-axis: step count, y-axis: GenEval Overall
SDXL FLUX.1-Dev LCM SANA-1.5 TiM SDXL-Turbo SD3.5-Turbo Self-E
Qualitative comparison
Qualitative comparison. Side-by-side visual results across different step budgets.
Matching to Evaluation

The Conceptual Shift

For a fixed noisy input, training can be viewed as learning directions on an energy landscape. In the animations, green corresponds to a score-driven better direction, blue corresponds to the model prediction, and dashed blue corresponds to the supervision signal used to update the model. The key shift is where supervision is applied: match a local direction at the start, or evaluate the quality of the landing point.

Matching at the start point

Diffusion

Diffusion provides a static target: for a given noisy input, its score function defines the ground-truth local direction. Training is standard supervised learning: update the model so its prediction aligns with the target.

Even with perfect local matching, inference still needs many steps. Starting from noise, the sampler must integrate these local directions step by step to follow a curved trajectory toward higher-density regions. This is why scaling model size alone cannot remove the step bottleneck: the limitation is geometry and numerical integration.

Diffusion training animation

Animation over training iterations: the model is updated so its prediction aligns with the fixed local target; the dashed arrow indicates the update direction.

Supervision
Fixed local target at the start point.
What it learns
A local vector field that supports many-step integration, not a shortcut.
Consequence
Long-range jumps with few steps tend to drift towards average behavior.
Evaluation at the landing point

Self-E

Self-E changes the training target from matching a direction to reaching a good destination. At each iteration, the model proposes a long-range jump to a landing candidate. The sample generated at the landed point is evaluated through the learned local direction at the landing point indicating how to move toward a prompt-following, higher-density region. This produces a dynamic supervision signal that teaches the model to directly aim for better destinations.

In other words, the model produces a proposal, the proposal is evaluated, and learning happens from a feedback. This outcome-oriented supervision implicitly shapes a reliable shortcut path. Self-E training resembles a refinement step used during diffusion inference, while at inference time Self-E can output the shortcut directly.

Self-E training animation

Animation over training iterations: the model proposes a long-range jump, is evaluated at the landed point, and is updated using a feedback direction toward a better target.

Supervision
Feedback at the landing point, dynamic rather than fixed at the start.
What it learns
How to land in good regions in few steps.
Consequence
The model implicitly learns a shortcut path.
Where does the evaluation signal come from

Evaluate by itself

Evaluating a landing point requires a score-like signal to indicate whether the proposed destination is good, but this signal is not directly available. Prior work typically obtains it from a pretrained diffusion teacher. Self-E instead co-trains the evaluator via learning from data and reuses it to provide feedback to the generator. This enables a fully from scratch training setup without relying on any external model.

Self-E is not a trajectory-based model

Explore its own path

Most text-to-image generators are trained to follow a pre-defined reverse trajectory. A noise schedule defines a diffusion SDE, and the corresponding deterministic probability-flow ODE (PF-ODE) induces a path for each sample: $$\frac{d x(t)}{dt} = f(x(t), t) - \frac{1}{2} g(t)^2 \nabla_x \log p_t(x(t)),$$ where the trajectory is the integral curve \(x(t)\), not merely the marginals \(p_t\). Flow Matching fits this trajectory by regressing the local tangent field \(v_\theta(x,t)\approx \dot{x}(t)\), typically requiring numerical integration at inference. Recent few-step methods (e.g., sCMs and MeanFlow) typically learn interval-level displacements (i.e. flow maps) \(x_t\!\rightarrow x_s\) to enable long-range jumps by matching the same PF-ODE trajectory induced by the local score functions or velocity fields.

Self-E is different: it does not aim to imitate one prescribed PF-ODE path, but to generate sample that matches the marginal distribution of real data. Learn local structure from real samples \((x_0, c)\) and noisy states \(x_t\), yielding an evolving local score/velocity signal. This becomes the internal evaluator. The generator proposes a long-range jump, the evaluator provides feedback at landing point to updates the generator through distribution matching techinque, without any pretrained teacher.

In short, Self-E shifts learning from “imitate the path” to “reach a good destination”, allowing the model to discover its own effective path.

Conclusion

Discussion & Future Work

Self-E departs from trajectory-based training that primarily matches local directions along a predefined path. By jointly learning local estimators from data and reusing them to supervise long-range jumps, Self-E enables flexible any-step inference without pretrained teacher distillation.

A useful perspective is an environment–agent loop: the model learns an internal evaluator from real data, then reuses it to score landing points and steer the generator. As training progresses, stronger local learning improves the evaluator, and a better evaluator improves few-step behavior.

1

Data Phase

Learn local structure from real samples $(x_0, c)$ and noisy states $x_t$, yielding an evolving local score/velocity signal. This becomes the internal evaluator.

2

Self-Evaluation Phase

Propose a long-range jump, evaluate where it lands, and update the generator to land in higher-density, prompt-consistent regions.

3

Closed Loop

Better learning from data improves the evaluator; a stronger evaluator improves few-step generation. This feedback cycle enables from-scratch training without a pretrained teacher.

We find this loop particularly intriguing as a potential bridge between pretraining and reinforcement learning: the evaluator effectively behaves like a learned reward model that can be queried at the landing point. Looking forward, a natural direction is to unify recent advance in RL techniques with this framework and study when and how they improve stability and alignment for visual generation. The current approach is still at an early stage. In extremely low step regimes, generated images can miss fine details compared with long multi-step inference. Several design choices remain underexplored, including objective weighting, inference-time scheduling, and its adapation for downsteam tasks. We expect systematic optimization of these factors to yield further gains.

Citation
BibTeX
@article{yu2025self,
  title={Self-Evaluation Unlocks Any-Step Text-to-Image Generation},
  author={Yu, Xin and Qi, Xiaojuan and Li, Zhengqi and Zhang, Kai and Zhang, Richard and Lin, Zhe and Shechtman, Eli and Wang, Tianyu and Nitzan, Yotam},
  journal={arXiv preprint arXiv:2512.22374},
  year={2025}
}