Self-Evaluation Unlocks
Any-Step Text-to-Image Generation
Hover (or tap) an image to view its text prompt.
Modern text to image models are dominated by diffusion and flow matching due to their stability, scalability, and strong visual fidelity. However, they are inherently multi-step models: they learn local structure of the underlying data distribution such as score functions or velocity fields, and therefore require dozens of steps to reliably traverse the curved reverse trajectory from noise to data.
We introduce the Self-Evaluating Model (Self-E), a training-from-scratch framework for any-step text to image generation. Self-E is able to unlock such capability without requiring distillation from a pre-trained teacher model. Instead, Self-E learns structure of local distribution from data in a manner similar to conditional flow matching (learning from data), while simultaneously employing a mechanism that evaluates its own few-step generated samples using its own score estimates (self-evaluation).
Two Complementary Signals
Self-E trains a single model with two complementary objectives: a learning from data component that provides local trajectory supervision, and a self-evaluation component that targets distribution level matching.
What it learns: local structure, i.e., the local score or velocity information that explains how density varies in nearby states. Concretely, we sample a real image $x_0$ with prompt $c$, add noise to obtain $x_t$, and train the model to predict the clean image from this noisy input using a conditional flow matching objective. This provides local trajectory supervision, which is most effective when generation follows a local path with many small denoising steps.
What it learns: distribution-level correctness of the generated sample at each specified denosing step, i.e., whether a resulting output is realistic and prompt-consistent. Instead of constraining the trajectory of intermediate generations, self-evaluation directly targets global distribution matching by treating the model output as a sample from its implicit distribution and pushing it toward the real data distribution. After the model proposes a long-range jump from a starting timestep to a landing timestep, it uses its own local score/flow estimator at the landing point to produce a directional signal indicating how the current sample should move toward a higher-quality, more prompt-consistent region.
GenEval Overall Across Step Counts
Self-E consistently achieves state-of-the-art results across step budgets and improves monotonically with more steps. 2, 4, 8, and 50 steps correspond to 0.753, 0.781, 0.785, and 0.815. The largest margin appears in the few-step regime, while performance remains top-tier at 8 and 50 steps.
Quantitative Comparison
| Method | 2 | 4 | 8 | 50 |
|---|---|---|---|---|
| SDXL | 0.0021 | 0.1576 | 0.3759 | 0.4601 |
| FLUX.1-Dev | 0.0998 | 0.3198 | 0.5893 | 0.7966 |
| LCM | 0.2624 | 0.3277 | 0.3398 | 0.3303 |
| SANA-1.5 | 0.1662 | 0.5725 | 0.7788 | 0.8062 |
| TiM | 0.6338 | 0.6867 | 0.7143 | 0.7797 |
| SDXL-Turbo | 0.4622 | 0.4766 | 0.4652 | 0.3983 |
| SD3.5-Turbo | 0.3635 | 0.7194 | 0.7071 | 0.6114 |
| Self-E | 0.7531 | 0.7806 | 0.7849 | 0.8151 |
Overall vs Steps
The Conceptual Shift
For a fixed noisy input, training can be viewed as learning directions on an energy landscape. In the animations, green corresponds to a score-driven better direction, blue corresponds to the model prediction, and dashed blue corresponds to the supervision signal used to update the model. The key shift is where supervision is applied: match a local direction at the start, or evaluate the quality of the landing point.
Diffusion
Diffusion provides a static target: for a given noisy input, its score function defines the ground-truth local direction. Training is standard supervised learning: update the model so its prediction aligns with the target.
Even with perfect local matching, inference still needs many steps. Starting from noise, the sampler must integrate these local directions step by step to follow a curved trajectory toward higher-density regions. This is why scaling model size alone cannot remove the step bottleneck: the limitation is geometry and numerical integration.
Animation over training iterations: the model is updated so its prediction aligns with the fixed local target; the dashed arrow indicates the update direction.
Self-E
Self-E changes the training target from matching a direction to reaching a good destination. At each iteration, the model proposes a long-range jump to a landing candidate. The sample generated at the landed point is evaluated through the learned local direction at the landing point indicating how to move toward a prompt-following, higher-density region. This produces a dynamic supervision signal that teaches the model to directly aim for better destinations.
In other words, the model produces a proposal, the proposal is evaluated, and learning happens from a feedback. This outcome-oriented supervision implicitly shapes a reliable shortcut path. Self-E training resembles a refinement step used during diffusion inference, while at inference time Self-E can output the shortcut directly.
Animation over training iterations: the model proposes a long-range jump, is evaluated at the landed point, and is updated using a feedback direction toward a better target.
Evaluate by itself
Evaluating a landing point requires a score-like signal to indicate whether the proposed destination is good, but this signal is not directly available. Prior work typically obtains it from a pretrained diffusion teacher. Self-E instead co-trains the evaluator via learning from data and reuses it to provide feedback to the generator. This enables a fully from scratch training setup without relying on any external model.
Explore its own path
Most text-to-image generators are trained to follow a pre-defined reverse trajectory. A noise schedule defines a diffusion SDE, and the corresponding deterministic probability-flow ODE (PF-ODE) induces a path for each sample: $$\frac{d x(t)}{dt} = f(x(t), t) - \frac{1}{2} g(t)^2 \nabla_x \log p_t(x(t)),$$ where the trajectory is the integral curve \(x(t)\), not merely the marginals \(p_t\). Flow Matching fits this trajectory by regressing the local tangent field \(v_\theta(x,t)\approx \dot{x}(t)\), typically requiring numerical integration at inference. Recent few-step methods (e.g., sCMs and MeanFlow) typically learn interval-level displacements (i.e. flow maps) \(x_t\!\rightarrow x_s\) to enable long-range jumps by matching the same PF-ODE trajectory induced by the local score functions or velocity fields.
Self-E is different: it does not aim to imitate one prescribed PF-ODE path, but to generate sample that matches the marginal distribution of real data. Learn local structure from real samples \((x_0, c)\) and noisy states \(x_t\), yielding an evolving local score/velocity signal. This becomes the internal evaluator. The generator proposes a long-range jump, the evaluator provides feedback at landing point to updates the generator through distribution matching techinque, without any pretrained teacher.
In short, Self-E shifts learning from “imitate the path” to “reach a good destination”, allowing the model to discover its own effective path.
Discussion & Future Work
Self-E departs from trajectory-based training that primarily matches local directions along a predefined path. By jointly learning local estimators from data and reusing them to supervise long-range jumps, Self-E enables flexible any-step inference without pretrained teacher distillation.
A useful perspective is an environment–agent loop: the model learns an internal evaluator from real data, then reuses it to score landing points and steer the generator. As training progresses, stronger local learning improves the evaluator, and a better evaluator improves few-step behavior.
Data Phase
Learn local structure from real samples $(x_0, c)$ and noisy states $x_t$, yielding an evolving local score/velocity signal. This becomes the internal evaluator.
Self-Evaluation Phase
Propose a long-range jump, evaluate where it lands, and update the generator to land in higher-density, prompt-consistent regions.
Closed Loop
Better learning from data improves the evaluator; a stronger evaluator improves few-step generation. This feedback cycle enables from-scratch training without a pretrained teacher.
We find this loop particularly intriguing as a potential bridge between pretraining and reinforcement learning: the evaluator effectively behaves like a learned reward model that can be queried at the landing point. Looking forward, a natural direction is to unify recent advance in RL techniques with this framework and study when and how they improve stability and alignment for visual generation. The current approach is still at an early stage. In extremely low step regimes, generated images can miss fine details compared with long multi-step inference. Several design choices remain underexplored, including objective weighting, inference-time scheduling, and its adapation for downsteam tasks. We expect systematic optimization of these factors to yield further gains.