-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[Diffusion] Support Diffusers backend - Run any model supported by Diffusers #14112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @adarshxs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances SGLang's multimodal generation capabilities by integrating a Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a diffusers backend, which is an excellent feature for expanding model compatibility. The implementation is thorough, with a robust DiffusersPipeline that handles various output formats and loading scenarios. The fallback logic in the model registry is well-designed. The changes also extend the CLI and OpenAI-compatible API to pass through diffusers-specific arguments, which adds a lot of flexibility. My review includes a few suggestions for improving code clarity, fixing a potential tensor shape inconsistency, and addressing some minor stylistic issues. Overall, this is a high-quality contribution.
python/sglang/multimodal_gen/runtime/pipelines/diffusers_pipeline.py
Outdated
Show resolved
Hide resolved
81c1af9 to
12884ab
Compare
|
needed some changes with ref to: #14129 |
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for spearheading this!
| This module provides a minimal pipeline configuration that works with the diffusers backend. | ||
| Since diffusers handles its own model loading and configuration, this config is intentionally minimal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool!
Some bits I wanted to mention:
- We can speed up the pipeline (that includes a bunch of models internally) loading process by specifying
device_map="cuda"(cudais just an example here) tofrom_pretrained(). This warms up the CUDA caching allocator to avoid small allocations of tensors. - We can load shards parallely into the individual pipeline models. Check out: https://huggingface.co/docs/diffusers/main/en/using-diffusers/loading#parallel-loading
|
|
||
| from dataclasses import dataclass, field | ||
|
|
||
| from sglang.multimodal_gen.configs.models import DiTConfig, EncoderConfig, VAEConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question (out of curiosity):
Is EncoderConfig designed to handle multiple encoders a pipeline might be using? For example, Flux.1 uses two text encoders, Flux.2 uses one. Then there are some pipelines, that make use of an image encoder as well (in case they are doing some image-guided tasks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EncoderConfig is designed to guide the generation of a single text encoder. We use one config for each encoder
| text_encoder_configs: tuple[EncoderConfig, ...] = field( | ||
| default_factory=lambda: (EncoderConfig(),) | ||
| ) | ||
| text_encoder_precisions: tuple[str, ...] = field(default_factory=lambda: ("fp16",)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that some pipelines that use models like Gemma cannot use FP16 because those models don't support inference in FP16. By "don't support", I mean one can obviously run inference, but the results will be all garbled up.
| text_encoder_precisions: tuple[str, ...] = field(default_factory=lambda: ("fp16",)) | ||
|
|
||
| # VAE settings | ||
| vae_tiling: bool = False # diffusers handles this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we support it here as well but "slicing" is also pretty popular:
https://github.com/huggingface/diffusers/blob/152f7ca357c066c4af3d1a58cdf17662ef5a2f87/src/diffusers/models/autoencoders/vae.py#L914
| background: Optional[str], | ||
| image_path: Optional[str] = None, | ||
| num_inference_steps: Optional[int] = None, | ||
| guidance_scale: Optional[float] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some pipelines like QwenImage, we have true_cfg_scale as well:
https://github.com/huggingface/diffusers/blob/152f7ca357c066c4af3d1a58cdf17662ef5a2f87/src/diffusers/pipelines/qwenimage/pipeline_qwenimage_edit.py#L553
This is to distinguish between guidance distillation (in which case, guidance_scale would mean an embedded scale) and CFG. This is a bit of a future-proof thing since QwenImage doesn't have a guidance-distilled checkpoint yet (this decision was taken by the authors themselves).
| class DiffusersExecutionStage(PipelineStage): | ||
| """Pipeline stage that wraps diffusers pipeline execution.""" | ||
|
|
||
| def __init__(self, diffusers_pipe: Any): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit): should diffusers_pipe be of type DiffusionPipeline?
|
|
||
| return output | ||
|
|
||
| def _build_pipeline_kwargs(self, batch: Req, server_args: ServerArgs) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: how do we handle the __call__ argument mismatches if they occur? For example, if a user specifies fps or num_frames for an image pipeline, how do we convey it to the user that those arguments will be ignored?
| self.diffusers_pipe = self._load_diffusers_pipeline(model_path, server_args) | ||
| self._detect_pipeline_type() | ||
|
|
||
| def _load_diffusers_pipeline(self, model_path: str, server_args: ServerArgs) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit): should we also let the users provide a quantization_config?
https://huggingface.co/docs/diffusers/main/en/quantization/overview#pipeline-level-quantization
Diffusers Backend Support for
SGLang DiffusionSummary
This PR adds a new
diffusersbackend to SGLang's multimodal generation module, allowing users to run any model supported by the Hugging Facediffuserslibrary through SGLang's infrastructure even if SGLang doesn't have native support for that specific model.Motivation
SGLang has excellent optimized pipelines for specific models (Flux, Wan, HunyuanVideo, etc.), but users often want to use less common diffusion models that don't yet have native SGLang implementations in their pipelines/workflows. This PR enables a fallback mechanism that wraps vanilla
diffuserspipelines, providing:--backend diffusersUsage
CLI
Server Mode
Python API
Changes
New Files
runtime/pipelines/diffusers_pipeline.py- Main pipeline wrapper containing:DiffusersExecutionStage: Pipeline stage that wraps diffusers executionDiffusersPipeline: ComposedPipelineBase implementation for diffusers modelsconfigs/pipeline_configs/diffusers_generic.py- Generic pipeline config for diffusers backendconfigs/sample/diffusers_generic.py- Generic sampling params for diffusers backendModified Files
runtime/server_args.py- AddedBackendenum (AUTO,SGLANG,DIFFUSERS) and--backendCLI argumentregistry.py- Updatedget_model_info()to:DiffusersPipelinewhenbackend=DIFFUSERSbackend=AUTOand no native support foundTesting
Tested with:
briaai/BRIA-3.2(custom BriaPipeline - not supported in sglang)Qwen/Qwen-Image-Edit(image editing)Notes
--trust-remote-codediffusersversion (pip install --upgrade diffusersor install from source)