-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Add BLIP Diffusion #4388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BLIP Diffusion #4388
Conversation
|
It has been 2 months since this was out, I can't believe how almost no one mentions it, let alone implement it. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
Ported the ViT visual encoder checkpoints used: https://huggingface.co/ayushtues/blipdiffusion/tree/main/vision_encoder from src.diffusers.pipelines.blip_diffusion.modeling_blip2 import Blip2VisionConfig, Blip2VisionModel
image_input_dummy = torch.zeros(1, 3, 224, 224)
visual_encoder = Blip2VisionModel.from_pretrained('ayushtues/blipdiffusion', subfolder='visual_encoder')
image_embed = visual_encoder(image_input_dummy, return_dict=True).last_hidden_stateNext step - Porting the Blip2QFormer |
|
@ayushtues maybe you can directly use BLIP-2 in Transformers as a dependency, rather than reimplementing it in Diffusers (similar to how CLIP or T5 aren't reimplemented in diffusers for Stable Diffusion)? You can also do |
|
Hey @NielsRogge I originally intended to do that, but as mentioned in huggingface/transformers#25245 the implementation of Blip2 in Transformers didn't support multimodal feature extraction, so I went ahead to do a local implementation in diffusers. If the feature gets added in the transformers implementation, we can shift to directly importing it |
|
Update, was able to port the model to diffusers, although a lot of the code needs refactoring/reusing and better integration Colab link : https://colab.research.google.com/drive/1PDlO8-1kPnhTUOmQBv5a2cIBTdYp_7Pi?usp=sharing cond_subject = "dog" |
|
Hey @sayakpaul can you please do a review of this PR? |
src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py
Outdated
Show resolved
Hide resolved
|
@ayushtues hopefully final set of comments from my end before we can merge:
Zero-shot: from diffusers.pipelines import BlipDiffusionPipeline
from diffusers.utils import load_image
import torch
blip_diffusion_pipe = BlipDiffusionPipeline.from_pretrained(
"ayushtues/blipdiffusion", torch_dtype=torch.float16
).to("cuda")
cond_subject = "dog"
tgt_subject = "dog"
text_prompt_input = "swimming underwater"
cond_image = load_image(
"https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/dog.jpg"
)
iter_seed = 88888
guidance_scale = 7.5
num_inference_steps = 25
negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"
output = blip_diffusion_pipe(
text_prompt_input,
cond_image,
cond_subject,
tgt_subject,
guidance_scale=guidance_scale,
num_inference_steps=num_inference_steps,
neg_prompt=negative_prompt,
height=512,
width=512,
).images
output[0].save("image.png")Control-guided (Canny): from diffusers.pipelines import BlipDiffusionControlNetPipeline
from diffusers.utils import load_image
from controlnet_aux import CannyDetector
blip_diffusion_pipe = BlipDiffusionControlNetPipeline.from_pretrained(
"ayushtues/blipdiffusion-controlnet", torch_dtype=torch.float16
).to("cuda")
style_subject = "flower" # subject that defines the style
tgt_subject = "teapot" # subject to generate.
text_prompt = "on a marble table"
cldm_cond_image = load_image(
"https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/kettle.jpg"
).resize((512, 512))
canny = CannyDetector()
cldm_cond_image = canny(cldm_cond_image, 30, 70, output_type="pil")
style_image = load_image(
"https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/flower.jpg"
)
guidance_scale = 7.5
num_inference_steps = 50
negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"
output = blip_diffusion_pipe(
text_prompt,
style_image,
cldm_cond_image,
style_subject,
tgt_subject,
guidance_scale=guidance_scale,
num_inference_steps=num_inference_steps,
neg_prompt=negative_prompt,
height=512,
width=512,
).images
output[0].save("image.png")Control-guided (scribble): from diffusers.pipelines import BlipDiffusionControlNetPipeline
from diffusers.utils import load_image
from controlnet_aux import HEDdetector
blip_diffusion_pipe = BlipDiffusionControlNetPipeline.from_pretrained(
"ayushtues/blipdiffusion-controlnet"
)
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-scribble")
blip_diffusion_pipe.controlnet = controlnet
blip_diffusion_pipe.to("cuda")
style_subject = "flower" # subject that defines the style
tgt_subject = "bag" # subject to generate.
text_prompt = "on a table"
cldm_cond_image = load_image(
"https://huggingface.co/lllyasviel/sd-controlnet-scribble/resolve/main/images/bag.png"
).resize((512, 512))
hed = HEDdetector.from_pretrained("lllyasviel/Annotators")
cldm_cond_image = hed(cldm_cond_image)
style_image = load_image(
"https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/flower.jpg"
)
guidance_scale = 7.5
num_inference_steps = 50
negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"
output = blip_diffusion_pipe(
text_prompt,
style_image,
cldm_cond_image,
style_subject,
tgt_subject,
guidance_scale=guidance_scale,
num_inference_steps=num_inference_steps,
neg_prompt=negative_prompt,
height=512,
width=512,
).images
output[0].save("image.png")Would be great if you could update the documentation to reflect this and also the model cards. Then we can edit the checkpoint path to reflect Let me know if anything's unclear. |
|
Another point is have you tried using the pipeline on various subjects and seeing if it's able to faithfully render them in the outputs? For example, I tried the zero-shot rendition pipeline on this image with the following parameters: But it didn't faithfully render the subject: Is this expected? |
Co-authored-by: Sayak Paul <[email protected]>
|
But aren't they deviating from the subject a bit or is that expected? |
|
Thanks @ayushtues @sayakpaul for the addition. The results look as expected. In zero-shot inference, the subject appearance does deviate a bit as suggested in the reported metrics (CLIP-I, DINO). More similar results can be obtained by few-step fine-tuning. |
|
^ @sayakpaul, if the original author says so :P |
patrickvonplaten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to merge from my side
|
Alright. Really, great work here Ayush! We have asked internally to move the checkpoints to hf.co/Salesforce. Once this is done and the checkpoint paths have been updated in the model cards (should be done before we transfer actually) and the docs, I will merge the PR :) |
Let me know when the transfer is done, I'll do the other changes |
|
@ayushtues we have got approval from @dxli94 to do the transfer. Please update the checkpoint paths and once done, let me know here. Will transfer and merge. |
|
Hey @sayakpaul updated to |
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work! Thanks so much for your patience and for iterating!
As soon as the transfer is done, will merge and ship this 🚀
|
Trasfer complete! Merging! |
|
how to load lora file using blip diffusion @sayakpaul |
* Add BLIP Diffusion skeleton * Add other model components * Add BLIP2, need to change it for now * Fix pipeline imports * Load pretrained ViT * Make qformer fwd pass same * Replicate fwd passes * Fix device bug * Add accelerate functions * Remove extra functions from Blip2 * Minor bug * Integrate initial review changes * Refactoring * Refactoring * Refactor * Add controlnet * Refactor * Update conversion script * Add image processor * Shift postprocessing to ImageProcessor * Refactor * Fix device * Add fast tests * Update conversion script * Fix checkpoint conversion script * Integrate review changes * Integrate reivew changes * Remove unused functions from test * Reuse HF image processor in Cond image * Create new BlipImageProcessor based on transfomers * Fix image preprocessor * Minor * Minor * Add canny preprocessing * Fix controlnet preprocessing * Fix blip diffusion test * Add controlnet test * Add initial doc strings * Integrate review changes * Refactor * Update examples * Remove DDIM comments * Add copied from for prepare_latents * Add type anotations * Add docstrings * Do black formatting * Add batch support * Make tests pass * Make controlnet tests pass * Black formatting * Fix progress bar * Fix some licensing comments * Fix imports * Refactor controlnet * Make tests faster * Edit examples * Black formatting/Ruff * Add doc * Minor Co-authored-by: Patrick von Platen <[email protected]> * Move controlnet pipeline * Make tests faster * Fix imports * Fix formatting * Fix make errors * Fix make errors * Minor * Add suggested doc changes Co-authored-by: Sayak Paul <[email protected]> * Edit docs * Fix 16 bit loading * Update examples * Edit toctree * Update docs/source/en/api/pipelines/blip_diffusion.md Co-authored-by: Sayak Paul <[email protected]> * Minor * Add tips * Edit examples * Update model paths --------- Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Sayak Paul <[email protected]>
* Add BLIP Diffusion skeleton * Add other model components * Add BLIP2, need to change it for now * Fix pipeline imports * Load pretrained ViT * Make qformer fwd pass same * Replicate fwd passes * Fix device bug * Add accelerate functions * Remove extra functions from Blip2 * Minor bug * Integrate initial review changes * Refactoring * Refactoring * Refactor * Add controlnet * Refactor * Update conversion script * Add image processor * Shift postprocessing to ImageProcessor * Refactor * Fix device * Add fast tests * Update conversion script * Fix checkpoint conversion script * Integrate review changes * Integrate reivew changes * Remove unused functions from test * Reuse HF image processor in Cond image * Create new BlipImageProcessor based on transfomers * Fix image preprocessor * Minor * Minor * Add canny preprocessing * Fix controlnet preprocessing * Fix blip diffusion test * Add controlnet test * Add initial doc strings * Integrate review changes * Refactor * Update examples * Remove DDIM comments * Add copied from for prepare_latents * Add type anotations * Add docstrings * Do black formatting * Add batch support * Make tests pass * Make controlnet tests pass * Black formatting * Fix progress bar * Fix some licensing comments * Fix imports * Refactor controlnet * Make tests faster * Edit examples * Black formatting/Ruff * Add doc * Minor Co-authored-by: Patrick von Platen <[email protected]> * Move controlnet pipeline * Make tests faster * Fix imports * Fix formatting * Fix make errors * Fix make errors * Minor * Add suggested doc changes Co-authored-by: Sayak Paul <[email protected]> * Edit docs * Fix 16 bit loading * Update examples * Edit toctree * Update docs/source/en/api/pipelines/blip_diffusion.md Co-authored-by: Sayak Paul <[email protected]> * Minor * Add tips * Edit examples * Update model paths --------- Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Sayak Paul <[email protected]>







This PR implements BLIP Diffusion as discussed in #4274
Notion for tracking progress/brainstorming : link
Model/Pipeline Description
BLIP diffusion (Salesforce): https://dxli94.github.io/BLIP-Diffusion-website/
BLIP diffusion enables subject-driven zero-shot image generation which is probably its best USP.
Code with pre-trained weights: https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion
Paper: https://arxiv.org/abs/2305.14720
Abstract:
From the official website mentioned above:

TODO
HF Model Link : https://huggingface.co/ayushtues/blipdiffusion/tree/main
Usage Examples
Zero-Shot Subject Driven Generation
Input


Output
Controlled subject-driven generation Canny-edge
Canny edge based Controlnet example -



Input
Conditioning image for Canny Edge
Output
Controlled subject-driven generation Scribble
Scribble example -



Input
Conditioning image for Scribble
Output
CC
@sayakpaul