Add BLIP Diffusion #4388

ayushtues · 2023-07-31T12:47:42Z

This PR implements BLIP Diffusion as discussed in #4274

Notion for tracking progress/brainstorming : link

Model/Pipeline Description

BLIP diffusion (Salesforce): https://dxli94.github.io/BLIP-Diffusion-website/

BLIP diffusion enables subject-driven zero-shot image generation which is probably its best USP.

Code with pre-trained weights: https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion
Paper: https://arxiv.org/abs/2305.14720

Abstract:

Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task, called prompted context generation, which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications.

From the official website mentioned above:

TODO

Implement BLIPDiffusionPipeline
Script to convert pretrained weights into diffusers checkpoints
Add model cards for checkpoints and move checkpoints to appropriate repositories
Write tests
Add docstrings for new classes
Create documentation
Add usage example(s)

HF Model Link : https://huggingface.co/ayushtues/blipdiffusion/tree/main

Usage Examples

Zero-Shot Subject Driven Generation

from diffusers.pipelines import BlipDiffusionPipeline
from diffusers.utils import load_image


blip_diffusion_pipe= BlipDiffusionPipeline.from_pretrained('ayushtues/blipdiffusion')
blip_diffusion_pipe.to('cuda')

cond_subject = ["dog"]
tgt_subject = ["dog"]
text_prompt_input = ["swimming underwater"]


cond_image = load_image("https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/dog.jpg")
num_output = 1

iter_seed = 88888
guidance_scale = 7.5
num_inference_steps = 50
negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"

for i in range(num_output):
    output = blip_diffusion_pipe(
        text_prompt_input,
        cond_image,
        cond_subject,
        tgt_subject,
        guidance_scale=guidance_scale,
        num_inference_steps=num_inference_steps,
        neg_prompt=negative_prompt,
        height=512,
        width=512,
    )

Input

Output

Controlled subject-driven generation Canny-edge

from diffusers.pipelines import BlipDiffusionControlNetPipeline
from diffusers.utils import load_image
from controlnet_aux import CannyDetector

blip_diffusion_pipe= BlipDiffusionControlNetPipeline.from_pretrained("ayushtues/blipdiffusion-controlnet")
blip_diffusion_pipe.to('cuda')

style_subject = ["flower"] # subject that defines the style
tgt_subject = ["teapot"]  # subject to generate.
text_prompt = ["on a marble table"]
cldm_cond_image = load_image("https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/kettle.jpg").resize((512, 512))
canny = CannyDetector()
cldm_cond_image = canny(cldm_cond_image, 30, 70, output_type='pil')
cldm_cond_image = [cldm_cond_image ]

style_image = load_image("https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/flower.jpg")


num_output = 1
iter_seed = 88888
guidance_scale = 7.5
num_inference_steps = 50
negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"

for i in range(num_output):
    output = blip_diffusion_pipe(
        text_prompt,
        style_image,
         cldm_cond_image,
        style_subject,
        tgt_subject,
        guidance_scale=guidance_scale,
        num_inference_steps=num_inference_steps,
        neg_prompt=negative_prompt,
        height=512,
        width=512,
    )

Canny edge based Controlnet example -
Input

Conditioning image for Canny Edge

Output

Controlled subject-driven generation Scribble

from diffusers.pipelines import BlipDiffusionControlNetPipeline
from diffusers.utils import load_image
from controlnet_aux import HEDdetector

blip_diffusion_pipe= BlipDiffusionControlNetPipeline.from_pretrained("ayushtues/blipdiffusion-controlnet")
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-scribble")
blip_diffusion_pipe.controlnet = controlnet
blip_diffusion_pipe.to('cuda')

style_subject = ["flower"] # subject that defines the style
tgt_subject = ["bag"]  # subject to generate.
text_prompt = ["on a table"]
cldm_cond_image = load_image("https://huggingface.co/lllyasviel/sd-controlnet-scribble/resolve/main/images/bag.png" ).resize((512, 512))
hed = HEDdetector.from_pretrained("lllyasviel/Annotators")
cldm_cond_image = hed(cldm_cond_image)
cldm_cond_image = [cldm_cond_image ]

style_image = load_image("https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/flower.jpg")


num_output = 1
iter_seed = 88888
guidance_scale = 7.5
num_inference_steps = 50
negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"

for i in range(num_output):
    output = blip_diffusion_pipe(
        text_prompt,
        style_image,
         cldm_cond_image,
        style_subject,
        tgt_subject,
        guidance_scale=guidance_scale,
        num_inference_steps=num_inference_steps,
        neg_prompt=negative_prompt,
        height=512,
        width=512,
    )

Scribble example -
Input

Conditioning image for Scribble

Output

CC

@sayakpaul

oumad · 2023-07-31T20:23:28Z

It has been 2 months since this was out, I can't believe how almost no one mentions it, let alone implement it.

HuggingFaceDocBuilderDev · 2023-08-04T11:00:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

ayushtues · 2023-08-07T09:37:52Z

Ported the ViT visual encoder checkpoints used: https://huggingface.co/ayushtues/blipdiffusion/tree/main/vision_encoder

from src.diffusers.pipelines.blip_diffusion.modeling_blip2 import Blip2VisionConfig, Blip2VisionModel

image_input_dummy = torch.zeros(1, 3, 224, 224)
visual_encoder = Blip2VisionModel.from_pretrained('ayushtues/blipdiffusion', subfolder='visual_encoder')
image_embed = visual_encoder(image_input_dummy, return_dict=True).last_hidden_state

Next step - Porting the Blip2QFormer

NielsRogge · 2023-08-08T10:48:08Z

@ayushtues maybe you can directly use BLIP-2 in Transformers as a dependency, rather than reimplementing it in Diffusers (similar to how CLIP or T5 aren't reimplemented in diffusers for Stable Diffusion)?

You can also do from transformers.models.blip_2.modeling_blip2 import Blip2VisionModel for instance, in case you only need the vision encoder

ayushtues · 2023-08-08T12:07:02Z

Hey @NielsRogge I originally intended to do that, but as mentioned in huggingface/transformers#25245 the implementation of Blip2 in Transformers didn't support multimodal feature extraction, so I went ahead to do a local implementation in diffusers.

If the feature gets added in the transformers implementation, we can shift to directly importing it

ayushtues · 2023-08-11T09:26:36Z

Update, was able to port the model to diffusers, although a lot of the code needs refactoring/reusing and better integration

Colab link : https://colab.research.google.com/drive/1PDlO8-1kPnhTUOmQBv5a2cIBTdYp_7Pi?usp=sharing
HF Model link : https://huggingface.co/ayushtues/blipdiffusion/tree/main

cond_subject = "dog"
tgt_subject = "dog"
text_prompt_input = "swimming underwater"

Input image -

Output image -

ayushtues · 2023-08-13T11:55:06Z

Hey @sayakpaul can you please do a review of this PR?

src/diffusers/pipelines/blip_diffusion/conversion_scripts.py

src/diffusers/pipelines/blip_diffusion/modeling_blip2.py

src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py

docs/source/en/api/pipelines/blip_diffusion.md

src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py

sayakpaul · 2023-09-20T12:19:42Z

@ayushtues hopefully final set of comments from my end before we can merge:

Resolve the open comments.
Change the examples to reflect the commonly followed practices:

Zero-shot:

from diffusers.pipelines import BlipDiffusionPipeline
from diffusers.utils import load_image
import torch

blip_diffusion_pipe = BlipDiffusionPipeline.from_pretrained(
    "ayushtues/blipdiffusion", torch_dtype=torch.float16
).to("cuda")

cond_subject = "dog"
tgt_subject = "dog"
text_prompt_input = "swimming underwater"

cond_image = load_image(
    "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/dog.jpg"
)

iter_seed = 88888
guidance_scale = 7.5
num_inference_steps = 25
negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"

output = blip_diffusion_pipe(
    text_prompt_input,
    cond_image,
    cond_subject,
    tgt_subject,
    guidance_scale=guidance_scale,
    num_inference_steps=num_inference_steps,
    neg_prompt=negative_prompt,
    height=512,
    width=512,
).images
output[0].save("image.png")

Control-guided (Canny):

from diffusers.pipelines import BlipDiffusionControlNetPipeline
from diffusers.utils import load_image
from controlnet_aux import CannyDetector

blip_diffusion_pipe = BlipDiffusionControlNetPipeline.from_pretrained(
    "ayushtues/blipdiffusion-controlnet", torch_dtype=torch.float16
).to("cuda")

style_subject = "flower"  # subject that defines the style
tgt_subject = "teapot"  # subject to generate.
text_prompt = "on a marble table"

cldm_cond_image = load_image(
    "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/kettle.jpg"
).resize((512, 512))
canny = CannyDetector()
cldm_cond_image = canny(cldm_cond_image, 30, 70, output_type="pil")
style_image = load_image(
    "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/flower.jpg"
)

guidance_scale = 7.5
num_inference_steps = 50
negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"

output = blip_diffusion_pipe(
    text_prompt,
    style_image,
    cldm_cond_image,
    style_subject,
    tgt_subject,
    guidance_scale=guidance_scale,
    num_inference_steps=num_inference_steps,
    neg_prompt=negative_prompt,
    height=512,
    width=512,
).images
output[0].save("image.png")

Control-guided (scribble):

from diffusers.pipelines import BlipDiffusionControlNetPipeline
from diffusers.utils import load_image
from controlnet_aux import HEDdetector

blip_diffusion_pipe = BlipDiffusionControlNetPipeline.from_pretrained(
    "ayushtues/blipdiffusion-controlnet"
)
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-scribble")
blip_diffusion_pipe.controlnet = controlnet
blip_diffusion_pipe.to("cuda")

style_subject = "flower"  # subject that defines the style
tgt_subject = "bag"  # subject to generate.
text_prompt = "on a table"
cldm_cond_image = load_image(
    "https://huggingface.co/lllyasviel/sd-controlnet-scribble/resolve/main/images/bag.png"
).resize((512, 512))
hed = HEDdetector.from_pretrained("lllyasviel/Annotators")
cldm_cond_image = hed(cldm_cond_image)
style_image = load_image(
    "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/flower.jpg"
)

guidance_scale = 7.5
num_inference_steps = 50
negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"

output = blip_diffusion_pipe(
    text_prompt,
    style_image,
    cldm_cond_image,
    style_subject,
    tgt_subject,
    guidance_scale=guidance_scale,
    num_inference_steps=num_inference_steps,
    neg_prompt=negative_prompt,
    height=512,
    width=512,
).images
output[0].save("image.png")

Would be great if you could update the documentation to reflect this and also the model cards.

Then we can edit the checkpoint path to reflect SalesForce, transfer them, and finally merge the PR.

Let me know if anything's unclear.

sayakpaul · 2023-09-20T12:24:41Z

Another point is have you tried using the pipeline on various subjects and seeing if it's able to faithfully render them in the outputs?

For example, I tried the zero-shot rendition pipeline on this image with the following parameters:

cond_subject = "backpack"
tgt_subject = "backpack"
text_prompt_input = "in a busy street"

But it didn't faithfully render the subject:

Is this expected?

Co-authored-by: Sayak Paul <[email protected]>

… into blip_diffusion

ayushtues · 2023-09-20T13:07:13Z

These are some examples I seem to be getting, I think they are okay? ( Multiple samples, since I am not fixing the generator )

sayakpaul · 2023-09-20T13:23:24Z

But aren't they deviating from the subject a bit or is that expected?

dxli94 · 2023-09-20T14:14:33Z

Thanks @ayushtues @sayakpaul for the addition. The results look as expected. In zero-shot inference, the subject appearance does deviate a bit as suggested in the reported metrics (CLIP-I, DINO). More similar results can be obtained by few-step fine-tuning.

ayushtues · 2023-09-20T14:52:47Z

^ @sayakpaul, if the original author says so :P

patrickvonplaten

Good to merge from my side

sayakpaul · 2023-09-20T19:57:25Z

Alright. Really, great work here Ayush!

We have asked internally to move the checkpoints to hf.co/Salesforce. Once this is done and the checkpoint paths have been updated in the model cards (should be done before we transfer actually) and the docs, I will merge the PR :)

ayushtues · 2023-09-21T08:35:24Z

Alright. Really, great work here Ayush!

We have asked internally to move the checkpoints to hf.co/Salesforce. Once this is done and the checkpoint paths have been updated in the model cards (should be done before we transfer actually) and the docs, I will merge the PR :)

Let me know when the transfer is done, I'll do the other changes

sayakpaul · 2023-09-21T13:34:31Z

@ayushtues we have got approval from @dxli94 to do the transfer. Please update the checkpoint paths and once done, let me know here. Will transfer and merge.

ayushtues · 2023-09-21T14:45:26Z

Hey @sayakpaul updated to Salesforce/blipdiffusion & Salesforce/blipdiffusion-controlnet, let me know if anything else is needed

sayakpaul

Amazing work! Thanks so much for your patience and for iterating!

As soon as the transfer is done, will merge and ship this 🚀

sayakpaul · 2023-09-21T16:05:31Z

Trasfer complete! Merging!

yanchaoguo · 2023-09-22T08:26:08Z

Trasfer complete! Merging!

Excellent! I have used it

yanchaoguo · 2023-09-27T05:50:20Z

how to load lora file using blip diffusion @sayakpaul

* Add BLIP Diffusion skeleton * Add other model components * Add BLIP2, need to change it for now * Fix pipeline imports * Load pretrained ViT * Make qformer fwd pass same * Replicate fwd passes * Fix device bug * Add accelerate functions * Remove extra functions from Blip2 * Minor bug * Integrate initial review changes * Refactoring * Refactoring * Refactor * Add controlnet * Refactor * Update conversion script * Add image processor * Shift postprocessing to ImageProcessor * Refactor * Fix device * Add fast tests * Update conversion script * Fix checkpoint conversion script * Integrate review changes * Integrate reivew changes * Remove unused functions from test * Reuse HF image processor in Cond image * Create new BlipImageProcessor based on transfomers * Fix image preprocessor * Minor * Minor * Add canny preprocessing * Fix controlnet preprocessing * Fix blip diffusion test * Add controlnet test * Add initial doc strings * Integrate review changes * Refactor * Update examples * Remove DDIM comments * Add copied from for prepare_latents * Add type anotations * Add docstrings * Do black formatting * Add batch support * Make tests pass * Make controlnet tests pass * Black formatting * Fix progress bar * Fix some licensing comments * Fix imports * Refactor controlnet * Make tests faster * Edit examples * Black formatting/Ruff * Add doc * Minor Co-authored-by: Patrick von Platen <[email protected]> * Move controlnet pipeline * Make tests faster * Fix imports * Fix formatting * Fix make errors * Fix make errors * Minor * Add suggested doc changes Co-authored-by: Sayak Paul <[email protected]> * Edit docs * Fix 16 bit loading * Update examples * Edit toctree * Update docs/source/en/api/pipelines/blip_diffusion.md Co-authored-by: Sayak Paul <[email protected]> * Minor * Add tips * Edit examples * Update model paths --------- Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Sayak Paul <[email protected]>

Add BLIP Diffusion skeleton

7a81459

ayushtues added 3 commits August 1, 2023 12:40

Add other model components

5ebe12c

Add BLIP2, need to change it for now

4ca7865

Fix pipeline imports

515784e

Load pretrained ViT

38dea95

ayushtues added 3 commits August 9, 2023 20:14

Make qformer fwd pass same

6f6fb9e

Replicate fwd passes

22b07c3

Fix device bug

b7a4b5d

ayushtues added 3 commits August 13, 2023 15:35

Add accelerate functions

bf43d3b

Remove extra functions from Blip2

37b8629

Minor bug

13c4afd