Skip to content

Conversation

@okotaku
Copy link
Contributor

@okotaku okotaku commented Sep 8, 2023

What does this PR do?

Support IP-Adapter.
github: https://github.com/tencent-ailab/IP-Adapter
arxiv: https://arxiv.org/abs/2308.06721
original weights: https://huggingface.co/h94/IP-Adapter

How to inference

test weights: https://huggingface.co/takuoko/IP-Adapter-test

To inference by diffusers pipeline, I converted from Original Repo h94/IP-Adapter .

from diffusers.pipelines import StableDiffusionPipeline, IPAdapterPipeline
import PIL

pipe = StableDiffusionPipeline.from_pretrained(
    'runwayml/stable-diffusion-v1-5',
    #torch_dtype=torch.float16,
    feature_extractor=None,
    safety_checker=None,
)
ip_model = IPAdapterPipeline.from_pretrained('takuoko/IP-Adapter-test', pipeline=pipe)
ip_model.load_ip_adapter('h94/IP-Adapter', subfolder='models', weight_name='ip-adapter_sd15.bin')
ip_model.to('cuda:0')
image = PIL.Image.open("woman.png")

images = ip_model(example_image=image, prompt='best quality, high quality', negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", num_images_per_prompt=1, num_inference_steps=50)
images.images[0].save('demo.png')

Image prompt

woman

output

demo

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Core library:

@okotaku
Copy link
Contributor Author

okotaku commented Sep 8, 2023

@sayakpaul @patrickvonplaten What do you think about this pipeline design?

scale=1.0,
):
if scale != 1.0:
warnings.warn("`scale` of IPAttnProcessor should be set by " "`IPAdapterPipeline.set_scale`")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use the cross_attention_kwargs mechanism here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, IPAttnProcessor only exists in cross attention layers. cross_attention_kwargs will be pass into ALL AttnProcessor,
passing extra parameters to the attention processor in self-attention will result in an error.

Maybe the standard processors could be modified so that they accept redundant parameters?

for example, change

def __call__(
self,
attn: Attention,
hidden_states,
encoder_hidden_states=None,
attention_mask=None,
temb=None,
scale=1.0,
):

to

def __call__(
        self,
        attn: Attention,
        hidden_states,
        encoder_hidden_states=None,
        attention_mask=None,
        temb=None,
        scale=1.0,
        *_args, 
        **_kwargs
    ):

@sayakpaul
Copy link
Member

Thanks for the WIP PR. Could you help understand the adapter a bit better? Like what does it do, etc.? The results don't look great to me either. So, a bit more context would be helpful.

@okotaku
Copy link
Contributor Author

okotaku commented Sep 8, 2023

@sayakpaul

IP-Adapter, an effective and lightweight adapter to achieve image
prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter
is decoupled cross-attention mechanism that separates cross-attention layers for text features and
image features.

Screenshot 2023-09-08 at 17 41 42

We can use an image as a prompt. We can combine IP-Adapter with other methods like ControlNet.

Screenshot 2023-09-08 at 17 42 28

Decoupled cross-attention: Separate text prompt cross attention and Image prompt cross attention.

You can also check demo notebook. https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter_demo.ipynb

@sayakpaul
Copy link
Member

Seems cool but the additional image encoder and the adapter modules are heavy IMO, no? The image encoder is 3.69 GB and the adapter is 700 MBs. Given that do you think the gains are reasonable?

We have #4388 too.

@okotaku
Copy link
Contributor Author

okotaku commented Sep 8, 2023

In BLIP-Diffusion, the vision encoder is 1.16GB and the qformer is 1.98GB for SDv1.5.
https://huggingface.co/ayushtues/blipdiffusion

In IP-Adapter, the image encoder is 2.53 GB and the adapter is 44 MBs for SDv1.5, 3.69 GB and 770MBs for SDXL.

Both can combine with other methods like ControlNet.

One advantage is there are trained weights for SDXL.

We also compare our IP-Adapter with other methods including Versatile Diffusion, BLIP Diffusion [31], Uni-ControlNet, T2I-Adapter, ControlNet Shuffle, and ControlNet Reference-only. The comparison results are shown in Figure 9. Compared with other existing methods, our method can generate superior results in both image quality and alignment with multimodal prompts.

Screenshot 2023-09-08 at 18 41 14

@sayakpaul
Copy link
Member

Alright! I have a better understanding now. Flexibility is controlling the generation process is definitely desirable. I think it also helps us achieve zero-shot subject-driven generation (like BLIP Diffusion)?

Also, the example you showed in the OP refers to image variation, right?

Regardless, the result looks okayish to me.

@patrickvonplaten what are your thoughts?

@okotaku
Copy link
Contributor Author

okotaku commented Sep 8, 2023

Alright! I have a better understanding now. Flexibility is controlling the generation process is definitely desirable. I think it also helps us achieve zero-shot subject-driven generation (like BLIP Diffusion)?

I think so too!

Also, the example you showed in the OP refers to image variation, right?

Exactly.

@sayakpaul
Copy link
Member

I am leaning towards having the pipeline as a part of the core library. Let's see what @patrickvonplaten has to share. Given the pipeline can do image variations, zero-shot subject-driven generation, can be combined with ControlNets, etc. -- I do see potential use-cases.

@ultranity
Copy link
Contributor

it's a little confusing to have a load_ip_adapter step, since every IP-adapter weighted is matched to the given image encoder and projector, why not mege it into init process? @okotaku

@okotaku
Copy link
Contributor Author

okotaku commented Sep 14, 2023

@ultranity Because we should set IPAttnProcessor first.

  1. load pipeline
  2. set IPAttnProcessor
  3. load the weights of IPAttnProcessor

This flow is similar to LoRA.

class IPAdapterPipeline(DiffusionPipeline):
def __init__(
self,
pipeline: DiffusionPipeline,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really allow passing a pipeline into another pipeline. From what I can see the pipeline also only works for stable diffusion 1,2 anyways (there is only one text encoder) - could we maybe just follow the previous logic and create a StableDiffusionIPAadapterPipeline(DiffusionPipeline) to begin with?

Not too keen on adding a new design paradigm here

@patrickvonplaten
Copy link
Contributor

This PR seems to be a bit in limbo, can someone take it over? cc @DN6 @yiyixuxu maybe?

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Oct 25, 2023
@andypotato
Copy link

Has there been any progress on this? IP Adapter works great for me and I'd love to see it supported in diffusers

@sayakpaul
Copy link
Member

@andypotato could you provide some example results?

@andypotato
Copy link

@sayakpaul sure, here are some examples I created with IP Adapter using their "ip-adapter-plus-face_sd15.bin" checkpoint.

For convenience, these were created using A1111, ReV Animated 1.22 and the ControlNet extension to load the IP Adapter models. Although I can also create similar images using diffusers and their own repo at https://github.com/tencent-ailab/IP-Adapter to generate similar images.

First image is the reference image:

andy-cut-foo

These are the generated images, the likeness is really quite stunning.

00198-1036673071-swapped

IMG_7106

@sayakpaul
Copy link
Member

Wow, that's quite impressive indeed.

@apolinario has made us aware that IP Adapters are becoming extremely popular in the community. So, I'd be in favor of supporting it in diffusers too. @patrickvonplaten what are your thoughts?

@patrickvonplaten
Copy link
Contributor

Wow, that's quite impressive indeed.

@apolinario has made us aware that IP Adapters are becoming extremely popular in the community. So, I'd be in favor of supporting it in diffusers too. @patrickvonplaten what are your thoughts?

Agree let's add it to main!

@patrickvonplaten
Copy link
Contributor

cc @DN6 @yiyixuxu can you review here as well?

@sayakpaul
Copy link
Member

@okotaku I went into your PR and addresses #4944 (comment). I hope that's okay?

@sayakpaul
Copy link
Member

sayakpaul commented Nov 2, 2023

it's a little confusing to have a load_ip_adapter step, since every IP-adapter weighted is matched to the given image encoder and projector, why not mege it into init process? @okotaku

IIUC the encoder and the projection modules are not changed for every IP module being loaded. So, not sure why having a separate load_ip_adapter() method isn't beneficial. @ultranity

@sayakpaul
Copy link
Member

sayakpaul commented Nov 8, 2023

@yiyixuxu summary of the changes I have done so far this morning:

  • Delegated IP Adapter loading logic to a separate mixin class (thanks for the idea!)
  • Rejig how we handle passing the inputs through image_projection so that we don't have to touch the vanilla unet for an SD.
  • Some minor refactorings that I think make the code cleaner.

The code from #4944 (comment) works and produces expected results. However, I updated it to better reflect the loading of the image encoder from the official repo:

from transformers import CLIPVisionModelWithProjection
from diffusers import StableDiffusionPipeline
import torch
from diffusers.utils import load_image

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter", 
    subfolder="models/image_encoder",
    torch_dtype=torch.float16,
).to("cuda")

pipeline = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    image_encoder=image_encoder,
    torch_dtype=torch.float16
).to("cuda")

image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png")

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")


images = pipeline(
    prompt='best quality, high quality', 
    image_prompt=image,
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", 
    num_inference_steps=50,
).images

images[0].save("sayak_test_out.png")


image = image.to(device=device, dtype=dtype)
image_embeds = self.image_encoder(image).image_embeds
projected_image_embeds = self.image_projection(image_embeds)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do this because I don't think we should attach weights to pipelines

I think it is better use the imageProjection layer in UNET cc @patrickvonplaten here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the encode_image I added for IP-adapter is no difference from encode_image in existing pipelines - same way as different pipelines have may have slightly different implementation of encode_prompts but essentially they do same thing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do this because I don't think we should attach weights to pipelines

Fair point.

the encode_image I added for IP-adapter is no difference from encode_image in existing pipelines - same way as different pipelines have may have slightly different implementation of encode_prompts but essentially they do same thing

Yeah with your implementation that doesn't use the projection module, it makes sense to name the function as encode_image() as that is exactly what it is doing.

@sayakpaul
Copy link
Member

Okay for me either way.

If we are attaching ad-hoc weights to the unet I think that is still okay here. I don't see a lot of side-effects of doing it.

I'd still like an opinion from @patrickvonplaten here.

import torch.nn.functional as F
from torch import nn

from ..loaders import PatchedLoraProjection, text_encoder_attn_modules, text_encoder_mlp_modules
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reverse this change, don't think it's needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, there's a circular import problem.

clip_skip=self.clip_skip,
)

if image_prompt is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool that this is all that is needed!

Comment on lines +31 to +38
from .models.attention_processor import (
AttnProcessor,
AttnProcessor2_0,
IPAdapterAttnProcessor,
IPAdapterAttnProcessor2_0,
IPAdapterControlNetAttnProcessor,
IPAdapterControlNetAttnProcessor2_0,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a circular dependency (regarding https://github.com/huggingface/diffusers/pull/4944/files#r1386535787). Let me know if there's a better way to address this.

@sayakpaul
Copy link
Member

sayakpaul commented Nov 8, 2023

@patrickvonplaten I am trying to address your concerns around save_pretrained(). This is my testing script:

from transformers import CLIPVisionModelWithProjection
from diffusers import StableDiffusionPipeline
from diffusers.models.attention_processor import AttnProcessor2_0
import torch
import numpy as np
import tempfile

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter", 
    subfolder="models/image_encoder",
    torch_dtype=torch.float16,
).to("cuda")

pipeline = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    image_encoder=image_encoder,
    torch_dtype=torch.float16
).to("cuda")
output = pipeline("hey", num_inference_steps=10, output_type="np").images

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")

ckpt_path = "ip_adapter_loaded_pipeline"

with tempfile.TemporaryDirectory() as tmpdirname:
    pipeline.save_pretrained(tmpdirname)
    pipeline.unet.set_attn_processor(AttnProcessor2_0())
    loaded_pipeline = StableDiffusionPipeline.from_pretrained(tmpdirname, torch_dtype=torch.float16).to("cuda")

output_loaded = loaded_pipeline("hey", num_inference_steps=10, output_type="np").images

print(np.allclose(output[0, :3, :3, -1], output_loaded[0, :3, :3, -1], atol=1e-4, rtol=1e-4))

I am facing this:

Some weights of the model checkpoint were not used when initializing UNet2DConditionModel: 
 ['down_blocks.0.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, mid_block.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, mid_block.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.processor.to_v_ip.weight']

However, the assertion passes. Since that is the case, I think that's fine for now. Later, we can work on an unload_ip_adapter() method and document about it so that the users are aware of these things.

WDYT?

@yiyixuxu what's your take?

scheduler: KarrasDiffusionSchedulers,
safety_checker: StableDiffusionSafetyChecker,
feature_extractor: CLIPImageProcessor,
image_encoder: CLIPVisionModelWithProjection = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickvonplaten defaulting to None breaks the serialization compatibility (as demonstrated in #4944 (comment)). Not defaulting to None breaks the existing tests. Any suggestions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to update the tests instead -

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

@sayakpaul
Copy link
Member

sayakpaul commented Nov 9, 2023

Chatted with @okotaku over Slack. We're okay closing this one in favor of #5713 (where @okotaku is already a co-author of the commits). Thanks very much for laying out the initial design, truly appreciated.

@sayakpaul sayakpaul closed this Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Issues that haven't received updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants