-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[WIP] Support IP-Adapter #4944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Support IP-Adapter #4944
Conversation
|
@sayakpaul @patrickvonplaten What do you think about this pipeline design? |
| scale=1.0, | ||
| ): | ||
| if scale != 1.0: | ||
| warnings.warn("`scale` of IPAttnProcessor should be set by " "`IPAdapterPipeline.set_scale`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use the cross_attention_kwargs mechanism here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, IPAttnProcessor only exists in cross attention layers. cross_attention_kwargs will be pass into ALL AttnProcessor,
passing extra parameters to the attention processor in self-attention will result in an error.
Maybe the standard processors could be modified so that they accept redundant parameters?
for example, change
diffusers/src/diffusers/models/attention_processor.py
Lines 546 to 554 in 1037287
| def __call__( | |
| self, | |
| attn: Attention, | |
| hidden_states, | |
| encoder_hidden_states=None, | |
| attention_mask=None, | |
| temb=None, | |
| scale=1.0, | |
| ): |
to
def __call__(
self,
attn: Attention,
hidden_states,
encoder_hidden_states=None,
attention_mask=None,
temb=None,
scale=1.0,
*_args,
**_kwargs
):|
Thanks for the WIP PR. Could you help understand the adapter a bit better? Like what does it do, etc.? The results don't look great to me either. So, a bit more context would be helpful. |
|
IP-Adapter, an effective and lightweight adapter to achieve image
We can use an image as a prompt. We can combine IP-Adapter with other methods like ControlNet.
Decoupled cross-attention: Separate text prompt cross attention and Image prompt cross attention. You can also check demo notebook. https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter_demo.ipynb |
|
Seems cool but the additional image encoder and the adapter modules are heavy IMO, no? The image encoder is 3.69 GB and the adapter is 700 MBs. Given that do you think the gains are reasonable? We have #4388 too. |
|
In BLIP-Diffusion, the vision encoder is 1.16GB and the qformer is 1.98GB for SDv1.5. In IP-Adapter, the image encoder is 2.53 GB and the adapter is 44 MBs for SDv1.5, 3.69 GB and 770MBs for SDXL. Both can combine with other methods like ControlNet. One advantage is there are trained weights for SDXL.
|
|
Alright! I have a better understanding now. Flexibility is controlling the generation process is definitely desirable. I think it also helps us achieve zero-shot subject-driven generation (like BLIP Diffusion)? Also, the example you showed in the OP refers to image variation, right? Regardless, the result looks okayish to me. @patrickvonplaten what are your thoughts? |
I think so too!
Exactly. |
|
I am leaning towards having the pipeline as a part of the core library. Let's see what @patrickvonplaten has to share. Given the pipeline can do image variations, zero-shot subject-driven generation, can be combined with ControlNets, etc. -- I do see potential use-cases. |
|
it's a little confusing to have a load_ip_adapter step, since every IP-adapter weighted is matched to the given image encoder and projector, why not mege it into init process? @okotaku |
|
@ultranity Because we should set
This flow is similar to LoRA. |
| class IPAdapterPipeline(DiffusionPipeline): | ||
| def __init__( | ||
| self, | ||
| pipeline: DiffusionPipeline, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really allow passing a pipeline into another pipeline. From what I can see the pipeline also only works for stable diffusion 1,2 anyways (there is only one text encoder) - could we maybe just follow the previous logic and create a StableDiffusionIPAadapterPipeline(DiffusionPipeline) to begin with?
Not too keen on adding a new design paradigm here
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
Has there been any progress on this? IP Adapter works great for me and I'd love to see it supported in diffusers |
|
@andypotato could you provide some example results? |
|
@sayakpaul sure, here are some examples I created with IP Adapter using their "ip-adapter-plus-face_sd15.bin" checkpoint. For convenience, these were created using A1111, ReV Animated 1.22 and the ControlNet extension to load the IP Adapter models. Although I can also create similar images using diffusers and their own repo at https://github.com/tencent-ailab/IP-Adapter to generate similar images. First image is the reference image: These are the generated images, the likeness is really quite stunning. |
|
Wow, that's quite impressive indeed. @apolinario has made us aware that IP Adapters are becoming extremely popular in the community. So, I'd be in favor of supporting it in |
Agree let's add it to main! |
|
@okotaku I went into your PR and addresses #4944 (comment). I hope that's okay? |
IIUC the encoder and the projection modules are not changed for every IP module being loaded. So, not sure why having a separate |
|
@yiyixuxu summary of the changes I have done so far this morning:
The code from #4944 (comment) works and produces expected results. However, I updated it to better reflect the loading of the image encoder from the official repo: from transformers import CLIPVisionModelWithProjection
from diffusers import StableDiffusionPipeline
import torch
from diffusers.utils import load_image
image_encoder = CLIPVisionModelWithProjection.from_pretrained(
"h94/IP-Adapter",
subfolder="models/image_encoder",
torch_dtype=torch.float16,
).to("cuda")
pipeline = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
image_encoder=image_encoder,
torch_dtype=torch.float16
).to("cuda")
image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
images = pipeline(
prompt='best quality, high quality',
image_prompt=image,
negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
num_inference_steps=50,
).images
images[0].save("sayak_test_out.png") |
|
|
||
| image = image.to(device=device, dtype=dtype) | ||
| image_embeds = self.image_encoder(image).image_embeds | ||
| projected_image_embeds = self.image_projection(image_embeds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't do this because I don't think we should attach weights to pipelines
I think it is better use the imageProjection layer in UNET cc @patrickvonplaten here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the encode_image I added for IP-adapter is no difference from encode_image in existing pipelines - same way as different pipelines have may have slightly different implementation of encode_prompts but essentially they do same thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't do this because I don't think we should attach weights to pipelines
Fair point.
the encode_image I added for IP-adapter is no difference from encode_image in existing pipelines - same way as different pipelines have may have slightly different implementation of encode_prompts but essentially they do same thing
Yeah with your implementation that doesn't use the projection module, it makes sense to name the function as encode_image() as that is exactly what it is doing.
|
Okay for me either way. If we are attaching ad-hoc weights to the I'd still like an opinion from @patrickvonplaten here. |
| import torch.nn.functional as F | ||
| from torch import nn | ||
|
|
||
| from ..loaders import PatchedLoraProjection, text_encoder_attn_modules, text_encoder_mlp_modules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reverse this change, don't think it's needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, there's a circular import problem.
src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py
Outdated
Show resolved
Hide resolved
| clip_skip=self.clip_skip, | ||
| ) | ||
|
|
||
| if image_prompt is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool that this is all that is needed!
| from .models.attention_processor import ( | ||
| AttnProcessor, | ||
| AttnProcessor2_0, | ||
| IPAdapterAttnProcessor, | ||
| IPAdapterAttnProcessor2_0, | ||
| IPAdapterControlNetAttnProcessor, | ||
| IPAdapterControlNetAttnProcessor2_0, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates a circular dependency (regarding https://github.com/huggingface/diffusers/pull/4944/files#r1386535787). Let me know if there's a better way to address this.
|
@patrickvonplaten I am trying to address your concerns around from transformers import CLIPVisionModelWithProjection
from diffusers import StableDiffusionPipeline
from diffusers.models.attention_processor import AttnProcessor2_0
import torch
import numpy as np
import tempfile
image_encoder = CLIPVisionModelWithProjection.from_pretrained(
"h94/IP-Adapter",
subfolder="models/image_encoder",
torch_dtype=torch.float16,
).to("cuda")
pipeline = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
image_encoder=image_encoder,
torch_dtype=torch.float16
).to("cuda")
output = pipeline("hey", num_inference_steps=10, output_type="np").images
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
ckpt_path = "ip_adapter_loaded_pipeline"
with tempfile.TemporaryDirectory() as tmpdirname:
pipeline.save_pretrained(tmpdirname)
pipeline.unet.set_attn_processor(AttnProcessor2_0())
loaded_pipeline = StableDiffusionPipeline.from_pretrained(tmpdirname, torch_dtype=torch.float16).to("cuda")
output_loaded = loaded_pipeline("hey", num_inference_steps=10, output_type="np").images
print(np.allclose(output[0, :3, :3, -1], output_loaded[0, :3, :3, -1], atol=1e-4, rtol=1e-4))I am facing this: Some weights of the model checkpoint were not used when initializing UNet2DConditionModel:
['down_blocks.0.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.0.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.0.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, down_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, mid_block.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, mid_block.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.1.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.1.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.1.attentions.2.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.2.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.2.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.2.attentions.2.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.3.attentions.0.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.3.attentions.1.transformer_blocks.0.attn2.processor.to_v_ip.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.processor.to_k_ip.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.processor.to_v_ip.weight']However, the assertion passes. Since that is the case, I think that's fine for now. Later, we can work on an WDYT? @yiyixuxu what's your take? |
| scheduler: KarrasDiffusionSchedulers, | ||
| safety_checker: StableDiffusionSafetyChecker, | ||
| feature_extractor: CLIPImageProcessor, | ||
| image_encoder: CLIPVisionModelWithProjection = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickvonplaten defaulting to None breaks the serialization compatibility (as demonstrated in #4944 (comment)). Not defaulting to None breaks the existing tests. Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try to update the tests instead -
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.






What does this PR do?
Support IP-Adapter.
github: https://github.com/tencent-ailab/IP-Adapter
arxiv: https://arxiv.org/abs/2308.06721
original weights: https://huggingface.co/h94/IP-Adapter
How to inference
test weights: https://huggingface.co/takuoko/IP-Adapter-test
To inference by diffusers pipeline, I converted from Original Repo h94/IP-Adapter .
Image prompt
output
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Core library: