Conversation
| from ..modeling_outputs import Transformer2DModelOutput | ||
| from ..modeling_utils import ModelMixin | ||
| from ..normalization import FP32LayerNorm | ||
| from .transformer_wan import WanTimeTextImageEmbedding, WanTransformerBlock |
There was a problem hiding this comment.
can we copy over these 2 things and add a #Copied from, instead of importing from wan?
There was a problem hiding this comment.
yep, that makes sense. so we’ll need to copy the all the modules in transformer_wan here.
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks for the PR! I left one question about whether we support any number of num_frame
other than that, I think we should remove stuff that's in wan but not needed here for chrono to simplify the code a bit, but if you want to keep it consistent and may support these features in the future, that's ok too
| self.video_processor = VideoProcessor(vae_scale_factor=self.vae_scale_factor_spatial) | ||
| self.image_processor = image_processor | ||
|
|
||
| def _get_t5_prompt_embeds( |
There was a problem hiding this comment.
let's add a Copied from if it's same one as Wan
|
|
||
| return prompt_embeds | ||
|
|
||
| def encode_image( |
| image_encoder: CLIPVisionModel = None, | ||
| transformer: ChronoEditTransformer3DModel = None, | ||
| transformer_2: ChronoEditTransformer3DModel = None, | ||
| boundary_ratio: Optional[float] = None, |
There was a problem hiding this comment.
| boundary_ratio: Optional[float] = None, |
if we don't support the two stage denoising loop, let's remove parameter and all its related logic, to simplify the pipeline a bit
| num_frames: int = 81, | ||
| num_inference_steps: int = 50, | ||
| guidance_scale: float = 5.0, | ||
| guidance_scale_2: Optional[float] = None, |
There was a problem hiding this comment.
| guidance_scale_2: Optional[float] = None, |
| prompt_embeds: Optional[torch.Tensor] = None, | ||
| negative_prompt_embeds: Optional[torch.Tensor] = None, | ||
| image_embeds: Optional[torch.Tensor] = None, | ||
| last_image: Optional[torch.Tensor] = None, |
There was a problem hiding this comment.
it's a image editing task and can output video to show the reasoning process, no? what would be a meaningful use case to also pass a last_iamge parameter here?
| if self.config.boundary_ratio is not None and image_embeds is not None: | ||
| raise ValueError("Cannot forward `image_embeds` when the pipeline's `boundary_ratio` is not configured.") | ||
|
|
||
| def prepare_latents( |
There was a problem hiding this comment.
i think this is same as in wan i2v too?
if you want to just add a #Copied from and keep this method as it is, it's fine! we can also just remove all the logics we don't need here related to last_frame and expand_timesteps
There was a problem hiding this comment.
yes it's the same as in wan i2v. I add reference to original function and remove all the logics for wan2.2.
| freqs_cos = self.freqs_cos.split(split_sizes, dim=1) | ||
| freqs_sin = self.freqs_sin.split(split_sizes, dim=1) | ||
|
|
||
| assert num_frames == 2 or num_frames == self.temporal_skip_len, ( |
There was a problem hiding this comment.
i don't understand this check here, I think after temporal reasoning step, mum_frames is 2, but other than that e.g. if temporal reasoning is not enabled, this dimension will have various lengths, based on the num_frames variable the users passed to pipeline, no?
if our model can only work with fixed num_frames, maybe we can throw an error from the pipeline when we check the inputs?
There was a problem hiding this comment.
yes, it works on num_frames >= 2. I've removed this check in latest commit.
|
Hi @yiyixuxu, thanks for your review and suggestions! I’ve updated the code accordingly in the latest commit. Please feel free to make any further changes if needed. |
yiyixuxu
left a comment
There was a problem hiding this comment.
looking great! do you add a doc page to oin this PR?
also tests, but we can help with tests if you need
|
doc we can do something similar to wan:
|
|
tests, can just follow what wan did |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-authored-by: YiYi Xu <[email protected]>
Co-authored-by: YiYi Xu <[email protected]>
test added. will work on the doc now :) |
|
@bot /style |
|
Style fix runs successfully without any file modified. |
|
Hi @zhangjiewu, could you perform the following?
Thanks! |
|
Hey @dg845, I’ve completed the two tasks you commented on. Thank you! |
|
I see that |
|
Hi @dg845, I got these errors when running Could you try if the following input works? |
|
For |
|
Thank you @sayakpaul for the info. diffusers/tests/pipelines/wan/test_wan_video_to_video.py Lines 145 to 149 in bc8fd86 We can fix this after #12500 is merged. @dg845 could you please run test again and see if it works. |
|
@zhangjiewu that sounds good to me! I have triggered our CI too. Thanks for your patience. |
|
Will let @dg845 take care of the final merging. |
|
@zhangjiewu, thanks for looking into the tests! The Chronoedit tests are passing on the CI (the failing tests are unrelated) and I also checked that the tests are working locally, so merging. |
|
Thank you all for your time and support! |
add ChronoEdit
This PR adds ChronoEdit, a state-of-the-art image editing model that reframes image editing as a video generation task to achieve physically consistent edits.
HF Model: https://huggingface.co/nvidia/ChronoEdit-14B-Diffusers
Gradio Demo: https://huggingface.co/spaces/nvidia/ChronoEdit
Paper: https://arxiv.org/abs/2510.04290
Code: https://github.com/nv-tlabs/ChronoEdit
Website: https://research.nvidia.com/labs/toronto-ai/chronoedit/
cc: @sayakpaul @yiyixuxu @asomoza
Usage
Full model
Full model with temporal reasoning
With 8-steps distillation LoRA