You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have implemented the 📚UAA: Ulysses Anything Attention: An Ulysses Attention that supports arbitrary sequence length with ✅zero padding and nearly ✅zero theoretical communication overhead. The default Ulysses Attention requires that the sequence len of hidden states must be divisible by the number of devices. This imposes significant limitations on the practical application of Ulysses.
# pip3 install "cache-dit[parallelism]"fromcache_ditimportParallelismConfigcache_dit.enable_cache(
pipe_or_adapter,
cache_config=DBCacheConfig(...),
# Set `experimental_ulysses_anything` as True to enable UAAparallelism_config=ParallelismConfig(
ulysses_size=2,
parallel_kwargs={
"experimental_ulysses_anything": True
},
),
)
# torchrun --nproc_per_node=2 parallel_cache_ulysses_anything.py
For example, in the T2I and I2V tasks, the length of prompts input by users is often variable, and it is difficult to ensure that this length is divisible by the number of devices. To address this issue, we have developed a ✅padding-free Ulysses Attention (UAA) for arbitrary sequence length, which enhances the versatility of Ulysses.
Compared to Ulysses Attention, in UAA, we have only added an extra all-gather op for scalar types to gather the seq_len value of each rank. To avoid multiple forced CUDA sync caused by H2D and D2H transfers, please add the ✅gloo backend in init_process_group. This will significantly reduce commucation latency.
U*: Ulysses Attention, UAA: Ulysses Anything Attenton, UAA*: UAA + Gloo, Device: NVIDIA L20
FLUX.1-Dev w/o CPU Offload, 28 steps; Qwen-Image w/ CPU Offload, 50 steps; Gloo: Extra All Gather w/ Gloo
CP2 w/ U*
CP2 w/ UAA*
CP2 w/ UAA
L20x1
CP2 w/ UAA*
CP2 w/ U*
L20x1
CP2 w/ UAA*
FLUX, 13.87s
🎉13.88s
14.75s
23.25s
🎉13.75s
Qwen, 132s
181s
🎉133s
1024x1024
1024x1024
1024x1024
1008x1008
1008x1008
1312x1312
1328x1328
1328x1328
✔️U* ✔️UAA
✔️U* ✔️UAA
✔️U* ✔️UAA
NO CP
❌U* ✔️UAA
✔️U* ✔️UAA
NO CP
❌U* ✔️UAA
Important
Please note that Ulysses Anything Attention (UAA) is currently an experimental feature. It has not undergone large-scale testing, and may introduce a slight performance degradation while the cpu:gloo commucation backend is not available.
🤖UAA: Ulysses Anything Attention
We have implemented the 📚UAA: Ulysses Anything Attention: An Ulysses Attention that supports arbitrary sequence length with ✅zero padding and nearly ✅zero theoretical communication overhead. The default Ulysses Attention requires that the sequence len of hidden states must be divisible by the number of devices. This imposes significant limitations on the practical application of Ulysses.
For example, in the T2I and I2V tasks, the length of prompts input by users is often variable, and it is difficult to ensure that this length is divisible by the number of devices. To address this issue, we have developed a ✅padding-free Ulysses Attention (UAA) for arbitrary sequence length, which enhances the versatility of Ulysses.
Compared to Ulysses Attention, in UAA, we have only added an extra all-gather op for scalar types to gather the seq_len value of each rank. To avoid multiple forced CUDA sync caused by H2D and D2H transfers, please add the ✅gloo backend in
init_process_group. This will significantly reduce commucation latency.U*: Ulysses Attention, UAA: Ulysses Anything Attenton, UAA*: UAA + Gloo, Device: NVIDIA L20
FLUX.1-Dev w/o CPU Offload, 28 steps; Qwen-Image w/ CPU Offload, 50 steps; Gloo: Extra All Gather w/ Gloo
Important
Please note that Ulysses Anything Attention (UAA) is currently an experimental feature. It has not undergone large-scale testing, and may introduce a slight performance degradation while the
cpu:gloocommucation backend is not available.@sayakpaul @DN6 Please let me know If you want to have UAA in diffusers. I'd be more than happy to submit a PR to support it. The implementation of UAA is here: https://github.com/vipshop/cache-dit/blob/main/src/cache_dit/parallelism/backends/native_diffusers/context_parallelism/attention/_templated_ulysses_anything.py