[Feature] Ulysses Attention for any sequence length w/o padding

## 🤖UAA: Ulysses Anything Attention 

<div id="ulysses-anything-attention"></div>

We have implemented the **[📚UAA: Ulysses Anything Attention](https://github.com/vipshop/cache-dit/blob/main/docs/User_Guide.md#uaa-ulysses-anything-attention)**: An Ulysses Attention that supports **arbitrary sequence length** with ✅**zero padding** and **nearly ✅zero theoretical communication overhead**. The default Ulysses Attention requires that the sequence len of hidden states **must be divisible by the number of devices**. This imposes **significant limitations** on the practical application of Ulysses.


```python
# pip3 install "cache-dit[parallelism]"
from cache_dit import ParallelismConfig

cache_dit.enable_cache(
    pipe_or_adapter, 
    cache_config=DBCacheConfig(...),
    # Set `experimental_ulysses_anything` as True to enable UAA
    parallelism_config=ParallelismConfig(
        ulysses_size=2,
        parallel_kwargs={
            "experimental_ulysses_anything": True
        },
    ),
)
# torchrun --nproc_per_node=2 parallel_cache_ulysses_anything.py
```

For example, in the T2I and I2V tasks, the length of prompts input by users is often variable, and it is difficult to ensure that this length is divisible by the number of devices. To address this issue, we have developed a **✅padding-free** Ulysses Attention (UAA) for **arbitrary sequence length**, which enhances the versatility of Ulysses.

```python
dist.init_process_group(backend="cpu:gloo,cuda:nccl")
```
Compared to Ulysses Attention, in **UAA**, we have only added an **extra all-gather** op for scalar types to gather the seq_len value of each rank. To avoid multiple forced CUDA sync caused by H2D and D2H transfers, please add the **✅gloo** backend in `init_process_group`. This will significantly reduce commucation latency.

<div align="center">

<p align="center">
    U*: Ulysses Attention, <b>UAA: Ulysses Anything Attenton</b>, UAA*: UAA + Gloo, Device: NVIDIA L20<br>
    FLUX.1-Dev w/o CPU Offload, 28 steps; Qwen-Image w/ CPU Offload, 50 steps; Gloo: Extra All Gather w/ Gloo
</p>

|CP2 w/ U* |CP2 w/ UAA* | CP2 w/ UAA |  L20x1 | CP2 w/ UAA* | CP2 w/ U* |  L20x1 |  CP2 w/ UAA* | 
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|FLUX, 13.87s|**🎉13.88s**|14.75s|23.25s| **🎉13.75s**|Qwen, 132s|181s|**🎉133s**|
|<img src="https://github.com/vipshop/cache-dit/raw/main/assets/uaa/flux.C0_Q0_NONE_Ulysses2.png" width=110px>|<img src="https://github.com/vipshop/cache-dit/raw/main/assets/uaa/flux.C0_Q0_NONE_Ulysses2_ulysses_anything.png" width=110px>|<img src="https://github.com/vipshop/cache-dit/raw/main/assets/uaa/flux.C0_Q0_NONE_Ulysses2_ulysses_anything.png" width=110px>|<img src="https://github.com/vipshop/cache-dit/raw/main/assets/uaa/flux.1008x1008.C0_Q0_NONE.png" width=110px>|<img src="https://github.com/vipshop/cache-dit/raw/main/assets//uaa/flux.1008x1008.C0_Q0_NONE_Ulysses2_ulysses_anything.png" width=110px>|<img src="https://github.com/vipshop/cache-dit/raw/main/assets/uaa/qwen-image.1312x1312.C0_Q0_NONE_Ulysses2.png" width=110px>|<img src="https://github.com/vipshop/cache-dit/raw/main/assets/uaa/qwen-image.1328x1328.C0_Q0_NONE.png" width=110px>|<img src="https://github.com/vipshop/cache-dit/raw/main/assets/uaa/qwen-image.1328x1328.C0_Q0_NONE_Ulysses2_ulysses_anything.png" width=110px>|
|1024x1024|1024x1024|1024x1024|1008x1008|1008x1008|1312x1312|1328x1328|1328x1328|
|✔️U* ✔️UAA|✔️U* ✔️UAA|✔️U* ✔️UAA| NO CP|❌U* ✔️UAA|✔️U* ✔️UAA|NO CP|❌U* ✔️UAA|

</div>

> [!Important]
> Please note that **Ulysses Anything Attention (UAA)** is currently an **experimental** feature. It has not undergone large-scale testing, and may introduce a slight performance degradation while the `cpu:gloo` commucation backend is not available.

@sayakpaul @DN6 Please let me know If you want to have UAA in diffusers. I'd be more than happy to submit a PR to support it. The implementation of UAA is here: https://github.com/vipshop/cache-dit/blob/main/src/cache_dit/parallelism/backends/native_diffusers/context_parallelism/attention/_templated_ulysses_anything.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Ulysses Attention for any sequence length w/o padding #12706

🤖UAA: Ulysses Anything Attention

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CP2 w/ U*	CP2 w/ UAA*	CP2 w/ UAA	L20x1	CP2 w/ UAA*	CP2 w/ U*	L20x1	CP2 w/ UAA*
FLUX, 13.87s	🎉13.88s	14.75s	23.25s	🎉13.75s	Qwen, 132s	181s	🎉133s

1024x1024	1024x1024	1024x1024	1008x1008	1008x1008	1312x1312	1328x1328	1328x1328
✔️U* ✔️UAA	✔️U* ✔️UAA	✔️U* ✔️UAA	NO CP	❌U* ✔️UAA	✔️U* ✔️UAA	NO CP	❌U* ✔️UAA

[Feature] Ulysses Attention for any sequence length w/o padding #12706

Description

🤖UAA: Ulysses Anything Attention

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions