Improve coverage of DeepCompile #7386

tohtana · 2025-06-25T03:57:47Z

This PR improves the coverage of DeepCompile.

Use real parameters when recompilation happens
Handling overflow error in profiling

This PR should be merged after #7366.

ZeRO1 and ZeRO3 both worked with OpenRLHF. See Wiki page for more details.

Signed-off-by: Masahiro Tanaka <[email protected]>

…eed into tohtana/dc_z1_no_sync

Signed-off-by: Masahiro Tanaka <[email protected]>

loadams · 2025-06-25T15:55:12Z

@tohtana - the HPU is down currently, so I'll remove this test for now.

hijkzzz · 2025-06-30T11:20:33Z

When will support for flashattn 2.8.0 be available?

tohtana · 2025-06-30T17:07:10Z

Hi @hijkzzz,

It is more about how PyTorch supports flash-attention. With the packing option enabled in OpenRLHF, transformers tried to use flash_attn_varlen_forward. But PyTorch compiler can't compile it. I tried flash-attn v2.8 but the matrix was the same.

Here is the error I got from flash-attention+packing+compile. This error happens without DeepSpeed.

Dynamo failed to run FX node with fake tensors: call_function flash_attn._flash_attn_varlen_forward(*(FakeTensor(..., device='cuda:0', size=(2328, 32, 128), dtype=torch.bfloat16), FakeTensor(..., device='cuda:0', size=(2328, 8, 128), dtype=torch.bfloat16), FakeTensor(..., device='cuda:0', size=(2328, 8, 128), dtype=torch.bfloat16), FakeTensor(..., device='cuda:0', size=(3,), dtype=torch.int32), FakeTensor(..., device='cuda:0', size=(3,), dtype=torch.int32), FakeTensor(..., device='cuda:0', size=(), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(), dtype=torch.int64), 0.0, 0.08838834764831845), **{'causal': True, 'window_size_left': -1, 'window_size_right': -1, 'softcap': 0.0, 'alibi_slopes': None, 'return_softmax': False, 'block_table': None}): got RuntimeError("flash_attn::_flash_attn_varlen_forward() Expected a value of type 'int' for argument 'max_seqlen_q' but instead found type 'FakeTensor'.\nPosition: 5\nValue: FakeTensor(..., device='cuda:0', size=(), dtype=torch.int64)\nDeclaration: flash_attn::_flash_attn_varlen_forward(Tensor q, Tensor k, Tensor v, Tensor cu_seqlens_q, Tensor cu_seqlens_k, SymInt max_seqlen_q, SymInt max_seqlen_k, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left=-1, SymInt window_size_right=-1, float softcap=0., Tensor? alibi_slopes=None, bool return_softmax=False, Tensor? block_table=None, Tensor? leftpad_k=None, Tensor? seqused_k=None, bool zero_tensors=False) -> (Tensor, Tensor, Tensor, Tensor)\nCast error details: Unable to cast Python instance of type <class 'torch._subclasses.fake_tensor.FakeTensor'> to C++ type '?' (#define PYBIND11_DETAILED_ERROR_MESSAGES or compile in debug mode for details)")

from user code:
   File "/home/mtanaka/.local/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py", line 279, in torch_dynamo_resume_in__flash_attention_forward_at_272
    attn_output = flash_attn_varlen_func(
  File "/home/mtanaka/.local/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 1443, in flash_attn_varlen_func
    return FlashAttnVarlenFunc.apply(
  File "/home/mtanaka/.local/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 925, in forward
    out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_varlen_forward(

hijkzzz · 2025-07-08T02:55:16Z

Hi @hijkzzz,

It is more about how PyTorch supports flash-attention. With the packing option enabled in OpenRLHF, transformers tried to use flash_attn_varlen_forward. But PyTorch compiler can't compile it. I tried flash-attn v2.8 but the matrix was the same.

Here is the error I got from flash-attention+packing+compile. This error happens without DeepSpeed.

Dynamo failed to run FX node with fake tensors: call_function flash_attn._flash_attn_varlen_forward(*(FakeTensor(..., device='cuda:0', size=(2328, 32, 128), dtype=torch.bfloat16), FakeTensor(..., device='cuda:0', size=(2328, 8, 128), dtype=torch.bfloat16), FakeTensor(..., device='cuda:0', size=(2328, 8, 128), dtype=torch.bfloat16), FakeTensor(..., device='cuda:0', size=(3,), dtype=torch.int32), FakeTensor(..., device='cuda:0', size=(3,), dtype=torch.int32), FakeTensor(..., device='cuda:0', size=(), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(), dtype=torch.int64), 0.0, 0.08838834764831845), **{'causal': True, 'window_size_left': -1, 'window_size_right': -1, 'softcap': 0.0, 'alibi_slopes': None, 'return_softmax': False, 'block_table': None}): got RuntimeError("flash_attn::_flash_attn_varlen_forward() Expected a value of type 'int' for argument 'max_seqlen_q' but instead found type 'FakeTensor'.\nPosition: 5\nValue: FakeTensor(..., device='cuda:0', size=(), dtype=torch.int64)\nDeclaration: flash_attn::_flash_attn_varlen_forward(Tensor q, Tensor k, Tensor v, Tensor cu_seqlens_q, Tensor cu_seqlens_k, SymInt max_seqlen_q, SymInt max_seqlen_k, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left=-1, SymInt window_size_right=-1, float softcap=0., Tensor? alibi_slopes=None, bool return_softmax=False, Tensor? block_table=None, Tensor? leftpad_k=None, Tensor? seqused_k=None, bool zero_tensors=False) -> (Tensor, Tensor, Tensor, Tensor)\nCast error details: Unable to cast Python instance of type <class 'torch._subclasses.fake_tensor.FakeTensor'> to C++ type '?' (#define PYBIND11_DETAILED_ERROR_MESSAGES or compile in debug mode for details)")

from user code:
   File "/home/mtanaka/.local/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py", line 279, in torch_dynamo_resume_in__flash_attention_forward_at_272
    attn_output = flash_attn_varlen_func(
  File "/home/mtanaka/.local/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 1443, in flash_attn_varlen_func
    return FlashAttnVarlenFunc.apply(
  File "/home/mtanaka/.local/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 925, in forward
    out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_varlen_forward(

@tohtana Flash-attn is currently a strong dependency for frameworks like OpenRLHF and VerL Alibaba ROLL. Is there any way to work around this issue?

tohtana · 2025-07-08T16:08:56Z

@hijkzzz Yeah, I understand the eager implementation is not a choice. What do you think about SDPF or disabling "packing sample" option? In my environment, DeepCompile+SDPF showed better performance with OpenRLHF than using flash attention.

hijkzzz · 2025-07-08T23:37:27Z

@hijkzzz Yeah, I understand the eager implementation is not a choice. What do you think about SDPF or disabling "packing sample" option? In my environment, DeepCompile+SDPF showed better performance with OpenRLHF than using flash attention.

@tohtana flash-attn and packing_samples are extremely important for RLHF, and even for SFT... they're not something we can afford to drop for now.

tohtana · 2025-07-09T17:59:05Z

@hijkzzz Sure, I understand the importance. I'm just wondering how much the performance difference with SDPA is (Sorry for the typo in my last message). From my understanding, SDPA calls the flash-attention kernels (or almost equivalent) inside, and you can enable it just by settinging sdpa to attn_implementation if you use HF models. At least, I didn't see much performance difference when I tried it.

hijkzzz · 2025-07-14T02:30:03Z

@hijkzzz Sure, I understand the importance. I'm just wondering how much the performance difference with SDPA is (Sorry for the typo in my last message). From my understanding, SDPA calls the flash-attention kernels (or almost equivalent) inside, and you can enable it just by settinging sdpa to attn_implementation if you use HF models. At least, I didn't see much performance difference when I tried it.

The packing samples / ring attn directly depends on the API interface of flash attn lib, so we are temporarily unable to do it this way.

This PR improves the coverage of DeepCompile. - Use real parameters when recompilation happens - Handling overflow error in profiling This PR should be merged after deepspeedai#7366. ZeRO1 and ZeRO3 both worked with OpenRLHF. See [Wiki page](https://github.com/tohtana/DeepCompile_docs/wiki/Debug-with-OpenRLHF-(%237243)) for more details. --------- Signed-off-by: Masahiro Tanaka <[email protected]> Co-authored-by: Stas Bekman <[email protected]>

Masahiro Tanaka and others added 21 commits June 14, 2025 01:39

keep real inputs for partial recompilation

5921295

Signed-off-by: Masahiro Tanaka <[email protected]>

Merge branch 'master' into tohtana/keep_real_inputs_for_recompile

ab9aad5

fix format

a7897cf

Signed-off-by: Masahiro Tanaka <[email protected]>

keep gradient through gradient accumulation period

653cfde

Signed-off-by: Masahiro Tanaka <[email protected]>

rename functions to consolidate z1 and z2

6d9fedb

Signed-off-by: Masahiro Tanaka <[email protected]>

add zero2

333a385

Signed-off-by: Masahiro Tanaka <[email protected]>

add common functions

cdb54fd

Signed-off-by: Masahiro Tanaka <[email protected]>

Merge branch 'master' of github.com:deepspeedai/DeepSpeed

0a2a71a

Merge branch 'master' into tohtana/dc_z0_no_sync

06237f0

Merge branch 'master' into tohtana/dc_z1_no_sync

021f356

Merge branch 'master' into tohtana/dc_z1_no_sync

2a12cea

Merge branch 'master' into tohtana/dc_z1_no_sync

1c44b8e

Signed-off-by: Masahiro Tanaka <[email protected]>

fix release of ipg buffer

ae744bd

Signed-off-by: Masahiro Tanaka <[email protected]>

remove name of unused variable

d77ae12

Signed-off-by: Masahiro Tanaka <[email protected]>

fix format

2d01323

Signed-off-by: Masahiro Tanaka <[email protected]>

Merge branch 'tohtana/fix_zero_bucket' into tohtana/dc_z1_no_sync

1829ec7

Merge branch 'master' into tohtana/dc_z1_no_sync

b6c70f9

Merge branch 'tohtana/dc_z1_no_sync' of github.com:deepspeedai/DeepSp…

33455cf

…eed into tohtana/dc_z1_no_sync

improve dummy value generation

839779e

Signed-off-by: Masahiro Tanaka <[email protected]>

use real param value when inputs are not captured

2c918a6

Signed-off-by: Masahiro Tanaka <[email protected]>

handle overflow when making cache key

f929909

Signed-off-by: Masahiro Tanaka <[email protected]>

tohtana requested review from jomayeri, loadams and tjruwase as code owners June 25, 2025 03:57

Merge branch 'master' into tohtana/dc_improve_z3_coverage

2910b35

tohtana mentioned this pull request Jun 25, 2025

[BUG] DeepCompile in ZeRO-3 fails to do the forward pass #7228

Open

tohtana mentioned this pull request Jun 27, 2025

[BUG] DeepCompile: MemoryProfiling error /pytorch/build/aten/src/ATen/RegisterCUDA.cpp:7280: SymIntArrayRef expected to contain only concrete integers #7311

Open

Merge branch 'master' into tohtana/dc_improve_z3_coverage

68a111a

loadams approved these changes Jun 27, 2025

View reviewed changes

Merge branch 'master' into tohtana/dc_improve_z3_coverage

11e909c

tohtana merged commit 6594c26 into master Jun 27, 2025
9 checks passed

tohtana deleted the tohtana/dc_improve_z3_coverage branch June 27, 2025 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve coverage of DeepCompile #7386

Improve coverage of DeepCompile #7386

Uh oh!

tohtana commented Jun 25, 2025 •

edited

Loading

Uh oh!

loadams commented Jun 25, 2025

Uh oh!

Uh oh!

hijkzzz commented Jun 30, 2025

Uh oh!

tohtana commented Jun 30, 2025

Uh oh!

hijkzzz commented Jul 8, 2025 •

edited

Loading

Uh oh!

tohtana commented Jul 8, 2025 •

edited

Loading

Uh oh!

hijkzzz commented Jul 8, 2025

Uh oh!

tohtana commented Jul 9, 2025

Uh oh!

hijkzzz commented Jul 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Improve coverage of DeepCompile #7386

Improve coverage of DeepCompile #7386

Uh oh!

Conversation

tohtana commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

loadams commented Jun 25, 2025

Uh oh!

Uh oh!

hijkzzz commented Jun 30, 2025

Uh oh!

tohtana commented Jun 30, 2025

Uh oh!

hijkzzz commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tohtana commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hijkzzz commented Jul 8, 2025

Uh oh!

tohtana commented Jul 9, 2025

Uh oh!

hijkzzz commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tohtana commented Jun 25, 2025 •

edited

Loading

hijkzzz commented Jul 8, 2025 •

edited

Loading

tohtana commented Jul 8, 2025 •

edited

Loading

hijkzzz commented Jul 14, 2025 •

edited

Loading