Skip to content

Conversation

@eternalNight
Copy link
Contributor

@eternalNight eternalNight commented Sep 19, 2025

Describe the bug

When applying deepcompile to the OpenVLA model (which is composed of two vision transformers and a llama-7B), I met the following issues:

a. Not all parameters are trained, which leads to compile-time exceptions as well as incorrect invocation of endBackward().
b. release_param() can be passed a tuple, not a tensor.
c. A use-before-define error in fast_free_schedule().

This PR attempts to fix all of those issues. Patch 1~2 resolves a, 3 resolves b and 4 resolves c.

To Reproduce the issues
Use this script: https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3

  1. deepspeed --num_gpus=N openvla-like.py -c

@eternalNight
Copy link
Contributor Author

@tohtana Would you please help review these changes? Also, may I know the motivation behind the tensor clone in release_param when the inductor backend is used?

@eternalNight eternalNight force-pushed the eternalNight/vla_enabling_fixes branch from fe3ddd3 to 5f1a3b6 Compare September 22, 2025 03:46
@eternalNight eternalNight changed the title [RFC] Deepcompile: Fix bugs when applying deepcompile to VLA-like models Deepcompile: Fix bugs when applying deepcompile to VLA-like models Sep 22, 2025
@eternalNight eternalNight marked this pull request as ready for review September 22, 2025 03:46
@eternalNight
Copy link
Contributor Author

Update patch 3 to take an alternative approach as the previous one breaks deepcompile + inductor. Refer to the commit message for more information.

@tohtana
Copy link
Collaborator

tohtana commented Sep 22, 2025

Hi @eternalNight,
Thank you for the great fix! This significantly improves the coverage of DeepCompile. I have a few questions.

  • Do we need the same fix for end_backward in z1 and z2?

  • Regarding the latest fix for release_param, we showed this example.

    %native_layer_norm : [num_users=3] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%clone, [1024], %wait_allgather_ds_param__primals_8_5, %wait_allgather_ds_param__primals_9_6, 1e-06), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
    %release_ds_param_primals_9_getitem_6 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%getitem, 140032938687600, 6, 1), kwargs = {})
    %release_ds_param_primals_8_getitem_5 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%release_ds_param_primals_9_getitem_6, 140032938687600, 5, 1), kwargs = {})

I think there might be an operator that requires other elements of native_layer_norm. How do you connect release_ds_param_primals_8_getitem_5 to it? If any of operators doesn't have the path to the end, doesn't inductor remove it?

Also, may I know the motivation behind the tensor clone in release_param when the inductor backend is used?

Good question! I experienced that triton code generated by inductor releases the buffer soon after release_param, and it broke the result. I tried to set a flag to tell inductor to keep the buffer, but it didn't work after all.

@eternalNight
Copy link
Contributor Author

Hi @eternalNight, Thank you for the great fix! This significantly improves the coverage of DeepCompile. I have a few questions.

  • Do we need the same fix for end_backward in z1 and z2?

Good point. Will insert end_backward() call in zero1_compile.py, too.

  • Regarding the latest fix for release_param, we showed this example.
    %native_layer_norm : [num_users=3] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%clone, [1024], %wait_allgather_ds_param__primals_8_5, %wait_allgather_ds_param__primals_9_6, 1e-06), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
    %release_ds_param_primals_9_getitem_6 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%getitem, 140032938687600, 6, 1), kwargs = {})
    %release_ds_param_primals_8_getitem_5 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%release_ds_param_primals_9_getitem_6, 140032938687600, 5, 1), kwargs = {})

I think there might be an operator that requires other elements of native_layer_norm. How do you connect release_ds_param_primals_8_getitem_5 to it? If any of operators doesn't have the path to the end, doesn't inductor remove it?

Yes, the other elements in the returned tuple are also used. Those elements are fetched from %native_layer_norm and have no relationship with any %release_ds_param_primals_xxx. A more complete view of the graph is:

%native_layer_norm : [num_users=3] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%clone, [1024], %wait_allgather_ds_param__primals_8_5, %wait_allgather_ds_param__primals_9_6, 1e-06), kwargs = {})
%getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
%release_ds_param_primals_9_getitem_6 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%getitem, 140160312802992, 6, 1), kwargs = {})
%getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 1), kwargs = {})
%getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 2), kwargs = {})
%release_ds_param_primals_8_getitem_5 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%release_ds_param_primals_9_getitem_6, 140160312802992, 5, 1), kwargs = {})
%_to_copy_4 : [num_users=1] = call_function[target=torch.ops.aten._to_copy.default](args = (%release_ds_param_primals_8_getitem_5,), kwargs = {dtype: torch.bfloat16})
...
return (..., getitem_1, getitem_2, ...)

Also, may I know the motivation behind the tensor clone in release_param when the inductor backend is used?

Good question! I experienced that triton code generated by inductor releases the buffer soon after release_param, and it broke the result. I tried to set a flag to tell inductor to keep the buffer, but it didn't work after all.

That sounds weird. I may be able to take a closer look when I have time.

@eternalNight
Copy link
Contributor Author

@tohtana BTW, are we limited to call only backward-compatible APIs of FX graphs? To get the output node of a graph, fx.Graph already provides get_output_node. I was wondering why we define our own get_output_node.

@eternalNight eternalNight force-pushed the eternalNight/vla_enabling_fixes branch from 403c6ab to 222c426 Compare September 22, 2025 11:12
It is generally unsafe to assume that all parameters will be used and
have gradients. A counterexample is OpenVLA, which consists of two ViTs
and one LLM but uses the second-to-last layer of both ViTs, leaving the
last layer of both ViTs untrained. In such cases, the following
exceptions will arise:

[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 195, in add_z3_gather_release
[rank0]:     return add_z3_gather_release_fw(gm,
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 135, in add_z3_gather_release_fw
[rank0]:     gm.graph = fast_free_schedule(
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/list_schedule.py", line 276, in fast_free_schedule
[rank0]:     node_to_last_use, user_to_last_uses = get_last_uses(graph)
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 260, in get_last_uses
[rank0]:     map_arg(node.args, lambda n: register_last_uses(n, node))
[rank0]:   File "/mnt/workspaces/banxing.mjj/openvla-venv/lib/python3.10/site-packages/torch/fx/node.py", line 862, in map_arg
[rank0]:     return _fx_map_arg(a, fn)
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 260, in <lambda>
[rank0]:     map_arg(node.args, lambda n: register_last_uses(n, node))
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 251, in register_last_uses
[rank0]:     user = node_to_last_use[user]
[rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='backend_fn' raised:
[rank0]: KeyError: wait_allgather_ds_param__primals_179_177

[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 195, in add_z3_gather_release
[rank0]:     return add_z3_gather_release_fw(gm,
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 135, in add_z3_gather_release_fw
[rank0]:     gm.graph = fast_free_schedule(
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/list_schedule.py", line 276, in fast_free_schedule
[rank0]:     node_to_last_use, user_to_last_uses = get_last_uses(graph)
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 260, in get_last_uses
[rank0]:     map_arg(node.args, lambda n: register_last_uses(n, node))
[rank0]:   File "/mnt/workspaces/banxing.mjj/openvla-venv/lib/python3.10/site-packages/torch/fx/node.py", line 862, in map_arg
[rank0]:     return _fx_map_arg(a, fn)
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 260, in <lambda>
[rank0]:     map_arg(node.args, lambda n: register_last_uses(n, node))
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 251, in register_last_uses
[rank0]:     user = node_to_last_use[user]
[rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='backend_fn' raised:
[rank0]: KeyError: wait_allgather_ds_param__primals_179_177

Add allgather and reducescatter only for params that are used in the
compute graph and thus have gradients.

Signed-off-by: Junjie Mao <[email protected]>
Today, the time to call end_backward() is inferred by counting the
number of gradient blocks that have been calculated. That can be
incorrect for models consisting of multiple modules without leveraging
all their parameters. An example is OpenVLA, which consists of two ViTs
and one LLM and uses the second-to-last layer of both ViTs.

For generality, make backward graphs call end_backward() explicitly at
the end.

Signed-off-by: Junjie Mao <[email protected]>
The release_param() API takes a tensor as its first argument, but a user
of a gathered param may return a tuple when the graph is compiled in
eager mode. That will lead to the following error.

    Profiling error dc::release_param() Expected a value of type 'Tensor' for argument 'a' but instead found type 'tuple'.

Take native_layer_norm as an example, the graph after inserting
allgather and release_param looks like the following.

    %native_layer_norm : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%clone, [1024], %wait_allgather_ds_param__primals_8_5, %wait_allgather_ds_param__primals_9_6, 1e-06), kwargs = {})
    %release_ds_param_primals_9_native_layer_norm_6 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%native_layer_norm, 140623563718032, 6, 1), kwargs = {})
    %release_ds_param_primals_8_native_layer_norm_5 : [num_users=3] = call_function[target=torch.ops.dc.release_param.default](args = (%release_ds_param_primals_9_native_layer_norm_6, 140623563718032, 5, 1), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%release_ds_param_primals_8_native_layer_norm_5, 0), kwargs = {})

To fix that exception, check if a param user is consumed by
operator.getitem, and if it is, release the param after the first
getitem call, so that the graph above will now become:

    %native_layer_norm : [num_users=3] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%clone, [1024], %wait_allgather_ds_param__primals_8_5, %wait_allgather_ds_param__primals_9_6, 1e-06), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
    %release_ds_param_primals_9_getitem_6 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%getitem, 140032938687600, 6, 1), kwargs = {})
    %release_ds_param_primals_8_getitem_5 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%release_ds_param_primals_9_getitem_6, 140032938687600, 5, 1), kwargs = {})

A previous attempt tried to fix that issue by allowing arbitrary type
(i.e. c10::IValue) as the first argument of release_param(), but it was
abandoned because it breaks the integration of deepcompile with inductor
backend. That is because allowing arbitrary type prevents torch
dispatcher from inferring the appropriate dispatch key and somehow
confuses the fake tensor dispatching logic in inductor, making it calls
allgather_param_meta() for allgather_param but release_param() for
release_param.

Signed-off-by: Junjie Mao <[email protected]>
When an op consumes multiple gathered params (e.g. cat), the graph after
inserting allgathers can look like the following:

  cat = torch.ops.aten.cat.default([t1, t2]);
  release_1 = torch.ops.aten.cat.default(cat, 1234, 2, 1);
  release_2 = torch.ops.aten.cat.default(release_1, 1234, 1, 1);

Today fast_free_schedule() arranges nodes by iterating over the params
(in some order) and inserting prerequisites and users of each param. If
param 1 is visited before param 2 in the case above, the release_2 node
will be inserted into the scheduled list before release_1, leading to a
use-before-define error in the generated graph and giving rise to the
following exception:

[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/backend.py", line 288, in make_fw_graph
[rank0]:     run_opt_passes(
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/backend.py", line 197, in run_opt_passes
[rank0]:     gm_new = opt_pass_fn(gm, graph_id, graph_order, profiling_results, create_inputs_fn, mem_budget, param_manager,
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 195, in add_z3_gather_release
[rank0]:     return add_z3_gather_release_fw(gm,
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 135, in add_z3_gather_release_fw
[rank0]:     gm.graph = fast_free_schedule(
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/list_schedule.py", line 429, in fast_free_schedule
[rank0]:     ret_graph = make_graph_from_schedule(scheduled)
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/list_schedule.py", line 28, in make_graph_from_schedule
[rank0]:     new_node = new_graph.node_copy(node, lambda n: env[n.name])
[rank0]:   File "/mnt/workspaces/banxing.mjj/openvla-venv/lib/python3.10/site-packages/torch/fx/graph.py", line 1503, in node_copy
[rank0]:     args = map_arg(node.args, arg_transform)
[rank0]:   File "/mnt/engines/deepspeed/deepspeed/compile/list_schedule.py", line 28, in <lambda>
[rank0]:     new_node = new_graph.node_copy(node, lambda n: env[n.name])
[rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='backend_fn' raised:
[rank0]: KeyError: 'release_ds_param_primals_7_cat_1'

Instead of adding the release node alone, also check if there is any
missing dependency and include them in the schedule_until_free list as
well.

Signed-off-by: Junjie Mao <[email protected]>
@eternalNight eternalNight force-pushed the eternalNight/vla_enabling_fixes branch from 222c426 to 5d2e1bb Compare September 22, 2025 11:14
@eternalNight
Copy link
Contributor Author

Call to end_backward is added to z1 and z2 as well.

@sfc-gh-truwase sfc-gh-truwase requested review from tohtana and removed request for loadams and tjruwase September 22, 2025 13:07
Copy link
Collaborator

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eternalNight Thank you for the clarification! This looks good.
Can you merge the latest master first? If you grant me to push changes to your branch, I will do it and merge.

@tohtana
Copy link
Collaborator

tohtana commented Sep 23, 2025

@tohtana BTW, are we limited to call only backward-compatible APIs of FX graphs? To get the output node of a graph, fx.Graph already provides get_output_node. I was wondering why we define our own get_output_node.

Probably I didn't find the official API when writing the code. Currently, I think DeepCompile works with v2.6 or later, but I'm flexible to support only newer versions if we have a good reason..

@eternalNight
Copy link
Contributor Author

@tohtana BTW, are we limited to call only backward-compatible APIs of FX graphs? To get the output node of a graph, fx.Graph already provides get_output_node. I was wondering why we define our own get_output_node.

Probably I didn't find the official API when writing the code. Currently, I think DeepCompile works with v2.6 or later, but I'm flexible to support only newer versions if we have a good reason..

This output_node API is introduced last November in v2.6.0a0. Let's leave it as is for this PR, and we can later switch to those APIs after they stablize.

@tohtana tohtana enabled auto-merge (squash) September 23, 2025 07:06
@tohtana tohtana merged commit 8c7c56a into deepspeedai:master Sep 23, 2025
12 checks passed
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
…eepspeedai#7569)

**Describe the bug**

When applying deepcompile to the OpenVLA model (which is composed of two
vision transformers and a llama-7B), I met the following issues:

a. Not all parameters are trained, which leads to compile-time
exceptions as well as incorrect invocation of `endBackward()`.
b. `release_param()` can be passed a tuple, not a tensor.
c. A use-before-define error in `fast_free_schedule()`.

This PR attempts to fix all of those issues. Patch 1~2 resolves a, 3
resolves b and 4 resolves c.

**To Reproduce the issues**
Use this script:
https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3

1. `deepspeed --num_gpus=N openvla-like.py -c`

---------

Signed-off-by: Junjie Mao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants