Deepcompile: Fix bugs when applying deepcompile to VLA-like models #7569

eternalNight · 2025-09-19T06:18:12Z

Describe the bug

When applying deepcompile to the OpenVLA model (which is composed of two vision transformers and a llama-7B), I met the following issues:

a. Not all parameters are trained, which leads to compile-time exceptions as well as incorrect invocation of endBackward().
b. release_param() can be passed a tuple, not a tensor.
c. A use-before-define error in fast_free_schedule().

This PR attempts to fix all of those issues. Patch 1~2 resolves a, 3 resolves b and 4 resolves c.

To Reproduce the issues
Use this script: https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3

deepspeed --num_gpus=N openvla-like.py -c

eternalNight · 2025-09-19T06:20:26Z

@tohtana Would you please help review these changes? Also, may I know the motivation behind the tensor clone in release_param when the inductor backend is used?

eternalNight · 2025-09-22T03:48:04Z

Update patch 3 to take an alternative approach as the previous one breaks deepcompile + inductor. Refer to the commit message for more information.

tohtana · 2025-09-22T05:27:53Z

Hi @eternalNight,
Thank you for the great fix! This significantly improves the coverage of DeepCompile. I have a few questions.

Do we need the same fix for end_backward in z1 and z2?
Regarding the latest fix for release_param, we showed this example.

    %native_layer_norm : [num_users=3] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%clone, [1024], %wait_allgather_ds_param__primals_8_5, %wait_allgather_ds_param__primals_9_6, 1e-06), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
    %release_ds_param_primals_9_getitem_6 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%getitem, 140032938687600, 6, 1), kwargs = {})
    %release_ds_param_primals_8_getitem_5 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%release_ds_param_primals_9_getitem_6, 140032938687600, 5, 1), kwargs = {})

I think there might be an operator that requires other elements of native_layer_norm. How do you connect release_ds_param_primals_8_getitem_5 to it? If any of operators doesn't have the path to the end, doesn't inductor remove it?

Also, may I know the motivation behind the tensor clone in release_param when the inductor backend is used?

Good question! I experienced that triton code generated by inductor releases the buffer soon after release_param, and it broke the result. I tried to set a flag to tell inductor to keep the buffer, but it didn't work after all.

eternalNight · 2025-09-22T08:53:13Z

Hi @eternalNight, Thank you for the great fix! This significantly improves the coverage of DeepCompile. I have a few questions.

Do we need the same fix for end_backward in z1 and z2?

Good point. Will insert end_backward() call in zero1_compile.py, too.

Regarding the latest fix for release_param, we showed this example.

    %native_layer_norm : [num_users=3] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%clone, [1024], %wait_allgather_ds_param__primals_8_5, %wait_allgather_ds_param__primals_9_6, 1e-06), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
    %release_ds_param_primals_9_getitem_6 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%getitem, 140032938687600, 6, 1), kwargs = {})
    %release_ds_param_primals_8_getitem_5 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%release_ds_param_primals_9_getitem_6, 140032938687600, 5, 1), kwargs = {})

I think there might be an operator that requires other elements of native_layer_norm. How do you connect release_ds_param_primals_8_getitem_5 to it? If any of operators doesn't have the path to the end, doesn't inductor remove it?

Yes, the other elements in the returned tuple are also used. Those elements are fetched from %native_layer_norm and have no relationship with any %release_ds_param_primals_xxx. A more complete view of the graph is:

%native_layer_norm : [num_users=3] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%clone, [1024], %wait_allgather_ds_param__primals_8_5, %wait_allgather_ds_param__primals_9_6, 1e-06), kwargs = {})
%getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
%release_ds_param_primals_9_getitem_6 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%getitem, 140160312802992, 6, 1), kwargs = {})
%getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 1), kwargs = {})
%getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 2), kwargs = {})
%release_ds_param_primals_8_getitem_5 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%release_ds_param_primals_9_getitem_6, 140160312802992, 5, 1), kwargs = {})
%_to_copy_4 : [num_users=1] = call_function[target=torch.ops.aten._to_copy.default](args = (%release_ds_param_primals_8_getitem_5,), kwargs = {dtype: torch.bfloat16})
...
return (..., getitem_1, getitem_2, ...)

Also, may I know the motivation behind the tensor clone in release_param when the inductor backend is used?

Good question! I experienced that triton code generated by inductor releases the buffer soon after release_param, and it broke the result. I tried to set a flag to tell inductor to keep the buffer, but it didn't work after all.

That sounds weird. I may be able to take a closer look when I have time.

eternalNight · 2025-09-22T09:00:36Z

@tohtana BTW, are we limited to call only backward-compatible APIs of FX graphs? To get the output node of a graph, fx.Graph already provides get_output_node. I was wondering why we define our own get_output_node.

It is generally unsafe to assume that all parameters will be used and have gradients. A counterexample is OpenVLA, which consists of two ViTs and one LLM but uses the second-to-last layer of both ViTs, leaving the last layer of both ViTs untrained. In such cases, the following exceptions will arise: [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 195, in add_z3_gather_release [rank0]: return add_z3_gather_release_fw(gm, [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 135, in add_z3_gather_release_fw [rank0]: gm.graph = fast_free_schedule( [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/list_schedule.py", line 276, in fast_free_schedule [rank0]: node_to_last_use, user_to_last_uses = get_last_uses(graph) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 260, in get_last_uses [rank0]: map_arg(node.args, lambda n: register_last_uses(n, node)) [rank0]: File "/mnt/workspaces/banxing.mjj/openvla-venv/lib/python3.10/site-packages/torch/fx/node.py", line 862, in map_arg [rank0]: return _fx_map_arg(a, fn) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 260, in <lambda> [rank0]: map_arg(node.args, lambda n: register_last_uses(n, node)) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 251, in register_last_uses [rank0]: user = node_to_last_use[user] [rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='backend_fn' raised: [rank0]: KeyError: wait_allgather_ds_param__primals_179_177 [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 195, in add_z3_gather_release [rank0]: return add_z3_gather_release_fw(gm, [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 135, in add_z3_gather_release_fw [rank0]: gm.graph = fast_free_schedule( [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/list_schedule.py", line 276, in fast_free_schedule [rank0]: node_to_last_use, user_to_last_uses = get_last_uses(graph) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 260, in get_last_uses [rank0]: map_arg(node.args, lambda n: register_last_uses(n, node)) [rank0]: File "/mnt/workspaces/banxing.mjj/openvla-venv/lib/python3.10/site-packages/torch/fx/node.py", line 862, in map_arg [rank0]: return _fx_map_arg(a, fn) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 260, in <lambda> [rank0]: map_arg(node.args, lambda n: register_last_uses(n, node)) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 251, in register_last_uses [rank0]: user = node_to_last_use[user] [rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='backend_fn' raised: [rank0]: KeyError: wait_allgather_ds_param__primals_179_177 Add allgather and reducescatter only for params that are used in the compute graph and thus have gradients. Signed-off-by: Junjie Mao <[email protected]>

Today, the time to call end_backward() is inferred by counting the number of gradient blocks that have been calculated. That can be incorrect for models consisting of multiple modules without leveraging all their parameters. An example is OpenVLA, which consists of two ViTs and one LLM and uses the second-to-last layer of both ViTs. For generality, make backward graphs call end_backward() explicitly at the end. Signed-off-by: Junjie Mao <[email protected]>

The release_param() API takes a tensor as its first argument, but a user of a gathered param may return a tuple when the graph is compiled in eager mode. That will lead to the following error. Profiling error dc::release_param() Expected a value of type 'Tensor' for argument 'a' but instead found type 'tuple'. Take native_layer_norm as an example, the graph after inserting allgather and release_param looks like the following. %native_layer_norm : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%clone, [1024], %wait_allgather_ds_param__primals_8_5, %wait_allgather_ds_param__primals_9_6, 1e-06), kwargs = {}) %release_ds_param_primals_9_native_layer_norm_6 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%native_layer_norm, 140623563718032, 6, 1), kwargs = {}) %release_ds_param_primals_8_native_layer_norm_5 : [num_users=3] = call_function[target=torch.ops.dc.release_param.default](args = (%release_ds_param_primals_9_native_layer_norm_6, 140623563718032, 5, 1), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%release_ds_param_primals_8_native_layer_norm_5, 0), kwargs = {}) To fix that exception, check if a param user is consumed by operator.getitem, and if it is, release the param after the first getitem call, so that the graph above will now become: %native_layer_norm : [num_users=3] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%clone, [1024], %wait_allgather_ds_param__primals_8_5, %wait_allgather_ds_param__primals_9_6, 1e-06), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {}) %release_ds_param_primals_9_getitem_6 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%getitem, 140032938687600, 6, 1), kwargs = {}) %release_ds_param_primals_8_getitem_5 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%release_ds_param_primals_9_getitem_6, 140032938687600, 5, 1), kwargs = {}) A previous attempt tried to fix that issue by allowing arbitrary type (i.e. c10::IValue) as the first argument of release_param(), but it was abandoned because it breaks the integration of deepcompile with inductor backend. That is because allowing arbitrary type prevents torch dispatcher from inferring the appropriate dispatch key and somehow confuses the fake tensor dispatching logic in inductor, making it calls allgather_param_meta() for allgather_param but release_param() for release_param. Signed-off-by: Junjie Mao <[email protected]>

When an op consumes multiple gathered params (e.g. cat), the graph after inserting allgathers can look like the following: cat = torch.ops.aten.cat.default([t1, t2]); release_1 = torch.ops.aten.cat.default(cat, 1234, 2, 1); release_2 = torch.ops.aten.cat.default(release_1, 1234, 1, 1); Today fast_free_schedule() arranges nodes by iterating over the params (in some order) and inserting prerequisites and users of each param. If param 1 is visited before param 2 in the case above, the release_2 node will be inserted into the scheduled list before release_1, leading to a use-before-define error in the generated graph and giving rise to the following exception: [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/backend.py", line 288, in make_fw_graph [rank0]: run_opt_passes( [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/backend.py", line 197, in run_opt_passes [rank0]: gm_new = opt_pass_fn(gm, graph_id, graph_order, profiling_results, create_inputs_fn, mem_budget, param_manager, [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 195, in add_z3_gather_release [rank0]: return add_z3_gather_release_fw(gm, [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 135, in add_z3_gather_release_fw [rank0]: gm.graph = fast_free_schedule( [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/list_schedule.py", line 429, in fast_free_schedule [rank0]: ret_graph = make_graph_from_schedule(scheduled) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/list_schedule.py", line 28, in make_graph_from_schedule [rank0]: new_node = new_graph.node_copy(node, lambda n: env[n.name]) [rank0]: File "/mnt/workspaces/banxing.mjj/openvla-venv/lib/python3.10/site-packages/torch/fx/graph.py", line 1503, in node_copy [rank0]: args = map_arg(node.args, arg_transform) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/list_schedule.py", line 28, in <lambda> [rank0]: new_node = new_graph.node_copy(node, lambda n: env[n.name]) [rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='backend_fn' raised: [rank0]: KeyError: 'release_ds_param_primals_7_cat_1' Instead of adding the release node alone, also check if there is any missing dependency and include them in the schedule_until_free list as well. Signed-off-by: Junjie Mao <[email protected]>

eternalNight · 2025-09-22T11:15:25Z

Call to end_backward is added to z1 and z2 as well.

tohtana

@eternalNight Thank you for the clarification! This looks good.
Can you merge the latest master first? If you grant me to push changes to your branch, I will do it and merge.

tohtana · 2025-09-23T05:45:05Z

@tohtana BTW, are we limited to call only backward-compatible APIs of FX graphs? To get the output node of a graph, fx.Graph already provides get_output_node. I was wondering why we define our own get_output_node.

Probably I didn't find the official API when writing the code. Currently, I think DeepCompile works with v2.6 or later, but I'm flexible to support only newer versions if we have a good reason..

eternalNight · 2025-09-23T06:23:46Z

@tohtana BTW, are we limited to call only backward-compatible APIs of FX graphs? To get the output node of a graph, fx.Graph already provides get_output_node. I was wondering why we define our own get_output_node.

Probably I didn't find the official API when writing the code. Currently, I think DeepCompile works with v2.6 or later, but I'm flexible to support only newer versions if we have a good reason..

This output_node API is introduced last November in v2.6.0a0. Let's leave it as is for this PR, and we can later switch to those APIs after they stablize.

…eepspeedai#7569) **Describe the bug** When applying deepcompile to the OpenVLA model (which is composed of two vision transformers and a llama-7B), I met the following issues: a. Not all parameters are trained, which leads to compile-time exceptions as well as incorrect invocation of `endBackward()`. b. `release_param()` can be passed a tuple, not a tensor. c. A use-before-define error in `fast_free_schedule()`. This PR attempts to fix all of those issues. Patch 1~2 resolves a, 3 resolves b and 4 resolves c. **To Reproduce the issues** Use this script: https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3 1. `deepspeed --num_gpus=N openvla-like.py -c` --------- Signed-off-by: Junjie Mao <[email protected]>

eternalNight force-pushed the eternalNight/vla_enabling_fixes branch from fe3ddd3 to 5f1a3b6 Compare September 22, 2025 03:46

eternalNight changed the title ~~[RFC] Deepcompile: Fix bugs when applying deepcompile to VLA-like models~~ Deepcompile: Fix bugs when applying deepcompile to VLA-like models Sep 22, 2025

eternalNight marked this pull request as ready for review September 22, 2025 03:46

eternalNight requested review from loadams and tjruwase as code owners September 22, 2025 03:46

eternalNight force-pushed the eternalNight/vla_enabling_fixes branch from 403c6ab to 222c426 Compare September 22, 2025 11:12

eternalNight added 4 commits September 22, 2025 19:14

eternalNight force-pushed the eternalNight/vla_enabling_fixes branch from 222c426 to 5d2e1bb Compare September 22, 2025 11:14

sfc-gh-truwase requested review from tohtana and removed request for loadams and tjruwase September 22, 2025 13:07

tohtana approved these changes Sep 23, 2025

View reviewed changes

Merge branch 'master' into eternalNight/vla_enabling_fixes

275e4ae

tohtana enabled auto-merge (squash) September 23, 2025 07:06

tohtana merged commit 8c7c56a into deepspeedai:master Sep 23, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deepcompile: Fix bugs when applying deepcompile to VLA-like models #7569

Deepcompile: Fix bugs when applying deepcompile to VLA-like models #7569

Uh oh!

eternalNight commented Sep 19, 2025 •

edited

Loading

Uh oh!

eternalNight commented Sep 19, 2025

Uh oh!

eternalNight commented Sep 22, 2025

Uh oh!

tohtana commented Sep 22, 2025

Uh oh!

eternalNight commented Sep 22, 2025

Uh oh!

eternalNight commented Sep 22, 2025

Uh oh!

eternalNight commented Sep 22, 2025

Uh oh!

tohtana left a comment •

edited

Loading

Uh oh!

tohtana commented Sep 23, 2025

Uh oh!

eternalNight commented Sep 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Deepcompile: Fix bugs when applying deepcompile to VLA-like models #7569

Deepcompile: Fix bugs when applying deepcompile to VLA-like models #7569

Uh oh!

Conversation

eternalNight commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eternalNight commented Sep 19, 2025

Uh oh!

eternalNight commented Sep 22, 2025

Uh oh!

tohtana commented Sep 22, 2025

Uh oh!

eternalNight commented Sep 22, 2025

Uh oh!

eternalNight commented Sep 22, 2025

Uh oh!

eternalNight commented Sep 22, 2025

Uh oh!

tohtana left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tohtana commented Sep 23, 2025

Uh oh!

eternalNight commented Sep 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eternalNight commented Sep 19, 2025 •

edited

Loading

tohtana left a comment •

edited

Loading