-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Deepcompile: Fix bugs when applying deepcompile to VLA-like models #7569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deepcompile: Fix bugs when applying deepcompile to VLA-like models #7569
Conversation
|
@tohtana Would you please help review these changes? Also, may I know the motivation behind the tensor clone in release_param when the inductor backend is used? |
fe3ddd3 to
5f1a3b6
Compare
|
Update patch 3 to take an alternative approach as the previous one breaks deepcompile + inductor. Refer to the commit message for more information. |
|
Hi @eternalNight,
I think there might be an operator that requires other elements of
Good question! I experienced that triton code generated by inductor releases the buffer soon after |
Good point. Will insert end_backward() call in zero1_compile.py, too.
Yes, the other elements in the returned tuple are also used. Those elements are fetched from
That sounds weird. I may be able to take a closer look when I have time. |
|
@tohtana BTW, are we limited to call only backward-compatible APIs of FX graphs? To get the output node of a graph, |
403c6ab to
222c426
Compare
It is generally unsafe to assume that all parameters will be used and have gradients. A counterexample is OpenVLA, which consists of two ViTs and one LLM but uses the second-to-last layer of both ViTs, leaving the last layer of both ViTs untrained. In such cases, the following exceptions will arise: [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 195, in add_z3_gather_release [rank0]: return add_z3_gather_release_fw(gm, [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 135, in add_z3_gather_release_fw [rank0]: gm.graph = fast_free_schedule( [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/list_schedule.py", line 276, in fast_free_schedule [rank0]: node_to_last_use, user_to_last_uses = get_last_uses(graph) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 260, in get_last_uses [rank0]: map_arg(node.args, lambda n: register_last_uses(n, node)) [rank0]: File "/mnt/workspaces/banxing.mjj/openvla-venv/lib/python3.10/site-packages/torch/fx/node.py", line 862, in map_arg [rank0]: return _fx_map_arg(a, fn) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 260, in <lambda> [rank0]: map_arg(node.args, lambda n: register_last_uses(n, node)) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 251, in register_last_uses [rank0]: user = node_to_last_use[user] [rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='backend_fn' raised: [rank0]: KeyError: wait_allgather_ds_param__primals_179_177 [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 195, in add_z3_gather_release [rank0]: return add_z3_gather_release_fw(gm, [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 135, in add_z3_gather_release_fw [rank0]: gm.graph = fast_free_schedule( [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/list_schedule.py", line 276, in fast_free_schedule [rank0]: node_to_last_use, user_to_last_uses = get_last_uses(graph) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 260, in get_last_uses [rank0]: map_arg(node.args, lambda n: register_last_uses(n, node)) [rank0]: File "/mnt/workspaces/banxing.mjj/openvla-venv/lib/python3.10/site-packages/torch/fx/node.py", line 862, in map_arg [rank0]: return _fx_map_arg(a, fn) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 260, in <lambda> [rank0]: map_arg(node.args, lambda n: register_last_uses(n, node)) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/util.py", line 251, in register_last_uses [rank0]: user = node_to_last_use[user] [rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='backend_fn' raised: [rank0]: KeyError: wait_allgather_ds_param__primals_179_177 Add allgather and reducescatter only for params that are used in the compute graph and thus have gradients. Signed-off-by: Junjie Mao <[email protected]>
Today, the time to call end_backward() is inferred by counting the number of gradient blocks that have been calculated. That can be incorrect for models consisting of multiple modules without leveraging all their parameters. An example is OpenVLA, which consists of two ViTs and one LLM and uses the second-to-last layer of both ViTs. For generality, make backward graphs call end_backward() explicitly at the end. Signed-off-by: Junjie Mao <[email protected]>
The release_param() API takes a tensor as its first argument, but a user
of a gathered param may return a tuple when the graph is compiled in
eager mode. That will lead to the following error.
Profiling error dc::release_param() Expected a value of type 'Tensor' for argument 'a' but instead found type 'tuple'.
Take native_layer_norm as an example, the graph after inserting
allgather and release_param looks like the following.
%native_layer_norm : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%clone, [1024], %wait_allgather_ds_param__primals_8_5, %wait_allgather_ds_param__primals_9_6, 1e-06), kwargs = {})
%release_ds_param_primals_9_native_layer_norm_6 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%native_layer_norm, 140623563718032, 6, 1), kwargs = {})
%release_ds_param_primals_8_native_layer_norm_5 : [num_users=3] = call_function[target=torch.ops.dc.release_param.default](args = (%release_ds_param_primals_9_native_layer_norm_6, 140623563718032, 5, 1), kwargs = {})
%getitem : [num_users=1] = call_function[target=operator.getitem](args = (%release_ds_param_primals_8_native_layer_norm_5, 0), kwargs = {})
To fix that exception, check if a param user is consumed by
operator.getitem, and if it is, release the param after the first
getitem call, so that the graph above will now become:
%native_layer_norm : [num_users=3] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%clone, [1024], %wait_allgather_ds_param__primals_8_5, %wait_allgather_ds_param__primals_9_6, 1e-06), kwargs = {})
%getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
%release_ds_param_primals_9_getitem_6 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%getitem, 140032938687600, 6, 1), kwargs = {})
%release_ds_param_primals_8_getitem_5 : [num_users=1] = call_function[target=torch.ops.dc.release_param.default](args = (%release_ds_param_primals_9_getitem_6, 140032938687600, 5, 1), kwargs = {})
A previous attempt tried to fix that issue by allowing arbitrary type
(i.e. c10::IValue) as the first argument of release_param(), but it was
abandoned because it breaks the integration of deepcompile with inductor
backend. That is because allowing arbitrary type prevents torch
dispatcher from inferring the appropriate dispatch key and somehow
confuses the fake tensor dispatching logic in inductor, making it calls
allgather_param_meta() for allgather_param but release_param() for
release_param.
Signed-off-by: Junjie Mao <[email protected]>
When an op consumes multiple gathered params (e.g. cat), the graph after inserting allgathers can look like the following: cat = torch.ops.aten.cat.default([t1, t2]); release_1 = torch.ops.aten.cat.default(cat, 1234, 2, 1); release_2 = torch.ops.aten.cat.default(release_1, 1234, 1, 1); Today fast_free_schedule() arranges nodes by iterating over the params (in some order) and inserting prerequisites and users of each param. If param 1 is visited before param 2 in the case above, the release_2 node will be inserted into the scheduled list before release_1, leading to a use-before-define error in the generated graph and giving rise to the following exception: [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/backend.py", line 288, in make_fw_graph [rank0]: run_opt_passes( [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/backend.py", line 197, in run_opt_passes [rank0]: gm_new = opt_pass_fn(gm, graph_id, graph_order, profiling_results, create_inputs_fn, mem_budget, param_manager, [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 195, in add_z3_gather_release [rank0]: return add_z3_gather_release_fw(gm, [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/passes/zero3_compile.py", line 135, in add_z3_gather_release_fw [rank0]: gm.graph = fast_free_schedule( [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/list_schedule.py", line 429, in fast_free_schedule [rank0]: ret_graph = make_graph_from_schedule(scheduled) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/list_schedule.py", line 28, in make_graph_from_schedule [rank0]: new_node = new_graph.node_copy(node, lambda n: env[n.name]) [rank0]: File "/mnt/workspaces/banxing.mjj/openvla-venv/lib/python3.10/site-packages/torch/fx/graph.py", line 1503, in node_copy [rank0]: args = map_arg(node.args, arg_transform) [rank0]: File "/mnt/engines/deepspeed/deepspeed/compile/list_schedule.py", line 28, in <lambda> [rank0]: new_node = new_graph.node_copy(node, lambda n: env[n.name]) [rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='backend_fn' raised: [rank0]: KeyError: 'release_ds_param_primals_7_cat_1' Instead of adding the release node alone, also check if there is any missing dependency and include them in the schedule_until_free list as well. Signed-off-by: Junjie Mao <[email protected]>
222c426 to
5d2e1bb
Compare
|
Call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eternalNight Thank you for the clarification! This looks good.
Can you merge the latest master first? If you grant me to push changes to your branch, I will do it and merge.
Probably I didn't find the official API when writing the code. Currently, I think DeepCompile works with v2.6 or later, but I'm flexible to support only newer versions if we have a good reason.. |
This |
…eepspeedai#7569) **Describe the bug** When applying deepcompile to the OpenVLA model (which is composed of two vision transformers and a llama-7B), I met the following issues: a. Not all parameters are trained, which leads to compile-time exceptions as well as incorrect invocation of `endBackward()`. b. `release_param()` can be passed a tuple, not a tensor. c. A use-before-define error in `fast_free_schedule()`. This PR attempts to fix all of those issues. Patch 1~2 resolves a, 3 resolves b and 4 resolves c. **To Reproduce the issues** Use this script: https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3 1. `deepspeed --num_gpus=N openvla-like.py -c` --------- Signed-off-by: Junjie Mao <[email protected]>
Describe the bug
When applying deepcompile to the OpenVLA model (which is composed of two vision transformers and a llama-7B), I met the following issues:
a. Not all parameters are trained, which leads to compile-time exceptions as well as incorrect invocation of
endBackward().b.
release_param()can be passed a tuple, not a tensor.c. A use-before-define error in
fast_free_schedule().This PR attempts to fix all of those issues. Patch 1~2 resolves a, 3 resolves b and 4 resolves c.
To Reproduce the issues
Use this script: https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3
deepspeed --num_gpus=N openvla-like.py -c