Skip to content

[BUG] Schema of DeepCompile C++ ops does not reflect aliases #7596

@eternalNight

Description

@eternalNight

Describe the bug

Today the schema of DeepCompile C++ ops is defined as follows. Specifically all ops that return tensors are declared as returning a fresh-new one, which is incorrect as wait_allgather() and release_param actually return their first tensor inputs without copying them.

    m.def("allgather_param(Tensor a, int graph_id, int id) -> Tensor");
    m.def("prefetch_params_fused(int graph_id, Tensor[] params, int[] ids) -> ()");
    m.def("wait_allgather(Tensor a, int graph_id, int id) -> Tensor");
    m.def("release_param(Tensor a, int graph_id, int id, int n_users) -> Tensor");
    m.def("reduce_grad(Tensor a, int graph_id, int id) -> Tensor");
    m.def("free_tensors(Tensor[] a) -> ()");
    m.def("offload_tensor(Tensor a, int id, int id) -> Tensor");
    m.def("reload_tensor(Tensor a, int id, int id) -> Tensor");
    m.def("wait_offload(Tensor a, int id, int id) -> Tensor");
    m.def("wait_reload(Tensor a, int id, int id) -> Tensor");
    m.def("offload_parameter(Tensor a, int id, int id) -> ()");
    m.def("reload_parameter(Tensor a, int id, int id) -> ()");
    m.def("end_backward(int graph_id) -> ()");

This causes the tensors to be freed, earlier than expected, by the additional tensor del statements in inductor-generated code, and eventually causes training losses to become NaN.

To Reproduce

  1. Run deepspeed --num_gpus=N openvla-like.py -c

openvla-like.py

Expected behavior
With DeepCompile, loss curves of eager & inductor backends match.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions