-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Closed
Labels
Description
Describe the bug
Today the schema of DeepCompile C++ ops is defined as follows. Specifically all ops that return tensors are declared as returning a fresh-new one, which is incorrect as wait_allgather() and release_param actually return their first tensor inputs without copying them.
m.def("allgather_param(Tensor a, int graph_id, int id) -> Tensor");
m.def("prefetch_params_fused(int graph_id, Tensor[] params, int[] ids) -> ()");
m.def("wait_allgather(Tensor a, int graph_id, int id) -> Tensor");
m.def("release_param(Tensor a, int graph_id, int id, int n_users) -> Tensor");
m.def("reduce_grad(Tensor a, int graph_id, int id) -> Tensor");
m.def("free_tensors(Tensor[] a) -> ()");
m.def("offload_tensor(Tensor a, int id, int id) -> Tensor");
m.def("reload_tensor(Tensor a, int id, int id) -> Tensor");
m.def("wait_offload(Tensor a, int id, int id) -> Tensor");
m.def("wait_reload(Tensor a, int id, int id) -> Tensor");
m.def("offload_parameter(Tensor a, int id, int id) -> ()");
m.def("reload_parameter(Tensor a, int id, int id) -> ()");
m.def("end_backward(int graph_id) -> ()");
This causes the tensors to be freed, earlier than expected, by the additional tensor del statements in inductor-generated code, and eventually causes training losses to become NaN.
To Reproduce
- Run
deepspeed --num_gpus=N openvla-like.py -c
Expected behavior
With DeepCompile, loss curves of eager & inductor backends match.