Problem
Because of the gang scheduling semantics of ML jobs, failures have a large effect on the reliability of an entire job—a single failure of a system component can cause thousands of GPUs to sit idle. [...][T]he time between component failures may be small enough to be disruptive.
https://arxiv.org/html/2410.21680v1
Solution
Leverage NVIDIA DCGM Background Health Checks.
Workaround
No response
Would you like to help us implement this feature by sending a PR?
Yes