Skip to content

[Feature]: NVIDIA GPU passive health checks #2930

@un-def

Description

@un-def

Problem

Because of the gang scheduling semantics of ML jobs, failures have a large effect on the reliability of an entire job—a single failure of a system component can cause thousands of GPUs to sit idle. [...][T]he time between component failures may be small enough to be disruptive.

https://arxiv.org/html/2410.21680v1

Solution

Leverage NVIDIA DCGM Background Health Checks.

Workaround

No response

Would you like to help us implement this feature by sending a PR?

Yes

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions