[Feature]: NVIDIA GPU passive health checks

### Problem

> Because of the gang scheduling semantics of ML jobs, failures have a large effect on the reliability of an entire job—a single failure of a system component can cause thousands of GPUs to sit idle. [...][T]he time between component failures may be small enough to be disruptive.

https://arxiv.org/html/2410.21680v1



### Solution

Leverage NVIDIA DCGM [Background Health Checks](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#background-health-checks).

### Workaround

_No response_

### Would you like to help us implement this feature by sending a PR?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: NVIDIA GPU passive health checks #2930

Problem

Solution

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: NVIDIA GPU passive health checks #2930

Description

Problem

Solution

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions