Skip to content

Conversation

@un-def
Copy link
Collaborator

@un-def un-def commented Aug 6, 2025

  • shim: Integrate libdcgm, add a new endpoint returning overall GPU health with list of incidents.
  • Periodically pull instance health from shim, store the raw response in a new DB table. Infer overall instance health and store it in a new column of the "instances" table.
  • Don't consider failed instances for submitted jobs. Note: instances with warnings are still considered for jobs.
  • API: add a new method returning a list of instance health checks with unified structure.
  • CLI: display "warning" and "failure" health statuses in the same way as "unreachable", below the instance status.

Closes: #2930

un-def added 4 commits August 6, 2025 16:59
* shim: Integrate libdcgm, add a new endpoint returning overall GPU
  health with list of incidents.
* Periodically pull instance health from shim, store the raw
  response in a new DB table. Infer overall instance health and
  store it in a new column of the "instances" table.
* Don't consider failed instances for submitted jobs. Note: instances
  with warnings are still considered for jobs.
* API: add a new method returning a list of instance health checks
  with unified structure.
* CLI: display "warning" and "failure" health statuses in the same way
  as "unreachable", below the instance status.

Closes: #2930
cannot use _Ctype_long(ts) (value of type _Ctype_long)
as _Ctype_int64_t value in struct literal

cannot use _Ctype_ulong(0) (constant 0 of type _Ctype_ulong)
as _Ctype_uint64_t value in argument to (_Cfunc_dcgmPolicyRegister_v2))
@un-def un-def requested a review from r4victor August 6, 2025 17:42
@un-def un-def merged commit 28012cf into master Aug 7, 2025
25 checks passed
@un-def un-def deleted the issue_2930_dcgm_passive_health_checks branch August 7, 2025 08:29
un-def added a commit that referenced this pull request Aug 7, 2025
#2952 broke macOS because of
go-dcgm
@un-def un-def mentioned this pull request Aug 7, 2025
jvstme added a commit that referenced this pull request Aug 7, 2025
Port changes from #2936 and #2952
jvstme added a commit that referenced this pull request Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: NVIDIA GPU passive health checks

3 participants