Utilization metrics across accelerators

Hi guys,

I often struggled to understand GPU utilization in Kubernetes clusters. Metrics are usually fragmented and difficult to collect, so I built an open-source DaemonSet - GitHub - gpusprint/gpusprint: Measuring accelerator utilization on K8s · GitHub - that exports metrics across different accelerators in a unified format (OLTP, Prometheus). It helps answer questions such as:

Topology & Inventory

  • Active clusters and nodes

  • GPUs per node / instance type

  • GPU model distribution across the fleet

Utilization vs. Reservations

  • Average compute utilization (cluster / node)

  • “Zombie” GPUs (allocated but underutilized or wasted VRAM)

  • Cluster allocation ratio (reserved / physical GPUs)

Team & Individual Efficiency

  • Team efficiency scores (utilization / reserved)

Example of charts/value one could get: https://gpusprint.com/

If you’ve faced a similar problem or want to deploy the DaemonSet, happy to chat.

Next, I plan to focus on automatic detection of unrecoverable failures (error codes, networking issues), improved health checks, and automatic remediation. Reach out if that problem is relevant.