Hi guys,
I often struggled to understand GPU utilization in Kubernetes clusters. Metrics are usually fragmented and difficult to collect, so I built an open-source DaemonSet - GitHub - gpusprint/gpusprint: Measuring accelerator utilization on K8s · GitHub - that exports metrics across different accelerators in a unified format (OLTP, Prometheus). It helps answer questions such as:
Topology & Inventory
-
Active clusters and nodes
-
GPUs per node / instance type
-
GPU model distribution across the fleet
Utilization vs. Reservations
-
Average compute utilization (cluster / node)
-
“Zombie” GPUs (allocated but underutilized or wasted VRAM)
-
Cluster allocation ratio (reserved / physical GPUs)
Team & Individual Efficiency
- Team efficiency scores (utilization / reserved)
Example of charts/value one could get: https://gpusprint.com/
If you’ve faced a similar problem or want to deploy the DaemonSet, happy to chat.
Next, I plan to focus on automatic detection of unrecoverable failures (error codes, networking issues), improved health checks, and automatic remediation. Reach out if that problem is relevant.