Utilization metrics across accelerators

antonibertel · March 7, 2026, 3:06pm

Hi guys,

I often struggled to understand GPU utilization in Kubernetes clusters. Metrics are usually fragmented and difficult to collect, so I built an open-source DaemonSet - GitHub - gpusprint/gpusprint: Measuring accelerator utilization on K8s · GitHub - that exports metrics across different accelerators in a unified format (OLTP, Prometheus). It helps answer questions such as:

Topology & Inventory

Active clusters and nodes
GPUs per node / instance type
GPU model distribution across the fleet

Utilization vs. Reservations

Average compute utilization (cluster / node)
“Zombie” GPUs (allocated but underutilized or wasted VRAM)
Cluster allocation ratio (reserved / physical GPUs)

Team & Individual Efficiency

Team efficiency scores (utilization / reserved)

Example of charts/value one could get: https://gpusprint.com/

If you’ve faced a similar problem or want to deploy the DaemonSet, happy to chat.

Next, I plan to focus on automatic detection of unrecoverable failures (error codes, networking issues), improved health checks, and automatic remediation. Reach out if that problem is relevant.

Topic		Replies	Views
Monitoring GPUs in Kubernetes with DCGM Technical Blog	8	1929	May 24, 2024
Process/client-level GPU utilization observability CUDA Programming and Performance cuda , kernel , kubernetes	0	118	July 30, 2025
Improving GPU Utilization in Kubernetes Technical Blog	11	2621	September 25, 2024
How to use GPUs more effectively with Kubernetes, GPU Operator, Triton Inference and Nvidia A16 TensorRT kubernetes	0	175	October 17, 2024
계층화되고 재현 가능한 레시피를 통한 GPU 인프라용 Kubernetes 검증하기 Technical Blog - South Korea	0	28	March 25, 2026
NVIDIA GPU Operator: Simplifying GPU Management in Kubernetes Technical Blog	0	570	August 25, 2020
Monitoring GPU Utilization "Top" like utility for GPU CUDA Programming and Performance	8	6585	July 28, 2010
Is there a Tesla GPU load monitoring tool for Linux? CUDA Programming and Performance	10	16153	June 12, 2012
GPU utilization Other Tools nvidia-smi	1	85	May 4, 2026
DGX Dashboard metrics DGX Spark / GB10	6	864	October 27, 2025

Utilization metrics across accelerators

Related topics