Skip to content

[Bug]: Podman: nvidia metrics broken since 2.8.1 #21353

@HolgerHees

Description

@HolgerHees

Bug description

Recently I updated from version 2.8.0 to 2.8.1 and it looks like nvidia metrics are not working anymore.

I'm running the latest container with podman

nvidia_smi inside the container is working

root@marvin:/usr/libexec/netdata/plugins.d# nvidia-smi 
Wed Nov 26 11:21:42 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3050        On  |   00000000:01:00.0 Off |                  N/A |
| 30%   44C    P2             19W /   70W |     535MiB /   6144MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           29046      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          260MiB |
|    0   N/A  N/A           29059      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          260MiB |
+-----------------------------------------------------------------------------------------+

If I run ./go.d.plugin -d -m nvidia_smi I got the following logs

root@marvin:/usr/libexec/netdata/plugins.d# ./go.d.plugin -d -m nvidia_smi
DBG godplugin/main.go:63 plugin: name=go.d, version=v2.8.1 user_config_dir=/etc/netdata stock_config_dir=/usr/lib/netdata/conf.d plugins_dir=/usr/libexec/netdata/plugins.d netdata_bin_dir=/usr/sbin component=agent
DBG godplugin/main.go:65 current user: name=root, uid=0 component=agent
INF godplugin/main.go:69 env HTTP_PROXY '', HTTPS_PROXY '' component=agent
INF godplugin/main.go:71 directories → config: [/etc/netdata /usr/lib/netdata/conf.d] | collectors: [/etc/netdata/go.d /usr/lib/netdata/conf.d/go.d] | sd: [/etc/netdata/go.d/sd /usr/lib/netdata/conf.d/go.d/sd] | varlib:  component=agent
INF agent/agent.go:213 instance is started component=agent
INF agent/setup.go:23 loading config file component=agent
DBG agent/setup.go:31 looking for 'go.d.conf' in [/etc/netdata /usr/lib/netdata/conf.d] component=agent
INF agent/setup.go:38 found '/etc/netdata/go.d.conf component=agent
INF agent/setup.go:45 config successfully loaded component=agent
INF agent/agent.go:217 using config: enabled 'true', default_run 'false', max_procs '0' component=agent
INF agent/setup.go:50 loading modules component=agent
INF agent/setup.go:73 enabled/registered modules: 1/125 component=agent
INF agent/setup.go:79 building discovery config component=agent
DBG agent/setup.go:109 looking for 'nvidia_smi.conf' in [/etc/netdata/go.d /usr/lib/netdata/conf.d/go.d] component=agent
DBG agent/setup.go:125 found '/usr/lib/netdata/conf.d/go.d/nvidia_smi.conf component=agent
INF agent/setup.go:130 dummy/read/watch paths: 0/1/0 component=agent
INF discovery/manager.go:116 registered discoverers: [file discovery: [file reader] service discovery] component="discovery manager"
DBG agent/setup.go:153 looking for 'vnodes/' in [/etc/netdata /usr/lib/netdata/conf.d] component=agent
DBG vnodes/vnodes.go:99 '/usr/lib/netdata/conf.d/vnodes' is not a regular file, skipping it component=vnodes
INF agent/setup.go:164 found '/usr/lib/netdata/conf.d/vnodes' (0 vhosts) component=agent
INF discovery/manager.go:61 instance is started component="discovery manager"
INF functions/manager.go:49 instance is started component="functions manager"
INF jobmgr/manager.go:97 instance is started component="job manager"
DBG functions/ext.go:62 registering function 'config' with prefix 'go.d:collector:' component="functions manager"
DBG functions/ext.go:62 registering function 'config' with prefix 'go.d:vnode' component="functions manager"
CONFIG go.d:vnode create accepted template /collectors/go.d/Vnodes internal 'internal' 'add schema userconfig test' 0x0000 0x0000

CONFIG go.d:collector:nvidia_smi create accepted template /collectors/go.d/Jobs internal 'internal' 'add schema enable disable test userconfig' 0x0000 0x0000

INF sd/sd.go:66 instance is started component="service discovery"
INF file/discovery.go:69 instance is started component=discovery discoverer=file
INF file/read.go:48 instance is started component=discovery discoverer=file
INF file/read.go:49 instance is stopped component=discovery discoverer=file
DBG jobmgr/manager.go:144 received configs: 1/+1/-0 ('/usr/lib/netdata/conf.d/go.d/nvidia_smi.conf') component="job manager"
CONFIG go.d:collector:nvidia_smi:nvidia_smi create accepted job /collectors/go.d/Jobs stock 'discoverer=file_reader,file=/usr/lib/netdata/conf.d/go.d/nvidia_smi.conf' 'schema get enable disable update restart test userconfig' 0x0000 0x0000

DBG jobmgr/manager.go:311 creating nvidia_smi[nvidia_smi] job, config: map[__provider__:file reader __source__:discoverer=file_reader,file=/usr/lib/netdata/conf.d/go.d/nvidia_smi.conf __source_type__:stock autodetection_retry:0 module:nvidia_smi name:nvidia_smi priority:70000 update_every:10] component="job manager"
DBG nvidia_smi/exec.go:98 executing '/usr/sbin/nd-run /usr/bin/nvidia-smi -q -x -l 5' collector=nvidia_smi job=nvidia_smi
INF pipeline/pipeline.go:144 instance is started component="service discovery" pipeline=docker
INF dockersd/docker.go:100 instance is started component="service discovery" discoverer=docker
ERR dockersd/docker.go:117 Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? component="service discovery" discoverer=docker
INF dockersd/docker.go:101 instance is stopped component="service discovery" discoverer=docker
INF pipeline/accumulator.go:92 discoverer 'sd:docker' exited before ctx done component="service discovery" pipeline=docker
INF pipeline/accumulator.go:61 all discoverers exited before ctx done component="service discovery" pipeline=docker
DBG pipeline/pipeline.go:165 received 0 target groups component="service discovery" pipeline=docker
INF pipeline/pipeline.go:163 instance is stopped component="service discovery" pipeline=docker
ERR module/job.go:244 init failed: process exited before the first sample was collected collector=nvidia_smi job=nvidia_smi
HOST ''

HOST ''

CONFIG go.d:collector:nvidia_smi:nvidia_smi delete

INF pipeline/pipeline.go:144 instance is started component="service discovery" pipeline="network listeners"
INF sd/sd.go:116 pipeline config is disabled 'snmp' (/usr/lib/netdata/conf.d/go.d/sd/snmp.conf) component="service discovery"
INF netlistensd/netlisteners.go:103 instance is started component="service discovery" discoverer=net_listeners
DBG netlistensd/netlisteners.go:104 used config: interval: 2m0s, timeout: 5s, cache expiration time: 10m0s component="service discovery" discoverer=net_listeners
DBG ndexec/ndexec.go:72 executing: /usr/sbin/nd-run /usr/libexec/netdata/plugins.d/local-listeners no-udp6 no-local no-inbound no-outbound no-namespaces
^CINF agent/agent.go:170 received interrupt signal (2). Terminating... component=agent
INF netlistensd/netlisteners.go:105 instance is stopped component="service discovery" discoverer=net_listeners
INF file/discovery.go:70 instance is stopped component=discovery discoverer=file
DBG functions/ext.go:78 unregistering function 'config' with prefix 'go.d:collector:' component="functions manager"
INF functions/manager.go:50 instance is stopped component="functions manager"
DBG functions/ext.go:78 unregistering function 'config' with prefix 'go.d:vnode' component="functions manager"
INF jobmgr/manager.go:98 instance is stopped component="job manager"
INF pipeline/accumulator.go:53 all discoverers exited component="service discovery" pipeline="network listeners"
INF pipeline/pipeline.go:161 instance is stopped component="service discovery" pipeline="network listeners"
INF sd/sd.go:67 instance is stopped component="service discovery"
INF discovery/manager.go:62 instance is stopped component="discovery manager"
INF agent/agent.go:214 instance is stopped component=agent

Expected behavior

nvidia related metrics should be visible

Steps to reproduce

  1. run latest netdata container
  2. install nvidia_smi inside the container
  3. run netdata and check logs
    ...

Installation method

docker

System info

Linux marvin 6.4.0-150600.23.78-default #1 SMP PREEMPT_DYNAMIC Thu Nov  6 21:50:11 UTC 2025 (80d92ac) x86_64 x86_64 x86_64 GNU/Linux
/etc/os-release:NAME="openSUSE Leap"
/etc/os-release:VERSION="15.6"
/etc/os-release:ID="opensuse-leap"
/etc/os-release:ID_LIKE="suse opensuse"
/etc/os-release:VERSION_ID="15.6"
/etc/os-release:PRETTY_NAME="openSUSE Leap 15.6"
/etc/os-release:ANSI_COLOR="0;32"
/etc/os-release:CPE_NAME="cpe:/o:opensuse:leap:15.6"
/etc/os-release:LOGO="distributor-logo-Leap"

Netdata build info

root@marvin:/usr/libexec/netdata/plugins.d# netdata -W buildinfo
time=2025-11-26T11:24:55.315+01:00 comm=netdata source=daemon level=notice errno="2, No such file or directory" tid=4212  msg="CONFIG: cannot load user config '/etc/netdata/stream.conf'. Will try stock config."
Packaging:
    Netdata Version ____________________________________________ : v2.8.1
    Installation Type __________________________________________ : oci
    Package Architecture _______________________________________ : x86_64
    Package Distro _____________________________________________ : unknown
    Configure Options __________________________________________ : cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_STANDARD=11 -DCMAKE_CXX_STANDARD=14 -DBUILD_SHARED_LIBS=OFF -DCMAKE_C_FLAGS='-O2 -funroll-loops -pipe -fexceptions -fstack-protector-strong -D_FORTIFY_SOURCE=3 -fstack-clash-protection -fcf-protection=full -ffunction-sections -fdata-sections -Wno-builtin-macro-redefined -fno-omit-frame-pointer -funwind-tables -fasynchronous-unwind-tables' -DCMAKE_CXX_FLAGS=' -O2 -funroll-loops -pipe -fexceptions -fstack-protector-strong -D_FORTIFY_SOURCE=3 -fstack-clash-protection -fcf-protection=full -ffunction-sections -fdata-sections -Wno-builtin-macro-redefined -fno-omit-frame-pointer -funwind-tables -fasynchronous-unwind-tables' -DCMAKE_COMPILE_DEFINITIONS='_GNU_SOURCE' -DCMAKE_EXE_LINKER_FLAGS='-Wl,--gc-sections -fexceptions -fstack-protector-strong -D_FORTIFY_SOURCE=3 -fstack-clash-protection -fcf-protection=full -ffunction-sections -fdata-sections -Wno-builtin-macro-redefined -rdynamic' -DCMAKE_SHARED_LINKER_FLAGS='-Wl,--gc-sections'
Default Directories:
    User Configurations ________________________________________ : /etc/netdata
    Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
    Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
    Permanent Databases ________________________________________ : /var/lib/netdata
    Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
    Static Web Files ___________________________________________ : /usr/share/netdata/web
    Log Files __________________________________________________ : /var/log/netdata
    Lock Files _________________________________________________ : /var/lib/netdata/lock
    Home _______________________________________________________ : /var/lib/netdata
Operating System:
    Kernel _____________________________________________________ : Linux
    Kernel Version _____________________________________________ : 6.4.0-150600.23.78-default
    Operating System ___________________________________________ : openSUSE Leap
    Operating System ID ________________________________________ : opensuse-leap
    Operating System ID Like ___________________________________ : suse opensuse
    Operating System Version ___________________________________ : 15.6
    Operating System Version ID ________________________________ : 13
    Detection __________________________________________________ : /host/etc/os-release
Hardware:
    CPU Cores __________________________________________________ : 12
    CPU Frequency ______________________________________________ : 1800000000
    RAM Bytes __________________________________________________ : 33419157504
    Disk Capacity ______________________________________________ : 9001789784064
    CPU Architecture ___________________________________________ : x86_64
    Virtualization Technology __________________________________ : none
    Virtualization Detection ___________________________________ : systemd-detect-virt
Container:
    Container __________________________________________________ : podman
    Container Detection ________________________________________ : systemd-detect-virt
    Container Orchestrator _____________________________________ : none
    Container Operating System _________________________________ : Debian GNU/Linux
    Container Operating System ID ______________________________ : debian
    Container Operating System ID Like _________________________ : unknown
    Container Operating System Version _________________________ : 13 (trixie)
    Container Operating System Version ID ______________________ : 13
    Container Operating System Detection _______________________ : /etc/os-release
Features:
    Built For __________________________________________________ : Linux
    Netdata Cloud ______________________________________________ : YES
    Health (trigger alerts and send notifications) _____________ : YES
    Streaming (stream metrics to parent Netdata servers) _______ : YES
    Back-filling (of higher database tiers) ____________________ : YES
    Replication (fill the gaps of parent Netdata servers) ______ : YES
    Streaming and Replication Compression ______________________ : YES (zstd lz4 gzip brotli)
    Contexts (index all active and archived metrics) ___________ : YES
    Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
    Machine Learning ___________________________________________ : YES
    Memory Allocator ___________________________________________ : system
Database Engines:
    dbengine (compression) _____________________________________ : YES (zstd lz4)
    alloc ______________________________________________________ : YES
    ram ________________________________________________________ : YES
    none _______________________________________________________ : YES
Connectivity Capabilities:
    ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
    static (Netdata internal web server) _______________________ : YES
    WebRTC (experimental) ______________________________________ : NO
    Native HTTPS (TLS Support) _________________________________ : YES
    TLS Host Verification ______________________________________ : YES
Libraries:
    LZ4 (extremely fast lossless compression algorithm) ________ : YES
    ZSTD (fast, lossless compression algorithm) ________________ : YES
    zlib (lossless data-compression library) ___________________ : YES
    Brotli (generic-purpose lossless compression algorithm) ____ : YES
    protobuf (platform-neutral data serialization protocol) ____ : YES (system)
    OpenSSL (cryptography) _____________________________________ : YES
    libdatachannel (stand-alone WebRTC data channels) __________ : NO
    JSON-C (lightweight JSON manipulation) _____________________ : YES
    libcap (Linux capabilities system operations) ______________ : YES
    libcrypto (cryptographic functions) ________________________ : YES
    libyaml (library for parsing and emitting YAML) ____________ : YES
    libmnl (library for working with netfilter) ________________ : YES
    stacktraces (library for getting stack traces) _____________ : libbacktrace (mmap, threads, data)
Plugins:
    apps (monitor processes) ___________________________________ : YES
    cgroups (monitor containers and VMs) _______________________ : YES
    cgroup-network (associate interfaces to CGROUPS) ___________ : YES
    proc (monitor Linux systems) _______________________________ : YES
    tc (monitor Linux network QoS) _____________________________ : YES
    diskspace (monitor Linux mount points) _____________________ : YES
    freebsd (monitor FreeBSD systems) __________________________ : NO
    macos (monitor MacOS systems) ______________________________ : NO
    windows (monitor Windows systems) __________________________ : NO
    statsd (collect custom application metrics) ________________ : YES
    timex (check system clock synchronization) _________________ : YES
    idlejitter (check system latency and jitter) _______________ : YES
    bash (support shell data collection jobs - charts.d) _______ : YES
    debugfs (kernel debugging metrics) _________________________ : YES
    cups (monitor printers and print jobs) _____________________ : NO
    ebpf (monitor system calls) ________________________________ : NO
    freeipmi (monitor enterprise server H/W) ___________________ : YES
    network-viewer (monitor TCP/UDP IPv4/6 sockets) ____________ : YES
    systemd-journal (monitor journal logs) _____________________ : YES
    windows-events (monitor Windows events) ____________________ : NO
    nfacct (gather netfilter accounting) _______________________ : NO
    perf (collect kernel performance events) ___________________ : YES
    slabinfo (monitor kernel object caching) ___________________ : YES
    Xen ________________________________________________________ : NO
    Xen VBD Error Tracking _____________________________________ : NO
Exporters:
    AWS Kinesis ________________________________________________ : NO
    GCP PubSub _________________________________________________ : NO
    MongoDB ____________________________________________________ : YES
    Prometheus (OpenMetrics) Exporter __________________________ : YES
    Prometheus Remote Write ____________________________________ : YES
    Graphite ___________________________________________________ : YES
    Graphite HTTP / HTTPS ______________________________________ : YES
    JSON _______________________________________________________ : YES
    JSON HTTP / HTTPS __________________________________________ : YES
    OpenTSDB ___________________________________________________ : YES
    OpenTSDB HTTP / HTTPS ______________________________________ : YES
    All Metrics API ____________________________________________ : YES
    Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
    Trace All Netdata Allocations (with charts) ________________ : NO
    Developer Mode (more runtime checks, slower) _______________ : NO
Runtime Information:
    Profile ____________________________________________________ : standalone
    Stream Parent (accept data from Children) __________________ : NO
    Stream Child (send data to a Parent) _______________________ : NO
    Total System Memory ________________________________________ : 33419157504
    Available System Memory ____________________________________ : 19062743040

Additional info

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions