Skip to content

Intel RAPL perms changed in newer kernels, system logs spammed #9324

@kmoad

Description

@kmoad

Relevant telegraf.conf:

[global_tags]
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false
[[outputs.influxdb]]
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.intel_powerstat]]
    cpu_metrics = ["cpu_frequency", "cpu_temperature"]
[[inputs.net]]
[[inputs.nvidia_smi]]

System info:

Telegraf 1.18.3 (git: HEAD 6a94f65)
Fedora 34
Kernel 5.12.8

Steps to reproduce:

  1. Enable [[input.intel_powerstat]] with `cpu_metrics = ["cpu_temperature"]
  2. Restart telegraf
  3. Watch journalctl -u telegraf -f

Expected behavior:

Hate to say it, but silent failure would be better. See additional info section.

Actual behavior:

Everytime metrics are collected, intel_powerstats reports an error and fails to read cpu_temperature

[inputs.intel_powerstat] error fetching rapl data for socket 0, err: error opening socket energy_uj file on path /sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/energy_uj, err: open /sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/energy_uj: permission denied

Additional info:

File is read-only by root

$ ls -l /sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/energy_uj
-r--------. 1 root root 4096 Jun  1 21:03 /sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/energy_uj

Similar issue for prometheus here

Related to this kernel change and Intel CVE-2020-8695

For security reasons, energy_uj is now readable only by root, and is likely to remain so. Telegraf cannot read this file and it would be nice if it failed without putting an error message in the logs every ten seconds.

In the long term, as more kernels update, a workaround to reading energy_uj will be needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/prometheusbugunexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions