Skip to content

sysstat input plugin bug, stuck using invalid sadc interval of 0 [unable to reproduce without code injection] #11135

@thatsafunnyname

Description

@thatsafunnyname

Hello and thanks for Telegraf,

Relevant telegraf.conf

[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 5000
  collection_jitter = "0s"
  flush_interval = "60s"
  flush_jitter = "45s"
  debug = false
  quiet = false
  logfile = "/var/log/telegraf/telegraf.log"

[[inputs.sysstat]]
  sadc_path = "/usr/lib64/sa/sadc" # required
  activities = ["DISK", "SNMP", "INT"]
  group = true
  [inputs.sysstat.options]
        -B = "paging"
        -b = "io"
        -d = "disk"             # requires DISK activity
        "-I ALL" = "interrupts" # requires INT activity
        "-n ALL" = "network"
        "-P ALL" = "per_cpu"
        -q = "queue"
        -S = "swap_util"
        -u = "cpu_util"
        -v = "inode"
        -W = "swap"
        -w = "task"

Logs from Telegraf

2022-05-14T16:10:20Z I! Loaded inputs: chrony cpu disk diskio ethtool kernel kernel_vmstat mem net netstat processes statsd swap sysstat system
2022-05-14T16:10:20Z I! Loaded aggregators:
2022-05-14T16:10:20Z I! Loaded processors: strings
2022-05-14T16:10:20Z I! Loaded outputs: http (2x)
2022-05-14T16:10:20Z I! Tags enabled: <REDACTED>
2022-05-14T16:10:20Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"<REDACTED>", Flush Interval:1m0s
2022-05-14T16:10:20Z I! [inputs.statsd::statsd_default] UDP listening on "[::]:18125"
2022-05-14T16:10:20Z I! [inputs.statsd::statsd_default] Started the statsd service on ":18125"
2022-05-14T16:11:00Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-3803201174: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
2022-05-14T16:11:39Z I! [inputs.statsd::statsd_default] Stopping the statsd service
2022-05-14T16:11:39Z E! [inputs.statsd::statsd_default] Error in plugin: read udp [::]:18125: use of closed network connection
2022-05-14T16:11:39Z I! [inputs.statsd::statsd_default] Stopped listener service on ":18125"
2022-05-14T16:11:39Z I! [agent] Hang on, flushing any cached metrics before shutdown
2022-05-14T16:11:39Z I! [agent] Stopping running outputs
2022-05-14T16:12:15Z I! Loaded inputs: chrony cpu disk diskio ethtool kernel kernel_vmstat mem net netstat processes statsd swap sysstat system
2022-05-14T16:12:15Z I! Loaded aggregators:
2022-05-14T16:12:15Z I! Loaded processors: strings
2022-05-14T16:12:15Z I! Loaded outputs: http (2x)
2022-05-14T16:12:15Z I! Tags enabled: <REDACTED>
2022-05-14T16:12:15Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"<REDACTED>", Flush Interval:1m0s
2022-05-14T16:12:15Z I! [inputs.statsd::statsd_default] UDP listening on "[::]:18125"
2022-05-14T16:12:15Z I! [inputs.statsd::statsd_default] Started the statsd service on ":18125"
2022-05-14T16:13:00Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-2026702340: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
2022-05-14T16:14:00Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-1242273769: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
2022-05-14T16:15:00Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-350760887: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
2022-05-14T16:16:00Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-1818052266: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
...repeats until restarted...

System info

Telegraph version 1.21.2 running on AlmaLinux release 8.5 on an AWS EC2 host of type r5n.12xlarge

Docker

No response

Steps to reproduce

I have been unable to reproduce without injecting code. We run the same version and config on many hosts and this is the first time I have seen this error, I suspect a timing race, possibly when the time being used is not monotonic, it was seen during startup on a rebooted AWS EC2 of type r5n.12xlarge, one thing I noticed in the host dmesg at around the time of the interval being set to 0 is the system clock being set:

[    0.773055] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[    0.820006] acpi PNP0A03:00: _OSC: OS supports [ASPM ClockPM Segments MSI EDR HPX-Type3]
[    0.907005] PTP clock support registered
[    0.920064] clocksource: Switched to clocksource kvm-clock
[    0.956947] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    1.972397] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x24093623c91, max_idle_ns: 440795291220 ns
[    1.979536] clocksource: Switched to clocksource tsc
[    2.470661] sched_clock: Marking stable (2470654077, 0)->(4846823451, -2376169374)
[    2.582565] rtc_cmos 00:00: setting system clock to 2022-05-14 16:12:04 UTC (1652544724)

If in telegraf/plugins/inputs/sysstat/sysstat.go I inject:

func (s *Sysstat) Gather(acc telegraf.Accumulator) error {
        if time.Duration(s.SadcInterval) != 0 {
                // Collect interval is calculated as interval - parseInterval
                s.interval = int(time.Duration(s.SadcInterval).Seconds()) + parseInterval
        }

        if s.interval == 0 {
>>              firstTimestamp = time.Now().Add(-time.Second * 1)
                if firstTimestamp.IsZero() {

then with this config:

[agent]
  interval = "3s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 5000
  collection_jitter = "0s"
  flush_interval = "3s"
  flush_jitter = "1s"
  hostname = ""
  omit_hostname = false

[[inputs.sysstat]]
  sadc_path = "/usr/lib64/sa/sadc" # required
  activities = ["DISK", "SNMP", "INT"]

  [inputs.sysstat.options]
    "-n ALL" = "network"

[[outputs.file]]
  files = ["stdout"]

When I run

> ./telegraf version
Telegraf 1.22.4 (git: HEAD acf67065)

./telegraf --config telegraf.conf

I get:

2022-05-18T16:51:13Z I! Starting Telegraf 1.22.4
2022-05-18T16:51:13Z I! Loaded inputs: sysstat
2022-05-18T16:51:13Z I! Loaded aggregators:
2022-05-18T16:51:13Z I! Loaded processors:
2022-05-18T16:51:13Z I! Loaded outputs: file
2022-05-18T16:51:13Z I! Tags enabled: host=<REDACTED>
2022-05-18T16:51:13Z I! [agent] Config: Interval:3s, Quiet:false, Hostname:"<REDACTED>", Flush Interval:3s
2022-05-18T16:51:15Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-684022459: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
Options are:
[ -C <comment> ] [ -F ] [ -L ] [ -V ] [ -f ]
[ -S { INT | DISK | IPV6 | POWER | SNMP | XDISK | ALL | XALL } ]
2022-05-18T16:51:18Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-1242391711: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
Options are:
[ -C <comment> ] [ -F ] [ -L ] [ -V ] [ -f ]
[ -S { INT | DISK | IPV6 | POWER | SNMP | XDISK | ALL | XALL } ]
2022-05-18T16:51:21Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-2285369201: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
Options are:
[ -C <comment> ] [ -F ] [ -L ] [ -V ] [ -f ]
[ -S { INT | DISK | IPV6 | POWER | SNMP | XDISK | ALL | XALL } ]
...repeats...

I can "fix" this with this change in telegraf/plugins/inputs/sysstat/sysstat.go::collect:

        // If true, interval is not defined yet and Gather is run for the first time.
<<      if collectInterval < 0 {
>>      if collectInterval <= 0 {
                collectInterval = 1 // In that case we only collect for 1 second.
        }
        options = append(options, strconv.Itoa(collectInterval), "2", tempfile)
        cmd := execCommand(s.Sadc, options...)

Expected behavior

Avoid getting stuck repeating calls to sadc with invalid 0 interval.

Actual behavior

See "Steps to reproduce"

Additional info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/snmpbugunexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions