-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Description
Hello and thanks for Telegraf,
Relevant telegraf.conf
[agent]
interval = "60s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 5000
collection_jitter = "0s"
flush_interval = "60s"
flush_jitter = "45s"
debug = false
quiet = false
logfile = "/var/log/telegraf/telegraf.log"
[[inputs.sysstat]]
sadc_path = "/usr/lib64/sa/sadc" # required
activities = ["DISK", "SNMP", "INT"]
group = true
[inputs.sysstat.options]
-B = "paging"
-b = "io"
-d = "disk" # requires DISK activity
"-I ALL" = "interrupts" # requires INT activity
"-n ALL" = "network"
"-P ALL" = "per_cpu"
-q = "queue"
-S = "swap_util"
-u = "cpu_util"
-v = "inode"
-W = "swap"
-w = "task"Logs from Telegraf
2022-05-14T16:10:20Z I! Loaded inputs: chrony cpu disk diskio ethtool kernel kernel_vmstat mem net netstat processes statsd swap sysstat system
2022-05-14T16:10:20Z I! Loaded aggregators:
2022-05-14T16:10:20Z I! Loaded processors: strings
2022-05-14T16:10:20Z I! Loaded outputs: http (2x)
2022-05-14T16:10:20Z I! Tags enabled: <REDACTED>
2022-05-14T16:10:20Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"<REDACTED>", Flush Interval:1m0s
2022-05-14T16:10:20Z I! [inputs.statsd::statsd_default] UDP listening on "[::]:18125"
2022-05-14T16:10:20Z I! [inputs.statsd::statsd_default] Started the statsd service on ":18125"
2022-05-14T16:11:00Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-3803201174: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
2022-05-14T16:11:39Z I! [inputs.statsd::statsd_default] Stopping the statsd service
2022-05-14T16:11:39Z E! [inputs.statsd::statsd_default] Error in plugin: read udp [::]:18125: use of closed network connection
2022-05-14T16:11:39Z I! [inputs.statsd::statsd_default] Stopped listener service on ":18125"
2022-05-14T16:11:39Z I! [agent] Hang on, flushing any cached metrics before shutdown
2022-05-14T16:11:39Z I! [agent] Stopping running outputs
2022-05-14T16:12:15Z I! Loaded inputs: chrony cpu disk diskio ethtool kernel kernel_vmstat mem net netstat processes statsd swap sysstat system
2022-05-14T16:12:15Z I! Loaded aggregators:
2022-05-14T16:12:15Z I! Loaded processors: strings
2022-05-14T16:12:15Z I! Loaded outputs: http (2x)
2022-05-14T16:12:15Z I! Tags enabled: <REDACTED>
2022-05-14T16:12:15Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"<REDACTED>", Flush Interval:1m0s
2022-05-14T16:12:15Z I! [inputs.statsd::statsd_default] UDP listening on "[::]:18125"
2022-05-14T16:12:15Z I! [inputs.statsd::statsd_default] Started the statsd service on ":18125"
2022-05-14T16:13:00Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-2026702340: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
2022-05-14T16:14:00Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-1242273769: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
2022-05-14T16:15:00Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-350760887: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
2022-05-14T16:16:00Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-1818052266: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
...repeats until restarted...
System info
Telegraph version 1.21.2 running on AlmaLinux release 8.5 on an AWS EC2 host of type r5n.12xlarge
Docker
No response
Steps to reproduce
I have been unable to reproduce without injecting code. We run the same version and config on many hosts and this is the first time I have seen this error, I suspect a timing race, possibly when the time being used is not monotonic, it was seen during startup on a rebooted AWS EC2 of type r5n.12xlarge, one thing I noticed in the host dmesg at around the time of the interval being set to 0 is the system clock being set:
[ 0.773055] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[ 0.820006] acpi PNP0A03:00: _OSC: OS supports [ASPM ClockPM Segments MSI EDR HPX-Type3]
[ 0.907005] PTP clock support registered
[ 0.920064] clocksource: Switched to clocksource kvm-clock
[ 0.956947] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[ 1.972397] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x24093623c91, max_idle_ns: 440795291220 ns
[ 1.979536] clocksource: Switched to clocksource tsc
[ 2.470661] sched_clock: Marking stable (2470654077, 0)->(4846823451, -2376169374)
[ 2.582565] rtc_cmos 00:00: setting system clock to 2022-05-14 16:12:04 UTC (1652544724)
If in telegraf/plugins/inputs/sysstat/sysstat.go I inject:
func (s *Sysstat) Gather(acc telegraf.Accumulator) error {
if time.Duration(s.SadcInterval) != 0 {
// Collect interval is calculated as interval - parseInterval
s.interval = int(time.Duration(s.SadcInterval).Seconds()) + parseInterval
}
if s.interval == 0 {
>> firstTimestamp = time.Now().Add(-time.Second * 1)
if firstTimestamp.IsZero() {
then with this config:
[agent]
interval = "3s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 5000
collection_jitter = "0s"
flush_interval = "3s"
flush_jitter = "1s"
hostname = ""
omit_hostname = false
[[inputs.sysstat]]
sadc_path = "/usr/lib64/sa/sadc" # required
activities = ["DISK", "SNMP", "INT"]
[inputs.sysstat.options]
"-n ALL" = "network"
[[outputs.file]]
files = ["stdout"]
When I run
> ./telegraf version
Telegraf 1.22.4 (git: HEAD acf67065)
./telegraf --config telegraf.conf
I get:
2022-05-18T16:51:13Z I! Starting Telegraf 1.22.4
2022-05-18T16:51:13Z I! Loaded inputs: sysstat
2022-05-18T16:51:13Z I! Loaded aggregators:
2022-05-18T16:51:13Z I! Loaded processors:
2022-05-18T16:51:13Z I! Loaded outputs: file
2022-05-18T16:51:13Z I! Tags enabled: host=<REDACTED>
2022-05-18T16:51:13Z I! [agent] Config: Interval:3s, Quiet:false, Hostname:"<REDACTED>", Flush Interval:3s
2022-05-18T16:51:15Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-684022459: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
Options are:
[ -C <comment> ] [ -F ] [ -L ] [ -V ] [ -f ]
[ -S { INT | DISK | IPV6 | POWER | SNMP | XDISK | ALL | XALL } ]
2022-05-18T16:51:18Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-1242391711: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
Options are:
[ -C <comment> ] [ -F ] [ -L ] [ -V ] [ -f ]
[ -S { INT | DISK | IPV6 | POWER | SNMP | XDISK | ALL | XALL } ]
2022-05-18T16:51:21Z E! [inputs.sysstat] Error in plugin: failed to run command /usr/lib64/sa/sadc -S DISK -S SNMP -S INT 0 2 /tmp/sysstat-2285369201: exit status 1 - Usage: /usr/lib64/sa/sadc [ options ] [ <interval> [ <count> ] ] [ <outfile> ]
Options are:
[ -C <comment> ] [ -F ] [ -L ] [ -V ] [ -f ]
[ -S { INT | DISK | IPV6 | POWER | SNMP | XDISK | ALL | XALL } ]
...repeats...
I can "fix" this with this change in telegraf/plugins/inputs/sysstat/sysstat.go::collect:
// If true, interval is not defined yet and Gather is run for the first time.
<< if collectInterval < 0 {
>> if collectInterval <= 0 {
collectInterval = 1 // In that case we only collect for 1 second.
}
options = append(options, strconv.Itoa(collectInterval), "2", tempfile)
cmd := execCommand(s.Sadc, options...)
Expected behavior
Avoid getting stuck repeating calls to sadc with invalid 0 interval.
Actual behavior
See "Steps to reproduce"
Additional info
No response