Skip to content

Commit fd0e24b

Browse files
committed
daemon/stats: more resilient cpu sampling
To avoid noise in sampling CPU usage metrics, we now sample the system usage closer to the actual response from the underlying runtime. Because the response from the runtime may be delayed, this makes the sampling more resilient in loaded conditions. In addition to this, we also replace the tick with a sleep to avoid situations where ticks can backup under loaded conditions. The trade off here is slightly more load reading the system CPU usage for each container. There may be an optimization required for large amounts of containers but the cost is on the order of 15 ms per 1000 containers. If this becomes a problem, we can time slot the sampling, but the complexity may not be worth it unless we can test further. Unfortunately, there aren't really any good tests for this condition. Triggering this behavior is highly system dependent. As a matter of course, we should qualify the fix with the users that are affected. Signed-off-by: Stephen J Day <[email protected]>
1 parent 3e1505e commit fd0e24b

File tree

1 file changed

+11
-7
lines changed

1 file changed

+11
-7
lines changed

daemon/stats/collector.go

+11-7
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ func (s *Collector) Run() {
5757
// it will grow enough in first iteration
5858
var pairs []publishersPair
5959

60-
for range time.Tick(s.interval) {
60+
for {
6161
// it does not make sense in the first iteration,
6262
// but saves allocations in further iterations
6363
pairs = pairs[:0]
@@ -72,12 +72,6 @@ func (s *Collector) Run() {
7272
continue
7373
}
7474

75-
systemUsage, err := s.getSystemCPUUsage()
76-
if err != nil {
77-
logrus.Errorf("collecting system cpu usage: %v", err)
78-
continue
79-
}
80-
8175
onlineCPUs, err := s.getNumberOnlineCPUs()
8276
if err != nil {
8377
logrus.Errorf("collecting system online cpu count: %v", err)
@@ -89,6 +83,14 @@ func (s *Collector) Run() {
8983

9084
switch err.(type) {
9185
case nil:
86+
// Sample system CPU usage close to container usage to avoid
87+
// noise in metric calculations.
88+
systemUsage, err := s.getSystemCPUUsage()
89+
if err != nil {
90+
logrus.WithError(err).WithField("container_id", pair.container.ID).Errorf("collecting system cpu usage")
91+
continue
92+
}
93+
9294
// FIXME: move to containerd on Linux (not Windows)
9395
stats.CPUStats.SystemUsage = systemUsage
9496
stats.CPUStats.OnlineCPUs = onlineCPUs
@@ -106,6 +108,8 @@ func (s *Collector) Run() {
106108
logrus.Errorf("collecting stats for %s: %v", pair.container.ID, err)
107109
}
108110
}
111+
112+
time.Sleep(s.interval)
109113
}
110114
}
111115

0 commit comments

Comments
 (0)