Skip to content

daemon/stats: more resilient cpu sampling#36519

Merged
yongtang merged 1 commit into
moby:masterfrom
stevvooe:resilient-cpu-sampling
Mar 9, 2018
Merged

daemon/stats: more resilient cpu sampling#36519
yongtang merged 1 commit into
moby:masterfrom
stevvooe:resilient-cpu-sampling

Conversation

@stevvooe
Copy link
Copy Markdown
Contributor

@stevvooe stevvooe commented Mar 7, 2018

To avoid noise in sampling CPU usage metrics, we now sample the system
usage closer to the actual response from the underlying runtime. Because
the response from the runtime may be delayed, this makes the sampling
more resilient in loaded conditions. In addition to this, we also
replace the tick with a sleep to avoid situations where ticks can backup
under loaded conditions.

The trade off here is slightly more load reading the system CPU usage
for each container. There may be an optimization required for large
amounts of containers but the cost is on the order of 15 ms per 1000
containers. If this becomes a problem, we can time slot the sampling,
but the complexity may not be worth it unless we can test further.

Unfortunately, there aren't really any good tests for this condition.
Triggering this behavior is highly system dependent. As a matter of
course, we should qualify the fix with the users that are affected.

Signed-off-by: Stephen J Day [email protected]

Short Nosed Bear

To avoid noise in sampling CPU usage metrics, we now sample the system
usage closer to the actual response from the underlying runtime. Because
the response from the runtime may be delayed, this makes the sampling
more resilient in loaded conditions. In addition to this, we also
replace the tick with a sleep to avoid situations where ticks can backup
under loaded conditions.

The trade off here is slightly more load reading the system CPU usage
for each container. There may be an optimization required for large
amounts of containers but the cost is on the order of 15 ms per 1000
containers. If this becomes a problem, we can time slot the sampling,
but the complexity may not be worth it unless we can test further.

Unfortunately, there aren't really any good tests for this condition.
Triggering this behavior is highly system dependent. As a matter of
course, we should qualify the fix with the users that are affected.

Signed-off-by: Stephen J Day <[email protected]>
Comment thread daemon/stats/collector.go
}
}

time.Sleep(s.interval)
Copy link
Copy Markdown
Contributor

@boaz0 boaz0 Mar 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you use time.Sleep and not kept using range time.Tick?
Unless I am wrong, they are similar functionality wise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it takes more than the interval time, the ticks backup and cause a burst of sampling, making the issue worse.

Copy link
Copy Markdown
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@crosbymichael
Copy link
Copy Markdown
Contributor

LGTM

@boaz0
Copy link
Copy Markdown
Contributor

boaz0 commented Mar 9, 2018

LGTM 🏆

Copy link
Copy Markdown
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐅

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 9, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@bc7424b). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master   #36519   +/-   ##
=========================================
  Coverage          ?   34.66%           
=========================================
  Files             ?      613           
  Lines             ?    45414           
  Branches          ?        0           
=========================================
  Hits              ?    15742           
  Misses            ?    27615           
  Partials          ?     2057

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants