-
Notifications
You must be signed in to change notification settings - Fork 40.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a metric that can be used to notice stuck worker threads #70884
Conversation
lavalamp
commented
Nov 9, 2018
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lavalamp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I can probably test this better (existing metrics don't seem to be tested at all?). lmk if there's a problem with the approach. /assign @mml |
unfinished := prometheus.NewGauge(prometheus.GaugeOpts{ | ||
Subsystem: name, | ||
Name: "unfinished_work_microseconds", | ||
Help: "How many microseconds of work has " + name + " done that is still in progress and hasn't yet been observed by work_duration.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is work_duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a metric defined a few lines up.
Thinking about this... I prefer having a counter that counts the number of times the thread has "finished". This lets you notice when it's stuck (and for how long), but also lets you compute rates and deltas, even if you miss some samples. And the data downsamples elegantly. With a "microseconds since" metric, stuck is pretty easy to detect, but you can't derive rates/deltas reliably. And downsampling tends to lose information about intermediate events. |
We already track the number of completions as part of the procssing duration statistic. You can't tell from that how many things haven't completed, which is what I am worried about. The problem is we don't know how many threads there are, it's not even guaranteed to be a constant number. You can tell how many threads are stuck by the rate at which this new metric is increasing. |
Looking for differences in the completion rate also doesn't work because the queue is probably far under capacity. If it is always processing everything that is added, then the completion rate will be the same whether it's 19 workers (1 stuck) or 20. If it's not processing everything that's added, then there's a major problem :) |
Also, you can't look at the difference between adds and completions, because adds are de-duped in the queue. The ratio could easily be 100:1 or more and that's fine, good even, as it means less work is done. |
Hm, actually it looks like the add metric is incremented post-deduping. I'm not convinced the code is right. Maybe I'll add some more tests. |
To rephrase, deltas on the duration metric would tell you if you're not doing any work, but that could be either because you're stuck or because there's nothing to do? |
The logic seems okay to me, but I'm also not super familiar with how we consume prometheus data, so I don't have a bunch of helpful input, unfortunately. In general it seems to make sense to me though. |
Re: @mml, yeah, I think the problem is that there would be no difference between not working and being stuck. |
If all threads get stuck then the metric would go to zero and you'd notice that. If only some threads get stuck, then you wouldn't necessarily notice that. The deduping actually does look to happen before incrementing the add metric, so in steady state the add and done metrics should be the same value. Unfortunately if they're off by one, you don't know if that's because of something that got added a second ago or a week ago. |
var total float64 | ||
if m.processingStartTimes != nil { | ||
for _, t := range m.processingStartTimes { | ||
total += sinceInMicroseconds(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so this is (items * time). This is kind of a confusing unit. Don't we really just care about the oldest thing in the queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could report that, but you wouldn't be able to tell the difference between one thing stuck and two things stuck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the unit is a little confusing, but emitting this as raw data is probably okay. Whatever thing is responsible for scraping this data can post-process before turning it into an actual (and more useful and less confusing) metric datapoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a confusing unit, but I think all the things that we can do to make it less confusing are lossy (or require making assumptions about the duration of the things that are being done) and therefore best done in the monitoring system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we can't tell the difference between 1 thing stuck for an hour, or 2 things stuck for 30 minutes.
If we want information on both ages and the count of things at certain ages, we'd normally use a distribution/histogram (which I think is "summary" in prometheus?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github stuck my response in the wrong place:
If you plot this over time, the slope will tell you the difference between those scenarios.
I'm open to reporting a distribution, of course, but I don't have a single thing to observe and it wasn't clear how to get around that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might change the unit to be seconds, though. Microseconds is far more granular than necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit changed, and I expanded the help string for the metric since it wasn't obvious how to use it.
OK, I greatly improved the testing, PTAL |
If you plot this over time, the slope will tell you the difference between
those scenarios.
…On Fri, Nov 9, 2018, 5:24 PM Matt Liggett ***@***.*** wrote:
***@***.**** commented on this pull request.
------------------------------
In staging/src/k8s.io/client-go/util/workqueue/metrics.go
<#70884 (comment)>
:
> @@ -103,6 +114,23 @@ func (m *defaultQueueMetrics) done(item t) {
}
}
+func (m *defaultQueueMetrics) updateUnfinishedWork() {
+ var total float64
+ if m.processingStartTimes != nil {
+ for _, t := range m.processingStartTimes {
+ total += sinceInMicroseconds(t)
In this case, we can't tell the difference between 1 thing stuck for an
hour, or 2 things stuck for 30 minutes.
If we want information on both ages and the *count of things at certain
ages*, we'd normally use a distribution/histogram (which I think is
"summary" in prometheus?).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70884 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAnglv_c03Bc7WK2EhFnZTsHHtDH3BuXks5utirFgaJpZM4YXV5H>
.
|
It looks good to me. The clock injection for testing was a nice touch. |
I'm open to reporting a distribution, of course, but I don't have a single
thing to observe and it wasn't clear how to get around that.
…On Fri, Nov 9, 2018, 5:31 PM Han Kang ***@***.*** wrote:
It looks good to me. The clock injection for testing was a nice touch.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70884 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAnglr82oCOwT1Nf_Mbotqth4KRNW-8wks5utixlgaJpZM4YXV5H>
.
|
metricsProvider: noopMetricsProvider{}, | ||
} | ||
|
||
type metricsFactory struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename to queueMetricsFactory
or queueMetricsProvider
? I don't know whether we're actively making a distinction between factories & providers, I find it confusing to have metricsFactory with an embedded metricsProvider.
|
||
func (f *metricsFactory) newQueueMetrics(name string, clock clock.Clock) queueMetrics { | ||
mp := f.metricsProvider | ||
if len(name) == 0 || mp == (noopMetricsProvider{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does mp == (noopMetricsProvider{})
work? I would have though you'd have to do a type-check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprisingly, go does the comparison correctly (otherwise the test wouldn't pass / compile).
metricsFactory.setProviders.Do(func() { | ||
metricsFactory.metricsProvider = metricsProvider | ||
}) | ||
globalMetricsFactory.set(metricsProvider) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename set
to setProvider
, for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -64,6 +82,9 @@ type Type struct { | |||
shuttingDown bool | |||
|
|||
metrics queueMetrics | |||
|
|||
unfinishedWorkUpdatePeriod time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we don't need to store it in the struct, we could just pass it into updateUnfinishedWorkLoop
. Not sure if that makes things better or worse in your mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer it when things are introspectable after the fact, although it doesn't matter in this case.
@@ -170,3 +191,22 @@ func (q *Type) ShuttingDown() bool { | |||
|
|||
return q.shuttingDown | |||
} | |||
|
|||
func (q *Type) updateUnfinishedWorkLoop() { | |||
t := q.clock.NewTicker(q.unfinishedWorkUpdatePeriod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably add a comment here that the reason we're doing this is because it's not safe to update metrics without holding the q.cond.L lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note to defaultQueueMetrics.
A few code comments. I like the metric though - although we're packing a few failure modes into one metric, I think it gives us signal where we didn't have signal before. If we need to add more metrics to disambiguate later, we can do so. (Do we have any sort of guarantee on metric compatibility?) |
PTAL I'm not sure if we've set a metrics deprecation policy or not, I think 2 releases / 6 months is probably the minimum deprecation time that could be reasonable. |
change units to seconds
/retest |
/retest |
Added a second metric for @mml. |
Can I get an LGTM so I can squash? :) |
func (prometheusMetricsProvider) NewLongestRunningProcessorMicrosecondsMetric(name string) workqueue.SettableGaugeMetric { | ||
unfinished := prometheus.NewGauge(prometheus.GaugeOpts{ | ||
Subsystem: name, | ||
Name: "longest_running_procesor_microseconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*processor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/LGTM
@logicalhan: changing LGTM is restricted to assignees, and only kubernetes/kubernetes repo collaborators may be assigned issues. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@p0lyn0mial fyi. This sort of thing would also be useful in controllers and perhaps in the server (if there's a spot we missed) |
I think it depends. We try to be polite, but I don't think we've ever enforced a hard rule on metric naming and the like |
} | ||
// Convert to seconds; microseconds is unhelpfully granular for this. | ||
total /= 1000000 | ||
m.unfinishedWorkSeconds.Set(total) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just "Set(total / 1000000)"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, if I make another change I'll update this otherwise it doesn't seem worth it :)
Thanks for the review!
/lgtm Thanks. |