fix(notifier): Fix target groups update starvation#14174
fix(notifier): Fix target groups update starvation#14174machine424 merged 3 commits intoprometheus:mainfrom
Conversation
7f8dd98 to
68640fe
Compare
notifier/notifier_test.go
Outdated
| } | ||
| } | ||
|
|
||
| // TODO: renameit and even replace TestHangingNotifier with it. |
There was a problem hiding this comment.
we can keep both of them as the current one helped fixing 5c63ef2
7c6bc2f to
ad40823
Compare
64ac864 to
5c63ef2
Compare
bfaadab to
c3447c2
Compare
|
I think this change is really good |
to show "targets groups update" starvation when the notifications queue is full and an Alertmanager is down. The existing `TestHangingNotifier` that was added in prometheus#10948 doesn't really reflect the reality as the SD changes are manually fed into `syncCh` in a continuous way, whereas in reality, updates are only resent every `updatert`. The test added here sets up an SD manager and links it to the notifier. The SD changes will be triggered by that manager as it's done in reality. Signed-off-by: machine424 <[email protected]> Co-authored-by: Ethan Hunter <[email protected]>
…rget updates and trigger reloads and the other one to send notifications. This is done to prevent the latter operation from blocking/starving the former, as previously, the `tsets` channel was consumed by the same goroutine that consumes and feeds the buffered `n.more` channel, the `tsets` channel was less likely to be ready as it's unbuffered and only fed every `SDManager.updatert` seconds. See prometheus#13676 and prometheus#8768 The synchronization with the sendLoop goroutine is managed through the n.mtx mutex. This uses a similar approach than scrape manager's https://github.com/prometheus/prometheus/blob/efbd6e41c59ec8d6b7a0791c1fb337fdac53b4f2/scrape/manager.go#L115-L117 The old TestHangingNotifier was replaced by the new one to more closely reflect reality. Signed-off-by: machine424 <[email protected]>
…et.ams in sendAll Signed-off-by: machine424 <[email protected]>
c3447c2 to
8e5a99e
Compare
| for { | ||
| select { | ||
| case <-n.ctx.Done(): | ||
| return |
There was a problem hiding this comment.
This has changed the behaviour of Run: previously, Run would only return once it had finished both sending notifications and processing target group updates.
However, with this change, Run could return before sending notifications has finished, and there's no way for callers to know when/if sending notifications has finished.
What do you think about waiting for sendLoop() to terminate before we return here?
There was a problem hiding this comment.
sending "the current (if any) batch" of notifications.
In both cases, other notifications could still remain in the queue when Run returns (and that's what #14290 is trying to address).
Also, in both cases, the current batch (if any) will be cancelled: the request context is a child of the manager's. Previously, it was cleaner: we were sure to catch the request context cancellation and updated some metrics (even though it's useless as Prometheus is already stopping).
There is no test or doc/comments enforcing that old behavior. You can enforce that as part of your PR, or I can take care of it. I think just putting the "notifications sending" loop in the foreground should help with that.
|
Why is this labeled "chore" and not "bugfix" ? |
I have a bad habit of holding onto "chore" even when the original intention changes. |
fixes #13676
chore(notifier): Split 'Run()' into two goroutines: one to receive target updates and trigger reloads and the other one to send notifications.
This is done to prevent the latter operation from blocking/starving the former, as previously, the
tsetschannel was consumed by the same goroutine that consumes and feeds the bufferedn.morechannel, thetsetschannel was less likely to be ready as it's unbuffered and only fed everySDManager.updatertseconds.See #13676 and #8768.
The synchronization with the sendLoop goroutine is managed through the n.mtx mutex.
This uses a similar approach than scrape manager's
prometheus/scrape/manager.go
Lines 115 to 117 in efbd6e4
chore(notifier): add a reproducer for #13676
to show "targets groups update" starvation when the notifications queue is full and an Alertmanager
is down.
The existing
TestHangingNotifierthat was added in #10948 doesn't really reflect the reality as the SD changes are manually fed intosyncChin a continuous way, whereas in reality, updates are only resent everyupdatert.The test added here sets up an SD manager and links it to the notifier. The SD changes will be triggered by that manager as it's done in reality.
fix(notifier): take alertmanagerSet.mtx before checking alertmanagerSet.ams in sendAll