notifier: optionally drain queued notifications before shutting down#14290
notifier: optionally drain queued notifications before shutting down#14290gotjosh merged 12 commits intoprometheus:mainfrom
Conversation
Signed-off-by: Charles Korn <[email protected]>
Signed-off-by: Charles Korn <[email protected]>
| @@ -300,21 +303,57 @@ func (n *Manager) nextBatch() []*Alert { | |||
|
|
|||
| // Run dispatches notifications continuously. | |||
| func (n *Manager) Run(tsets <-chan map[string][]*targetgroup.Group) { | |||
There was a problem hiding this comment.
I think the logic is correct. However, this function is getting more complicated, and I would like you to consider an alternative approach.
My suggestion starts from the assumption that, when draining, we don't need to run n.reload(ts), which will make things easier IMO. Consider the Manager.Run() function from main branch. When the n.stopAndDrainRequested is signaled, we can simply exit the for loop. After the existing for loop, we add something like this, to handle the draining (consider it a pseudo-code, I haven't run it):
if n.opts.DrainTimeout > 0 {
drainTimedOut := time.After(n.opts.DrainTimeout)
for n.queueLen() > 0 {
select {
case <-drainTimedOut:
return // TODO would be even cleaner to exit the for loop instead
default:
alerts := n.nextBatch()
if !n.sendAll(alerts...) {
n.metrics.dropped.Add(float64(len(alerts)))
}
}
}
}
In this way we can keep the draining logically separated from the main loop, and I believe it may be easier to follow the logic.
There was a problem hiding this comment.
What about keeping Run() as is (we may split it here) and only delay the call to n.cancel() in Stop(): loop/block until the timeout is reached or the queue is empty, then call n.cancel() which will stop the Run()
This way we don't need to re-implement the Run logic.
There was a problem hiding this comment.
What about keeping
Run()as is (we may split it here) and only delay the call ton.cancel()inStop(): loop/block until the timeout is reached or the queue is empty, then calln.cancel()which will stop theRun()
This way we don't need to re-implement the Run logic.
My concern with making this change is that it changes the contract of Stop(), which may cause issues elsewhere. Previously, Stop() signalled that the manager should shutdown, but didn't block, whereas with this suggestion, Stop() will block until the manager has stopped.
There was a problem hiding this comment.
My concern with making this change is that it changes the contract of Stop(), which may cause issues elsewhere. Previously, Stop() signalled that the manager should shutdown, but didn't block, whereas with this suggestion, Stop() will block until the manager has stopped.
I don't see how the contract will change if the default value of DrainTimeout is 0.
(one can argue that the same applies for Run now, before it wasn't blocking, now it does, but again the default DrainTimeout is 0)
Note that we take a similar approach for remote write queues here
prometheus/storage/remote/queue_manager.go
Line 1201 in 5a21870
There was a problem hiding this comment.
(one can argue that the same applies for
Runnow, before it wasn't blocking, now it does, but again the defaultDrainTimeoutis0)
Run was blocking before - before this PR, Run wouldn't return until Stop was called, and this PR does not change this behaviour.
There was a problem hiding this comment.
I've changed the behaviour of Stop in 086be26.
Signed-off-by: Charles Korn <[email protected]>
|
The CI failure seems like a flake unrelated to my changes, but I don't have permission to re-run it. |
notifier/notifier.go
Outdated
| } | ||
|
|
||
| <-n.stopped | ||
| n.drainQueue() |
There was a problem hiding this comment.
I'm not convinced it's safe to drain the queue here. As you stated in another comment, the Run() contract is that it blocks until the Manager has stopped executing. Moving the draining here is breaking such contract.
Signed-off-by: Charles Korn <[email protected]>
Signed-off-by: Charles Korn <[email protected]>
c6cf398 to
a828653
Compare
|
I've made a few changes in response to feedback from @pracucci and @gotjosh:
|
|
Resolving the merge conflict here is blocked pending a response to #14174 (review). |
# Conflicts: # notifier/notifier.go
…re robust Signed-off-by: Charles Korn <[email protected]>
Signed-off-by: Charles Korn <[email protected]>
Signed-off-by: Charles Korn <[email protected]>
|
I've resolved the merge conflict, and this is ready for another review. |
Signed-off-by: Charles Korn <[email protected]>
pracucci
left a comment
There was a problem hiding this comment.
I know it was discussed extensively, but I personally disagree loosing the guarantee that once you call Stop() within max (timeout) period the manager will effectively stop.
Before this PR, the context was canceled on Stop(). After this PR, the context is never canceled until the notification work has been done (essentially there's no point having a cancelable context at all).
This change may be fine for Prometheus, but it's risky for downstream multi-tenant projects running 1 ruler per tenant (e.g. Mimir, Cortex last time I checked, ...). There's the risk that stopping the Notifier will take a long time, thus slowing down or blocking other operations in the downstream project.
Signed-off-by: Charles Korn <[email protected]>
Signed-off-by: Charles Korn <[email protected]>
pracucci
left a comment
There was a problem hiding this comment.
I know it was discussed extensively, but I personally disagree loosing the guarantee that once you call Stop() within max (timeout) period the manager will effectively stop.
Discussed offline with Josh. Despite I still think having a "max stop timeout" after which the "notifier context gets canceled" may be a good idea, in practice this may not be a significant issue, so changes here LGTM.
gotjosh
left a comment
There was a problem hiding this comment.
LGTM
Thanks very much for your hard work @charleskorn.
@machine424 please do let me know if you have any additional concerns with this.
This PR adds an option to drain any queued notifications before shutting down the notification manager.
The option is enabled by default. If it is disabled, the current behaviour is preserved, where any queued notifications are effectively dropped when stopping the process.
Enabling draining of the queue solves the issue where alert notifications are silently dropped if alert rule evaluation completes just before the process is sent a shutdown signal, and the notifications have not been sent by the time the notification manager is told to shut down.