notifier: optionally drain queued notifications before shutting down by charleskorn · Pull Request #14290 · prometheus/prometheus

charleskorn · 2024-06-12T06:50:31Z

This PR adds an option to drain any queued notifications before shutting down the notification manager.

The option is enabled by default. If it is disabled, the current behaviour is preserved, where any queued notifications are effectively dropped when stopping the process.

Enabling draining of the queue solves the issue where alert notifications are silently dropped if alert rule evaluation completes just before the process is sent a shutdown signal, and the notifications have not been sent by the time the notification manager is told to shut down.

Signed-off-by: Charles Korn <[email protected]>

pracucci · 2024-06-12T08:42:24Z

notifier/notifier.go

@@ -300,21 +303,57 @@ func (n *Manager) nextBatch() []*Alert {

 // Run dispatches notifications continuously.
 func (n *Manager) Run(tsets <-chan map[string][]*targetgroup.Group) {


I think the logic is correct. However, this function is getting more complicated, and I would like you to consider an alternative approach.

My suggestion starts from the assumption that, when draining, we don't need to run n.reload(ts), which will make things easier IMO. Consider the Manager.Run() function from main branch. When the n.stopAndDrainRequested is signaled, we can simply exit the for loop. After the existing for loop, we add something like this, to handle the draining (consider it a pseudo-code, I haven't run it):

if n.opts.DrainTimeout > 0 { drainTimedOut := time.After(n.opts.DrainTimeout) for n.queueLen() > 0 { select { case <-drainTimedOut: return // TODO would be even cleaner to exit the for loop instead default: alerts := n.nextBatch() if !n.sendAll(alerts...) { n.metrics.dropped.Add(float64(len(alerts))) } } } }

In this way we can keep the draining logically separated from the main loop, and I believe it may be easier to follow the logic.

What about keeping Run() as is (we may split it here) and only delay the call to n.cancel() in Stop(): loop/block until the timeout is reached or the queue is empty, then call n.cancel() which will stop the Run()
This way we don't need to re-implement the Run logic.

What about keeping Run() as is (we may split it here) and only delay the call to n.cancel() in Stop(): loop/block until the timeout is reached or the queue is empty, then call n.cancel() which will stop the Run()
This way we don't need to re-implement the Run logic.

My concern with making this change is that it changes the contract of Stop(), which may cause issues elsewhere. Previously, Stop() signalled that the manager should shutdown, but didn't block, whereas with this suggestion, Stop() will block until the manager has stopped.

I've implemented something along the lines of what @pracucci suggested in e4f5fee. Let me know what you think.

My concern with making this change is that it changes the contract of Stop(), which may cause issues elsewhere. Previously, Stop() signalled that the manager should shutdown, but didn't block, whereas with this suggestion, Stop() will block until the manager has stopped.

I don't see how the contract will change if the default value of DrainTimeout is 0.
(one can argue that the same applies for Run now, before it wasn't blocking, now it does, but again the default DrainTimeout is 0)
Note that we take a similar approach for remote write queues here

prometheus/storage/remote/queue_manager.go

Line 1201 in 5a21870

func (s *shards) stop() {

(one can argue that the same applies for Run now, before it wasn't blocking, now it does, but again the default DrainTimeout is 0)

Run was blocking before - before this PR, Run wouldn't return until Stop was called, and this PR does not change this behaviour.

I've changed the behaviour of Stop in 086be26.

Signed-off-by: Charles Korn <[email protected]>

charleskorn · 2024-06-14T06:06:33Z

The CI failure seems like a flake unrelated to my changes, but I don't have permission to re-run it.

pracucci · 2024-06-17T16:15:50Z

notifier/notifier.go

+	}
+
+	<-n.stopped
+	n.drainQueue()


I'm not convinced it's safe to drain the queue here. As you stated in another comment, the Run() contract is that it blocks until the Manager has stopped executing. Moving the draining here is breaking such contract.

Signed-off-by: Charles Korn <[email protected]>

charleskorn · 2024-06-20T01:42:20Z

I've made a few changes in response to feedback from @pracucci and @gotjosh:

Stop now behaves as it did before: it merely signals that the notifier should stop and returns immediately, rather than waiting for the notifier to shut down. This preserves the existing contract of Stop.
I've removed the timeout in favour of a flag to enable or disable draining the queue, and enabled it by default. The rationale behind this is that we should always try to send notifications that have been generated, just as we already always try to finish evaluating rules and recording any resulting samples before shutting down. We can rely on whatever is running Prometheus (eg. Kubernetes) to kill the process if draining is taking too long.

charleskorn · 2024-06-20T01:49:22Z

Resolving the merge conflict here is blocked pending a response to #14174 (review).

# Conflicts: # notifier/notifier.go

…re robust Signed-off-by: Charles Korn <[email protected]>

Signed-off-by: Charles Korn <[email protected]>

charleskorn · 2024-06-24T04:15:01Z

I've resolved the merge conflict, and this is ready for another review.

Signed-off-by: Charles Korn <[email protected]>

pracucci

I know it was discussed extensively, but I personally disagree loosing the guarantee that once you call Stop() within max (timeout) period the manager will effectively stop.

Before this PR, the context was canceled on Stop(). After this PR, the context is never canceled until the notification work has been done (essentially there's no point having a cancelable context at all).

This change may be fine for Prometheus, but it's risky for downstream multi-tenant projects running 1 ruler per tenant (e.g. Mimir, Cortex last time I checked, ...). There's the risk that stopping the Notifier will take a long time, thus slowing down or blocking other operations in the downstream project.

notifier/notifier.go

Signed-off-by: Charles Korn <[email protected]>

pracucci

I know it was discussed extensively, but I personally disagree loosing the guarantee that once you call Stop() within max (timeout) period the manager will effectively stop.

Discussed offline with Josh. Despite I still think having a "max stop timeout" after which the "notifier context gets canceled" may be a good idea, in practice this may not be a significant issue, so changes here LGTM.

gotjosh

LGTM

Thanks very much for your hard work @charleskorn.

@machine424 please do let me know if you have any additional concerns with this.

Add draining of queued notifications to notifier.Manager

20d6732

Signed-off-by: Charles Korn <[email protected]>

charleskorn mentioned this pull request Jun 12, 2024

Ruler shutdown behaviour improvements grafana/mimir#8346

Merged

2 tasks

Update docs

ed56b29

Signed-off-by: Charles Korn <[email protected]>

pracucci reviewed Jun 12, 2024

View reviewed changes

Address PR feedback

e4f5fee

Signed-off-by: Charles Korn <[email protected]>

pracucci reviewed Jun 17, 2024

View reviewed changes

charleskorn added 2 commits June 20, 2024 11:26

Add more logging

a8f73d6

Signed-off-by: Charles Korn <[email protected]>

Address offline feedback: remove timeout

a828653

Signed-off-by: Charles Korn <[email protected]>

charleskorn force-pushed the charleskorn/drain-notifier-queue branch from c6cf398 to a828653 Compare June 20, 2024 01:38

machine424 mentioned this pull request Jun 20, 2024

fix(notifier): Fix target groups update starvation #14174

Merged

charleskorn added 4 commits June 24, 2024 13:49

Merge branch 'main' into charleskorn/drain-notifier-queue

adce91e

# Conflicts: # notifier/notifier.go

Ensure stopping takes priority over further processing, make tests mo…

07296e4

…re robust Signed-off-by: Charles Korn <[email protected]>

Make channel unbuffered

52b6e84

Signed-off-by: Charles Korn <[email protected]>

Update docs

03a7372

Signed-off-by: Charles Korn <[email protected]>

charleskorn requested review from machine424 and pracucci June 24, 2024 04:15

Fix race in test

b6f86e2

Signed-off-by: Charles Korn <[email protected]>

pracucci reviewed Jun 24, 2024

View reviewed changes

notifier/notifier.go Outdated Show resolved Hide resolved

notifier/notifier.go Show resolved Hide resolved

charleskorn added 2 commits June 25, 2024 11:45

Remove unnecessary context

5c19890

Signed-off-by: Charles Korn <[email protected]>

Make Stop safe to call multiple times

6433d28

Signed-off-by: Charles Korn <[email protected]>

pracucci approved these changes Jun 25, 2024

View reviewed changes

charleskorn mentioned this pull request Jun 26, 2024

Stop the ruler notifiers on re-shuffle. grafana/mimir#8513

Closed

4 tasks

gotjosh approved these changes Jun 26, 2024

View reviewed changes

gotjosh merged commit 2dd07fb into prometheus:main Jun 26, 2024

This was referenced Jun 27, 2024

Merge upstream Prometheus at c5040c5 grafana/mimir-prometheus#655

Merged

Upgrade mimir-prometheus grafana/mimir#8553

Closed

		@@ -300,21 +303,57 @@ func (n Manager) nextBatch() []Alert {

		// Run dispatches notifications continuously.
		func (n Manager) Run(tsets <-chan map[string][]targetgroup.Group) {

Conversation

charleskorn commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pracucci Jun 12, 2024

Choose a reason for hiding this comment

Uh oh!

machine424 Jun 12, 2024

Choose a reason for hiding this comment

Uh oh!

charleskorn Jun 13, 2024

Choose a reason for hiding this comment

Uh oh!

charleskorn Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machine424 Jun 13, 2024

Choose a reason for hiding this comment

Uh oh!

charleskorn Jun 14, 2024

Choose a reason for hiding this comment

Uh oh!

charleskorn Jun 14, 2024

Choose a reason for hiding this comment

Uh oh!

charleskorn commented Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pracucci Jun 17, 2024

Choose a reason for hiding this comment

Uh oh!

charleskorn commented Jun 20, 2024

Uh oh!

charleskorn commented Jun 20, 2024

Uh oh!

charleskorn commented Jun 24, 2024

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

gotjosh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

charleskorn commented Jun 12, 2024 •

edited

Loading

charleskorn Jun 13, 2024 •

edited

Loading

charleskorn commented Jun 14, 2024 •

edited

Loading