Skip to content

notifier: optionally drain queued notifications before shutting down#14290

Merged
gotjosh merged 12 commits intoprometheus:mainfrom
charleskorn:charleskorn/drain-notifier-queue
Jun 26, 2024
Merged

notifier: optionally drain queued notifications before shutting down#14290
gotjosh merged 12 commits intoprometheus:mainfrom
charleskorn:charleskorn/drain-notifier-queue

Conversation

@charleskorn
Copy link
Copy Markdown
Contributor

@charleskorn charleskorn commented Jun 12, 2024

This PR adds an option to drain any queued notifications before shutting down the notification manager.

The option is enabled by default. If it is disabled, the current behaviour is preserved, where any queued notifications are effectively dropped when stopping the process.

Enabling draining of the queue solves the issue where alert notifications are silently dropped if alert rule evaluation completes just before the process is sent a shutdown signal, and the notifications have not been sent by the time the notification manager is told to shut down.

Signed-off-by: Charles Korn <[email protected]>
@@ -300,21 +303,57 @@ func (n *Manager) nextBatch() []*Alert {

// Run dispatches notifications continuously.
func (n *Manager) Run(tsets <-chan map[string][]*targetgroup.Group) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic is correct. However, this function is getting more complicated, and I would like you to consider an alternative approach.

My suggestion starts from the assumption that, when draining, we don't need to run n.reload(ts), which will make things easier IMO. Consider the Manager.Run() function from main branch. When the n.stopAndDrainRequested is signaled, we can simply exit the for loop. After the existing for loop, we add something like this, to handle the draining (consider it a pseudo-code, I haven't run it):

if n.opts.DrainTimeout > 0 {
  drainTimedOut := time.After(n.opts.DrainTimeout)

  for n.queueLen() > 0 {
    select {
    case <-drainTimedOut:
      return // TODO would be even cleaner to exit the for loop instead

    default:
      alerts := n.nextBatch()

      if !n.sendAll(alerts...) {
        n.metrics.dropped.Add(float64(len(alerts)))
      }
    }
  }
}

In this way we can keep the draining logically separated from the main loop, and I believe it may be easier to follow the logic.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about keeping Run() as is (we may split it here) and only delay the call to n.cancel() in Stop(): loop/block until the timeout is reached or the queue is empty, then call n.cancel() which will stop the Run()
This way we don't need to re-implement the Run logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about keeping Run() as is (we may split it here) and only delay the call to n.cancel() in Stop(): loop/block until the timeout is reached or the queue is empty, then call n.cancel() which will stop the Run()
This way we don't need to re-implement the Run logic.

My concern with making this change is that it changes the contract of Stop(), which may cause issues elsewhere. Previously, Stop() signalled that the manager should shutdown, but didn't block, whereas with this suggestion, Stop() will block until the manager has stopped.

Copy link
Copy Markdown
Contributor Author

@charleskorn charleskorn Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented something along the lines of what @pracucci suggested in e4f5fee. Let me know what you think.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with making this change is that it changes the contract of Stop(), which may cause issues elsewhere. Previously, Stop() signalled that the manager should shutdown, but didn't block, whereas with this suggestion, Stop() will block until the manager has stopped.

I don't see how the contract will change if the default value of DrainTimeout is 0.
(one can argue that the same applies for Run now, before it wasn't blocking, now it does, but again the default DrainTimeout is 0)
Note that we take a similar approach for remote write queues here

func (s *shards) stop() {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(one can argue that the same applies for Run now, before it wasn't blocking, now it does, but again the default DrainTimeout is 0)

Run was blocking before - before this PR, Run wouldn't return until Stop was called, and this PR does not change this behaviour.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the behaviour of Stop in 086be26.

Signed-off-by: Charles Korn <[email protected]>
@charleskorn
Copy link
Copy Markdown
Contributor Author

charleskorn commented Jun 14, 2024

The CI failure seems like a flake unrelated to my changes, but I don't have permission to re-run it.

}

<-n.stopped
n.drainQueue()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced it's safe to drain the queue here. As you stated in another comment, the Run() contract is that it blocks until the Manager has stopped executing. Moving the draining here is breaking such contract.

@charleskorn charleskorn force-pushed the charleskorn/drain-notifier-queue branch from c6cf398 to a828653 Compare June 20, 2024 01:38
@charleskorn
Copy link
Copy Markdown
Contributor Author

I've made a few changes in response to feedback from @pracucci and @gotjosh:

  • Stop now behaves as it did before: it merely signals that the notifier should stop and returns immediately, rather than waiting for the notifier to shut down. This preserves the existing contract of Stop.
  • I've removed the timeout in favour of a flag to enable or disable draining the queue, and enabled it by default. The rationale behind this is that we should always try to send notifications that have been generated, just as we already always try to finish evaluating rules and recording any resulting samples before shutting down. We can rely on whatever is running Prometheus (eg. Kubernetes) to kill the process if draining is taking too long.

@charleskorn
Copy link
Copy Markdown
Contributor Author

Resolving the merge conflict here is blocked pending a response to #14174 (review).

@charleskorn
Copy link
Copy Markdown
Contributor Author

I've resolved the merge conflict, and this is ready for another review.

Signed-off-by: Charles Korn <[email protected]>
Copy link
Copy Markdown
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was discussed extensively, but I personally disagree loosing the guarantee that once you call Stop() within max (timeout) period the manager will effectively stop.

Before this PR, the context was canceled on Stop(). After this PR, the context is never canceled until the notification work has been done (essentially there's no point having a cancelable context at all).

This change may be fine for Prometheus, but it's risky for downstream multi-tenant projects running 1 ruler per tenant (e.g. Mimir, Cortex last time I checked, ...). There's the risk that stopping the Notifier will take a long time, thus slowing down or blocking other operations in the downstream project.

Copy link
Copy Markdown
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was discussed extensively, but I personally disagree loosing the guarantee that once you call Stop() within max (timeout) period the manager will effectively stop.

Discussed offline with Josh. Despite I still think having a "max stop timeout" after which the "notifier context gets canceled" may be a good idea, in practice this may not be a significant issue, so changes here LGTM.

Copy link
Copy Markdown
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks very much for your hard work @charleskorn.

@machine424 please do let me know if you have any additional concerns with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants