-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
When using time intervals for routes, alerts received outside of the correct time window are dropped with a log message:
Lines 841 to 846 in 6ce841c
| // If the current time is inside a mute time, all alerts are removed from the pipeline. | |
| if muted { | |
| level.Debug(l).Log("msg", "Notifications not sent, route is within mute time") | |
| return ctx, nil, nil | |
| } | |
| return ctx, alerts, nil |
Lines 878 to 884 in 6ce841c
| // If the current time is not inside an active time, all alerts are removed from the pipeline | |
| if !active { | |
| level.Debug(l).Log("msg", "Notifications not sent, route is not within active time") | |
| return ctx, nil, nil | |
| } | |
| return ctx, alerts, nil |
Imagine a scenario where:
- there is >=1 routes with time intervals set (eg, engineering teams wanting alerts during business hours)
- there is a NOC or similar team available to triage alerts 24/7
Since alerts received outside of valid time intervals aren't retried and are outright dropped, it could be helpful to the NOC to have a count of how many alerts have been dropped for tracking or meta-alerting purposes (for example, a high rate of dropped alerts could indicate a more severe issue and justify the NOC reaching out to subject matter experts, even if outside of business hours).
Using the existing log lines, it is possible to, for example, create a recording rule within loki and generate a metric off of it. However, the log entries don't inform the user how many alerts were dropped.
I'm proposing a new counter metric alertmanager_alerts_dropped_total should be created, and when alerts are dropped, it should be incremented by how many alerts were dropped (ie, counterVar.Add(len(alerts)))