Skip to content

Proposal: Add counter metric to track alerts dropped for being outside of active/within muted time bounds #3512

@tjhop

Description

@tjhop

When using time intervals for routes, alerts received outside of the correct time window are dropped with a log message:

// If the current time is inside a mute time, all alerts are removed from the pipeline.
if muted {
level.Debug(l).Log("msg", "Notifications not sent, route is within mute time")
return ctx, nil, nil
}
return ctx, alerts, nil

// If the current time is not inside an active time, all alerts are removed from the pipeline
if !active {
level.Debug(l).Log("msg", "Notifications not sent, route is not within active time")
return ctx, nil, nil
}
return ctx, alerts, nil

Imagine a scenario where:

  1. there is >=1 routes with time intervals set (eg, engineering teams wanting alerts during business hours)
  2. there is a NOC or similar team available to triage alerts 24/7

Since alerts received outside of valid time intervals aren't retried and are outright dropped, it could be helpful to the NOC to have a count of how many alerts have been dropped for tracking or meta-alerting purposes (for example, a high rate of dropped alerts could indicate a more severe issue and justify the NOC reaching out to subject matter experts, even if outside of business hours).

Using the existing log lines, it is possible to, for example, create a recording rule within loki and generate a metric off of it. However, the log entries don't inform the user how many alerts were dropped.

I'm proposing a new counter metric alertmanager_alerts_dropped_total should be created, and when alerts are dropped, it should be incremented by how many alerts were dropped (ie, counterVar.Add(len(alerts)))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions