Skip to content

Comments

Don't return alerts in the API (alerts/groups) until they are cross-c…#692

Closed
prymitive wants to merge 3 commits intoprometheus:masterfrom
prymitive:issue-609
Closed

Don't return alerts in the API (alerts/groups) until they are cross-c…#692
prymitive wants to merge 3 commits intoprometheus:masterfrom
prymitive:issue-609

Conversation

@prymitive
Copy link
Contributor

…hecked with silences

alerts/groups API endpoint provides "silenced" key for all alerts, which is set if any silence for this alert exist.
Due to how silences are matched with alerts, which happens out of band from API perspective, alerts/groups will return all alerts as soon as they created and only after silences are matched with an alert a "silenced" key will be added to the object in API response. This means that API clients that wants to rely on that key can't trust it, as the absence of "silenced" key doesn't guarantee that given alerts isn't silenced. This change fixes that by tracking which alert was matched against silences and only returning those alerts that were checked.
Fixes #609

@Kellel
Copy link
Contributor

Kellel commented Apr 6, 2017

Nice! I've noticed this while working on #636 as well

@prymitive
Copy link
Contributor Author

@stuartnelson3 does this look sane? It does break the behaviour of alert/groups API a bit, but I think that's acceptable given that the API isn't stable yet.

@stuartnelson3
Copy link
Contributor

It does change the behavior, but as you said, we aren't currently making any API stability guarantees.

Looking at this, it makes sense -- add a processed field so you know whether the API has looked at it. I'm just wondering if we should instead move from adding a boolean field to adding a status field, so that we don't keep adding more boolean flags in the future. We already having inhibited, silenced, and with this, processed.

So something like status: (unprocessed|active|silenced|inhibited) is what I'm thinking would actually be better. I realize this ends up with either tasking you with a refactoring of how this is managed internally, or waiting for one of us to do it.

@fabxc @brancz @mxinden -- opinion on merging all these booleans into a single status (or whatever) field?

@prymitive
Copy link
Contributor Author

That does seem like an improvement. It doesn't seem like a big task and I'm happy to do it, will update once I have a new patch ready.

@prymitive
Copy link
Contributor Author

I got back to it and took a look over the code. Current memMarker struct contains:

  • inhibited map[model.Fingerprint]struct{}
  • silenced map[model.Fingerprint]string

So those are not simple bool fields, looking at the code that uses those I see that:

  • inhibited value is not used - link
  • silenced value is used - link1 link2

So we can move those into the status field.

This field will be used to store alert flags, start with uint8 as we don't need too many bits
Inhibited flag will now be set on the status field, we don't need to keep track of this value, a single bit is enough
Once notify have a chance to resolve silences matched for given alert mark it also as processed, so we can start returning it in the group overview API

// if there are no flags set remove the whole key
if m.status[alert] == 0 {
delete(m.status, alert)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a point where we can nuke entire m.status[alert]? I'm not sure if this will ever be cleaned, current code used delete() if passed bool was false, but with status we no longer can delete the entire map if any other flag is set and processing one will be set for everything. Where should the cleanup code go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking for the code that removes Alerts ... that should also (hopefully) have access to the marker map, so we can get rid of the alert.

@prymitive
Copy link
Contributor Author

First go at the new patch

@mxinden
Copy link
Member

mxinden commented Apr 12, 2017

@prymitive @stuartnelson3 I like the idea of the status: (unprocessed|active|silenced|inhibited) field as those four states are a disjoint set, right?

@stuartnelson3
Copy link
Contributor

More or less. Technically, an alert could be both inhibited AND silenced, but in that case I think it's only necessary to mark it as inhibited, and the result for the user is the same (i.e., no notification).

@brancz
Copy link
Member

brancz commented Apr 12, 2017

@stuartnelson3 in the notification pipeline if an alert is inhibited it will never reach the silence stage, therefore won't be added to the set of silenced alerts.

@prymitive
Copy link
Contributor Author

Oh, if they are disjoint I can simplify this PR, but:

  • all but silenced are just a bool markers, so those can be a simple iota, but for silences we need to store the value as it's the id of a silence that we need to set on an alert, so we'll endup with having status: silenced and silenced: abcdf, both fields would need to be kept in sync
  • I'm still not sure when should I drop an alert from the marker map

@stuartnelson3
Copy link
Contributor

@brancz thanks for the clarification!

@stuartnelson3
Copy link
Contributor

@prymitive

  • yeah, right now you would need to keep both fields in sync. Another change I would like to make to the API is for an alert to always return a silenced field. Currently this field only exists in the JSON representation if it has a value, but I think having consistent JSON structure is more important than not having an empty value. (People are welcome to disagree with this opinion)
  • I haven't looked at the code so am not sure how the marker map is managed. I'll take a look tomorrow, or if @brancz wants to comment, that'd also be cool :)

@prymitive
Copy link
Contributor Author

It could also be:

type Status uint8

const (
        Unprocessed = Status(iota)
        Active
        Silenced
        Inhibited
)

type alertStatus struct {
        status uint8
        value  string
}

 type memMarker struct {
 	status    map[model.Fingerprint]alertStatus 
  	mtx sync.RWMutex
  }

This way instead of having two variables to keep in sync we just have a single struct to worry about.

@stuartnelson3
Copy link
Contributor

We do have one struct, but we still have two values :) I'll think about it, but right now my gut feeling is to keep the actual alert JSON representation flat and not have a nested object containing status information. We can still accomplish that in the serialization, though, with this struct. So just go with this for now.

Took a stab at the code since alertStatus will need a custom MarshalJSON:

type state uint8

const (
	unprocessed state = iota
	active
	silenced
	inhibited
)

type alertStatus struct {
	status state
	value  string
}

func (a *alertStatus) MarshalJSON() ([]byte, error) {
	status, prs := statusMap[a.status]
	if !prs {
		status = "unknown"
	}
	return json.Marshal(map[string]string{
		"status": status,
		"value":  a.value,
	})
}

// For serializing status to JSON.
var statusMap = map[state]string{
	unprocessed: "unprocessed",
	active:      "active",
	silenced:    "silenced",
	inhibited:   "inhibited",
}

type memMarker struct {
	status map[model.Fingerprint]alertStatus

	sync.RWMutex
}

@stuartnelson3
Copy link
Contributor

I think the semantics in the marker map will stay the same.

  • If we SetSilenced, update the state to silenced depending on if a silence is passed into the method. If a silence is passed, set value to the silence id, if no silence is passed, unset the value. If it's already inhibited, don't change the state.
  • SetInhibited will, depending on the bool passed in, set the state to inhibited, or set it to silenced (value != ""), or active (value=="")
  • Set to active once it has been processed
  • Create the iota definition so that the 0 value is unprocessed

This pseudo-state machine should live within the marker implementation, I think, and we just make calls to different exposed status update methods.

@stuartnelson3
Copy link
Contributor

It seems alert cleanup happens in provider/mem/mem.go in the runGC() loop. I'm going to think about how best to do this, but potentially just adding a pointer to marker as an argument to NewAlerts when it gets created and managing it there..

@fabxc
Copy link
Contributor

fabxc commented Apr 13, 2017

+1 for adding some sort of proper "status". Although I tend to think that a status object is more appropriate for reasons of extensibility (more below).

IMO, having the unprocessed state is pretty "meh" to begin with. Good to add here as the way things are but ideally, we'd have proper indexing of all alerts and silences that allows us to avoid this.
In an ideal scenario, our "status" object could not only tell us what state an alert is in, but also which silences or alerts (for inhibition) are muting it. That's of course not trivial with silences being time-bound (and editable) and alerts being entirely ephemeral and unpredictable.

Which tools would use this API extension/change if we added it now?
In general, I think I'm okay adding it as we'll likely introduce a v2 API anyway.

@stuartnelson3
Copy link
Contributor

It sounds like the main use would be for api consumers of large installations, both the UI and amtool.

Cool idea for displaying the alert responsible for inhibition, didn't even think of that. What would we use to identify the alert?

Once we have the mechanics in place we can clean up the actual state transitions for the changes we'll be making for the new api.

@prymitive
Copy link
Contributor Author

Speaking of changes to the API - the silenced field only links to a single silence object, but there can be multiple matched silenced. I'm happy with that, but is this on purpose or accidental? From the POV of alert one is enough, but a person looking at this silenced alert might want to see all silences that are matched, but then again, if that list grows too much it stops being useful.
I think leaving it as is might be the best for now, but might be worth considering when designing v2 API.

I'll get back to making a new change set here.

@matthiasr
Copy link

If the API doesn't return an alert before it's processed, how will this look when I just created an alert? How can O know if anything happened at all?
I kind of lost track of what this PR is about, the original description is about filtering alerts altogether, but y'all are now discussing a different change it seems?

@prymitive
Copy link
Contributor Author

That's a fair comment, discussion that followed got a little bit broad and I think I'll:

  1. Close this PR as the proposed change here isn't getting merged
  2. Document proposed change on API shows briefly reports alerts as not being silenced until silences are applied #609

The new change should be:

  • refactor Alertmanager alert status handling (introduce status field on the marker struct)
  • move silence linking elsewhere (to JSON marshaling like proposed above)
  • don't change what API responds with, all alerts should be exposed, if client wants to see only those that were processed then the status key in JSON reponse can be used for that

@prymitive
Copy link
Contributor Author

Updated #609, closing this PR and will come up with a new patch, thanks all

@prymitive prymitive closed this Apr 14, 2017
@prymitive prymitive deleted the issue-609 branch April 14, 2017 23:02
@prymitive prymitive mentioned this pull request Apr 14, 2017
@prymitive
Copy link
Contributor Author

Followup in #711

hh pushed a commit to ii/alertmanager that referenced this pull request Oct 20, 2017
* Remove unnecessary select statement

* Remove unnecessary if-statement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API shows briefly reports alerts as not being silenced until silences are applied

7 participants