Don't return alerts in the API (alerts/groups) until they are cross-c…#692
Don't return alerts in the API (alerts/groups) until they are cross-c…#692prymitive wants to merge 3 commits intoprometheus:masterfrom prymitive:issue-609
Conversation
|
Nice! I've noticed this while working on #636 as well |
|
@stuartnelson3 does this look sane? It does break the behaviour of alert/groups API a bit, but I think that's acceptable given that the API isn't stable yet. |
|
It does change the behavior, but as you said, we aren't currently making any API stability guarantees. Looking at this, it makes sense -- add a processed field so you know whether the API has looked at it. I'm just wondering if we should instead move from adding a boolean field to adding a So something like @fabxc @brancz @mxinden -- opinion on merging all these booleans into a single |
|
That does seem like an improvement. It doesn't seem like a big task and I'm happy to do it, will update once I have a new patch ready. |
|
I got back to it and took a look over the code. Current memMarker struct contains:
So those are not simple bool fields, looking at the code that uses those I see that: So we can move those into the |
This field will be used to store alert flags, start with uint8 as we don't need too many bits
Inhibited flag will now be set on the status field, we don't need to keep track of this value, a single bit is enough
Once notify have a chance to resolve silences matched for given alert mark it also as processed, so we can start returning it in the group overview API
|
|
||
| // if there are no flags set remove the whole key | ||
| if m.status[alert] == 0 { | ||
| delete(m.status, alert) |
There was a problem hiding this comment.
Is there a point where we can nuke entire m.status[alert]? I'm not sure if this will ever be cleaned, current code used delete() if passed bool was false, but with status we no longer can delete the entire map if any other flag is set and processing one will be set for everything. Where should the cleanup code go?
There was a problem hiding this comment.
Looking for the code that removes Alerts ... that should also (hopefully) have access to the marker map, so we can get rid of the alert.
|
First go at the new patch |
|
@prymitive @stuartnelson3 I like the idea of the |
|
More or less. Technically, an alert could be both inhibited AND silenced, but in that case I think it's only necessary to mark it as inhibited, and the result for the user is the same (i.e., no notification). |
|
@stuartnelson3 in the notification pipeline if an alert is inhibited it will never reach the silence stage, therefore won't be added to the set of silenced alerts. |
|
Oh, if they are disjoint I can simplify this PR, but:
|
|
@brancz thanks for the clarification! |
|
|
It could also be: This way instead of having two variables to keep in sync we just have a single struct to worry about. |
|
We do have one struct, but we still have two values :) I'll think about it, but right now my gut feeling is to keep the actual alert JSON representation flat and not have a nested object containing status information. We can still accomplish that in the serialization, though, with this struct. So just go with this for now. Took a stab at the code since alertStatus will need a custom MarshalJSON: type state uint8
const (
unprocessed state = iota
active
silenced
inhibited
)
type alertStatus struct {
status state
value string
}
func (a *alertStatus) MarshalJSON() ([]byte, error) {
status, prs := statusMap[a.status]
if !prs {
status = "unknown"
}
return json.Marshal(map[string]string{
"status": status,
"value": a.value,
})
}
// For serializing status to JSON.
var statusMap = map[state]string{
unprocessed: "unprocessed",
active: "active",
silenced: "silenced",
inhibited: "inhibited",
}
type memMarker struct {
status map[model.Fingerprint]alertStatus
sync.RWMutex
} |
|
I think the semantics in the marker map will stay the same.
This pseudo-state machine should live within the marker implementation, I think, and we just make calls to different exposed status update methods. |
|
It seems alert cleanup happens in |
|
+1 for adding some sort of proper "status". Although I tend to think that a status object is more appropriate for reasons of extensibility (more below). IMO, having the unprocessed state is pretty "meh" to begin with. Good to add here as the way things are but ideally, we'd have proper indexing of all alerts and silences that allows us to avoid this. Which tools would use this API extension/change if we added it now? |
|
It sounds like the main use would be for api consumers of large installations, both the UI and Cool idea for displaying the alert responsible for inhibition, didn't even think of that. What would we use to identify the alert? Once we have the mechanics in place we can clean up the actual state transitions for the changes we'll be making for the new api. |
|
Speaking of changes to the API - the I'll get back to making a new change set here. |
|
If the API doesn't return an alert before it's processed, how will this look when I just created an alert? How can O know if anything happened at all? |
|
That's a fair comment, discussion that followed got a little bit broad and I think I'll:
The new change should be:
|
|
Updated #609, closing this PR and will come up with a new patch, thanks all |
|
Followup in #711 |
* Remove unnecessary select statement * Remove unnecessary if-statement
…hecked with silences
alerts/groups API endpoint provides "silenced" key for all alerts, which is set if any silence for this alert exist.
Due to how silences are matched with alerts, which happens out of band from API perspective, alerts/groups will return all alerts as soon as they created and only after silences are matched with an alert a "silenced" key will be added to the object in API response. This means that API clients that wants to rely on that key can't trust it, as the absence of "silenced" key doesn't guarantee that given alerts isn't silenced. This change fixes that by tracking which alert was matched against silences and only returning those alerts that were checked.
Fixes #609