Notification: Reset internal states on (missed)recovery#10361
Merged
Notification: Reset internal states on (missed)recovery#10361
Conversation
Al2Klimov
reviewed
Mar 7, 2025
Al2Klimov
previously approved these changes
Mar 11, 2025
39bc531 to
be83cb6
Compare
no_more_notifications only on recoverybe83cb6 to
30233cd
Compare
Member
Author
|
Please also refer to the updated PR description and the test cases! |
Previously, if you enable flapping for a Checkable but the corresponding `Notification` object does not have `FlappingStart` or `FlappingEnd` types set, the `no_more_notifications` flag wasn't reset to false again. This commit ensures that this flag is always reset on `Recovery` even the type filter does not match including when we miss the `Recovery` due to Flapping state.
30233cd to
4596b44
Compare
jschmidt-icinga
approved these changes
May 15, 2025
Contributor
jschmidt-icinga
left a comment
There was a problem hiding this comment.
I've verified the provided test cases (with some modifications), with and without the PR. The argumentation makes sense to me and the comments in the code explain what is going on.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit 2447f8c kinda resets #7818 to the initial state before I suggested changing this to #7818 (review) due to a bad assumption. Now, after running some tests and going through the code again, resetting this flag on non-recovery types doesn't make any sense at all. The
no_more_notificationsflag is used purely for state notifications and will also be set when triggering aProblemnotification, and likewise should be reset when triggering aRecoverynotification as the other types have nothing to do with it.Here is a brief analysis of why and when the
no_more_notificationsflag is even necessary:The first problem notification is always sent, no matter what the interval is (of course, if the type filter matches).
When a notification object has a non-zero
intervalset, it means that the user wants to receive the same problem notification at regular intervals (reminder), provided the checkable is still in the same state. In this case, theno_more_notificationsflag isn't relevant and won't be set/used.When there's no configured interval, the
no_more_notificationsflag is used to prevent sending any reminder notifications. Now, let's say we've a notification object that matches on all type and state filters, and a Checkable runs into a problem state. We'll send the first problem notification, and theno_more_notificationsflag will be set to true here.icinga2/lib/icinga/notification.cpp
Lines 390 to 391 in 5c651e4
If you now set that Checkable in downtime, with the master branch, it would implicitly reset the
no_more_notificationsflag to false, even though it's not state related notification. It will then immediately try to send reminder notifications but since the Checkable is in downtime, we won't get any notifications. This will only result in unnecessary notifications being sent out, if the downtime either ends automatically or is removed manually while the Checkable is still in the problem state.When a notification object isn't configured to trigger
Recoverynotifications, theno_more_notificationsflag will be reset anyway here, so we don't need to worry about that.icinga2/lib/icinga/notification.cpp
Lines 336 to 343 in 5c651e4
As described in Notification: Reset internal states on (missed)recovery #10361 (comment), the
no_more_notificationsflag does not behave as expected with flapping state. When a checkable goes into a problem state, it sends aProblemnotification and sets that flag totrue. I couldn't trigger this behaviour for hosts, but I can confirm that for services, if you submit a check result and switch from 'Warning' <-> 'Critical' back and forth a few times (with the default flapping threshold usually for each state three times to reach the actual flapping value of33.+). This will make the service flapping. After this, all notifications will be discarded and theno_more_notificationsflag will remain set to true. Next, send new check results withOKstates to recover the service. You will have to submit at least15times to enforce the flapping end. You can also use the script from below if you don't want to do this by yourself. The service should then be in an OK state, but if you check theno_more_notificationsflag on the notification object, it will still be set to true. You are absolutely right: this isn't how it should behave. However, what are the consequences of that apart from its false positive value? The answer is nothing. Should the service go into a problem state again, the problem notification will still be sent nonetheless (it will actually not be sent but that's a different story see 625ff43, since that flag is only relevant for reminder notifications and not for initial ones. What if the notification object is configured to delay the notifications, you ask? In that case, the flag is explicitly set to false and we don't rely on its previous state.icinga2/lib/icinga/notification.cpp
Lines 305 to 309 in cefe1bc
Overall, there are way too many internal
Notificationobject-specific states that depend on theRecoverytype, but if that somehow gets lost, they will all be in a broken state. So additionally this PR is trying to streamline this kind of recovery state checking in the sense that they all are using the exact same check now, and as far as I am aware they should all have a consistent state across different notification types.no_more_notificationsflag either on current or missed previousRecoverynotificationNotificationobject has no configuredRecoveryorFlappingEndtypes.last_notified_state_per_usercache either on current or missed previousRecoverynotificationnotified_problem_userslist either on current or missed previousRecoverynotificationTest
Use this script to trigger all errors except the non-matching type filters. In order to trigger that bug as well, you must remove the notification types
FlappingStartandFlappingEndfrom the notification configuration above.2447f8c
For this test to work comment out the final
submit_check_result 1 "WARNING - Not flapping"call from the above script and run it in a clean Icinga 2 state.Before
After
Non Flapping test cases
Before
After
30233cd
For this test case to work, you must remove the
FlappingEndandFlappingStartnotification types from the above configuration.Before
After
625ff43
Before
After
622bee3
For this test to work comment out the final
submit_check_result 1 "WARNING - Not flapping"call from the above script and run it in a clean Icinga 2 state.Before
After
ref/IP/57790