Skip to content

Notification: Reset internal states on (missed)recovery#10361

Merged
yhabteab merged 4 commits intomasterfrom
reset-no-more-notifications-only-on-recovery
May 16, 2025
Merged

Notification: Reset internal states on (missed)recovery#10361
yhabteab merged 4 commits intomasterfrom
reset-no-more-notifications-only-on-recovery

Conversation

@yhabteab
Copy link
Copy Markdown
Member

@yhabteab yhabteab commented Mar 6, 2025

This commit 2447f8c kinda resets #7818 to the initial state before I suggested changing this to #7818 (review) due to a bad assumption. Now, after running some tests and going through the code again, resetting this flag on non-recovery types doesn't make any sense at all. The no_more_notifications flag is used purely for state notifications and will also be set when triggering a Problem notification, and likewise should be reset when triggering a Recovery notification as the other types have nothing to do with it.

Here is a brief analysis of why and when the no_more_notifications flag is even necessary:

  • The first problem notification is always sent, no matter what the interval is (of course, if the type filter matches).

  • When a notification object has a non-zero interval set, it means that the user wants to receive the same problem notification at regular intervals (reminder), provided the checkable is still in the same state. In this case, the no_more_notifications flag isn't relevant and won't be set/used.

  • When there's no configured interval, the no_more_notifications flag is used to prevent sending any reminder notifications. Now, let's say we've a notification object that matches on all type and state filters, and a Checkable runs into a problem state. We'll send the first problem notification, and the no_more_notifications flag will be set to true here.

    if (type == NotificationProblem && GetInterval() <= 0)
    SetNoMoreNotifications(true);

    If you now set that Checkable in downtime, with the master branch, it would implicitly reset the no_more_notifications flag to false, even though it's not state related notification. It will then immediately try to send reminder notifications but since the Checkable is in downtime, we won't get any notifications. This will only result in unnecessary notifications being sent out, if the downtime either ends automatically or is removed manually while the Checkable is still in the problem state.

  • When a notification object isn't configured to trigger Recovery notifications, the no_more_notifications flag will be reset anyway here, so we don't need to worry about that.

    /* Ensure to reset no_more_notifications on Recovery notifications,
    * even if the admin did not configure them in the filter.
    */
    {
    ObjectLock olock(this);
    if (type == NotificationRecovery && GetInterval() <= 0)
    SetNoMoreNotifications(false);
    }

  • As described in Notification: Reset internal states on (missed)recovery #10361 (comment), the no_more_notifications flag does not behave as expected with flapping state. When a checkable goes into a problem state, it sends a Problem notification and sets that flag to true. I couldn't trigger this behaviour for hosts, but I can confirm that for services, if you submit a check result and switch from 'Warning' <-> 'Critical' back and forth a few times (with the default flapping threshold usually for each state three times to reach the actual flapping value of 33.+). This will make the service flapping. After this, all notifications will be discarded and the no_more_notifications flag will remain set to true. Next, send new check results with OK states to recover the service. You will have to submit at least 15 times to enforce the flapping end. You can also use the script from below if you don't want to do this by yourself. The service should then be in an OK state, but if you check the no_more_notifications flag on the notification object, it will still be set to true. You are absolutely right: this isn't how it should behave. However, what are the consequences of that apart from its false positive value? The answer is nothing. Should the service go into a problem state again, the problem notification will still be sent nonetheless (it will actually not be sent but that's a different story see 625ff43, since that flag is only relevant for reminder notifications and not for initial ones. What if the notification object is configured to delay the notifications, you ask? In that case, the flag is explicitly set to false and we don't rely on its previous state.

    /*
    * We need to set no more notifications to false, in case
    * some notifications were sent previously
    */
    SetNoMoreNotifications(false);

Overall, there are way too many internal Notification object-specific states that depend on the Recovery type, but if that somehow gets lost, they will all be in a broken state. So additionally this PR is trying to streamline this kind of recovery state checking in the sense that they all are using the exact same check now, and as far as I am aware they should all have a consistent state across different notification types.

  • 2447f8c Resets the no_more_notifications flag either on current or missed previous Recovery notification
  • 30233cd This is basically the same as the above one, but covers the case where the Notification object has no configured Recovery or FlappingEnd types.
  • 625ff43 Clears the last_notified_state_per_user cache either on current or missed previous Recovery notification
  • 622bee3 Clears the notified_problem_users list either on current or missed previous Recovery notification

Test

object NotificationCommand "send" {
	command = [ "true" ]
}

apply Notification "notify" to Service {
	command = "send"
	users = ["icingaadmin"]
	interval = 0
	types = [ Recovery, Problem, FlappingStart, FlappingEnd ]
	assign where true
}

Object Host "test" {
	check_command = "dummy"
	max_check_attempts = 1
	check_interval = 30s

	vars.dummy_state = 0
	vars.dummy_text = "I'm just testing something"
}

object Service "service" {
    check_command = "dummy"
    host_name = "test"
    max_check_attempts = 1
    enable_active_checks = false
    enable_flapping = true
}

Use this script to trigger all errors except the non-matching type filters. In order to trigger that bug as well, you must remove the notification types FlappingStart and FlappingEnd from the notification configuration above.

#!/bin/zsh

set -exo pipefail

submit_check_result() {
  curl -sSk \
    -u root:icinga \
    -o /dev/null \
    -H 'Content-Type: application/json' \
    -H 'Accept: application/json' \
    -X POST 'https://localhost:5665/v1/actions/process-check-result' -d@- <<EOF &
{
    "type": "Service",
    "filter": "host.name == \"test\"",
    "exit_status": $1,
    "plugin_output": "$2"
}
EOF
}

while ; do
  for state in 1 2; do
    submit_check_result $state "WARNING/CRITICAL - flapping"
  done

  flapping=$(curl -ksSu root:icinga 'https://localhost:5665/v1/objects/services?pretty=1' | jq '.results[].attrs.flapping');
  if [ "$flapping" = true ] ; then
    echo "Service is flapping, sending recovery state\n"
    while ; do
      submit_check_result 0 "OK - flapping"

      flapping=$(curl -ksSu root:icinga 'https://localhost:5665/v1/objects/services?pretty=1' | jq '.results[].attrs.flapping')
      if [ "$flapping" = false ]; then
        echo "Service is no longer flapping\n"
        # Trigger a final state change to put the service back into a problem state 'WARNING' and Icinga won't send
        # any problem notification due to the last notified state per user cache not being cleared.
        submit_check_result 1 "WARNING - Not flapping"
        break
      fi
    done

    break # We're done here and can exit the script
  fi
  echo "Service is not flapping yet, keep trying\n"
done
2447f8c

For this test to work comment out the final submit_check_result 1 "WARNING - Not flapping" call from the above script and run it in a clean Icinga 2 state.

Before

$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
false

After

$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
false

Non Flapping test cases

Before

$ curl -sSku root:icinga -H 'Content-Type: application/json'  -H 'Accept: application/json' \
 -X POST 'https://localhost:5667/v1/actions/process-check-result' -d @<(cat <<EOF
{
        "type": "Service", "filter": "host.name == \"test\"",
        "exit_status": 1,
        "pretty": true, "plugin_output": $(( RANDOM ))
}
EOF
)
...
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
true
$ curl -ksSu root:icinga -H 'Accept: application/json' -X POST 'https://localhost:5667/v1/actions/schedule-downtime' \
 -d @<(cat <<EOF
{ "type": "Service", "filter": "host.name == \"test\"", "start_time": $(date +%s), "end_time": $(date -v+1M +%s), "author": "icingaadmin", "comment": "IPv4 network maintenance", "pretty": true }
EOF
)
...
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
false

After

$ curl -sSku root:icinga -H 'Content-Type: application/json'  -H 'Accept: application/json' \
 -X POST 'https://localhost:5667/v1/actions/process-check-result' -d @<(cat <<EOF
{
        "type": "Service", "filter": "host.name == \"test\"",
        "exit_status": 1,
        "pretty": true, "plugin_output": $(( RANDOM ))
}
EOF
)
...
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
true
$ curl -ksSu root:icinga -H 'Accept: application/json' -X POST 'https://localhost:5667/v1/actions/schedule-downtime' \
 -d @<(cat <<EOF
{ "type": "Service", "filter": "host.name == \"test\"", "start_time": $(date +%s), "end_time": $(date -v+1M +%s), "author": "icingaadmin", "comment": "IPv4 network maintenance", "pretty": true }
EOF
)
...
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
true
30233cd

For this test case to work, you must remove the FlappingEnd and FlappingStart notification types from the above configuration.

Before

$ ./flapping.sh
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
true

After

$ ./flapping.sh
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
false
625ff43

Before

[2025-03-12 11:00:22 +0100] debug/HttpUtility: Request body: '{    "type": "Service",    "filter": "host.name == \"test\"",    "exit_status": 1,    "plugin_output": "WARNING - Not flapping"}'
...
[2025-03-12 11:00:22 +0100] notice/Checkable: State Change: Checkable 'test!service' hard state change from OK to WARNING detected.
[2025-03-12 11:00:22 +0100] information/Checkable: Checkable 'test!service' has 1 notification(s). Checking filters for type 'Problem', sends will be logged.
...
[2025-03-12 11:00:22 +0100] notice/Notification: Notification object 'test!service!notify': We already notified user 'icingaadmin' for a Warning problem. Likely after that another state change notification was filtered out by config. Not sending duplicate 'Warning' notification.

After

[2025-03-12 11:02:04 +0100] debug/HttpUtility: Request body: '{    "type": "Service",    "filter": "host.name == \"test\"",    "exit_status": 1,    "plugin_output": "WARNING - Not flapping"}'
[2025-03-12 11:02:04 +0100] notice/Checkable: State Change: Checkable 'test!service' hard state change from OK to WARNING detected.
[2025-03-12 11:02:04 +0100] information/Checkable: Checkable 'test!service' has 1 notification(s). Checking filters for type 'Problem', sends will be logged.
...
[2025-03-12 11:02:04 +0100] information/Notification: Sending 'Problem' notification 'test!service!notify' for user 'icingaadmin'
[2025-03-12 11:02:04 +0100] information/HttpServerConnection: Request POST /v1/actions/process-check-result (from [::1]:54323, user: root, agent: curl/8.7.1, status: OK) took total 11ms.
[2025-03-12 11:02:04 +0100] notice/Process: Running command 'true': PID 79226
[2025-03-12 11:02:04 +0100] information/Notification: Completed sending 'Problem' notification 'test!service!notify' for checkable 'test!service' and user 'icingaadmin' using command 'send'.
622bee3

For this test to work comment out the final submit_check_result 1 "WARNING - Not flapping" call from the above script and run it in a clean Icinga 2 state.

Before

$ ./flapping.sh
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.notified_problem_users'
[
  "icingaadmin"
]

After

$ ./flapping.sh
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.notified_problem_users'
[]

ref/IP/57790

@yhabteab yhabteab added bug Something isn't working area/notifications Notification events ref/IP labels Mar 6, 2025
@yhabteab yhabteab requested a review from Al2Klimov March 6, 2025 10:50
@cla-bot cla-bot bot added the cla/signed label Mar 6, 2025
Al2Klimov
Al2Klimov previously approved these changes Mar 11, 2025
@yhabteab yhabteab added this to the 2.15.0 milestone Mar 12, 2025
@yhabteab yhabteab force-pushed the reset-no-more-notifications-only-on-recovery branch from 39bc531 to be83cb6 Compare March 12, 2025 13:29
@yhabteab yhabteab changed the title Notification: Reset no_more_notifications only on recovery Notification: Reset internal states on (missed)recovery Mar 12, 2025
@yhabteab yhabteab force-pushed the reset-no-more-notifications-only-on-recovery branch from be83cb6 to 30233cd Compare March 12, 2025 13:33
@yhabteab
Copy link
Copy Markdown
Member Author

Please also refer to the updated PR description and the test cases!

@yhabteab yhabteab requested a review from Al2Klimov March 12, 2025 14:32
@Al2Klimov Al2Klimov removed their request for review March 20, 2025 09:04
@yhabteab yhabteab requested a review from julianbrost April 2, 2025 14:56
yhabteab added 4 commits May 12, 2025 12:03
Previously, if you enable flapping for a Checkable but the corresponding
`Notification` object does not have `FlappingStart` or `FlappingEnd`
types set, the `no_more_notifications` flag wasn't reset to false again.
This commit ensures that this flag is always reset on `Recovery` even
the type filter does not match including when we miss the `Recovery` due
to Flapping state.
@yhabteab yhabteab force-pushed the reset-no-more-notifications-only-on-recovery branch from 30233cd to 4596b44 Compare May 12, 2025 10:04
@yhabteab yhabteab requested a review from jschmidt-icinga May 13, 2025 07:54
Copy link
Copy Markdown
Contributor

@jschmidt-icinga jschmidt-icinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've verified the provided test cases (with some modifications), with and without the PR. The argumentation makes sense to me and the comments in the code explain what is going on.

@yhabteab yhabteab removed the request for review from julianbrost May 16, 2025 07:52
@yhabteab yhabteab merged commit 83a0f9d into master May 16, 2025
27 checks passed
@yhabteab yhabteab deleted the reset-no-more-notifications-only-on-recovery branch May 16, 2025 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/notifications Notification events bug Something isn't working cla/signed ref/IP

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants