Notification: Reset internal states on (missed)recovery by yhabteab · Pull Request #10361 · Icinga/icinga2

yhabteab · 2025-03-06T10:50:39Z

This commit 2447f8c kinda resets #7818 to the initial state before I suggested changing this to #7818 (review) due to a bad assumption. Now, after running some tests and going through the code again, resetting this flag on non-recovery types doesn't make any sense at all. The no_more_notifications flag is used purely for state notifications and will also be set when triggering a Problem notification, and likewise should be reset when triggering a Recovery notification as the other types have nothing to do with it.

Here is a brief analysis of why and when the no_more_notifications flag is even necessary:

The first problem notification is always sent, no matter what the interval is (of course, if the type filter matches).
When a notification object has a non-zero interval set, it means that the user wants to receive the same problem notification at regular intervals (reminder), provided the checkable is still in the same state. In this case, the no_more_notifications flag isn't relevant and won't be set/used.
When there's no configured interval, the no_more_notifications flag is used to prevent sending any reminder notifications. Now, let's say we've a notification object that matches on all type and state filters, and a Checkable runs into a problem state. We'll send the first problem notification, and the no_more_notifications flag will be set to true here.

icinga2/lib/icinga/notification.cpp

Lines 390 to 391 in 5c651e4

if (type == NotificationProblem && GetInterval() <= 0)

SetNoMoreNotifications(true);

If you now set that Checkable in downtime, with the master branch, it would implicitly reset the no_more_notifications flag to false, even though it's not state related notification. It will then immediately try to send reminder notifications but since the Checkable is in downtime, we won't get any notifications. This will only result in unnecessary notifications being sent out, if the downtime either ends automatically or is removed manually while the Checkable is still in the problem state.

When a notification object isn't configured to trigger Recovery notifications, the no_more_notifications flag will be reset anyway here, so we don't need to worry about that.

icinga2/lib/icinga/notification.cpp

Lines 336 to 343 in 5c651e4

    
           			/* Ensure to reset no_more_notifications on Recovery notifications, 
        
           			 * even if the admin did not configure them in the filter. 
        
           			 */ 
        
           			{ 
        
           				ObjectLock olock(this); 
        
           				if (type == NotificationRecovery && GetInterval() <= 0) 
        
           					SetNoMoreNotifications(false); 
        
           			}

As described in Notification: Reset internal states on (missed)recovery #10361 (comment), the no_more_notifications flag does not behave as expected with flapping state. When a checkable goes into a problem state, it sends a Problem notification and sets that flag to true. I couldn't trigger this behaviour for hosts, but I can confirm that for services, if you submit a check result and switch from 'Warning' <-> 'Critical' back and forth a few times (with the default flapping threshold usually for each state three times to reach the actual flapping value of 33.+). This will make the service flapping. After this, all notifications will be discarded and the no_more_notifications flag will remain set to true. Next, send new check results with OK states to recover the service. You will have to submit at least 15 times to enforce the flapping end. You can also use the script from below if you don't want to do this by yourself. The service should then be in an OK state, but if you check the no_more_notifications flag on the notification object, it will still be set to true. You are absolutely right: this isn't how it should behave. However, what are the consequences of that apart from its false positive value? The answer is nothing. Should the service go into a problem state again, the problem notification will still be sent nonetheless (it will actually not be sent but that's a different story see 625ff43, since that flag is only relevant for reminder notifications and not for initial ones. What if the notification object is configured to delay the notifications, you ask? In that case, the flag is explicitly set to false and we don't rely on its previous state.

icinga2/lib/icinga/notification.cpp

Lines 305 to 309 in cefe1bc

    
           				/* 
        
           				 * We need to set no more notifications to false, in case 
        
           				 * some notifications were sent previously 
        
           				 */ 
        
           				SetNoMoreNotifications(false);

Overall, there are way too many internal Notification object-specific states that depend on the Recovery type, but if that somehow gets lost, they will all be in a broken state. So additionally this PR is trying to streamline this kind of recovery state checking in the sense that they all are using the exact same check now, and as far as I am aware they should all have a consistent state across different notification types.

2447f8c Resets the no_more_notifications flag either on current or missed previous Recovery notification
30233cd This is basically the same as the above one, but covers the case where the Notification object has no configured Recovery or FlappingEnd types.
625ff43 Clears the last_notified_state_per_user cache either on current or missed previous Recovery notification
622bee3 Clears the notified_problem_users list either on current or missed previous Recovery notification

Test

object NotificationCommand "send" {
	command = [ "true" ]
}

apply Notification "notify" to Service {
	command = "send"
	users = ["icingaadmin"]
	interval = 0
	types = [ Recovery, Problem, FlappingStart, FlappingEnd ]
	assign where true
}

Object Host "test" {
	check_command = "dummy"
	max_check_attempts = 1
	check_interval = 30s

	vars.dummy_state = 0
	vars.dummy_text = "I'm just testing something"
}

object Service "service" {
    check_command = "dummy"
    host_name = "test"
    max_check_attempts = 1
    enable_active_checks = false
    enable_flapping = true
}

Use this script to trigger all errors except the non-matching type filters. In order to trigger that bug as well, you must remove the notification types FlappingStart and FlappingEnd from the notification configuration above.

#!/bin/zsh

set -exo pipefail

submit_check_result() {
  curl -sSk \
    -u root:icinga \
    -o /dev/null \
    -H 'Content-Type: application/json' \
    -H 'Accept: application/json' \
    -X POST 'https://localhost:5665/v1/actions/process-check-result' -d@- <<EOF &
{
    "type": "Service",
    "filter": "host.name == \"test\"",
    "exit_status": $1,
    "plugin_output": "$2"
}
EOF
}

while ; do
  for state in 1 2; do
    submit_check_result $state "WARNING/CRITICAL - flapping"
  done

  flapping=$(curl -ksSu root:icinga 'https://localhost:5665/v1/objects/services?pretty=1' | jq '.results[].attrs.flapping');
  if [ "$flapping" = true ] ; then
    echo "Service is flapping, sending recovery state\n"
    while ; do
      submit_check_result 0 "OK - flapping"

      flapping=$(curl -ksSu root:icinga 'https://localhost:5665/v1/objects/services?pretty=1' | jq '.results[].attrs.flapping')
      if [ "$flapping" = false ]; then
        echo "Service is no longer flapping\n"
        # Trigger a final state change to put the service back into a problem state 'WARNING' and Icinga won't send
        # any problem notification due to the last notified state per user cache not being cleared.
        submit_check_result 1 "WARNING - Not flapping"
        break
      fi
    done

    break # We're done here and can exit the script
  fi
  echo "Service is not flapping yet, keep trying\n"
done

2447f8c

For this test to work comment out the final submit_check_result 1 "WARNING - Not flapping" call from the above script and run it in a clean Icinga 2 state.

Before

$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
false

After

$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
false

Non Flapping test cases

Before

$ curl -sSku root:icinga -H 'Content-Type: application/json'  -H 'Accept: application/json' \
 -X POST 'https://localhost:5667/v1/actions/process-check-result' -d @<(cat <<EOF
{
        "type": "Service", "filter": "host.name == \"test\"",
        "exit_status": 1,
        "pretty": true, "plugin_output": $(( RANDOM ))
}
EOF
)
...
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
true
$ curl -ksSu root:icinga -H 'Accept: application/json' -X POST 'https://localhost:5667/v1/actions/schedule-downtime' \
 -d @<(cat <<EOF
{ "type": "Service", "filter": "host.name == \"test\"", "start_time": $(date +%s), "end_time": $(date -v+1M +%s), "author": "icingaadmin", "comment": "IPv4 network maintenance", "pretty": true }
EOF
)
...
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
false

After

$ curl -sSku root:icinga -H 'Content-Type: application/json'  -H 'Accept: application/json' \
 -X POST 'https://localhost:5667/v1/actions/process-check-result' -d @<(cat <<EOF
{
        "type": "Service", "filter": "host.name == \"test\"",
        "exit_status": 1,
        "pretty": true, "plugin_output": $(( RANDOM ))
}
EOF
)
...
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
true
$ curl -ksSu root:icinga -H 'Accept: application/json' -X POST 'https://localhost:5667/v1/actions/schedule-downtime' \
 -d @<(cat <<EOF
{ "type": "Service", "filter": "host.name == \"test\"", "start_time": $(date +%s), "end_time": $(date -v+1M +%s), "author": "icingaadmin", "comment": "IPv4 network maintenance", "pretty": true }
EOF
)
...
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
true

30233cd

For this test case to work, you must remove the FlappingEnd and FlappingStart notification types from the above configuration.

Before

$ ./flapping.sh
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
true

After

$ ./flapping.sh
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.no_more_notifications'
false

625ff43

Before

[2025-03-12 11:00:22 +0100] debug/HttpUtility: Request body: '{    "type": "Service",    "filter": "host.name == \"test\"",    "exit_status": 1,    "plugin_output": "WARNING - Not flapping"}'
...
[2025-03-12 11:00:22 +0100] notice/Checkable: State Change: Checkable 'test!service' hard state change from OK to WARNING detected.
[2025-03-12 11:00:22 +0100] information/Checkable: Checkable 'test!service' has 1 notification(s). Checking filters for type 'Problem', sends will be logged.
...
[2025-03-12 11:00:22 +0100] notice/Notification: Notification object 'test!service!notify': We already notified user 'icingaadmin' for a Warning problem. Likely after that another state change notification was filtered out by config. Not sending duplicate 'Warning' notification.

After

[2025-03-12 11:02:04 +0100] debug/HttpUtility: Request body: '{    "type": "Service",    "filter": "host.name == \"test\"",    "exit_status": 1,    "plugin_output": "WARNING - Not flapping"}'
[2025-03-12 11:02:04 +0100] notice/Checkable: State Change: Checkable 'test!service' hard state change from OK to WARNING detected.
[2025-03-12 11:02:04 +0100] information/Checkable: Checkable 'test!service' has 1 notification(s). Checking filters for type 'Problem', sends will be logged.
...
[2025-03-12 11:02:04 +0100] information/Notification: Sending 'Problem' notification 'test!service!notify' for user 'icingaadmin'
[2025-03-12 11:02:04 +0100] information/HttpServerConnection: Request POST /v1/actions/process-check-result (from [::1]:54323, user: root, agent: curl/8.7.1, status: OK) took total 11ms.
[2025-03-12 11:02:04 +0100] notice/Process: Running command 'true': PID 79226
[2025-03-12 11:02:04 +0100] information/Notification: Completed sending 'Problem' notification 'test!service!notify' for checkable 'test!service' and user 'icingaadmin' using command 'send'.

622bee3

For this test to work comment out the final submit_check_result 1 "WARNING - Not flapping" call from the above script and run it in a clean Icinga 2 state.

Before

$ ./flapping.sh
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.notified_problem_users'
[
  "icingaadmin"
]

After

$ ./flapping.sh
$ curl -ksSu root:icinga 'https://localhost:5667/v1/objects/notifications?pretty=1' | jq '.results[].attrs.notified_problem_users'
[]

ref/IP/57790

lib/icinga/notification.cpp

yhabteab · 2025-03-12T14:32:46Z

Please also refer to the updated PR description and the test cases!

Previously, if you enable flapping for a Checkable but the corresponding `Notification` object does not have `FlappingStart` or `FlappingEnd` types set, the `no_more_notifications` flag wasn't reset to false again. This commit ensures that this flag is always reset on `Recovery` even the type filter does not match including when we miss the `Recovery` due to Flapping state.

jschmidt-icinga

I've verified the provided test cases (with some modifications), with and without the PR. The argumentation makes sense to me and the comments in the code explain what is going on.

yhabteab added bug Something isn't working area/notifications Notification events ref/IP labels Mar 6, 2025

yhabteab requested a review from Al2Klimov March 6, 2025 10:50

cla-bot bot added the cla/signed label Mar 6, 2025

Al2Klimov reviewed Mar 7, 2025

View reviewed changes

lib/icinga/notification.cpp Outdated Show resolved Hide resolved

Al2Klimov previously approved these changes Mar 11, 2025

View reviewed changes

lib/icinga/notification.cpp Outdated Show resolved Hide resolved

yhabteab mentioned this pull request Mar 12, 2025

Notification: Clear last notified state per user on flapping end as well #10374

Closed

yhabteab added this to the 2.15.0 milestone Mar 12, 2025

yhabteab dismissed Al2Klimov’s stale review via be83cb6 March 12, 2025 13:29

yhabteab force-pushed the reset-no-more-notifications-only-on-recovery branch from 39bc531 to be83cb6 Compare March 12, 2025 13:29

yhabteab changed the title ~~Notification: Reset no_more_notifications only on recovery~~ Notification: Reset internal states on (missed)recovery Mar 12, 2025

yhabteab force-pushed the reset-no-more-notifications-only-on-recovery branch from be83cb6 to 30233cd Compare March 12, 2025 13:33

yhabteab requested a review from Al2Klimov March 12, 2025 14:32

Al2Klimov removed their request for review March 20, 2025 09:04

yhabteab requested a review from julianbrost April 2, 2025 14:56

yhabteab added 4 commits May 12, 2025 12:03

Notification: Reset no_more_notifications only on recovery

89f12c2

Notification: Clear last notified state per user on flapping end as well

86365a4

Notification: Reset notified problem users on flapping end as well

9166326

yhabteab force-pushed the reset-no-more-notifications-only-on-recovery branch from 30233cd to 4596b44 Compare May 12, 2025 10:04

yhabteab requested a review from jschmidt-icinga May 13, 2025 07:54

jschmidt-icinga approved these changes May 15, 2025

View reviewed changes

yhabteab removed the request for review from julianbrost May 16, 2025 07:52

yhabteab merged commit 83a0f9d into master May 16, 2025
27 checks passed

yhabteab deleted the reset-no-more-notifications-only-on-recovery branch May 16, 2025 07:53

yhabteab mentioned this pull request Dec 18, 2025

Fix a race condition in NotificationComponent leading to duplicate notification #10628

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notification: Reset internal states on (missed)recovery#10361

Notification: Reset internal states on (missed)recovery#10361
yhabteab merged 4 commits intomasterfrom
reset-no-more-notifications-only-on-recovery

yhabteab commented Mar 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

yhabteab commented Mar 12, 2025

Uh oh!

jschmidt-icinga left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if (type == NotificationProblem && GetInterval() <= 0)
	SetNoMoreNotifications(true);

	/* Ensure to reset no_more_notifications on Recovery notifications,
	* even if the admin did not configure them in the filter.
	*/
	{
	ObjectLock olock(this);
	if (type == NotificationRecovery && GetInterval() <= 0)
	SetNoMoreNotifications(false);
	}

	/*
	* We need to set no more notifications to false, in case
	* some notifications were sent previously
	*/
	SetNoMoreNotifications(false);

Conversation

yhabteab commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test

Before

After

Non Flapping test cases

Before

After

Before

After

Before

After

Before

After

Uh oh!

Uh oh!

Uh oh!

yhabteab commented Mar 12, 2025

Uh oh!

jschmidt-icinga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yhabteab commented Mar 6, 2025 •

edited

Loading