Skip to content

Checkable::GetSeverity(): consider reachability#10399

Merged
julianbrost merged 9 commits intomasterfrom
severity-reachability
Apr 2, 2025
Merged

Checkable::GetSeverity(): consider reachability#10399
julianbrost merged 9 commits intomasterfrom
severity-reachability

Conversation

@julianbrost
Copy link
Copy Markdown
Member

@julianbrost julianbrost commented Mar 31, 2025

The PR can be split in two general parts: All but the last commits are general cleanup of the code (nothing spectacular, just braces for ifs, early return instead of nesting code in an else block, and similar) in Host::GetSeverity() and Service::GetSeverity() without any change in functionality, these are split into individual commits so that the individual changes are easier to follow. Finally, the last commit implements the functional change of this PR.

So far, Service::GetSeverity() only considered the state of its own host, i.e. the implicit service to its own host dependency, and treated it similar to acknowledgements and downtimes. In contrast, Host::GetSeverity() considered reachability and treated it like a state, i.e. for the severity calculation, the host was either up, down, or unreachable.

This commit changes the following things:

  1. Make the service severity also consider explicitly configured dependencies by using IsReachable().
  2. Prefer acknowledgements and downtimes over unreachability in the severity calculation so that if an already acknowledged or in-downtime services (i.e. already handled service) becomes unreachable, it shouln't become more severe.
  3. To unify host and service severities a bit, hosts now use the same logic that treats reachability more like acknowledgements/downtimes instead of like a state (changing the other way around would the state from the check plugin would not affect the severity for unrachable services anymore).

Tests

The following config snippet generates host and service objects for all the combinations of the different state attributes, i.e. hosts like github-10399-down-indowntime-acknowledged-unreachable where the parts of the name explain the state. For acknowledgements and downtime, API requests are listed below to bring the objects into the desired state, the rest of the state is automatically created by the config.

Icinga 2 config
for (var dep_reachable in [true, false]) {
	for (var acknowledged in [true, false]) {
		for (var in_downtime in [true, false]) {
			for (var state in [null /* pending */, HostUp, HostDown]) {
				var name = "github-10399"

				if (state == HostUp) {
					name += "-up"
				} else if (state == HostDown) {
					name += "-down"
				} else {
					name += "-pending"
				}

				if (in_downtime) {
					name += "-indowntime"
				}

				if (acknowledged) {
					name += "-acknowledged"
				}

				if (!dep_reachable) {
					name += "-unreachable"
				}

				log("Host: " + name)
				object Host name use (state, in_downtime, acknowledged) {
					check_command = "dummy"
					check_interval = 15m
					retry_interval = 15m

					if (state == HostDown) {
						vars.dummy_state = ServiceCritical
					} else if (state == HostUp) {
						vars.dummy_state = ServiceOK
					} else {
						// keep it pending
						enable_active_checks = false
						enable_passive_checks = false
					}

					vars.github_10399_in_downtime = in_downtime
					vars.github_10399_acknowledged = acknowledged
				}

				if (!dep_reachable) {
					object Dependency "unreachable" use (name) {
						child_host_name = name
						parent_host_name = "github-10399-down"
					}
				}
			}

			for (var state in [null /* pending */, ServiceOK, ServiceWarning, ServiceCritical, ServiceUnknown]) {
				for (var host_up in [true, false]) {
					var name = "github-10399"

					if (state == ServiceOK) {
						name += "-ok"
					} else if (state == ServiceWarning) {
						name += "-warning"
					} else if (state == ServiceCritical) {
						name += "-critical"
					} else if (state == ServiceUnknown) {
						name += "-unknown"
					} else {
						name += "-pending"
					}

					if (in_downtime) {
						name += "-indowntime"
					}

					if (acknowledged) {
						name += "-acknowledged"
					}

					if (!dep_reachable) {
						name += "-unreachable"
					}

					if (host_up) {
						var host = "github-10399-up"
					} else {
						name += "-hostdown"
						var host = "github-10399-down"
					}

					log("Service: " + name)
					object Service name use (host, state, in_downtime, acknowledged) {
						host_name = host
						check_command = "dummy"
						check_interval = 15m
						retry_interval = 15m

						if (state != null) {
							vars.dummy_state = state		
						} else {
							// keep it pending
							enable_active_checks = false
							enable_passive_checks = false
						}

						vars.github_10399_in_downtime = in_downtime
						vars.github_10399_acknowledged = acknowledged
					}

					if (!dep_reachable) {
						object Dependency "unreachable" use (host, name) {
							child_host_name = host
							child_service_name = name
							parent_host_name = "github-10399-down"
						}
					}
				}
			} 
		}
	}
}
Icinga 2 API requests for adding acknowledgements and downtimes
curl -sSku root:icinga 'https://localhost:5665/v1/actions/acknowledge-problem' --json '{"type": "Host", "filter": "host.vars.github_10399_acknowledged", "author": "dummy", "comment": "dummy", "pretty": true}'

curl -sSku root:icinga 'https://localhost:5665/v1/actions/acknowledge-problem' --json '{"type": "Service", "filter": "service.vars.github_10399_acknowledged", "author": "dummy", "comment": "dummy", "pretty": true}'

curl -sSku root:icinga 'https://localhost:5665/v1/actions/schedule-downtime' --json '{"type": "Host", "filter": "host.vars.github_10399_in_downtime", "start_time": '"$(date +%s)"', "end_time": '"$(date -d +1day +%s)"', "author": "dummy", "comment": "dummy", "pretty": true}'

curl -sSku root:icinga 'https://localhost:5665/v1/actions/schedule-downtime' --json '{"type": "Service", "filter": "service.vars.github_10399_in_downtime", "start_time": '"$(date +%s)"', "end_time": '"$(date -d +1day +%s)"', "author": "dummy", "comment": "dummy", "pretty": true}'

Current master (0613381)

Host list sorted by severity
Service list sorted by severity

This PR (31a224c)

Ordering still looks reasonable overall. The notable change is that in the service list, you can see how all completely unhandled services (i.e. problem and not faded out with some pictogram) are moved to the top of the list.

Host list sorted by severity
Service list sorted by severity

fixes #10340

No change in functionality, just making the code a bit nicer and more compact.
No change in functionality, just making the code a bit more compact.
No change in functionality. The first two branches actually set the final
return value for the method, so they can just return directly, removing the
need to have the rest of the function inside an else block.
No change in functionality. The ObjectLock destructor will implicitly release
the locks when returning from the function.
No change in functionality, just makes the code a bit nicer.
No change in functionality. The first two branches actually set the final
return value for the method, so they can just return directly, removing the
need to have the rest of the function inside an else block.
No change in functionality. The ObjectLock destructor will implicitly release
the locks when returning from the function.
So far, Service::GetSeverity() only considered the state of its own host, i.e.
the implicit service to its own host dependency, and treated it similar to
acknowledgements and downtimes. In contrast, Host::GetSeverity() considered
reachability and treated it like a state, i.e. for the severity calculation,
the host was either up, down, or unreachable.

This commit changes the following things:
1. Make the service severity also consider explicitly configured dependencies
   by using IsReachable().
2. Prefer acknowledgements and downtimes over unreachability in the severity
   calculation so that if an already acknowledged or in-downtime services (i.e.
   already handled service) becomes unreachable, it shouln't become more
   severe.
3. To unify host and service severities a bit, hosts now use the same logic
   that treats reachability more like acknowledgements/downtimes instead of
   like a state (changing the other way around would the state from the check
   plugin would not affect the severity for unrachable services anymore).
@cla-bot cla-bot bot added the cla/signed label Mar 31, 2025
@julianbrost julianbrost marked this pull request as ready for review April 1, 2025 13:41
@julianbrost julianbrost added the area/runtime Downtimes, comments, dependencies, events label Apr 1, 2025
@julianbrost julianbrost requested a review from yhabteab April 1, 2025 13:42
@yhabteab yhabteab added this to the 2.15.0 milestone Apr 1, 2025
@julianbrost julianbrost merged commit 27e1850 into master Apr 2, 2025
25 checks passed
@julianbrost julianbrost deleted the severity-reachability branch April 2, 2025 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/runtime Downtimes, comments, dependencies, events cla/signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Service display by service_severity doesn't take care of dependencies

3 participants