Handle config package updates gracefully#10476
Conversation
9d38416 to
f168a23
Compare
f168a23 to
19cd522
Compare
19cd522 to
d124e51
Compare
|
I've changed this to a mutex based implementation instead of atomic operations as requested. Functional wise it works just like before, but I don't know if this will slow down the requests to a point where it will be noticeable. So, please have a look! |
jschmidt-icinga
left a comment
There was a problem hiding this comment.
I like these changes, I think it's much easier to parse this way. As suggested below, if you're concerned about the mutex being held, you could release it before generating the JSON error messages.
d124e51 to
ad88fe7
Compare
ad88fe7 to
af9fdef
Compare
|
I've changed the implementation to use |
jschmidt-icinga
left a comment
There was a problem hiding this comment.
Looks good to me.
af9fdef to
79daa75
Compare
|
Sorry, I had to fix some segfault crashes 🙈 ! |
79daa75 to
962dd0b
Compare
|
I've just pushed a new commit that fixes the referenced issue in the PR description, and makes use of the AtomicFile for some of the important files. So, please have a look at it and let me know if you have any questions or concerns. Thanks! |
23892a0 to
78adfe4
Compare
I don't yet understand how this is supposed to fix this. |
No it doesn't make $ cat /var/lib/icinga2/api/packages/example-cmd/include.conf
include "*/include.conf"...originated from this code part: icinga2/lib/remote/configpackageutility.cpp Lines 123 to 126 in aeb2550 Now, what does this include file do? It includes the $ cat /var/lib/icinga2/api/packages/example-cmd/dc8d5503-71a4-4498-9c8b-637e99bd06c1/include.conf
include "../active.conf"
if (ActiveStages["example-cmdb"] == "dc8d5503-71a4-4498-9c8b-637e99bd06c1") {
include_recursive "conf.d"
include_zones "example-cmdb", "zones.d"
}This is originated from this code part: icinga2/lib/remote/configpackageutility.cpp Lines 153 to 160 in aeb2550 So, the package's |
So it's there but it doesn't find it? Why? (I've overheard you having a discussion about normalizing relative file paths earlier, I believe that's probably relevant here, but I didn't follow and the details are missing here.) |
It's obviously not there like I said in my previous comment. The include expression contains the deleted config stage as a relative base and the icinga2/lib/config/configcompiler.cpp Lines 123 to 143 in aeb2550
Nothing that would resolve this, so it shouldn't be relevant for this PR and I don't see how normalizing relative paths earlier would resolve this specific case. |
|
Is there any point where Or is the following race condition happening here?
If |
abf08d0 to
8832e11
Compare
Previously, we used a simple boolean to track the state of the package updates, and didn't reset it back when the config validation was successful because it was assumed that if we successfully validated the config beforehand, then the worker would also successfully reload the config afterwards, and that the old worker would be terminated. However, this assumption is not always true due to a number of reasons that I can't even think of right now, but the most obvious one is that after we successfully validated the config, the config might have changed again before the worker was able to reload it. If that happens, then the new worker might fail to successfully validate the config due to the recent changes, in which case the old worker would remain active, and this flag would still be set to true, causing any subsequent requests to fail with a `423` until you manually restart the Icinga 2 service. So, in order to prevent such a situation, we are additionally tracking the last time a reload failed and allow to bypass the `m_RunningPackageUpdates` flag only if the last reload failed time was changed since the previous request.
8832e11 to
73f3fdb
Compare
|
Please change 73f3fdb so that the Footnotes
|
I agree with that1, lest I get innocently Footnotes
|
Once the new worker process has read the config, it also includes a `include */include.conf` statement within the config packages root directory, and from there on we must not allow to delete any stage directory from the config package. Otherwise, when the worker actually evaluates that include statement, it will fail to find the directory where the include file is located, or the `active.conf` file, which is included from each stage's `include.conf` file, thus causing the worker fail. Co-Authored-By: Johannes Schmidt <[email protected]>
73f3fdb to
ce3275d
Compare
julianbrost
left a comment
There was a problem hiding this comment.
This can also be tested without having to patch anything in C++. Simply add the following to a config file somewhere in /etc/icinga2:
if (!globals.contains("ActiveStageOverride")) { throw "do not try this at home" }
This will fail the config load unless it's called with --define ActiveStageOverride=... as it's done during the first config validation. So it'll pass that but fail the actual reload.
Now with that in place, trigger a Director config deployment, the first one will go through but fail the subsequent reload, preventing a further config deployment with that conflicting request error.
With this PR on the other hand, subsequent deployments can be started. And after removing the bad line from the config, the reloads will also succeed again.
Locking against concurrent config deployments also still works (can easily be tested by adding a sleep(30) to the config instead).
Previously, we used a simple boolean to track the state of the package updates, and didn't reset it back when the config validation was successful because it was assumed that if we successfully validated the config beforehand, then the worker would also successfully reload the config afterwards, and that the old worker would be terminated. However, this assumption is not always true due to a number of reasons that I can't even think of right now, but the most obvious one is that after we successfully validated the config, the config might have changed again before the worker was able to reload it. If that happens, then the new worker might fail to successfully validate the config due to the recent changes, in which case the old worker would remain active, and this flag would still be set to true, causing any subsequent requests to fail with a
423until you manually restart the Icinga 2 service.So, in order to prevent such a situation, we are additionally tracking the last time a reload failed and allow to bypass the
m_RunningPackageUpdatesflag only if the last reload failed time was changed since the previous request.Additionally, this PR uses AtomicFile to write some of the important files in ConfigPackageUtility, such as the
active.conffile.This additional commit abf08d0 fixes the following strange looking error from a customer.
Tests
Use this to generate a broken config, so that the package update succeeds but the actual worker fails due to an error.
Before
Will never accept new package updates after this until you manually restart the Icinga 2 service.
After
Since the above diff will always generate a broken config after the config update request successfully validated the config, the only error you'll see (unless you've started the Icinga 2 systemd service without the
--close-stdiooption) is the one from theStatusentry of thesystemctlcommand:$ systemctl status icinga2 ● icinga2.service - Icinga host/service/network monitoring system Loaded: loaded (/usr/lib/systemd/system/icinga2.service; disabled; preset: disabled) Drop-In: /usr/lib/systemd/system/service.d └─10-timeout-abort.conf, 50-keep-warm.conf Active: active (running) since Mon 2025-06-16 08:37:46 UTC; 10min ago Invocation: d9a34569c78f462bb40c24afb9341d42 Process: 74429 ExecStartPre=/usr/local/icinga2/lib/icinga2/prepare-dirs /usr/local/icinga2/etc/sysconfig/icinga2 (code=exited, status=0/SUCCESS) Main PID: 74438 (icinga2) Status: "Config validation failed."To trigger the newly introduced
503error case, you can run the following loop:ref/IP/58256
fixes #10055