Skip to content

Handle config package updates gracefully#10476

Merged
julianbrost merged 3 commits intomasterfrom
broken-config-stage-updates
Jul 25, 2025
Merged

Handle config package updates gracefully#10476
julianbrost merged 3 commits intomasterfrom
broken-config-stage-updates

Conversation

@yhabteab
Copy link
Copy Markdown
Member

@yhabteab yhabteab commented Jun 13, 2025

Previously, we used a simple boolean to track the state of the package updates, and didn't reset it back when the config validation was successful because it was assumed that if we successfully validated the config beforehand, then the worker would also successfully reload the config afterwards, and that the old worker would be terminated. However, this assumption is not always true due to a number of reasons that I can't even think of right now, but the most obvious one is that after we successfully validated the config, the config might have changed again before the worker was able to reload it. If that happens, then the new worker might fail to successfully validate the config due to the recent changes, in which case the old worker would remain active, and this flag would still be set to true, causing any subsequent requests to fail with a 423 until you manually restart the Icinga 2 service.

So, in order to prevent such a situation, we are additionally tracking the last time a reload failed and allow to bypass the m_RunningPackageUpdates flag only if the last reload failed time was changed since the previous request.

Additionally, this PR uses AtomicFile to write some of the important files in ConfigPackageUtility, such as the active.conf file.

This additional commit abf08d0 fixes the following strange looking error from a customer.

[2025-06-29 18:02:18 +0200] critical/config: Error: Include file '../active.conf' does not exist
Location: in /var/lib/icinga2/api/packages/director/df4d559a-1f5f-4a58-8aab-ebb6b745f640/include.conf: 1:0-1:23
[2025-06-29 18:02:18 +0200] critical/cli: Config validation failed. Re-run with 'icinga2 daemon -C' after fixing the config.
[2025-06-29 18:02:18 +0200] critical/Application: Found error in config: reloading aborted

Tests

Use this to generate a broken config, so that the package update succeeds but the actual worker fails due to an error.

diff --git a/lib/remote/configpackageutility.cpp b/lib/remote/configpackageutility.cpp
index 0c3356099..8e5776175 100644
--- a/lib/remote/configpackageutility.cpp
+++ b/lib/remote/configpackageutility.cpp
@@ -131,7 +131,7 @@ void ConfigPackageUtility::WritePackageConfig(const String& packageName)
                << "  globals.ActiveStages = {}\n"
                << "}\n"
                << "\n"
-               << "if (globals.contains(\"ActiveStageOverride\")) {\n"
+               << "if (global.contains(\"ActiveStageOverride\")) {\n"
                << "  var arr = ActiveStageOverride.split(\":\")\n"
                << "  if (arr[0] == \"" << packageName << "\") {\n"
                << "    if (arr.len() < 2) {\n"

Before

$ while :; do curl -ksSu root:icinga -H 'Accept: application/json' -X POST 'https://10.27.0.22:5665/v1/config/stages/example-cmdb' -d '{ "files": { "conf.d/test.conf": "object Host \"cmdb-hosts\" { check_command = \"dummy\" }" }, "pretty": true }'; done
{
    "results": [
        {
            "code": 200,
            "package": "example-cmdb",
            "stage": "06ec0ab0-a36b-4b34-971a-00a9a819177b",
            "status": "Created stage. Reload triggered."
        }
    ]
}
{
    "error": 423,
    "status": "Conflicting request, there is already an ongoing package update in progress. Please try it again later."
}
{
    "error": 423,
    "status": "Conflicting request, there is already an ongoing package update in progress. Please try it again later."
}
{
    "error": 423,
    "status": "Conflicting request, there is already an ongoing package update in progress. Please try it again later."
}
---

Will never accept new package updates after this until you manually restart the Icinga 2 service.

After

$ curl -k -s -S -i -u root:icinga -H 'Accept: application/json' -X POST 'https://localhost:5665/v1/config/packages/example-cmdb?pretty=1'
$ while :; do curl -ksSu root:icinga -H 'Accept: application/json' -X POST 'https://10.27.0.22:5665/v1/config/stages/example-cmdb' -d '{ "files": { "conf.d/test.conf": "object Host \"cmdb-hosts\" { check_command = \"dummy\" }" }, "pretty": true }'; done
---
{
    "results": [
        {
            "code": 200,
            "package": "example-cmdb",
            "stage": "c58b6e5e-4f6c-4029-8981-29d0e2647f36",
            "status": "Created stage. Reload triggered."
        }
    ]
}
{
    "error": 423,
    "status": "Conflicting request, there is already an ongoing package update in progress. Please try it again later."
}
{
    "error": 423,
    "status": "Conflicting request, there is already an ongoing package update in progress. Please try it again later."
}
{
    "error": 423,
    "status": "Conflicting request, there is already an ongoing package update in progress. Please try it again later."
}
{
    "error": 423,
    "status": "Conflicting request, there is already an ongoing package update in progress. Please try it again later."
}
{
    "results": [
        {
            "code": 200,
            "package": "example-cmdb",
            "stage": "05f72333-5d70-4142-9f23-d2e3b443b89c",
            "status": "Created stage. Reload triggered."
        }
    ]
}
{
    "error": 423,
    "status": "Conflicting request, there is already an ongoing package update in progress. Please try it again later."
}
{
    "error": 423,
    "status": "Conflicting request, there is already an ongoing package update in progress. Please try it again later."
}
{
    "error": 423,
    "status": "Conflicting request, there is already an ongoing package update in progress. Please try it again later."
}
{
    "error": 423,
    "status": "Conflicting request, there is already an ongoing package update in progress. Please try it again later."
}
{
    "error": 423,
    "status": "Conflicting request, there is already an ongoing package update in progress. Please try it again later."
}
{
    "results": [
        {
            "code": 200,
            "package": "example-cmdb",
            "stage": "d155a90a-5d59-48a1-ae49-41649a3786e6",
            "status": "Created stage. Reload triggered."
        }
    ]
}

Since the above diff will always generate a broken config after the config update request successfully validated the config, the only error you'll see (unless you've started the Icinga 2 systemd service without the --close-stdio option) is the one from the Status entry of the systemctl command:

$ systemctl status icinga2
● icinga2.service - Icinga host/service/network monitoring system
     Loaded: loaded (/usr/lib/systemd/system/icinga2.service; disabled; preset: disabled)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf, 50-keep-warm.conf
     Active: active (running) since Mon 2025-06-16 08:37:46 UTC; 10min ago
 Invocation: d9a34569c78f462bb40c24afb9341d42
    Process: 74429 ExecStartPre=/usr/local/icinga2/lib/icinga2/prepare-dirs /usr/local/icinga2/etc/sysconfig/icinga2 (code=exited, status=0/SUCCESS)
   Main PID: 74438 (icinga2)
     Status: "Config validation failed."

To trigger the newly introduced 503 error case, you can run the following loop:

$ while :; do curl -ksSu root:icinga -H 'Accept: application/json' -X POST 'https://10.27.0.22:5665/v1/config/stages/example-cmdb' -d '{ "files": { "conf.d/test.conf": "object Host \"cmdb-hosts\" { check_command = \"dummy\"; sleep(10); }" }, "pretty": true }'; sleep 11; done
{
    "results": [
        {
            "code": 200,
            "package": "example-cmdb",
            "stage": "896ccbcd-d53f-4210-b2cb-b2210f6c36f0",
            "status": "Created stage. Reload triggered."
        }
    ]
}
{
    "error": 503,
    "status": "Icinga is reloading."
}
{
    "results": [
        {
            "code": 200,
            "package": "example-cmdb",
            "stage": "c73ebadb-3ce7-4e15-8f79-946121b0bfb3",
            "status": "Created stage. Reload triggered."
        }
    ]
}

ref/IP/58256
fixes #10055

@yhabteab yhabteab added bug Something isn't working area/api REST API ref/IP labels Jun 13, 2025
@cla-bot cla-bot bot added the cla/signed label Jun 13, 2025
@yhabteab yhabteab force-pushed the broken-config-stage-updates branch 2 times, most recently from 9d38416 to f168a23 Compare June 16, 2025 11:08
@yhabteab yhabteab marked this pull request as ready for review June 16, 2025 11:08
@yhabteab yhabteab requested a review from jschmidt-icinga June 16, 2025 11:08
@yhabteab yhabteab force-pushed the broken-config-stage-updates branch from f168a23 to 19cd522 Compare June 17, 2025 15:00
@yhabteab yhabteab requested a review from jschmidt-icinga June 17, 2025 15:01
@yhabteab yhabteab force-pushed the broken-config-stage-updates branch from 19cd522 to d124e51 Compare June 17, 2025 15:07
@yhabteab
Copy link
Copy Markdown
Member Author

I've changed this to a mutex based implementation instead of atomic operations as requested. Functional wise it works just like before, but I don't know if this will slow down the requests to a point where it will be noticeable. So, please have a look!

Copy link
Copy Markdown
Contributor

@jschmidt-icinga jschmidt-icinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these changes, I think it's much easier to parse this way. As suggested below, if you're concerned about the mutex being held, you could release it before generating the JSON error messages.

@yhabteab yhabteab force-pushed the broken-config-stage-updates branch from d124e51 to ad88fe7 Compare June 18, 2025 08:10
@yhabteab yhabteab requested a review from jschmidt-icinga June 18, 2025 08:11
@yhabteab yhabteab added this to the 2.16.0 milestone Jun 18, 2025
@yhabteab yhabteab added the consider backporting Should be considered for inclusion in a bugfix release label Jun 18, 2025
@yhabteab yhabteab force-pushed the broken-config-stage-updates branch from ad88fe7 to af9fdef Compare June 18, 2025 10:01
@yhabteab
Copy link
Copy Markdown
Member Author

I've changed the implementation to use Application::GetLastReloadFailed() as suggested by @julianbrost, and works exactly as the previous version of this PR. So, please have a look at the new implementation. I think, this is just a trivial change now.

Copy link
Copy Markdown
Contributor

@jschmidt-icinga jschmidt-icinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@yhabteab
Copy link
Copy Markdown
Member Author

Sorry, I had to fix some segfault crashes 🙈 !

@yhabteab yhabteab requested a review from jschmidt-icinga June 18, 2025 13:05
@yhabteab
Copy link
Copy Markdown
Member Author

yhabteab commented Jul 1, 2025

I've just pushed a new commit that fixes the referenced issue in the PR description, and makes use of the AtomicFile for some of the important files. So, please have a look at it and let me know if you have any questions or concerns. Thanks!

@yhabteab yhabteab force-pushed the broken-config-stage-updates branch from 23892a0 to 78adfe4 Compare July 21, 2025 12:19
@yhabteab yhabteab requested a review from julianbrost July 21, 2025 14:29
@julianbrost
Copy link
Copy Markdown
Member

Additionally, this PR uses AtomicFile to write some of the important files in ConfigPackageUtility, such as the active.conf file.

This additional commit abf08d0 fixes the following strange looking error from a customer.

[2025-06-29 18:02:18 +0200] critical/config: Error: Include file '../active.conf' does not exist
Location: in /var/lib/icinga2/api/packages/director/df4d559a-1f5f-4a58-8aab-ebb6b745f640/include.conf: 1:0-1:23
[2025-06-29 18:02:18 +0200] critical/cli: Config validation failed. Re-run with 'icinga2 daemon -C' after fixing the config.
[2025-06-29 18:02:18 +0200] critical/Application: Found error in config: reloading aborted

I don't yet understand how this is supposed to fix this. ../active.conf relative to that file is /var/lib/icinga2/api/packages/director/active.conf, so that's a file that's outside of the config stage, i.e. deleting the stage shouldn't make this file disappear.

@yhabteab
Copy link
Copy Markdown
Member Author

I don't yet understand how this is supposed to fix this. ../active.conf relative to that file is /var/lib/icinga2/api/packages/director/active.conf, so that's a file that's outside of the config stage, i.e. deleting the stage shouldn't make this file disappear.

No it doesn't make active.conf file disappear but a config package (note: a package not a stage) has a include.conf file that looks like this:

$ cat /var/lib/icinga2/api/packages/example-cmd/include.conf
include "*/include.conf"

...originated from this code part:

String includePath = GetPackageDir() + "/" + packageName + "/include.conf";
std::ofstream fpInclude(includePath.CStr(), std::ofstream::out | std::ostream::binary | std::ostream::trunc);
fpInclude << "include \"*/include.conf\"\n";
fpInclude.close();

Now, what does this include file do? It includes the include.conf file of each and every stage directory relative to that package regardless whether it's currently active or not. And the include.conf file of a config stage looks roughly like this (roughly because the stage name can vary depending on the currently active stage):

$ cat /var/lib/icinga2/api/packages/example-cmd/dc8d5503-71a4-4498-9c8b-637e99bd06c1/include.conf
include "../active.conf"
if (ActiveStages["example-cmdb"] == "dc8d5503-71a4-4498-9c8b-637e99bd06c1") {
  include_recursive "conf.d"
  include_zones "example-cmdb", "zones.d"
}

This is originated from this code part:

String path = GetPackageDir() + "/" + packageName + "/" + stageName + "/include.conf";
std::ofstream fp(path.CStr(), std::ofstream::out | std::ostream::binary | std::ostream::trunc);
fp << "include \"../active.conf\"\n"
<< "if (ActiveStages[\"" << packageName << "\"] == \"" << stageName << "\") {\n"
<< " include_recursive \"conf.d\"\n"
<< " include_zones \"" << packageName << "\", \"zones.d\"\n"
<< "}\n";
fp.close();

So, the package's include.conf file includes the include.conf file of all stages and the include.conf file of the individual config stage on the other hand includes the active.conf file of their package. Now, when reloading the Icinga 2 daemon, it will actually evaluate all the include files of the config stages even if the stage is not active. So, when a config stage gets deleted right after the new Icinga 2 daemon has read the config files, i.e., the parser constructed an IncludeExpression for the ../active.conf file to search relative to the deleted stage, it won't be able to find it anymore, since that include file along with its parent directory has been removed. The commit fixes this by preventing stage deletions while the Icinga 2 daemon is reloading just like the other handlers.

@julianbrost
Copy link
Copy Markdown
Member

So, when a config stage gets deleted right after the new Icinga 2 daemon has read the config files, i.e., the parser constructed an IncludeExpression for the ../active.conf file to search relative to the deleted stage, it won't be able to find it anymore, since that include file along with its parent directory has been removed.

So it's there but it doesn't find it? Why? (I've overheard you having a discussion about normalizing relative file paths earlier, I believe that's probably relevant here, but I didn't follow and the details are missing here.)

@yhabteab
Copy link
Copy Markdown
Member Author

So it's there but it doesn't find it? Why?

It's obviously not there like I said in my previous comment. The include expression contains the deleted config stage as a relative base and the ../active.conf as include path and when the config stage gets deleted after that, this include evaluation can't do nothing about it but fail.

upath = relativeBase + "/" + path;
String includePath = upath;
if (search) {
for (const String& dir : m_IncludeSearchDirs) {
String spath = dir + "/" + path;
if (Utility::PathExists(spath)) {
includePath = spath;
break;
}
}
}
std::vector<std::unique_ptr<Expression> > expressions;
auto funcCallback = [&expressions, zone, package](const String& file) { CollectIncludes(expressions, file, zone, package); };
if (!Utility::Glob(includePath, funcCallback, GlobFile) && includePath.FindFirstOf("*?") == String::NPos) {
std::ostringstream msgbuf;
msgbuf << "Include file '" + path + "' does not exist";

I've overheard you having a discussion about normalizing relative file paths earlier, I believe that's probably relevant here, but I didn't follow and the details are missing here.

Nothing that would resolve this, so it shouldn't be relevant for this PR and I don't see how normalizing relative paths earlier would resolve this specific case.

@julianbrost
Copy link
Copy Markdown
Member

Is there any point where /var/lib/icinga2/api/packages/director/active.conf doesn't exist?

Or is the following race condition happening here?

  1. Reload is started, /var/lib/icinga2/api/packages/director/df4d559a-1f5f-4a58-8aab-ebb6b745f640/include.conf is read.
  2. Config stage df4d559a-1f5f-4a58-8aab-ebb6b745f640 is deleted using the API, thereby the whole directory /var/lib/icinga2/api/packages/director/df4d559a-1f5f-4a58-8aab-ebb6b745f640/ is deleted.
  3. Now the the include.conf from the first step is actually evaluated, resulting in the include "../active.conf" being mapped to the path /var/lib/icinga2/api/packages/director/df4d559a-1f5f-4a58-8aab-ebb6b745f640/../active.conf. Accessing this fails because an intermediate directory no longer exists.

Nothing that would resolve this, so it shouldn't be relevant for this PR and I don't see how normalizing relative paths earlier would resolve this specific case.

If /var/lib/icinga2/api/packages/director/df4d559a-1f5f-4a58-8aab-ebb6b745f640/../active.conf was normalized to /var/lib/icinga2/api/packages/director/active.conf, that symptom wouldn't appear, compare ls /doesnotexist/.. and ls $(realpath -m /doesnotexist/..). (I'm not saying that we should change this, it's just important for understanding what's happening here.)

yhabteab added 2 commits July 24, 2025 10:54
Previously, we used a simple boolean to track the state of the package updates,
and didn't reset it back when the config validation was successful because it was
assumed that if we successfully validated the config beforehand, then the worker
would also successfully reload the config afterwards, and that the old worker would
be terminated. However, this assumption is not always true due to a number of reasons
that I can't even think of right now, but the most obvious one is that after we successfully
validated the config, the config  might have changed again before the worker was able
to reload it. If that happens, then the new worker might fail to successfully validate
the config due to the recent changes, in which case the old worker would remain active,
and this flag would still be set to true, causing any subsequent requests to fail with a
`423` until you manually restart the Icinga 2 service.

So, in order to prevent such a situation, we are additionally tracking the last time a reload
failed and allow to bypass the `m_RunningPackageUpdates` flag only if the last reload failed
time was changed since the previous request.
@julianbrost
Copy link
Copy Markdown
Member

Please change 73f3fdb so that the Co-Authored-By isn't in the middle of the commit message but at the end as it's supposed to be a trailer1. Also, it's more common that you still set the commit author to yourself and add the other person as the co-author (typically, you should not have to manually set the author of a commit to a different person).

Footnotes

  1. see https://git-scm.com/docs/git-interpret-trailers and https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors

@jschmidt-icinga
Copy link
Copy Markdown
Contributor

Also, it's more common that you still set the commit author to yourself and add the other person as the co-author (typically, you should not have to manually set the author of a commit to a different person).

I agree with that1, lest I get innocently git-blamed for it at some future date. 😆 (not that there's anything wrong with the commit, just joking)

Footnotes

  1. Unless perhaps when re-mastering a commit made by someone else in a complex merge/rebase kind of scenario.

Once the new worker process has read the config, it also includes a
`include */include.conf` statement within the config packages root
directory, and from there on we must not allow to delete any stage
directory from the config package. Otherwise, when the worker actually
evaluates that include statement, it will fail to find the directory
where the include file is located, or the `active.conf` file, which is
included from each stage's `include.conf` file, thus causing the worker
fail.

Co-Authored-By: Johannes Schmidt <[email protected]>
@yhabteab yhabteab force-pushed the broken-config-stage-updates branch from 73f3fdb to ce3275d Compare July 24, 2025 14:03
@yhabteab yhabteab requested a review from julianbrost July 24, 2025 14:03
Copy link
Copy Markdown
Member

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be tested without having to patch anything in C++. Simply add the following to a config file somewhere in /etc/icinga2:

if (!globals.contains("ActiveStageOverride")) { throw "do not try this at home" }

This will fail the config load unless it's called with --define ActiveStageOverride=... as it's done during the first config validation. So it'll pass that but fail the actual reload.

Now with that in place, trigger a Director config deployment, the first one will go through but fail the subsequent reload, preventing a further config deployment with that conflicting request error.

With this PR on the other hand, subsequent deployments can be started. And after removing the bad line from the config, the reloads will also succeed again.

Locking against concurrent config deployments also still works (can easily be tested by adding a sleep(30) to the config instead).

@julianbrost julianbrost merged commit 1a94c2c into master Jul 25, 2025
29 checks passed
@julianbrost julianbrost deleted the broken-config-stage-updates branch July 25, 2025 07:42
@yhabteab yhabteab added backported Fix was included in a bugfix release and removed consider backporting Should be considered for inclusion in a bugfix release labels Oct 8, 2025
@yhabteab yhabteab changed the title Handle concurrent config package updates gracefully Handle config package updates gracefully Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api REST API backported Fix was included in a bugfix release bug Something isn't working cla/signed ref/IP

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ConfigPackageUtility misses atomic file operations

3 participants