Deterministically shut down all connections on ApiListener::Stop() by jschmidt-icinga · Pull Request #10460 · Icinga/icinga2

jschmidt-icinga · 2025-06-03T07:37:51Z

Deterministically shut down all connections on ApiListener::Stop()

Currently, while the shutdown order is determined by icinga2 object priorities, the connection objects on the backend can continue to have effects on the state of the program, even after seemingly having been shut down (i.e. the Stop() method of ApiListener has been called). In practice, the connection objects will mostly stay alive until the very end when icinga2 exits and will answer all kinds of requests until then.

To mitigate some of the effects from ApiListener's associated objects after Stop() has returned, with this PR, these additional steps will be performed (in order):

Exit from ListenerCoroutineProc
Disconnect remaining endpoint and anonymous JsonRpcConnections
Abort HTTP API requests that are modifying the program state and reply with 503: Service unavailable

This PR relates to #10179 and implements one of the TODOs in @Al2Klimov's comment.

Exiting from `ListnerCoroutineProc`

ListenerCoroutineProc is stopped by closing the acceptor socket, then waiting for the completion of outstanding NewClientHandlers using a separate wait group.

Shutting down the remaining endpoint and anonymous JsonRpcConnections

Since the ListenerCoroutineProc will not take any new connections now, and m_ReconnectTimer has already been stopped, we can be sure that no endpoints are currently connecting and we can simply iterate over all endpoint objects and Disconnect() those that are still connected.

This means that some check results are still discarded when the wait group is stopped. A solution to this will require a message to be passed between the nodes and could be addressed in a future PR.

Aborting HTTP API requests

In API requests that modify or have the potential to modify the program state, the wait group will be locked and in case of potentially longer running requests, the stopped condition will be used to abort loops that iterate over config objects early and return with a 503 error code:

{
    "error": 503,
    "status": "Shutting down."
}

fixes #10179

lib/checker/checkercomponent.cpp

lib/remote/httpserverconnection.cpp

lib/base/wait-group.cpp

lib/remote/apilistener.cpp

lib/remote/actionshandler.cpp

lib/remote/eventshandler.cpp

lib/remote/objectqueryhandler.cpp

lib/remote/apilistener.hpp

lib/remote/apilistener.cpp

lib/remote/eventshandler.cpp

lib/remote/apilistener.cpp

jschmidt-icinga · 2025-06-06T10:03:09Z

No need to review this now. I'm still looking at a few things. I will update the PR description and add a comment here when I'm done.

lib/base/tlsstream.cpp

lib/remote/apilistener.cpp

jschmidt-icinga · 2025-06-06T13:58:42Z

Changes:

ListenerCoroutineProc is now exited by closing the socket. The wait group is still needed to track ongoing NewClientHandler() instances to be sure we hit all live connections in the following disconnects.
JsonRpcConnection::Disconnect() is still called immediately, but the lock on the wait group is now across the read and write coroutines outside the loop.
I've reduced the messages about disconnecting API clients to LogInformation. If we want LogWarning for the case this happens outside a normal shutdown, I can make them ask ApiListener first if this is a shutdown.
HttpHandlers now get passed an atomic boolean to signal the abort condition instead of the wait group. I've also reduced the situations where this abort is done as per the reviewer's suggestions. This is triggered in a seperate AbortProcessingResponse() method.
HttpServerConnection is now part of the same wait group as everything else except the one around NewClientHandler
I've moved the grace period timer inside DisconnectHttpConnections() and renamed them.
This PR now no longer adds a Cancel() method and the lock_shared() interface to WaitGroup since it is no longer needed (at the moment).

Caveats:

There is still a small potential for a situation where a connection is accepted in ListenerCoroutineProc and immediately closed, but I don't think it will happen often and should be harmless.
There is still a m_ListenerWaitGroup, and I'm afraid it's needed to wait on potentially multiple outstanding NewClientHandlers to complete. I could move this waitgroup to be local to AddListener() and be captured by the coroutines, but that would be more effort than I think is justified.
JsonRpcConnections may need some more attention if we want it to do what I tried to do initially, i.e. ensure that all messages other nodes have sent us will get fully processed. After thinking about this, this will at the very least involve some additional communication.

jschmidt-icinga · 2025-06-10T07:58:16Z

Incorporated some offline suggestions and fixed a crash due to a pointer initialization I managed to "clean up" during one of the recent rebases.

lib/remote/apilistener.cpp

lib/remote/httpserverconnection.cpp

Al2Klimov · 2025-06-10T09:37:30Z

lib/remote/apilistener.cpp

+		try{
+			acceptor->close();
+		} catch (const std::exception& ex) {
+			Log(LogCritical, "ApiListener")
+				<< "Failed to close acceptor socket: " << ex.what();
+		}


Why can't this block...

lib/remote/apilistener.cpp

jschmidt-icinga · 2025-06-11T09:07:24Z

Again I've simplified and reduced the scope of the PR, according to the suggestion of @Al2Klimov. I've also think I've addressed the remaining comments that still apply.

lib/remote/actionshandler.cpp

lib/remote/apilistener.hpp

julianbrost · 2025-06-11T16:02:38Z

lib/remote/jsonrpcconnection.cpp

+	std::shared_lock wgLock(*m_WaitGroup, std::try_to_lock);
+	if (!wgLock) {
+		return;
+	}


With this lock held not only for the message processing itself but also during socket operations, this raises a question: are these operations unblocked within a reasonable time (very few seconds) during shutdown?

From my testing, yes. In the previous iteration of the PR I was scared this would be a similar situation as the async_write() in HttpServerConnection blocking the shutdown. But unlike there, here the coroutine mostly sits in the async_read() without actually making progress (i.e. giving other operations a chance to run) or actively reading the rather small RPC messages (like check results).

Then again, the same corner cases could arise. Large messages over very, very slow connections. But even in that case, I think it would be better to let the message finish than to interrupt reading it in the middle, even if it takes a second or two.

To move around this, acquiring the lock somewhere else isn't even enough, I would have to pass something into NetString::ReadStringFromStream to interrupt reading the message, which I really don't want to do.

After thinking about this some more, unlike in the previous iteration of the PR (which had the WaitGroup::Cancel() in a timeout), I could now get away with just locking the WaitGroup per iteration, after the read. But again... it depends on what we want... do we want to shut down fast or do we want to make sure things we started get finished... We can't have this cake and eat it too.

or do we want to make sure things we started get finished

That just depends on your interpretation of what counts as started. I'd totally fine with only counting the message processing itself, no the read operation.

The other extreme (I haven't checked in detail yet if something like this could actually happen) would be the remote crashes so that the connection just hangs in the middle of reading a single JSON-RPC message. In that situation, it shouldn't happen that the shutdown blocks until the CheckLiveness() coroutine kills the connection.

The nicest option would be to signal that we want to shut down to the connection, process all outstanding messages and then cleanly shut down the connection, with an additional quick timeout that would kill the connection if this doesn't happen within a few connection and only waits for the processing of the current message then (so that if processing of a message started, it also finishes). However, I don't see how an implementation without introducing new cluster messages to notify the peer about the intent to shut down would work any better than more or less randomly selecting a last message after which the connection is closed which should be pretty much what you get when locking around each MessageHandler() individually.

I don't see how an implementation without introducing new cluster messages [...]

I get that and also commented on it here (last point).

I'd totally fine with only counting the message processing itself, no the read operation.

Alright, so I'd move to lock to the top of MessageHandler(). Is that fine? Should the lock in WriteOutgoingMessages() stay then or should I rather remove it entirely? There's a timeout in Disconnect(), but I guess that can fail to cancel the operation too under certain conditions.

lib/base/wait-group.cpp

lib/remote/httpserverconnection.cpp

jschmidt-icinga · 2025-06-13T07:55:30Z

I've pushed and update that should address the remaining comments, with the exception of the wait group lock in JsonRpcConnection::WriteOutgoingMessages, which I'm still not sure about.

Another change is that I'm now using a boost::signal+post instead of the post+coroutine I used to stop the listener before. I've also moved the lock inside the loop because I had after all encountered rare situations where, like @julianbrost warned, the operation_aborted error was not thrown and the socket was closed after async_accept succeeded, trapping ListenerCoroutineProc in an endless loop of bad_file_descriptor exceptions. Asio coroutines are a lot of fun.

julianbrost · 2025-06-13T08:23:03Z

the operation_aborted error was not thrown and the socket was closed after async_accept succeeded, trapping ListenerCoroutineProc in an endless loop of bad_file_descriptor exceptions

Did this just happen randomly? Were you able to make this happen more reliably by creating some specific scenario?

I've also moved the lock inside the loop

Does that prevent the bad_file_descriptor situation itself or just the endless loop as that locking operation acts pretty much like a loop condition (like for example while (acceptor->is_open()) or while (m_ListenerWaitGroup->IsLockable()))?

lib/remote/apilistener.cpp

lib/remote/actionshandler.cpp

lib/remote/deleteobjecthandler.cpp

lib/remote/consolehandler.cpp

julianbrost · 2025-06-13T09:12:07Z

Sorry, I forgot to address the following in a previous comment/review:

with the exception of the wait group lock in JsonRpcConnection::WriteOutgoingMessages, which I'm still not sure about.

For me, it's pretty similar to the one in the read loop and don't add unless:

You can reason that the socket operations that are done while it's held are unblocked quickly on shutdown; or
You say it's needed for correctness, but in that case we need to ensure that that socket operations are unblocked at shutdown.

jschmidt-icinga · 2025-06-13T09:12:21Z

Did this just happen randomly? Were you able to make this happen more reliably by creating some specific scenario?

I think it's the same situation I've observed in #5581, with the coroutine yielding again when the async operation is actually already done. Here, this would mean closing the socket comes too late to make async_accept() throw, but obviously still closes the socket. I thought about alternatively checking if the wait group is not lockable anymore when catching the bad_file_descriptor exception, but I liked this option better and it should avoid the exception entirely, at least I moved the lock.

that locking operation acts pretty much like a loop condition

Well, it acts as locking and a loop condition. I can also keep it outside the loop and just do while (IsLockable()) if you prefer it that way. Wouldn't even need to check the shared_lock if we use it as a loop condition.

jschmidt-icinga · 2025-06-13T09:39:44Z

2. You say it's needed for correctness, but in that case we need to ensure that that socket operations are unblocked at shutdown.

We can ensure they are unblocked, but not guarantee it's within a reasonable time, even when the 5s timeout in Disconnect() fires, the async_write() will not yield on very slow but steady connections, same as in HttpServerConnection. I'll just remove it.

julianbrost · 2025-06-13T09:45:34Z

the same situation I've observed in #5581

Is that really the correct PR you wanted to reference?

I was asking primarily because I didn't understand how changing the scope of the locking should solve bad file descriptor errors and if these appear randomly, that's always a hint that something could be wrong with the code.¹

In the current state of the PR, I think there could still be a small race condition in that respect:

icinga2/lib/remote/apilistener.cpp

Lines 519 to 524 in 44c902d

    
           void ApiListener::StopListener() 
        
           { 
        
           	m_OnListenerShutdown(); 
        
           	m_ListenerWaitGroup->Join(); 
        
           }

The accepting socket is already closed by line 521, but the wait group you check before starting a new async_accept() is only signaled to stop in line 523, so couldn't that still start one (or even a few) more accept(s) until the signal from the wait group propagates to the listener?

icinga2/lib/remote/apilistener.cpp

Lines 541 to 548 in 44c902d

    
           std::shared_lock wgLock(*m_ListenerWaitGroup, std::try_to_lock); 
        
           if (!wgLock) { 
        
           	return; 
        
           } 
        
           asio::ip::tcp::socket socket (io); 
        
           server->async_accept(socket.lowest_layer(), yc);

So I think while (server->is_open()) might even be the more correct option as that's already synchronized with the corresponding close() call by the use of the common strand for both.

If a syscall actually returns EBADFD that could be a hint that a fd number is used after it was closed. At least in a multi-threaded application, instead of receiving that error, there's a chance that the fd number could have been reused already and instead of getting an error, the operation could also have been performed on a random other file instead. However, with Asio in between, that also maps operations on closed socket objects (which wrap a fd) to that error without ever issuing a syscall so that use-after-close isn't an issue (unless you have code that reuses the Asio socket object). ↩

jschmidt-icinga · 2025-06-13T10:39:24Z

Is that really the correct PR you wanted to reference?

Sorry, no. I screwed up the auto complete. I meant #10139.

and if these appear randomly, that's always a hint that something could be wrong with the code.

I mean, it didn't appear randomly, it appeared sporadically when due to timing closing the socket didn't cancel an ongoing async_accept().

Anyway, I've just pushed yet another update.

Note: I just noticed that ModifyObjectsHandler currently always sends http::status::ok as a result, despite other error conditions existing. I'm not going to change that here, but maybe doing the same as DeleteObjectsHandler would be a good thing.

julianbrost · 2025-06-13T11:02:38Z

I mean, it didn't appear randomly, it appeared sporadically

Randomly in the sense that I don't understand the pattern and it looks random to me. ;)

Just remove "randomly" from my sentence. seeing a bad file descriptor error anywhere is rarely a good sign.

julianbrost

We should be getting close to the finish line now as I'm getting more into some details and a bit of code style. Overall functionality should be good from looking at the code, I'll give it a test later today.

@Al2Klimov @yhabteab I only expect some small changes here, so in case you want to take another look, now (or after the next push) would be a good time.

lib/remote/deleteobjecthandler.cpp

lib/remote/actionshandler.cpp

lib/remote/apilistener.cpp

lib/base/wait-group.cpp

lib/remote/actionshandler.cpp

lib/remote/apilistener.cpp

lib/remote/actionshandler.cpp

The wait group gets passed to HttpServerConnection, then down to the HttpHandlers. For those handlers that modify the program state, the wait group is locked so ApiListener will wait on Stop() for the request to complete. If the request iterates over config objects, a further check on the state of the wait group is added to abort early and not delay program shutdown. In that case, 503 responses will be sent to the client. Additionally, in HttpServerConnection, no further requests than the one already started will be allowed once the wait group is joining.

jschmidt-icinga · 2025-06-13T12:52:36Z

Fixed the trailing whitespace, added the type and name fields, and moved the new Dictionary directly into emplace_back

julianbrost

Thank you very much!

@Al2Klimov @yhabteab I only expect some small changes here, so in case you want to take another look, now (or after the next push) would be a good time.

Not merging yet until I get some feedback there.

yhabteab · 2025-06-16T10:01:05Z

yhabteab herewith dismisses all his requested changes :)! Just joking - I didn't request anything explicitly here on purpose, but submitted only few comments, so my review shouldn't block this in anyway. Since Alex already requested a review from himself, it should be fine when both of you approve it!

yhabteab · 2025-06-16T15:46:26Z

My over loaded docker setup found nothing suspicious, so this should be fine. Thanks to all of you 🎉!

jschmidt-icinga requested review from Al2Klimov, julianbrost and yhabteab June 3, 2025 07:37

cla-bot bot added the cla/signed label Jun 3, 2025

jschmidt-icinga added this to the 2.15.0 milestone Jun 3, 2025

jschmidt-icinga force-pushed the apilistener-shutdown-conns-on-stop branch 4 times, most recently from 12aa63f to a5b8289 Compare June 3, 2025 13:22

Al2Klimov reviewed Jun 4, 2025

View reviewed changes

julianbrost reviewed Jun 4, 2025

View reviewed changes

lib/remote/apilistener.hpp Outdated Show resolved Hide resolved

lib/remote/apilistener.cpp Outdated Show resolved Hide resolved

lib/remote/apilistener.cpp Outdated Show resolved Hide resolved

Al2Klimov reviewed Jun 4, 2025

View reviewed changes

lib/remote/eventshandler.cpp Outdated Show resolved Hide resolved

lib/remote/apilistener.cpp Outdated Show resolved Hide resolved

jschmidt-icinga force-pushed the apilistener-shutdown-conns-on-stop branch from a5b8289 to 8a201f9 Compare June 6, 2025 08:50

julianbrost reviewed Jun 6, 2025

View reviewed changes

lib/base/tlsstream.cpp Outdated Show resolved Hide resolved

lib/remote/apilistener.cpp Outdated Show resolved Hide resolved

lib/remote/apilistener.cpp Outdated Show resolved Hide resolved

jschmidt-icinga force-pushed the apilistener-shutdown-conns-on-stop branch from 8a201f9 to f38c98a Compare June 6, 2025 13:46

jschmidt-icinga force-pushed the apilistener-shutdown-conns-on-stop branch 2 times, most recently from 350c91a to 7ca5994 Compare June 10, 2025 07:52

yhabteab reviewed Jun 10, 2025

View reviewed changes

Al2Klimov reviewed Jun 10, 2025

View reviewed changes

jschmidt-icinga force-pushed the apilistener-shutdown-conns-on-stop branch from 7ca5994 to e1e0467 Compare June 11, 2025 09:04

julianbrost reviewed Jun 11, 2025

View reviewed changes

Al2Klimov reviewed Jun 11, 2025

View reviewed changes

lib/base/wait-group.cpp Show resolved Hide resolved

lib/remote/httpserverconnection.cpp Show resolved Hide resolved

jschmidt-icinga force-pushed the apilistener-shutdown-conns-on-stop branch from e1e0467 to 44c902d Compare June 13, 2025 07:44

julianbrost reviewed Jun 13, 2025

View reviewed changes

jschmidt-icinga force-pushed the apilistener-shutdown-conns-on-stop branch from 44c902d to fb81054 Compare June 13, 2025 10:39

julianbrost reviewed Jun 13, 2025

View reviewed changes

jschmidt-icinga added 4 commits June 13, 2025 14:48

Add IsLockable method to WaitGroup

157e375

Stop ApiListener::ListenerCoroutineProc() when Stop() is called

00802ed

Disconnect JSON-RPC clients on ApiListner::Stop()

33777f6

jschmidt-icinga force-pushed the apilistener-shutdown-conns-on-stop branch from fb81054 to 82bb636 Compare June 13, 2025 12:51

julianbrost approved these changes Jun 13, 2025

View reviewed changes

Al2Klimov self-requested a review June 13, 2025 14:53

Al2Klimov removed their request for review June 16, 2025 14:50

yhabteab merged commit dcef22e into master Jun 16, 2025
25 checks passed

yhabteab deleted the apilistener-shutdown-conns-on-stop branch June 16, 2025 15:47

yhabteab mentioned this pull request Jun 16, 2025

Bugfix: configobject shutdown order #10191

Closed

Conversation

jschmidt-icinga commented Jun 3, 2025 • edited by yhabteab Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deterministically shut down all connections on ApiListener::Stop()

Exiting from ListnerCoroutineProc

Shutting down the remaining endpoint and anonymous JsonRpcConnections

Aborting HTTP API requests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jschmidt-icinga commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jschmidt-icinga commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jschmidt-icinga commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Al2Klimov Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jschmidt-icinga commented Jun 11, 2025

Uh oh!

Uh oh!

Uh oh!

julianbrost Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

jschmidt-icinga Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

jschmidt-icinga Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

julianbrost Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

jschmidt-icinga Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jschmidt-icinga commented Jun 13, 2025

Uh oh!

julianbrost commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

julianbrost commented Jun 13, 2025

Uh oh!

jschmidt-icinga commented Jun 13, 2025

Uh oh!

jschmidt-icinga commented Jun 13, 2025

jschmidt-icinga commented Jun 3, 2025 •

edited by yhabteab

Loading

Exiting from `ListnerCoroutineProc`

jschmidt-icinga commented Jun 6, 2025 •

edited

Loading

jschmidt-icinga commented Jun 10, 2025 •

edited

Loading

julianbrost commented Jun 13, 2025 •

edited

Loading

jschmidt-icinga commented Jun 13, 2025 •

edited

Loading