pkg/lwip: fix race in async sock API with event#21093
Draft
maribu wants to merge 1 commit intoRIOT-OS:masterfrom
Draft
pkg/lwip: fix race in async sock API with event#21093maribu wants to merge 1 commit intoRIOT-OS:masterfrom
maribu wants to merge 1 commit intoRIOT-OS:masterfrom
Conversation
In TCP server mode, the sock_tcp_t sockets are managed by the network stack and can be reused if a previous connection is no longer in used. However, an event may still be posted in the event queue when the socket is reused. Wiping it will result in the `next` pointer in that event to be NULL, which will cause the event handler fetching that event to crash. This adds an `event_cancel()` at two places: 1. Just before reusing the socket 2. During sock_tcp_disconnect() The former catches issues in server mode e.g. when a connect has been closed (e.g. due to timeout) and is reused before a pending event (e.g. a timeout event) has been processed. The letter may be an issue on client side. E.g. when `sock_tcp_t` was allocated on stack and goes out of scope after `sock_tcp_disconnect` but before the event handler was run.
Member
Author
|
I'm not super confident in this PR in regard to regressions. My WIP CoAP over TCP server still behaves oddly. I was unable to crash it anymore with this applied. |
Member
Author
|
Aaand if seen another segfault, again with the event queue pointing to the event in a |
Contributor
|
I don't really have any insight in how this part of the code works. |
Member
Since you are working on network interface code and the segfault happens a few layers up, maybe the packet times out and is deleted while it is still in the queue? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Contribution description
In TCP server mode, the
sock_tcp_tsockets are managed by the network stack and can be reused if a previous connection is no longer in used. However, an event may still be posted in the event queue when the socket is reused. Wiping it will result in thenextpointer in that event to be NULL, which will cause the event handler fetching that event to crash.This adds an
event_cancel()at two places:The former catches issues in server mode e.g. when a connect has been closed (e.g. due to timeout) and is reused before a pending event (e.g. a timeout event) has been processed.
The letter may be an issue on client side. E.g. when
sock_tcp_twas allocated on stack and goes out of scope aftersock_tcp_disconnectbut before the event handler was run.Testing procedure
I was able to see this bug in the wild in my work-in-progress PR that adds CoAP over TCP to nanocoap. But that code is not in a well shape now and has other issues.
A test that puts an async TCP server with lots of connections through the paces would likely be a good demonstrator.
Issues/PRs references
None