Skip to content

sock_async_event: fix race-condition#13438

Merged
gschorcht merged 1 commit intoRIOT-OS:masterfrom
miri64:sock_async_event/fix/race-condition
Feb 21, 2020
Merged

sock_async_event: fix race-condition#13438
gschorcht merged 1 commit intoRIOT-OS:masterfrom
miri64:sock_async_event/fix/race-condition

Conversation

@miri64
Copy link
Copy Markdown
Member

@miri64 miri64 commented Feb 21, 2020

Contribution description

If a new event is fired during the execution of the event callback, event->type might change. However as event->type is set to 0 after the execution of the callback, that leads to it being 0 on the next round of the event handler's execution, leading in the event getting lost.

Testing procedure

  • make -C tests/gnrc_sock_async_event flash test should still work with a board of your choice.
  • To confirm the race condition
    • try tests/lwip with lwip: provide sock_async support #13427 merged and start a TCP server with tcp server start 1337
    • connect to that server from your host machine using netcat (e.g. nc fe80::0200:dead:beef:affe%tapbr0 1337 when using native)
    • With this PR the closing of the connection when hitting Ctrl+C in netcat is registered and printed to stdout. You can reconnect to the TCP server by starting a new netcat process.
    • Without this PR the closing of the connection when hitting Ctrl+C in netcat is not registered. Reconnecting to the TCP server is not possible (accept will error with an -ENOMEM).

Issues/PRs references

Discovered in #13427.

If a new event is fired during the execution of the event callback,
`event->type` might change. However as `event->type` is set to 0 after
the execution of the callback, that leads to it being 0 on the next
round of the event handler's execution, leading in the event getting
lost.
@miri64 miri64 added Area: network Area: Networking Process: needs backport Integration Process: The PR is required to be backported to a release or feature branch Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) labels Feb 21, 2020
@miri64 miri64 added this to the Release 2020.04 milestone Feb 21, 2020
@miri64 miri64 requested a review from fjmolinas February 21, 2020 15:01
@miri64 miri64 added Area: pkg Area: External package ports and removed Process: needs backport Integration Process: The PR is required to be backported to a release or feature branch labels Feb 21, 2020
@miri64
Copy link
Copy Markdown
Member Author

miri64 commented Feb 21, 2020

Backport for 2020.01 probably not needed as no support for TCP for sock_async exists there

@miri64 miri64 requested a review from gschorcht February 21, 2020 15:02
Copy link
Copy Markdown
Contributor

@gschorcht gschorcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

@miri64 miri64 added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Feb 21, 2020
Copy link
Copy Markdown
Contributor

@gschorcht gschorcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot reproduce the race conditions. Therefore I have set it to 'request changes' to avoid accidental merging.

@miri64
Copy link
Copy Markdown
Member Author

miri64 commented Feb 21, 2020

I cannot reproduce the race conditions. Therefore I have set it to 'request changes' to avoid accidental merging.

I don't understand your reasoning for blocking this then... Even if you can't reproduce, isn't it obvious that a race condition can occur and that this PR is a fix for that? Why should a merge be bad, even if accidentally?

@miri64
Copy link
Copy Markdown
Member Author

miri64 commented Feb 21, 2020

(can't reproduce it with #13427 either btw, most likely because my latest fixes hardened the backend itself against the race condition. Nevertheless, it still can occur and should be fixed.)

@gschorcht
Copy link
Copy Markdown
Contributor

I cannot reproduce the race conditions. Therefore I have set it to 'request changes' to avoid accidental merging.

Maybe, I missed something but I can't reproduce the race condition with PR #13427, neither with esp_wifi nor with native. It works as expected.

Without this PR I can observe following behavior when I leave nc with Ctrl-C.

TCP client [fe80::d4e9:caff:fe69:da71]:36076 connected
TCP connection to [fe80::d4e9:caff:fe69:da71]:36076 reset

I can reconnect afterwards.

With this PR on top of PR #13427, I observer the following:

TCP client [fe80::d4e9:caff:fe69:da71]:36294 connected
TCP connection to [fe80::d4e9:caff:fe69:da71]:36294 reset
TCP connection to [100:0:100:0:200:0:44ef:5f56]:22112 reset

But, at least I can reconnect afterwards.

@gschorcht
Copy link
Copy Markdown
Contributor

(can't reproduce it with #13427 either btw, most likely because my latest fixes hardened the backend itself against the race condition. Nevertheless, it still can occur and should be fixed.)

As said in #13438 (comment), the behavior seems to be more strange with this PR than before.

@miri64
Copy link
Copy Markdown
Member Author

miri64 commented Feb 21, 2020

With #13427 resetted to aaa644b I still can reproduce the race condition.

@miri64
Copy link
Copy Markdown
Member Author

miri64 commented Feb 21, 2020

With this PR on top of PR #13427, I observer the following:

TCP client [fe80::d4e9:caff:fe69:da71]:36294 connected
TCP connection to [fe80::d4e9:caff:fe69:da71]:36294 reset
TCP connection to [100:0:100:0:200:0:44ef:5f56]:22112 reset

But, at least I can reconnect afterwards.

This should have been fixed with 24ce2ba. Are you sure you used the most recent version?

@miri64
Copy link
Copy Markdown
Member Author

miri64 commented Feb 21, 2020

So... to prevent the garbage output and still have the race condition you need to revert 54069b1...

@gschorcht
Copy link
Copy Markdown
Contributor

gschorcht commented Feb 21, 2020

With this PR on top of PR #13427, I observer the following:

TCP client [fe80::d4e9:caff:fe69:da71]:36294 connected
TCP connection to [fe80::d4e9:caff:fe69:da71]:36294 reset
TCP connection to [100:0:100:0:200:0:44ef:5f56]:22112 reset

But, at least I can reconnect afterwards.

This should have been fixed with 24ce2ba. Are you sure you used the most recent version?

OK, that's my fault, the last commit in PR #13427 was 54069b1. I will check again.

@miri64
Copy link
Copy Markdown
Member Author

miri64 commented Feb 21, 2020

Here is what I got:

$ git log --oneline
dcc7f8832 Revert "fixup! lwip: provide sock_async support"
24ce2ba16 (origin/lwip/enh/sock_async) fixup! tests/lwip: adapt test to use sock_async
54069b14b fixup! lwip: provide sock_async support
aaa644b38 fixup! tests/lwip: adapt test to use sock_async
386c3f1b1 lwip: fix style errors found by Vera++
53e6a8190 tests/lwip: adapt test to use sock_async
b56479427 lwip: provide sock_async support
188cc20ac sock_async_event: fix typo in tcp_queue function
5f938bd49 lwip: introduce sock base-type

[…]

RIOT native @ tap0
/home/mlenders/Repositories/RIOT-OS/RIOT/tests/lwip/bin/native/tests_lwip.elf tap0 
RIOT native interrupts/signals initialized.
LED_RED_OFF
LED_GREEN_ON
RIOT native board initialized.
RIOT native hardware initialization complete.

main(): This is RIOT! (Version: 2020.04-devel-593-gdcc7f-lwip/enh/sock_async)
RIOT lwip test application
> ifconfig
ifconfig
ET_00:  inet6 fe80::682e:4fff:fe3d:dfcb
> tcp server start 1337
tcp server start 1337
Success: started TCP server on port 1337
> TCP client [fe80::98be:cbff:fe6f:6cae]:34928 connected
Error on TCP accept [-12]

after I did

$ nc -6 fe80::682e:4fff:fe3d:dfcb%tapbr0 1337
^C
$ nc -6 fe80::682e:4fff:fe3d:dfcb%tapbr0 1337
^C

Then

$ git cherry-pick d26355fbb41cdc0cb6a5af17901cd7e95a58bd8a
Auto-merging sys/net/sock/async/event/sock_async_event.c
[lwip/enh/sock_async 7647e854b] sock_async_event: fix race-condition
 Date: Fri Feb 21 15:50:30 2020 +0100
 1 file changed, 7 insertions(+), 1 deletion(-)

I get

RIOT native @ tap0
make: Entering directory '/home/mlenders/Repositories/RIOT-OS/RIOT/tests/lwip'
/home/mlenders/Repositories/RIOT-OS/RIOT/tests/lwip/bin/native/tests_lwip.elf tap0 
RIOT native interrupts/signals initialized.
LED_RED_OFF
LED_GREEN_ON
RIOT native board initialized.
RIOT native hardware initialization complete.

main(): This is RIOT! (Version: 2020.04-devel-594-g7647e-lwip/enh/sock_async)
RIOT lwip test application
> ifconfig
ifconfig
ET_00:  inet6 fe80::682e:4fff:fe3d:dfcb

> tcp server start 1337
tcp server start 1337
Success: started TCP server on port 1337
> TCP client [fe80::98be:cbff:fe6f:6cae]:35086 connected
TCP connection to [fe80::98be:cbff:fe6f:6cae]:35086 reset
TCP client [fe80::98be:cbff:fe6f:6cae]:35092 connected
TCP connection to [fe80::98be:cbff:fe6f:6cae]:35092 reset

Again, after I used nc twice

$ nc -6 fe80::682e:4fff:fe3d:dfcb%tapbr0 1337
^C
$ nc -6 fe80::682e:4fff:fe3d:dfcb%tapbr0 1337
^C

@miri64
Copy link
Copy Markdown
Member Author

miri64 commented Feb 21, 2020

(edited comment as I copy-pasted something wrongly 😅)

@miri64
Copy link
Copy Markdown
Member Author

miri64 commented Feb 21, 2020

Note the version strings that should make clear that I use the version I am actually claiming to use. Is this justification enough to have a bug fix for my own code merged 😉?

@gschorcht
Copy link
Copy Markdown
Contributor

I can confirm that #13427 resetted to aaa644b gives an error and I can't reconnect:

TCP client [fe80::d4e9:caff:fe69:da71]:38516 connected
Error on TCP accept [-12]

I can further confirm that this PR on top of aaa644b gives some garbage but I can reconnect:

TCP client [fe80::d4e9:caff:fe69:da71]:38620 connected
TCP connection to [fe80::d4e9:caff:fe69:da71]:38620 reset
TCP connection to [100:0:100:0:200:0:44df:6356]:22116 reset

I can also confirm that this PR on top of the last commit of PR #13427 works as expected.

TCP client [fe80::d4e9:caff:fe69:da71]:38664 connected
TCP connection to [fe80::d4e9:caff:fe69:da71]:38664 reset

@gschorcht
Copy link
Copy Markdown
Contributor

gschorcht commented Feb 21, 2020

BTW, I saw this Error on TCP accept [-12] with an earlier version of PR #13427 when I was playing a bit with two instances of nc. I can still reproduce it. Could you check please whether it is releated this this PR or PR #13427?

riot#> tcp server start 1337

term1> nc -6 fe80::d4e9:caff:fe69:da72%tap0 1337
riot#> TCP client [fe80::d4e9:caff:fe69:da71]:41788 connected

term2> nc -6 fe80::d4e9:caff:fe69:da72%tap0 1337
riot#> Error on TCP accept [-12]
term2> ^C

term1> ^C
riot#> TCP connection to [fe80::d4e9:caff:fe69:da71]:41788 reset

term1> nc -6 fe80::d4e9:caff:fe69:da72%tap0 1337
riot#> TCP client [fe80::d4e9:caff:fe69:da71]:41792 connected

term1> ^C
riot#> NOTHING HAPPENS

term1> nc -6 fe80::d4e9:caff:fe69:da72%tap0 1337
riot#> Error on TCP accept [-12]

Copy link
Copy Markdown
Contributor

@gschorcht gschorcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense. The problem desribed in #13438 (comment) seems not to be related to this PR. ACK

@miri64
Copy link
Copy Markdown
Member Author

miri64 commented Feb 21, 2020

I am pretty sure this PR has nothing to do with it or in the worst case just reveals a problem with #13427. From a code perspective this PR should be ready to go.

@gschorcht
Copy link
Copy Markdown
Contributor

I am pretty sure this PR has nothing to do with it or in the worst case just reveals a problem with #13427. From a code perspective this PR should be ready to go.

Yes, but would you try to reproduce the behavior described in #13438 (comment)?

@gschorcht gschorcht merged commit 8dc53cf into RIOT-OS:master Feb 21, 2020
@miri64
Copy link
Copy Markdown
Member Author

miri64 commented Feb 21, 2020

I am pretty sure this PR has nothing to do with it or in the worst case just reveals a problem with #13427. From a code perspective this PR should be ready to go.

Yes, but would you try to reproduce the behavior described in #13438 (comment)?

Already was able to :-). It is definitely something being wrong with #13427 (the callback_arg member of the netconn at the second connect stays unset which is why the callback is skipped, as the sock can't be determined... still trying to figure out why).

@miri64 miri64 deleted the sock_async_event/fix/race-condition branch February 21, 2020 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: network Area: Networking Area: pkg Area: External package ports CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants