Skip to content

[RFC] gnrc_tcp_recv(): fix connection closed by remote host not handled#10899

Closed
benemorius wants to merge 2 commits intoRIOT-OS:masterfrom
benemorius:pr/gnrc_tcp-remoteclosed
Closed

[RFC] gnrc_tcp_recv(): fix connection closed by remote host not handled#10899
benemorius wants to merge 2 commits intoRIOT-OS:masterfrom
benemorius:pr/gnrc_tcp-remoteclosed

Conversation

@benemorius
Copy link
Copy Markdown
Member

Contribution description

I'm not certain this is right, but it fixed the problem I was having and it seems like it's right. Without this PR, I don't see how application code calling gnrc_tcp_recv() is notified of the remote host closing the connection. Documentation doesn't say what it should return when the remote end has closed. Currently it just continues blocking until timeout (or else it returns -EAGAIN), so we aren't able to tell that the remote host has closed the connection. This seems wrong.

The way recv() works on linux is that it returns 0 if the remote end has closed. The application usually responds to this by closing the local end if it has no more data to send. It seems to me that gnrc_tcp_recv() should return 0 so that the application can then call gnrc_tcp_close().

Otherwise what happens is that the connection gets stuck half open because the application never learns that it should close it. That is, tcb->state gets stuck in FSM_STATE_CLOSE_WAIT and new connections from a client are rejected until the riot node is reset.

Testing procedure

We don't have anything in tests or examples that uses gnrc_tcp_recv() in a way that can test what happens when the remote host closes the connection. What I'm testing with is a telnet server that I implemented against gnrc_tcp. I can connect with a telnet client and then disconnect, but without this PR subsequent connection attempts are refused. With this PR, the connection is closed properly and subsequent connections are successful.

Maybe the telnet server will compile for many platforms but unfortunately I don't have it alone in a clean branch right now and that branch has some hacks in core and sys, so really it's a very non-ideal way to be testing this.

Normally I'd want to stop and write a stripped down test for this, but I don't have a lot of time and this seems like something we should try to address in time for the 2019.01 release unless I'm just mistaken about how it's supposed to work. Anyway I wanted to see what anyone else thought about it since I don't know when I'll have time to continue with it.

@PeterKietzmann PeterKietzmann added Area: network Area: Networking Area: tests Area: tests and testing framework labels Jan 30, 2019
Copy link
Copy Markdown
Member

@brummer-simon brummer-simon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi benemorius,

I took a brief look into the TCP RFC regarding the handling of the receive call (https://tools.ietf.org/html/rfc793#page-59). Your bug fix is not consistent with the specified behavior.

It is possible that an incoming packet with FIN set contains payload. This means there could be received data in the TCB. The early return would make already received data in CLOSE-WAIT state unavailable to the calling code.

I'll take a more detailed look into the recv behavior and implementation during next the weekend and give you a more detailed on the topic.

Regarding the lack of test code. There is a PR already in Progress, addressing this issue.
I have currently not much spare time to work on this testing PR but if you want to help you are more than welcome to help me. Here is the current state of the PR: https://github.com/brummer-simon/RIOT/tree/gnrc_tcp-improve_tests.

@brummer-simon
Copy link
Copy Markdown
Member

On the Topic of "new connections from a client are rejected until the riot node is reset". That might be valid behavior. RIOT allows per default only a single TCB to be allocated at a time(however this can be changed via a define).

Depending on the order of the incoming packets for the shutdown sequence, the RIOT TCB can enter a state (TIME-WAIT), where it has to wait for remaining packets on the air associated the closed connection (default value is two minutes). Until that time passed the TCB is not allowed to be reused.
For further details see: https://tools.ietf.org/html/rfc793#page-39

In case of a linux system, a new TCB would be allocated and used to accept that new connection.
On RIOT this is not an option. Basically you are forced to wait until the TCB can be reused.

Are you sure that the behavior you are observing it not caused by this waiting period?

@benemorius
Copy link
Copy Markdown
Member Author

benemorius commented Jan 30, 2019

On the Topic of "new connections from a client are rejected until the riot node is reset". That might be valid behavior.

Yes I'm sure that this part is expected behavior although it isn't from being stuck in TIME-WAIT. The TIME-WAIT state is never actually reached as that can't happen until after gnrc_tcp_close() is called.

The unexpected part is being unable to call gnrc_tcp_close() due to gnrc_tcp_recv() not giving indication that it has entered CLOSE-WAIT.

@benemorius
Copy link
Copy Markdown
Member Author

benemorius commented Jan 30, 2019

I think you're right about the early return. I can fix that but I wanted to wait for confirmation that there's really a bug here before going any further. Maybe everything is fine and I'm just using it wrong. I just can't see any other way that I'm supposed to detect the CLOSE-WAIT state and call gnrc_tcp_close().

@brummer-simon
Copy link
Copy Markdown
Member

brummer-simon commented Jan 30, 2019

I'll take a look into the entire issue next weekend (I want to take a look how CLOSE-WAIT is handled in the internal FSM). After that, I can give you more detailed feedback on the issue.

But it certainly looks like you stumbled upon a bug.

@miri64 miri64 added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Discussion: RFC The issue/PR is used as a discussion starting point about the item of the issue/PR labels Jan 31, 2019
@brummer-simon
Copy link
Copy Markdown
Member

brummer-simon commented Feb 3, 2019

Hi benemorius,

After looking at the current implementation I think you found something there.

As you said the problem is that the caller is blocked in the recv()-call until the next data carrying packet arrives. If this packet doesn't arrive you get the error that your requested timeout. So far so good.
In cases where you read anything from the recv buffer and the last packet contained the FIN-Flag, it makes no sense to wait for new Payload because the connection is terminated already.

I would suggest the following semantics for your fix:
The RFC says that even if the TCB is in the CLOSE-WAIT state, a recv-call must allow to read from the recv-buffer. At first I would check if the TCB is in the CLOSE-WAIT state. If so, try to read Data from the receive buffer(_fsm_call_recv() in gnrc_tcp_fsm.c). If it is not empty (return code > 0, it is the number of read bytes) return this code. In case the buffer was empty, return an appropriate return code.

On the return code: Currently either the number of read bytes or an error code is returned, so using zero to indicate that the connection was closed is fine, although please update the documentation in the header file because it is a small change in the semantics.

Additionally please add something to _fsm_call_recv():
As soon as the data was read from the receive buffer, there is a check to the receive buffer can hold new data >= MMS. If so it sends a Packet to the peer announcing the new free window size. Please add a check that this window update is not sent in CLOSE-WAIT State. I think it doesn't make sense to reopen the receive window because the peer announced that it has nothing to send any more. I can't find anything on this in the RFC but I think prevents unnecessary packets from being sent.

@stale
Copy link
Copy Markdown

stale bot commented Aug 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

@stale stale bot added the State: stale State: The issue / PR has no activity for >185 days label Aug 10, 2019
@miri64 miri64 removed the State: stale State: The issue / PR has no activity for >185 days label Aug 10, 2019
@miri64
Copy link
Copy Markdown
Member

miri64 commented Oct 3, 2019

See #12367

@miri64 miri64 closed this Oct 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: network Area: Networking Area: tests Area: tests and testing framework Discussion: RFC The issue/PR is used as a discussion starting point about the item of the issue/PR Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants