Skip to content

server: Fix connection issues when receiving ECONNRESET#123

Merged
dmcgowan merged 1 commit intocontainerd:mainfrom
darfux:fix_leak_when_conn_reset
Nov 9, 2022
Merged

server: Fix connection issues when receiving ECONNRESET#123
dmcgowan merged 1 commit intocontainerd:mainfrom
darfux:fix_leak_when_conn_reset

Conversation

@darfux
Copy link

@darfux darfux commented Sep 21, 2022

The ttrpc server somtimes receives ECONNRESET rather than EOF when the client is exited. Such as reading from a closed connection. Handle it properly to avoid goroutine and connection leak.

Below is a debug example:

> github.com/containerd/ttrpc.(*serverConn).run() containerd/ttrpc/server.go:470 (hits goroutine(23900):1 total:3) (PC: 0x55f06b5a7cb3)
(dlv) p err
error(*errors.errorString) *{s: "EOF"}
(dlv) c
> github.com/containerd/ttrpc.(*serverConn).run() containerd/ttrpc/server.go:470 (hits goroutine(47938):1 total:4) (PC: 0x55f06b5a7cb3)
(dlv) p err
error(*net.OpError) *{
        Op: "read",
        Net: "unix",
        Source: net.Addr(*net.UnixAddr) *{
                Name: "/run/containerd/containerd.sock.ttrpc",
                Net: "unix",},
        Addr: net.Addr(*net.UnixAddr) *{Name: "@", Net: "unix"},
        Err: error(*os.SyscallError) *{
                Syscall: "read",
                Err: error(syscall.Errno) *(*error)(0xc001840950),},}
(dlv) p err.Err
error(*os.SyscallError) *{
        Syscall: "read",
        Err: error(syscall.Errno) ECONNRESET (104),}

Currently there's a possible race condition when shutting down the
server, which causes new connection to be added after Shutdown()

//s.Serve
l.Accept()
handshaker.Handshake
.
.       // s.Shutdown()
.        --> close(s.done)
.        --> closeListeners
.        --> closeIdleConns // should have no more conns after
.
s.newConn // An unexpected new conn is added

@darfux darfux force-pushed the fix_leak_when_conn_reset branch 6 times, most recently from 14878b8 to 8a6a44e Compare September 21, 2022 13:41
@darfux darfux changed the title server: Fix connection leak when receiving ECONNRESET server: Fix connection issues when receiving ECONNRESET and Shutdown() Sep 21, 2022
@darfux darfux force-pushed the fix_leak_when_conn_reset branch 2 times, most recently from 9079746 to edfef48 Compare September 21, 2022 13:56
server.go Outdated
// requests due to a terminal error.
recvErr = nil // connection is now "closing"
if err == io.EOF || err == io.ErrUnexpectedEOF {
opErr, isOpErr := err.(*net.OpError)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this type check? Shouldn't we just be able to use errors.Is without this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, updated, thx 😁

@darfux darfux force-pushed the fix_leak_when_conn_reset branch 2 times, most recently from 60d8f33 to 031309f Compare October 6, 2022 08:21
@darfux darfux requested a review from cpuguy83 October 11, 2022 03:10
if err != nil {
logrus.WithError(err).Error("error receiving message")
}
logrus.WithError(err).Error("error receiving message")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the nil check though? If this message is needed, then perhaps we need to change sendStatus to return error to send on recvError.

It seems to me though that this is trying to catch the shutdown case which this block is purposefully avoiding when there is active connections. If the return after this line is necessary, then the shutdown logic in this goroutine still doesn't seem correct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As code at https://github.com/containerd/ttrpc/blob/v1.1.0/server.go#L363 will do the nil check and only send non-nill error to the recvErr channel, I think it's safe to remove the nil check here.

It seems to me though that this is trying to catch the shutdown case which this block is purposefully avoiding when there is active connections. If the return after this line is necessary, then the shutdown logic in this goroutine still doesn't seem correct.

Yes, the return here is not necessary and should be done by the shutdown logic. I‘ll remove it, thx

@dmcgowan
Copy link
Member

Currently there's a possible race condition when shutting down the
server, which causes new connection to be added after Shutdown()

It seems like the correct behavior would be to handle the connection since it has already been accepted. In this change, the check is being done before, but the same race is still possible since there is no guarantee in execution order between checking s.done and calling s.newConn. Where is the unexpected new conn causing issues, if we can address it there.

For the ECONNRESET case, it makes sense to treat it like EOF. The other part of the change where it always returns from that branch I don't think is correct though, that could probably be broken out and explored separately.

@darfux darfux force-pushed the fix_leak_when_conn_reset branch from 031309f to 9ccde4c Compare November 2, 2022 06:12
@darfux
Copy link
Author

darfux commented Nov 2, 2022

It seems like the correct behavior would be to handle the connection since it has already been accepted. In this change, the check is being done before, but the same race is still possible since there is no guarantee in execution order between checking s.done and calling s.newConn. Where is the unexpected new conn causing issues, if we can address it there.

@dmcgowan You are right, the race condition still exists. I found the possible actual issue is addConnection may be called after closeIdleConns done and causes s.connections to be not empty. It looks a little bit hard to address this condition. I've removed the commit from this PR and will work on it later.

@darfux darfux force-pushed the fix_leak_when_conn_reset branch 3 times, most recently from fd47460 to 8a747cf Compare November 4, 2022 05:26
The ttrpc server somtimes receives `ECONNRESET` rather than `EOF` when
the client is exited. Such as reading from a closed connection. Handle
it properly to avoid goroutine and connection leak.

Change-Id: If32711cfc1347dd2da27ca846dd13c3f5af351bb
Signed-off-by: liyuxuan.darfux <[email protected]>
@darfux darfux force-pushed the fix_leak_when_conn_reset branch from 8a747cf to a03aa04 Compare November 4, 2022 05:56
@darfux darfux changed the title server: Fix connection issues when receiving ECONNRESET and Shutdown() server: Fix connection issues when receiving ECONNRESET Nov 8, 2022
@dmcgowan dmcgowan merged commit 5cc9169 into containerd:main Nov 9, 2022
@benbuzbee
Copy link

Thanks for the fix to this leak that we've been seeing for a while

It still seems wrong to me that the default case is to set recvErr = nil (which keeps this select case from ever unblocking again) and continuing the loop after logging the error.

Yes this fixes ECONNRESET but what happens the next time you see an esoteric error? You will have the same problem.

Why dont you just 'return' all the time, and only use the if to decide whether or not to log the error?

If your concern is not draining the other channels, then the goroutine that is writing to those channels should close them in a deferral like it does here:
https://github.com/darfux/ttrpc/blob/a03aa0459192f20913133e3d1e71419414431c6f/server.go#L351

This would guarantee that if the writer (This goroutine) dies, the channels are closed and the select isn't stuck hanging.

@darfux
Copy link
Author

darfux commented Apr 11, 2023

Why remove the nil check though? If this message is needed, then perhaps we need to change sendStatus to return error to send on recvError.

It seems to me though that this is trying to catch the shutdown case which this block is purposefully avoiding when there is active connections. If the return after this line is necessary, then the shutdown logic in this goroutine still doesn't seem correct.

Hi @benbuzbee, as discussed before, we considered other cases should be handled by the shutdown logic. But I'm agree with you, all errors should return in the block as the related connection will no longer receive requests. The shutdown case is only used for normal connection without any error. cc @dmcgowan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants