Skip to content

Server-side goroutine leak on receive message error #142

@austinvazquez

Description

@austinvazquez

Hey folks, while tracking down a production error I inadvertently found a rare error in the receive message logic for server connections.

To set some context the receive message logic relies on two goroutines working together.
goroutine A: https://github.com/containerd/ttrpc/blob/main/server.go#L136
goroutine B: https://github.com/containerd/ttrpc/blob/main/server.go#L369-L487

A [built-in Go] error channel is used for communication of errors between goroutines A and B.
https://github.com/containerd/ttrpc/blob/main/server.go#L339

The first thing to note is on message header read error the channel returns the underlying error to goroutine B. Based on the current implementation, these errors can only be from io.ReadFull. io.ReadFull can return io.EOF or io.ErrUnexpectedEOF. Reference: https://pkg.go.dev/io#ReadFull
https://github.com/containerd/ttrpc/blob/main/channel.go#L73-L76

This is fine because goroutine A has error handling to check for these errors and stop execution if they occur.
https://github.com/containerd/ttrpc/blob/main/server.go#L545-L555

The issue begins when considering errors on message read by goroutine B. In this branch, goroutine B has successfully read a message header for a message with some length > 0; however, when reading the message an error has occurred by io.ReadFull either io.EOF or io.ErrUnexpectedEOF.

What I have found is for these errors we wrap the error with more information to clarify the error (opposed to on message header read) occurred on message read.
https://github.com/containerd/ttrpc/blob/main/channel.go#L138

However, goroutine A's error handling logic is not currently setup to check for wrapped errors.
https://github.com/containerd/ttrpc/blob/main/server.go#L550-L554

So the effect is goroutine B is stopped due to a [non-gRPC] receive error, but goroutine A is never stopped resulting in leaked goroutine and server connection.
Stop goroutine B on receive error: https://github.com/containerd/ttrpc/blob/main/server.go#L380-L385
Again (currently) only stop goroutine A on some receive errors: https://github.com/containerd/ttrpc/blob/main/server.go#L550-L554

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions