Skip to content

AsyncProducer retries causing OOM #1358

@hagen1778

Description

@hagen1778
Versions
Sarama Kafka Go
1.20.1 1.0.0 1.11.3
Configuration
        conf := sarama.NewConfig()
	conf.Version = sarama.V0_11_0_0
	conf.Net.MaxOpenRequests = 5
	conf.Net.ReadTimeout = 5s
	conf.Net.DialTimeout = 5s
	conf.Net.WriteTimeout = 5s
	conf.Metadata.Retry.Max = 0
	conf.Producer.RequiredAcks = -1
	conf.Producer.Timeout = 5s
	conf.Producer.Flush.Bytes = 16MB
	conf.Producer.Flush.Frequency = 5s
	conf.Producer.Return.Errors = true
	conf.Producer.Return.Successes = true
Logs

Sarama logs recorded right before memory usage goes up:

logs: CLICK ME

producer/leader/<topic>/58 abandoning broker 10137
producer/leader/<topic>/14 selected broker 10128
producer/broker/10137 state change to [closed] on <topic>/80
producer/broker/10128 state change to [open] on <topic>/12
client/metadata fetching metadata for [<topic>] from broker <addr>:9092
producer/broker/10247 state change to [retrying] on <topic>/46 because kafka server: Request exceeded the user-specified time limit in the request.
producer/broker/10137 state change to [closed] on <topic>/18
producer/leader/<topic>/30 selected broker 10247
producer/broker/10247 state change to [retrying] on <topic>/30 because kafka server: Request exceeded the user-specified time limit in the request.
producer/broker/10137 state change to [open] on <topic>/58
producer/leader/<topic>/3 abandoning broker 10247
producer/broker/10247 state change to [closed] on <topic>/30
producer/broker/10247 state change to [closed] on <topic>/3
producer/leader/<topic>/55 state change to [normal]
producer/broker/10139 state change to [retrying] on <topic>/7 because kafka server: Request exceeded the user-specified time limit in the request.

Problem Description

The go-application writes messages into AsyncProducer. It also reads from Success and Error channels. The average memory usage of application is about 8GB.
The problem is that during peak load application memory consumption could raise up to 4x which causes OOM.

Log messages above were recorded during the incident. Messages with content maximum request accumulated, waiting for space were omitted. From the first glance it looks like there are some issues with inserting into Kafka and producer starts to retry messages.

Digging into AsyncProducer showed that retryHandler uses buffer without limiting by max size. So I created an additional metric to store buffer Length - // Length returns the number of elements currently stored in the queue.
And it showed perfect correlation between memory consumption and queue size:
image
image

This makes me think that AsyncProducer continues to accept incoming messages without retry queue growth control, which leads to uncontrolled memory consumption. I wasn't able to reproduce the issue at local environment.

Expected behaviour

If AsyncProducer is unable to flush retry queue it should block on receiving new messages.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedstaleIssues and pull requests without any recent activity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions