-
Notifications
You must be signed in to change notification settings - Fork 1.8k
AsyncProducer retries causing OOM #1358
Description
Versions
| Sarama | Kafka | Go |
|---|---|---|
| 1.20.1 | 1.0.0 | 1.11.3 |
Configuration
conf := sarama.NewConfig()
conf.Version = sarama.V0_11_0_0
conf.Net.MaxOpenRequests = 5
conf.Net.ReadTimeout = 5s
conf.Net.DialTimeout = 5s
conf.Net.WriteTimeout = 5s
conf.Metadata.Retry.Max = 0
conf.Producer.RequiredAcks = -1
conf.Producer.Timeout = 5s
conf.Producer.Flush.Bytes = 16MB
conf.Producer.Flush.Frequency = 5s
conf.Producer.Return.Errors = true
conf.Producer.Return.Successes = trueLogs
Sarama logs recorded right before memory usage goes up:
logs: CLICK ME
producer/leader/<topic>/58 abandoning broker 10137
producer/leader/<topic>/14 selected broker 10128
producer/broker/10137 state change to [closed] on <topic>/80
producer/broker/10128 state change to [open] on <topic>/12
client/metadata fetching metadata for [<topic>] from broker <addr>:9092
producer/broker/10247 state change to [retrying] on <topic>/46 because kafka server: Request exceeded the user-specified time limit in the request.
producer/broker/10137 state change to [closed] on <topic>/18
producer/leader/<topic>/30 selected broker 10247
producer/broker/10247 state change to [retrying] on <topic>/30 because kafka server: Request exceeded the user-specified time limit in the request.
producer/broker/10137 state change to [open] on <topic>/58
producer/leader/<topic>/3 abandoning broker 10247
producer/broker/10247 state change to [closed] on <topic>/30
producer/broker/10247 state change to [closed] on <topic>/3
producer/leader/<topic>/55 state change to [normal]
producer/broker/10139 state change to [retrying] on <topic>/7 because kafka server: Request exceeded the user-specified time limit in the request.
Problem Description
The go-application writes messages into AsyncProducer. It also reads from Success and Error channels. The average memory usage of application is about 8GB.
The problem is that during peak load application memory consumption could raise up to 4x which causes OOM.
Log messages above were recorded during the incident. Messages with content maximum request accumulated, waiting for space were omitted. From the first glance it looks like there are some issues with inserting into Kafka and producer starts to retry messages.
Digging into AsyncProducer showed that retryHandler uses buffer without limiting by max size. So I created an additional metric to store buffer Length - // Length returns the number of elements currently stored in the queue.
And it showed perfect correlation between memory consumption and queue size:


This makes me think that AsyncProducer continues to accept incoming messages without retry queue growth control, which leads to uncontrolled memory consumption. I wasn't able to reproduce the issue at local environment.
Expected behaviour
If AsyncProducer is unable to flush retry queue it should block on receiving new messages.