AsyncProducer retries causing OOM

##### Versions

| Sarama | Kafka | Go |
|--------|-------|----|
| 1.20.1 |1.0.0  |1.11.3  |

##### Configuration

``` go
        conf := sarama.NewConfig()
	conf.Version = sarama.V0_11_0_0
	conf.Net.MaxOpenRequests = 5
	conf.Net.ReadTimeout = 5s
	conf.Net.DialTimeout = 5s
	conf.Net.WriteTimeout = 5s
	conf.Metadata.Retry.Max = 0
	conf.Producer.RequiredAcks = -1
	conf.Producer.Timeout = 5s
	conf.Producer.Flush.Bytes = 16MB
	conf.Producer.Flush.Frequency = 5s
	conf.Producer.Return.Errors = true
	conf.Producer.Return.Successes = true
```

##### Logs

Sarama logs recorded right before memory usage goes up:
<details><summary>logs: CLICK ME</summary>
<p>

```
producer/leader/<topic>/58 abandoning broker 10137
producer/leader/<topic>/14 selected broker 10128
producer/broker/10137 state change to [closed] on <topic>/80
producer/broker/10128 state change to [open] on <topic>/12
client/metadata fetching metadata for [<topic>] from broker <addr>:9092
producer/broker/10247 state change to [retrying] on <topic>/46 because kafka server: Request exceeded the user-specified time limit in the request.
producer/broker/10137 state change to [closed] on <topic>/18
producer/leader/<topic>/30 selected broker 10247
producer/broker/10247 state change to [retrying] on <topic>/30 because kafka server: Request exceeded the user-specified time limit in the request.
producer/broker/10137 state change to [open] on <topic>/58
producer/leader/<topic>/3 abandoning broker 10247
producer/broker/10247 state change to [closed] on <topic>/30
producer/broker/10247 state change to [closed] on <topic>/3
producer/leader/<topic>/55 state change to [normal]
producer/broker/10139 state change to [retrying] on <topic>/7 because kafka server: Request exceeded the user-specified time limit in the request.
```

</p>
</details>

##### Problem Description
The go-application writes messages into `AsyncProducer`. It also reads from `Success` and `Error` channels. The average memory usage of application is about 8GB. 
The problem is that during peak load application memory consumption could raise up to 4x which causes OOM.

Log messages above were recorded during the incident. Messages with content `maximum request accumulated, waiting for space` were omitted. From the first glance it looks like there are some issues with inserting into Kafka and producer starts to retry messages.

Digging into `AsyncProducer` showed that [retryHandler](https://github.com/Shopify/sarama/blob/master/async_producer.go#L969) uses [buffer](https://github.com/Shopify/sarama/blob/master/async_producer.go#L971) without limiting by max size. So I created an additional metric to store buffer `Length` - `// Length returns the number of elements currently stored in the queue.` 
And it showed perfect correlation between memory consumption and queue size:
![image](https://user-images.githubusercontent.com/2902918/56276126-df3dc000-60f9-11e9-8d52-bafa6651790b.png)
![image](https://user-images.githubusercontent.com/2902918/56276188-02686f80-60fa-11e9-814d-038e9305791a.png)

This makes me think that `AsyncProducer` continues to accept incoming messages without retry queue growth control, which leads to uncontrolled memory consumption. I wasn't able to reproduce the issue at local environment.

##### Expected behaviour

If `AsyncProducer` is unable to flush retry queue it should block on receiving new messages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AsyncProducer retries causing OOM #1358

Versions

Configuration

Logs

Problem Description

Expected behaviour

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AsyncProducer retries causing OOM #1358

Description

Versions

Configuration

Logs

Problem Description

Expected behaviour

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions