rd_kafka_query_watermark_offsets API hang forever

Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ



Description
===========
rd_kafka_query_watermark_offsets API will hang forever when the kafka cluster network encounter access restriction(network isolation)


How to reproduce
================
**I could reproduce this problem with latest librdkafka version**

1. launch 2 vm/docker instances(my local os is centos 6). A, B
2. install confluent-oss at instance A, start kafka with 3 broker services
     - brokerId: 1, port:9093
     - brokerId: 2, port:9094
     - brokerId: 3, port:9095
3. create a topic "test" for kafka with 3 partitions and replication-factor equal to 1, each broker should have a unique partition Id, assuming the "test" topic is with the following compositions:
     - brokerId: 1, port:9093 partitionId:0
     - brokerId: 2, port:9094 partitionId:1
     - brokerId: 3, port:9095 partitionId:2
4. at instance B, deploy the test program 
[main.go.zip](https://github.com/edenhill/librdkafka/files/3770893/main.go.zip)

5. enable iptable service at instance A, just reject instance B's accessing for port 9095
6. Now run test program at instance B(test API QueryWatermarkOffsets), and it will hang(**the partitionId 2's broker is alive but is not accessible for instanceB**)
    - ./kafkatest -broker=$instanceA_IP:9093 -newAPI=true -topic=test -partitionId=2 -timeout=2000

7. If we use the OffsetsForTimes API,  the program could exit when timeout
    - ./kafkatest -broker=$instanceA_IP:9093 -newAPI=false -topic=test -partitionId=2 -timeout=5000

conclusion:
I think the issue could be easily reproduced when a partitionId's leader(broker) is isolated.
The infinite looping code is [here](https://github.com/edenhill/librdkafka/blob/v0.11.6/src/rdkafka.c#L2643), 


**IMPORTANT**: Always try to reproduce the issue on the latest released version (see https://github.com/edenhill/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed.


Checklist
=========

**IMPORTANT**: We will close issues where the checklist has not been completed.

Please provide the following information:

 - [x] librdkafka version (release number or git tag): `v0.11.6`
 - [x] Apache Kafka version: `confluent-oss-5.0.0-2.11`
 - [x] librdkafka client configuration: `"session.timeout.ms": 10000`
 - [x] Operating system: `centos 6`
 - [ ] Provide logs (with `debug=..` as necessary) from librdkafka
 - [ ] Provide broker log excerpts
 - [x] Critical issue



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rd_kafka_query_watermark_offsets API hang forever #2588

Description

How to reproduce

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

rd_kafka_query_watermark_offsets API hang forever #2588

Description

Description

How to reproduce

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions