Skip to content

rd_kafka_query_watermark_offsets API hang forever #2588

@hxiaodon

Description

Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ

Description

rd_kafka_query_watermark_offsets API will hang forever when the kafka cluster network encounter access restriction(network isolation)

How to reproduce

I could reproduce this problem with latest librdkafka version

  1. launch 2 vm/docker instances(my local os is centos 6). A, B

  2. install confluent-oss at instance A, start kafka with 3 broker services

    • brokerId: 1, port:9093
    • brokerId: 2, port:9094
    • brokerId: 3, port:9095
  3. create a topic "test" for kafka with 3 partitions and replication-factor equal to 1, each broker should have a unique partition Id, assuming the "test" topic is with the following compositions:

    • brokerId: 1, port:9093 partitionId:0
    • brokerId: 2, port:9094 partitionId:1
    • brokerId: 3, port:9095 partitionId:2
  4. at instance B, deploy the test program
    main.go.zip

  5. enable iptable service at instance A, just reject instance B's accessing for port 9095

  6. Now run test program at instance B(test API QueryWatermarkOffsets), and it will hang(the partitionId 2's broker is alive but is not accessible for instanceB)

    • ./kafkatest -broker=$instanceA_IP:9093 -newAPI=true -topic=test -partitionId=2 -timeout=2000
  7. If we use the OffsetsForTimes API, the program could exit when timeout

    • ./kafkatest -broker=$instanceA_IP:9093 -newAPI=false -topic=test -partitionId=2 -timeout=5000

conclusion:
I think the issue could be easily reproduced when a partitionId's leader(broker) is isolated.
The infinite looping code is here,

IMPORTANT: Always try to reproduce the issue on the latest released version (see https://github.com/edenhill/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • librdkafka version (release number or git tag): v0.11.6
  • Apache Kafka version: confluent-oss-5.0.0-2.11
  • librdkafka client configuration: "session.timeout.ms": 10000
  • Operating system: centos 6
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • Critical issue

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions