Skip to content
This repository was archived by the owner on Jan 24, 2024. It is now read-only.
This repository was archived by the owner on Jan 24, 2024. It is now read-only.

[BUG] KoP cannot support a FETCH request with many partitions #544

@BewareMyPower

Description

@BewareMyPower

Describe the bug
When I run performance tests on a topic with 100 partitions, the topic will be not available quickly. What's worse is the whole KoP cluster will be broken because in this occasion, no bookie will be available. So even a new topic was created after that, it would still be unavailable because Not enough non-faulty bookies available.

15:07:19.134 [BookKeeperClientWorker-OrderedExecutor-0-0] WARN org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (177, 15): Bookie operation timeout
15:07:19.143 [BookKeeperClientWorker-OrderedExecutor-0-0] WARN org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to find 1 bookies : excludeBookies [Bookie:10.0.0.220:3181, Bookie:10.0.0.198:3181, Bookie:10.0.0.228:3181], allBookies [Bookie:10.0.0.228:3181, Bookie:10.0.0.198:3181, Bookie:10.0.0.220:3181].
15:07:19.143 [BookKeeperClientWorker-OrderedExecutor-0-0] WARN org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl - Failed to choose a bookie: excluded [Bookie:10.0.0.220:3181, Bookie:10.0.0.198:3181, Bookie:10.0.0.228:3181], fallback to choose bookie randomly from the cluster.
...
15:07:19.143 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.client.MetadataUpdateLoop - UpdateLoop(ledgerId=177,loopId=4ef321e8) Exception updating
1828 org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available

The root cause is Bookie operation timeout. It seems to lead to the result that all bookies are excluded.

After that, producer might still be enable to write some messages, but there could be a possibility that corrupted message could be written to BK, see #442 for reference. And then a direct memory leak will happen as well. #542 is to handle this corrupted messages.

When there's no connected consumer, this case won't happen.

To Reproduce
Deploy a KoP cluster, create
Steps to reproduce the behavior:

  1. Deploy a KoP cluster (maybe standalone is more easy to reproduce?)
  2. Create a topic with 100 partitions
  3. Run a producer and a consumer. Even if the produce rate is not so large (like 10 MB/s), the

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
Bookie operation timeout should not happen.

Additional context
As I debugged, it may be caused by MessageIdUtils#getPositionForOffset:

public static PositionImpl getPositionForOffset(ManagedLedger managedLedger, Long offset) {
try {
return (PositionImpl) managedLedger.asyncFindPosition(new OffsetSearchPredicate(offset)).get();
} catch (InterruptedException | ExecutionException e) {
log.error("[{}] Failed to find position for offset {}", managedLedger.getName(), offset);
throw new RuntimeException(managedLedger.getName() + " failed to find position for offset " + offset);
}
}

The call stacks are

KafkaTopicConsumerManager#remove
  KafkaTopicConsumerManager#createCursorIfNotExists
    KafkaTopicConsumerManager#getCursorByOffset
      MessageIdUtils#getPositionForOffset

The getPositionForOffset is a synchronous method that blocks until a CompletableFuture is completed. However, it is called in a CompletableFuture's callback (thenAccept) when the ManagedCursor was not cached.

To verify the point, I added a trick fix to getPositionForOffset:

    public static PositionImpl getPositionForOffset(ManagedLedger managedLedger, Long offset) {
        if (offset == 0L) {
            return ((ManagedLedgerImpl) managedLedger).getFirstPosition();
        }
        // ...

Then I succeeded testing a 100 partitions topic for a few minutes. Though, if the produce rate was too high (like 800 MB/s), after about one minute, the bug could still happen. It may be caused by a namespace unload, then getPositionForOffset could call getPositionForOffset again while offset is not zero.

Here're the logs. There's no warn/error logs in bookie.log.

bookie.log
broker.log

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions