Fix deadlock caused by synchronous asyncFindPosition #556

BewareMyPower · 2021-06-08T16:50:56Z

Fixes #544

Motivation

After the debugging, when a FETCH request contains many partitions, for each partition, KafkaTopicConsumerManager#remove will be executed in the same pulsar-io-<suffix> thread, which will eventually call MessageIdUtils#getPositionForOffset to get PositionImpl of the offset.

However, getPositionForOffset is a synchronous method that waits until a CompletableFuture is done. The future is returned by ManagedLedgerImpl#asyncFindPosition that calls ManagedLedgerImpl#asyncReadEntry internally. If the producer is publishing messages at the same time, the asyncAddEntry method could be called for the same ManagedLedgerImpl object. It will somehow cause the deadlock.

Modifications

Just use ManagedLedgerImpl#asyncFindPosition instead of the synchronous call in MessageIdUtils#getPositionForOffset.
Caches the future of Pair<ManagedCursor, Long> in KafkaTopicConsumerManager instead of the Pair<ManagedCursor, Long>. Then in MessageFetchContext, use the ManagedCursor in future's callback .
Add a unit test to ensure after this change, the non-durable cursor will be still created once for each consumer. This condition is easily to break in a refactor like this PR and it may cause performance problem because cursor might be created each time a FETCH request arrived. So here I added a test to avoid the case.

hangc0276

good job

BewareMyPower requested a review from jiazhai as a code owner June 8, 2021 16:50

Fix deadlock caused by synchronous asyncFindPosition

b5f76bb

BewareMyPower force-pushed the bewaremypower/fix-find-position-deadlock branch from 424d369 to b5f76bb Compare June 8, 2021 16:53

BewareMyPower self-assigned this Jun 8, 2021

BewareMyPower added the type/bug label Jun 8, 2021

Add test to ensure cursor is only created once for a consumer

bc5cb24

BewareMyPower changed the title ~~[WIP] Fix deadlock caused by synchronous asyncFindPosition~~ Fix deadlock caused by synchronous asyncFindPosition Jun 9, 2021

BewareMyPower requested review from aloyszhang, dockerzhang and hangc0276 June 9, 2021 03:59

BewareMyPower mentioned this pull request Jun 9, 2021

fix direct memory leak bug #542

Merged

jiazhai approved these changes Jun 9, 2021

View reviewed changes

jiazhai merged commit 05b778d into streamnative:master Jun 9, 2021

BewareMyPower deleted the bewaremypower/fix-find-position-deadlock branch June 9, 2021 07:16

hangc0276 approved these changes Jun 9, 2021

View reviewed changes

BewareMyPower mentioned this pull request Sep 9, 2021

[FEATURE] Add KopEventManager to repair metadata consistency of cluster and consistency of consumer group status #712

Merged

BewareMyPower mentioned this pull request Nov 25, 2021

Analyze (and remove) the usages of get() or join() method of a CompletableFuture #932

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix deadlock caused by synchronous asyncFindPosition #556

Fix deadlock caused by synchronous asyncFindPosition #556

Uh oh!

BewareMyPower commented Jun 8, 2021 •

edited

Loading

Uh oh!

hangc0276 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix deadlock caused by synchronous asyncFindPosition #556

Fix deadlock caused by synchronous asyncFindPosition #556

Uh oh!

Conversation

BewareMyPower commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Uh oh!

hangc0276 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BewareMyPower commented Jun 8, 2021 •

edited

Loading