Skip to content

Fix #1749: leaked transient X-lock [dev]#1754

Merged
badrishc merged 2 commits into
devfrom
badrishc/fix-1749-dev
Apr 30, 2026
Merged

Fix #1749: leaked transient X-lock [dev]#1754
badrishc merged 2 commits into
devfrom
badrishc/fix-1749-dev

Conversation

@badrishc

Copy link
Copy Markdown
Collaborator

When a SET / RMW / DELETE produces an AOF entry larger than AofPageSize, TsavoriteLog.ValidateAllocatedLength throws TsavoriteException("Entry does not fit on page") from inside the PostUpsertOperation / PostRMWOperation / PostDeleteOperation callback. The exception unwinds the finally block in InternalUpsert / InternalRMW / InternalDelete (and ContinuePendingRMW) before EphemeralXUnlock can run, leaving the hash bucket's ephemeral exclusive lock held forever. Any subsequent op on the same bucket then spins indefinitely in a RETRY_LATER loop, pinning the server CPU at 100% until restart.

Wrap the Post*Operation call in a nested try/finally so EphemeralXUnlock always runs even on exception. The original TsavoriteException continues to propagate, preserving the observed behaviour for the user (connection closed) while leaving the server responsive.

While here, drop the dead 'latchOperation' / 'LatchRelease' machinery from InternalUpsert / InternalRMW / InternalDelete (the LatchOperation enum and the ref parameter on CheckCPRConsistency*): the CheckCPRConsistency* helpers never wrote to the ref parameter ("Now we no longer need to do the bucket latching" comment confirms it), so the switch in the LatchRelease block was unreachable. Renamed the empty LatchRelease label to Done for clarity.

Adds a regression test (OversizedAofEntryDoesNotHangServer) plus an aofPageSize parameter on TestUtils.CreateGarnetServer so the test can trigger the oversize path with a 4 KB AOF page.

When a SET / RMW / DELETE produces an AOF entry larger than AofPageSize,
TsavoriteLog.ValidateAllocatedLength throws TsavoriteException("Entry
does not fit on page") from inside the PostUpsertOperation /
PostRMWOperation / PostDeleteOperation callback. The exception unwinds
the finally block in InternalUpsert / InternalRMW / InternalDelete (and
ContinuePendingRMW) before EphemeralXUnlock can run, leaving the hash
bucket's ephemeral exclusive lock held forever. Any subsequent op on the
same bucket then spins indefinitely in a RETRY_LATER loop, pinning the
server CPU at 100% until restart.

Wrap the Post*Operation call in a nested try/finally so EphemeralXUnlock
always runs even on exception. The original TsavoriteException continues
to propagate, preserving the observed behaviour for the user (connection
closed) while leaving the server responsive.

While here, drop the dead 'latchOperation' / 'LatchRelease' machinery
from InternalUpsert / InternalRMW / InternalDelete (the LatchOperation
enum and the ref parameter on CheckCPRConsistency*): the
CheckCPRConsistency* helpers never wrote to the ref parameter ("Now we
no longer need to do the bucket latching" comment confirms it), so the
switch in the LatchRelease block was unreachable. Renamed the empty
LatchRelease label to Done for clarity.

Adds a regression test (OversizedAofEntryDoesNotHangServer) plus an
aofPageSize parameter on TestUtils.CreateGarnetServer so the test can
trigger the oversize path with a 4 KB AOF page.

Co-authored-by: Copilot <[email protected]>
Copilot AI review requested due to automatic review settings April 29, 2026 23:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a Tsavorite ephemeral X-lock leak that could permanently wedge a hash bucket (and spin CPU) when AOF enqueue throws for entries larger than the configured AOF page size, keeping the server responsive while preserving the existing “connection dropped” user-visible behavior.

Changes:

  • Ensure EphemeralXUnlock always runs by wrapping Post*Operation callbacks in an inner try/finally for Upsert/RMW/Delete and pending RMW completion paths.
  • Remove dead bucket-latching (LatchOperation / LatchRelease) plumbing and simplify control flow (Done label).
  • Add a regression test that forces the oversize-AOF-entry path via a small AofPageSize, plus a test utility hook to configure it.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
test/Garnet.test/TestUtils.cs Adds optional aofPageSize parameter to configure GarnetServerOptions.AofPageSize for tests.
test/Garnet.test/RespAofTests.cs Adds regression test OversizedAofEntryDoesNotHangServer for issue #1749.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalUpsert.cs Guarantees ephemeral X-unlock even if PostUpsertOperation throws; removes dead latch plumbing.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalRMW.cs Guarantees ephemeral X-unlock even if PostRMWOperation throws; removes dead latch plumbing.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalDelete.cs Guarantees ephemeral X-unlock even if PostDeleteOperation throws; removes dead latch plumbing.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/Helpers.cs Removes unused LatchOperation enum.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/ContinuePending.cs Guarantees ephemeral X-unlock even if PostRMWOperation throws during pending completion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@badrishc badrishc merged commit d0a93d7 into dev Apr 30, 2026
2 checks passed
@badrishc badrishc deleted the badrishc/fix-1749-dev branch April 30, 2026 02:20
chenhao-ye pushed a commit to chenhao-ye/garnet that referenced this pull request May 13, 2026
…icrosoft#1754)

When a SET / RMW / DELETE produces an AOF entry larger than AofPageSize,
TsavoriteLog.ValidateAllocatedLength throws TsavoriteException("Entry
does not fit on page") from inside the PostUpsertOperation /
PostRMWOperation / PostDeleteOperation callback. The exception unwinds
the finally block in InternalUpsert / InternalRMW / InternalDelete (and
ContinuePendingRMW) before EphemeralXUnlock can run, leaving the hash
bucket's ephemeral exclusive lock held forever. Any subsequent op on the
same bucket then spins indefinitely in a RETRY_LATER loop, pinning the
server CPU at 100% until restart.

Wrap the Post*Operation call in a nested try/finally so EphemeralXUnlock
always runs even on exception. The original TsavoriteException continues
to propagate, preserving the observed behaviour for the user (connection
closed) while leaving the server responsive.

While here, drop the dead 'latchOperation' / 'LatchRelease' machinery
from InternalUpsert / InternalRMW / InternalDelete (the LatchOperation
enum and the ref parameter on CheckCPRConsistency*): the
CheckCPRConsistency* helpers never wrote to the ref parameter ("Now we
no longer need to do the bucket latching" comment confirms it), so the
switch in the LatchRelease block was unreachable. Renamed the empty
LatchRelease label to Done for clarity.

Adds a regression test (OversizedAofEntryDoesNotHangServer) plus an
aofPageSize parameter on TestUtils.CreateGarnetServer so the test can
trigger the oversize path with a 4 KB AOF page.

Co-authored-by: Copilot <[email protected]>
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 29, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants