Fix #1749: leaked transient X-lock [dev]#1754
Merged
Merged
Conversation
When a SET / RMW / DELETE produces an AOF entry larger than AofPageSize,
TsavoriteLog.ValidateAllocatedLength throws TsavoriteException("Entry
does not fit on page") from inside the PostUpsertOperation /
PostRMWOperation / PostDeleteOperation callback. The exception unwinds
the finally block in InternalUpsert / InternalRMW / InternalDelete (and
ContinuePendingRMW) before EphemeralXUnlock can run, leaving the hash
bucket's ephemeral exclusive lock held forever. Any subsequent op on the
same bucket then spins indefinitely in a RETRY_LATER loop, pinning the
server CPU at 100% until restart.
Wrap the Post*Operation call in a nested try/finally so EphemeralXUnlock
always runs even on exception. The original TsavoriteException continues
to propagate, preserving the observed behaviour for the user (connection
closed) while leaving the server responsive.
While here, drop the dead 'latchOperation' / 'LatchRelease' machinery
from InternalUpsert / InternalRMW / InternalDelete (the LatchOperation
enum and the ref parameter on CheckCPRConsistency*): the
CheckCPRConsistency* helpers never wrote to the ref parameter ("Now we
no longer need to do the bucket latching" comment confirms it), so the
switch in the LatchRelease block was unreachable. Renamed the empty
LatchRelease label to Done for clarity.
Adds a regression test (OversizedAofEntryDoesNotHangServer) plus an
aofPageSize parameter on TestUtils.CreateGarnetServer so the test can
trigger the oversize path with a 4 KB AOF page.
Co-authored-by: Copilot <[email protected]>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a Tsavorite ephemeral X-lock leak that could permanently wedge a hash bucket (and spin CPU) when AOF enqueue throws for entries larger than the configured AOF page size, keeping the server responsive while preserving the existing “connection dropped” user-visible behavior.
Changes:
- Ensure
EphemeralXUnlockalways runs by wrappingPost*Operationcallbacks in an innertry/finallyfor Upsert/RMW/Delete and pending RMW completion paths. - Remove dead bucket-latching (
LatchOperation/LatchRelease) plumbing and simplify control flow (Donelabel). - Add a regression test that forces the oversize-AOF-entry path via a small
AofPageSize, plus a test utility hook to configure it.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| test/Garnet.test/TestUtils.cs | Adds optional aofPageSize parameter to configure GarnetServerOptions.AofPageSize for tests. |
| test/Garnet.test/RespAofTests.cs | Adds regression test OversizedAofEntryDoesNotHangServer for issue #1749. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalUpsert.cs | Guarantees ephemeral X-unlock even if PostUpsertOperation throws; removes dead latch plumbing. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalRMW.cs | Guarantees ephemeral X-unlock even if PostRMWOperation throws; removes dead latch plumbing. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalDelete.cs | Guarantees ephemeral X-unlock even if PostDeleteOperation throws; removes dead latch plumbing. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/Helpers.cs | Removes unused LatchOperation enum. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/ContinuePending.cs | Guarantees ephemeral X-unlock even if PostRMWOperation throws during pending completion. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
TedHartMS
approved these changes
Apr 30, 2026
chenhao-ye
pushed a commit
to chenhao-ye/garnet
that referenced
this pull request
May 13, 2026
…icrosoft#1754) When a SET / RMW / DELETE produces an AOF entry larger than AofPageSize, TsavoriteLog.ValidateAllocatedLength throws TsavoriteException("Entry does not fit on page") from inside the PostUpsertOperation / PostRMWOperation / PostDeleteOperation callback. The exception unwinds the finally block in InternalUpsert / InternalRMW / InternalDelete (and ContinuePendingRMW) before EphemeralXUnlock can run, leaving the hash bucket's ephemeral exclusive lock held forever. Any subsequent op on the same bucket then spins indefinitely in a RETRY_LATER loop, pinning the server CPU at 100% until restart. Wrap the Post*Operation call in a nested try/finally so EphemeralXUnlock always runs even on exception. The original TsavoriteException continues to propagate, preserving the observed behaviour for the user (connection closed) while leaving the server responsive. While here, drop the dead 'latchOperation' / 'LatchRelease' machinery from InternalUpsert / InternalRMW / InternalDelete (the LatchOperation enum and the ref parameter on CheckCPRConsistency*): the CheckCPRConsistency* helpers never wrote to the ref parameter ("Now we no longer need to do the bucket latching" comment confirms it), so the switch in the LatchRelease block was unreachable. Renamed the empty LatchRelease label to Done for clarity. Adds a regression test (OversizedAofEntryDoesNotHangServer) plus an aofPageSize parameter on TestUtils.CreateGarnetServer so the test can trigger the oversize path with a 4 KB AOF page. Co-authored-by: Copilot <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a SET / RMW / DELETE produces an AOF entry larger than AofPageSize, TsavoriteLog.ValidateAllocatedLength throws TsavoriteException("Entry does not fit on page") from inside the PostUpsertOperation / PostRMWOperation / PostDeleteOperation callback. The exception unwinds the finally block in InternalUpsert / InternalRMW / InternalDelete (and ContinuePendingRMW) before EphemeralXUnlock can run, leaving the hash bucket's ephemeral exclusive lock held forever. Any subsequent op on the same bucket then spins indefinitely in a RETRY_LATER loop, pinning the server CPU at 100% until restart.
Wrap the Post*Operation call in a nested try/finally so EphemeralXUnlock always runs even on exception. The original TsavoriteException continues to propagate, preserving the observed behaviour for the user (connection closed) while leaving the server responsive.
While here, drop the dead 'latchOperation' / 'LatchRelease' machinery from InternalUpsert / InternalRMW / InternalDelete (the LatchOperation enum and the ref parameter on CheckCPRConsistency*): the CheckCPRConsistency* helpers never wrote to the ref parameter ("Now we no longer need to do the bucket latching" comment confirms it), so the switch in the LatchRelease block was unreachable. Renamed the empty LatchRelease label to Done for clarity.
Adds a regression test (OversizedAofEntryDoesNotHangServer) plus an aofPageSize parameter on TestUtils.CreateGarnetServer so the test can trigger the oversize path with a 4 KB AOF page.