Fix use-after-free in GetAllocationForRetry when page is evicted#1716
Merged
Conversation
When a retry record's page is evicted between SaveAllocationForRetry and GetAllocationForRetry, the Fail block unconditionally created a LogRecord from the stale address and called OnDispose/ClearHeapFields. This accessed freed or recycled memory, causing either: - Crashes in GetFillerLength (accessing invalid memory) - Corruption of live records on recycled pages (e.g. clearing ValueIsObject bit, leading to WRONGTYPE errors on concurrent object operations) The fix guards the OnDispose call with a HeadAddress check. When the record is below HeadAddress, its heap fields were already cleared by OnDispose before SaveAllocationForRetry (in CAS failure paths) or were never written (in pre-write paths), so no cleanup is needed. Co-authored-by: Copilot <[email protected]>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a use-after-free hazard in Tsavorite’s retry-allocation reuse path when the allocated record’s page is evicted between saving the allocation and attempting to reuse/cleanup it.
Changes:
- Guarded
OnDisposeinGetAllocationForRetryso cleanup only happens when the retry record is still in-memory (newLogicalAddress >= HeadAddress). - Added an explanatory comment documenting the eviction/reuse failure mode and why disposal must be skipped when below
HeadAddress.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
9de4ecd to
8c3942e
Compare
Root cause: Task.Delay(1) in the popper retry loop creates threadpool scheduling pressure — each timer callback queues work back onto the pool. With 20 concurrent tasks on small CI runners (2-4 cores), this causes a scheduling storm where poppers monopolize the pool, starving pushers and the Task.Delay-based timeout (which also needs pool threads). Fix: - Replace Task.Delay(1) with Thread.Sleep(1) in the retry loop to avoid threadpool scheduling pressure - Use TaskCreationOptions.LongRunning for the 5-minute safety timeout so it runs on a dedicated thread independent of the pool Verified: Python client stress test (50 runs at 10/100, 4 runs at 100/500 with 200 connections) confirmed the server is correct — the hang was purely a test-side threadpool issue. Co-authored-by: Copilot <[email protected]>
8c3942e to
8d12481
Compare
badrishc
added a commit
that referenced
this pull request
Apr 20, 2026
Integrates 4 new dev commits: - #1716: Fix use-after-free in GetAllocationForRetry when page is evicted - #1712: Internalize heap-size tracking into Tsavorite (IRecordTriggers refactor) - #1711: Fix three broken/flaky Tsavorite and Garnet tests - #1710: Fix native linux device Key changes from #1712 affecting bf-tree: - IRecordDisposer replaced by IRecordTriggers with OnEvict/OnFlush/OnDispose/OnDiskRead - GarnetRecordDisposer replaced by GarnetRecordTriggers - OnEvict now takes EvictionSource parameter (MainLog vs ReadCache) - Heap-size tracking internalized into Tsavorite Resolution: merged with -X theirs, then restored bf-tree GarnetRecordTriggers implementation (OnEvict for BfTree disposal, OnFlush for snapshot, OnDiskRead for handle invalidation, OnDispose for delete cleanup, OnCheckpoint/OnRecovery for checkpoint lifecycle). Updated OnEvict signature to accept EvictionSource. Fixed unnecessary using in ReadMethods.cs. All tests pass: 55 RangeIndex, 345 RespTests, 34 MultiDatabase. Co-authored-by: Copilot <[email protected]>
badrishc
added a commit
that referenced
this pull request
Apr 20, 2026
Integrates 4 new dev commits: - #1716: Fix use-after-free in GetAllocationForRetry when page is evicted - #1712: Internalize heap-size tracking into Tsavorite (IRecordTriggers refactor) - #1711: Fix three broken/flaky Tsavorite and Garnet tests - #1710: Fix native linux device Key changes from #1712 affecting bf-tree: - IRecordDisposer replaced by IRecordTriggers with OnEvict/OnFlush/OnDispose/OnDiskRead - GarnetRecordDisposer replaced by GarnetRecordTriggers - OnEvict now takes EvictionSource parameter (MainLog vs ReadCache) - Heap-size tracking internalized into Tsavorite via logSizeTracker - OnDisposeValueObject removed; disposal routed through OnDispose Resolution: took dev's Tsavorite src/test wholesale, then restored bf-tree additions: CheckpointTrigger enum, OnRecovery/OnCheckpoint/OnRecoverySnapshotRead hooks in IRecordTriggers/IStoreFunctions/StoreFunctions, WrongType in UpsertAction, HybridLogCheckpointSMTask and Recovery call sites. Updated GarnetRecordTriggers OnEvict to accept EvictionSource parameter. Re-added BfTreeInterop project ref and WRONGTYPE guard in UpsertMethods. All tests pass: 55 RangeIndex (3 runs), 345 RespTests. Format clean. Co-authored-by: Copilot <[email protected]>
badrishc
added a commit
that referenced
this pull request
Apr 20, 2026
Integrates 4 dev commits (#1716, #1712, #1711, #1710). Key change: #1712 internalized heap-size tracking into Tsavorite and refactored IRecordDisposer → IRecordTriggers with OnEvict(EvictionSource). Resolution: - IRecordTriggers: take dev's version (EvictionSource enum, OnEvict with EvictionSource param, OnDisposeDiskRecord, removed OnDisposeValueObject), add bf-tree checkpoint hooks (OnRecovery, OnCheckpoint, OnRecoverySnapshotRead) - GarnetRecordTriggers: update OnEvict to accept EvictionSource; remove OnDisposeValueObject and heap-tracking from OnDispose (now internal to Tsavorite); keep bf-tree dispatch (OnFlush, OnEvict, OnDiskRead, OnDispose, OnCheckpoint, OnRecovery) - GarnetServer: keep cacheSizeTracker + rangeIndexManager constructor - RMWMethods PostCopyUpdater: keep ClearTreeHandle for RIPROMOTE - Fix unnecessary using in ReadMethods.cs All tests pass: 55 RangeIndex (3 runs), 345 RespTests, 4 CacheSizeTracker. Format clean (Garnet + Tsavorite). Co-authored-by: Copilot <[email protected]>
badrishc
added a commit
that referenced
this pull request
Apr 20, 2026
Integrates 4 dev commits (#1716, #1712, #1711, #1710). Key change: #1712 internalized heap-size tracking into Tsavorite and refactored IRecordDisposer → IRecordTriggers with OnEvict(EvictionSource). Resolution: - IRecordTriggers: take dev's version (EvictionSource enum, OnEvict with EvictionSource param, OnDisposeDiskRecord, removed OnDisposeValueObject), add bf-tree checkpoint hooks (OnRecovery, OnCheckpoint, OnRecoverySnapshotRead) - GarnetRecordTriggers: update OnEvict to accept EvictionSource; remove OnDisposeValueObject and heap-tracking from OnDispose (now internal to Tsavorite); keep bf-tree dispatch (OnFlush, OnEvict, OnDiskRead, OnDispose, OnCheckpoint, OnRecovery) - GarnetServer: keep cacheSizeTracker + rangeIndexManager constructor - RMWMethods PostCopyUpdater: keep ClearTreeHandle for RIPROMOTE - Fix unnecessary using in ReadMethods.cs All tests pass: 55 RangeIndex (3 runs), 345 RespTests, 4 CacheSizeTracker. Format clean (Garnet + Tsavorite). Co-authored-by: Copilot <[email protected]>
badrishc
added a commit
that referenced
this pull request
Apr 20, 2026
Integrates 4 dev commits (#1716, #1712, #1711, #1710). Key change: #1712 internalized heap-size tracking into Tsavorite and refactored IRecordDisposer → IRecordTriggers with OnEvict(EvictionSource). Resolution: - IRecordTriggers: take dev's version (EvictionSource enum, OnEvict with EvictionSource param, OnDisposeDiskRecord, removed OnDisposeValueObject), add bf-tree checkpoint hooks (OnRecovery, OnCheckpoint, OnRecoverySnapshotRead) - GarnetRecordTriggers: update OnEvict to accept EvictionSource; remove OnDisposeValueObject and heap-tracking from OnDispose (now internal to Tsavorite); keep bf-tree dispatch (OnFlush, OnEvict, OnDiskRead, OnDispose, OnCheckpoint, OnRecovery) - GarnetServer: keep cacheSizeTracker + rangeIndexManager constructor - RMWMethods PostCopyUpdater: keep ClearTreeHandle for RIPROMOTE - Fix unnecessary using in ReadMethods.cs All tests pass: 55 RangeIndex (3 runs), 345 RespTests, 4 CacheSizeTracker. Format clean (Garnet + Tsavorite). Co-authored-by: Copilot <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a retry record's page is evicted between SaveAllocationForRetry and GetAllocationForRetry, the Fail block unconditionally created a LogRecord from the stale address and called OnDispose/ClearHeapFields. This accessed freed or recycled memory, causing either:
The fix guards the OnDispose call with a HeadAddress check. When the record is below HeadAddress, its heap fields were already cleared by OnDispose before SaveAllocationForRetry (in CAS failure paths) or were never written (in pre-write paths), so no cleanup is needed.