Skip to content

Fix use-after-free in GetAllocationForRetry when page is evicted#1716

Merged
badrishc merged 2 commits into
devfrom
badrishc/fix-get-allocation-retry
Apr 20, 2026
Merged

Fix use-after-free in GetAllocationForRetry when page is evicted#1716
badrishc merged 2 commits into
devfrom
badrishc/fix-get-allocation-retry

Conversation

@badrishc

Copy link
Copy Markdown
Collaborator

When a retry record's page is evicted between SaveAllocationForRetry and GetAllocationForRetry, the Fail block unconditionally created a LogRecord from the stale address and called OnDispose/ClearHeapFields. This accessed freed or recycled memory, causing either:

  • Crashes in GetFillerLength (accessing invalid memory)
  • Corruption of live records on recycled pages (e.g. clearing ValueIsObject bit, leading to WRONGTYPE errors on concurrent object operations)

The fix guards the OnDispose call with a HeadAddress check. When the record is below HeadAddress, its heap fields were already cleared by OnDispose before SaveAllocationForRetry (in CAS failure paths) or were never written (in pre-write paths), so no cleanup is needed.

When a retry record's page is evicted between SaveAllocationForRetry and
GetAllocationForRetry, the Fail block unconditionally created a LogRecord
from the stale address and called OnDispose/ClearHeapFields. This accessed
freed or recycled memory, causing either:
- Crashes in GetFillerLength (accessing invalid memory)
- Corruption of live records on recycled pages (e.g. clearing ValueIsObject
  bit, leading to WRONGTYPE errors on concurrent object operations)

The fix guards the OnDispose call with a HeadAddress check. When the record
is below HeadAddress, its heap fields were already cleared by OnDispose
before SaveAllocationForRetry (in CAS failure paths) or were never written
(in pre-write paths), so no cleanup is needed.

Co-authored-by: Copilot <[email protected]>
Copilot AI review requested due to automatic review settings April 20, 2026 01:34

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a use-after-free hazard in Tsavorite’s retry-allocation reuse path when the allocated record’s page is evicted between saving the allocation and attempting to reuse/cleanup it.

Changes:

  • Guarded OnDispose in GetAllocationForRetry so cleanup only happens when the retry record is still in-memory (newLogicalAddress >= HeadAddress).
  • Added an explanatory comment documenting the eviction/reuse failure mode and why disposal must be skipped when below HeadAddress.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@badrishc badrishc force-pushed the badrishc/fix-get-allocation-retry branch 6 times, most recently from 9de4ecd to 8c3942e Compare April 20, 2026 15:46
Root cause: Task.Delay(1) in the popper retry loop creates threadpool
scheduling pressure — each timer callback queues work back onto the pool.
With 20 concurrent tasks on small CI runners (2-4 cores), this causes a
scheduling storm where poppers monopolize the pool, starving pushers
and the Task.Delay-based timeout (which also needs pool threads).

Fix:
- Replace Task.Delay(1) with Thread.Sleep(1) in the retry loop to avoid
  threadpool scheduling pressure
- Use TaskCreationOptions.LongRunning for the 5-minute safety timeout
  so it runs on a dedicated thread independent of the pool

Verified: Python client stress test (50 runs at 10/100, 4 runs at 100/500
with 200 connections) confirmed the server is correct — the hang was
purely a test-side threadpool issue.

Co-authored-by: Copilot <[email protected]>
@badrishc badrishc force-pushed the badrishc/fix-get-allocation-retry branch from 8c3942e to 8d12481 Compare April 20, 2026 15:59
@badrishc badrishc merged commit 5036d8b into dev Apr 20, 2026
29 of 30 checks passed
@badrishc badrishc deleted the badrishc/fix-get-allocation-retry branch April 20, 2026 17:01
badrishc added a commit that referenced this pull request Apr 20, 2026
Integrates 4 new dev commits:
- #1716: Fix use-after-free in GetAllocationForRetry when page is evicted
- #1712: Internalize heap-size tracking into Tsavorite (IRecordTriggers refactor)
- #1711: Fix three broken/flaky Tsavorite and Garnet tests
- #1710: Fix native linux device

Key changes from #1712 affecting bf-tree:
- IRecordDisposer replaced by IRecordTriggers with OnEvict/OnFlush/OnDispose/OnDiskRead
- GarnetRecordDisposer replaced by GarnetRecordTriggers
- OnEvict now takes EvictionSource parameter (MainLog vs ReadCache)
- Heap-size tracking internalized into Tsavorite

Resolution: merged with -X theirs, then restored bf-tree GarnetRecordTriggers
implementation (OnEvict for BfTree disposal, OnFlush for snapshot, OnDiskRead
for handle invalidation, OnDispose for delete cleanup, OnCheckpoint/OnRecovery
for checkpoint lifecycle). Updated OnEvict signature to accept EvictionSource.
Fixed unnecessary using in ReadMethods.cs.

All tests pass: 55 RangeIndex, 345 RespTests, 34 MultiDatabase.

Co-authored-by: Copilot <[email protected]>
badrishc added a commit that referenced this pull request Apr 20, 2026
Integrates 4 new dev commits:
- #1716: Fix use-after-free in GetAllocationForRetry when page is evicted
- #1712: Internalize heap-size tracking into Tsavorite (IRecordTriggers refactor)
- #1711: Fix three broken/flaky Tsavorite and Garnet tests
- #1710: Fix native linux device

Key changes from #1712 affecting bf-tree:
- IRecordDisposer replaced by IRecordTriggers with OnEvict/OnFlush/OnDispose/OnDiskRead
- GarnetRecordDisposer replaced by GarnetRecordTriggers
- OnEvict now takes EvictionSource parameter (MainLog vs ReadCache)
- Heap-size tracking internalized into Tsavorite via logSizeTracker
- OnDisposeValueObject removed; disposal routed through OnDispose

Resolution: took dev's Tsavorite src/test wholesale, then restored bf-tree
additions: CheckpointTrigger enum, OnRecovery/OnCheckpoint/OnRecoverySnapshotRead
hooks in IRecordTriggers/IStoreFunctions/StoreFunctions, WrongType in UpsertAction,
HybridLogCheckpointSMTask and Recovery call sites. Updated GarnetRecordTriggers
OnEvict to accept EvictionSource parameter. Re-added BfTreeInterop project ref
and WRONGTYPE guard in UpsertMethods.

All tests pass: 55 RangeIndex (3 runs), 345 RespTests. Format clean.

Co-authored-by: Copilot <[email protected]>
badrishc added a commit that referenced this pull request Apr 20, 2026
Integrates 4 dev commits (#1716, #1712, #1711, #1710).

Key change: #1712 internalized heap-size tracking into Tsavorite and
refactored IRecordDisposer → IRecordTriggers with OnEvict(EvictionSource).

Resolution:
- IRecordTriggers: take dev's version (EvictionSource enum, OnEvict with
  EvictionSource param, OnDisposeDiskRecord, removed OnDisposeValueObject),
  add bf-tree checkpoint hooks (OnRecovery, OnCheckpoint, OnRecoverySnapshotRead)
- GarnetRecordTriggers: update OnEvict to accept EvictionSource; remove
  OnDisposeValueObject and heap-tracking from OnDispose (now internal to
  Tsavorite); keep bf-tree dispatch (OnFlush, OnEvict, OnDiskRead, OnDispose,
  OnCheckpoint, OnRecovery)
- GarnetServer: keep cacheSizeTracker + rangeIndexManager constructor
- RMWMethods PostCopyUpdater: keep ClearTreeHandle for RIPROMOTE
- Fix unnecessary using in ReadMethods.cs

All tests pass: 55 RangeIndex (3 runs), 345 RespTests, 4 CacheSizeTracker.
Format clean (Garnet + Tsavorite).

Co-authored-by: Copilot <[email protected]>
badrishc added a commit that referenced this pull request Apr 20, 2026
Integrates 4 dev commits (#1716, #1712, #1711, #1710).

Key change: #1712 internalized heap-size tracking into Tsavorite and
refactored IRecordDisposer → IRecordTriggers with OnEvict(EvictionSource).

Resolution:
- IRecordTriggers: take dev's version (EvictionSource enum, OnEvict with
  EvictionSource param, OnDisposeDiskRecord, removed OnDisposeValueObject),
  add bf-tree checkpoint hooks (OnRecovery, OnCheckpoint, OnRecoverySnapshotRead)
- GarnetRecordTriggers: update OnEvict to accept EvictionSource; remove
  OnDisposeValueObject and heap-tracking from OnDispose (now internal to
  Tsavorite); keep bf-tree dispatch (OnFlush, OnEvict, OnDiskRead, OnDispose,
  OnCheckpoint, OnRecovery)
- GarnetServer: keep cacheSizeTracker + rangeIndexManager constructor
- RMWMethods PostCopyUpdater: keep ClearTreeHandle for RIPROMOTE
- Fix unnecessary using in ReadMethods.cs

All tests pass: 55 RangeIndex (3 runs), 345 RespTests, 4 CacheSizeTracker.
Format clean (Garnet + Tsavorite).

Co-authored-by: Copilot <[email protected]>
badrishc added a commit that referenced this pull request Apr 20, 2026
Integrates 4 dev commits (#1716, #1712, #1711, #1710).

Key change: #1712 internalized heap-size tracking into Tsavorite and
refactored IRecordDisposer → IRecordTriggers with OnEvict(EvictionSource).

Resolution:
- IRecordTriggers: take dev's version (EvictionSource enum, OnEvict with
  EvictionSource param, OnDisposeDiskRecord, removed OnDisposeValueObject),
  add bf-tree checkpoint hooks (OnRecovery, OnCheckpoint, OnRecoverySnapshotRead)
- GarnetRecordTriggers: update OnEvict to accept EvictionSource; remove
  OnDisposeValueObject and heap-tracking from OnDispose (now internal to
  Tsavorite); keep bf-tree dispatch (OnFlush, OnEvict, OnDiskRead, OnDispose,
  OnCheckpoint, OnRecovery)
- GarnetServer: keep cacheSizeTracker + rangeIndexManager constructor
- RMWMethods PostCopyUpdater: keep ClearTreeHandle for RIPROMOTE
- Fix unnecessary using in ReadMethods.cs

All tests pass: 55 RangeIndex (3 runs), 345 RespTests, 4 CacheSizeTracker.
Format clean (Garnet + Tsavorite).

Co-authored-by: Copilot <[email protected]>
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 20, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants