Skip to content

[BFTree] Add RangeIndex cluster migration#1731

Merged
tiagonapoli merged 51 commits into
mainfrom
tiagonapoli/bftree-migration
Jun 17, 2026
Merged

[BFTree] Add RangeIndex cluster migration#1731
tiagonapoli merged 51 commits into
mainfrom
tiagonapoli/bftree-migration

Conversation

@tiagonapoli

@tiagonapoli tiagonapoli commented Apr 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds end-to-end cluster migration for RangeIndex keys, supporting both MIGRATE SLOTS and MIGRATE KEYS paths. RangeIndex keys are backed by a native BfTree whose on-disk state (data.bftree) lives outside Tsavorite — shipping just the 51-byte stub is insufficient. This change streams the entire BfTree snapshot file alongside the stub over the existing migration transport.

Architecture

  • Serializer (RangeIndexChunkedSerializer): Pure state machine over Span<byte> — no I/O. Takes file data as input via MoveNext(dest, fileData, out consumed).
  • Migration Reader (RangeIndexMigrationReader): Async wrapper that reads the snapshot file and feeds bytes to the serializer.
  • Deserializer (RangeIndexChunkedDeserializer): Sync state machine that writes received file data to a temp file, validates xxHash64 checksum, recovers the native BfTree, and publishes the stub to the store.
  • Factory (RangeIndexManager.SnapshotRangeIndexAndCreateReader): Snapshots the BfTree under an exclusive lock to a temp file, then creates the reader.

Wire format

Single MigrationRecordSpanType.SerializedRangeIndexStream (tag 4). Stream format across chunks:

[4B keyLen][key bytes][8B fileSize][file bytes][8B xxHash64][4B stubLen][stub]

Key and file bytes may span chunks; all other elements must fit within a single chunk.

SLOTS path

  1. Scan detects RI keys via RecordType == 2, captures to MigrateOperation.RangeIndexes
  2. After scan completes, MigrateRangeIndexKeysAsync runs a sketch-protected batch cycle:
    • Add all RI keys to sketch (INITIALIZING)
    • TRANSMITTING + epoch barrier — blocks writes during snapshot+transmit
    • Transmit each key via TransmitRangeIndexAsync
    • DELETING + epoch barrier — blocks reads+writes during delete
    • Delete each key
    • finally: clear sketch (unblocks clients)

KEYS path

  1. GetRangeIndexKeysForMigration discovers RI keys via RIGET
  2. TransmitKeysAsync skips RI keys (in rangeIndexKeysToIgnore)
  3. Each RI key transmitted via TransmitRangeIndexAsync, then marked in sketch for DeleteKeysAsync

Bug fixes

  • Remove RI+cluster startup guard (GarnetServer.cs)
  • Fix round-trip migration: Publish deletes existing data file before move
  • Fix Publish registration: accept InPlaceUpdated/CopyUpdated status
  • Fix serializer empty-buffer bug in FileData phase
  • TransmitRangeIndexAsync catches all exceptions (never throws)
  • Sketch protection for RI keys in SLOTS path (previously unprotected)

Tests

  • 26 unit tests for serializer/deserializer (round-trip, checksums, error states, buffer boundaries, chunk sizes)
  • 11 cluster integration tests: SingleBySlot, ByKeys, ManyBySlot, WhileModifying, MigrateBack, LargeTree (1KB/256KB chunks), ChunkSize variants, StressAsync (Explicit)

TODO

  • Vector Set index keys have the same sketch protection gap (documented in code)
  • AOF replication of migrated trees to destination replicas

@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-migration branch 6 times, most recently from 61a467b to 3a6e52f Compare May 5, 2026 17:15
@tiagonapoli tiagonapoli marked this pull request as ready for review May 5, 2026 17:26
Copilot AI review requested due to automatic review settings May 5, 2026 17:26
@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-migration branch from 3a6e52f to dc78c02 Compare May 5, 2026 17:36
@tiagonapoli tiagonapoli changed the title [BFTree] Add RangeIndex cluster migration Add RangeIndex cluster migration with sketch protection May 5, 2026
@tiagonapoli tiagonapoli changed the title Add RangeIndex cluster migration with sketch protection [BFTree] Add RangeIndex cluster migration May 5, 2026
@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-migration branch 3 times, most recently from 74564ec to 74a1e44 Compare May 5, 2026 17:53

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@tiagonapoli tiagonapoli force-pushed the tiagonapoli/bftree-migration branch 4 times, most recently from 5754d5d to ac06150 Compare May 5, 2026 22:50
@vazois

vazois commented May 6, 2026

Copy link
Copy Markdown
Contributor

Missing RangeIndex feature-enabled guards

This PR doesn't consistently check whether the RangeIndex feature is enabled (i.e., whether RangeIndexManager is non-null) before using it. Two spots are at risk of NullReferenceException:

1. Source side — MigrateSession.RangeIndex.cs:29

var rangeIndexManager = clusterProvider.storeWrapper.DefaultDatabase.RangeIndexManager;
// No null check before calling .SnapshotRangeIndexAndCreateReader(...)

The KEYS path correctly uses ?.GetRangeIndexKeysForMigration(...) with a fallback to an empty dictionary, but TransmitRangeIndexAsync assumes the manager is always available. If the SLOTS scan somehow discovers an RI key on a node where the feature is disabled, this will crash.

2. Receive side — RespClusterMigrateCommands.cs:181

if (!rangeIndexMigrationState.ProcessRecord(...))

rangeIndexMigrationState is set to null when RangeIndexManager is null (line 70 in ClusterSession.cs), but the dispatch at line 181 doesn't guard against that. If a receiving node has RangeIndex disabled and the sender transmits a SerializedRangeIndexStream record, this will throw.

Suggestion: Add a null check on the receive side (e.g., log a warning and skip/error the record), and consider a defensive null check in TransmitRangeIndexAsync as well.

Comment thread libs/cluster/Server/Migration/MigrateOperation.cs Outdated
Comment thread libs/cluster/Server/Migration/MigrateOperation.cs Outdated

@vazois vazois left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libs/cluster/Server/Migration/MigrateScanFunctions.cs:53 — maybe not for this PR, but can't we just get these values from an enum?

Comment thread libs/cluster/Server/Migration/MigrateSession.RangeIndex.cs
Comment thread libs/cluster/Server/Migration/MigrateSessionKeys.cs Outdated

@vazois vazois left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libs/cluster/Server/Migration/MigrateSessionKeys.cs:43 — seems really wasteful to maintain a separate dictionary just for skipping keys. We can potentially store the info for the key type inline and skip the key once we read it from the sketch list

Comment thread libs/cluster/Session/RangeIndexMigrationReceiveSession.cs
Comment thread libs/cluster/Session/RangeIndexMigrationReceiveSession.cs
@tiagonapoli tiagonapoli changed the base branch from dev to main May 21, 2026 15:03
Tiago Napoli and others added 18 commits June 15, 2026 13:31
An accidental undo removed the trailing 'return true;' after the ReceivingFileData try/catch, causing case fall-through (CS0163) and a non-returning path (CS0161). Restore it: when the file is not yet fully received, the case falls past the try and returns true to await more chunks.

Co-authored-by: Copilot <[email protected]>
Cover the two remaining deserializer Error paths that lacked tests: split file-size header (fewer than 8 bytes at WaitingForFileHeader) and a trailer chunk smaller than the 12-byte hash+stubLen header.

Co-authored-by: Copilot <[email protected]>
…pose throws

fileStream.Dispose() can throw; previously that would skip the temp snapshot file deletion (leaking it) and propagate out of Dispose. Wrap the stream disposal in try/catch (log + null in finally) so the temp-file cleanup always runs and Dispose never throws.

Co-authored-by: Copilot <[email protected]>
…Reader

ReadNextChunkAsync was untested. Add a ChunkDriver-parameterized round-trip body shared between the serializer test helper and the real async RangeIndexMigrationReader, covering single/multi-chunk, key-larger-than-chunk, small-chunk, and exact-fill shapes against both drivers. Plus reader-specific truncation (zero-byte read throws) and cancellation tests. Reuses one assertion body to avoid duplication.

Co-authored-by: Copilot <[email protected]>
Clarify how a caller knows serialization is done: loop while IsComplete is false, send each chunk; IsComplete flips true after the final bytes are framed, and a 0 return while not complete signals a too-small buffer. Matches the TransmitRangeIndexAsync caller.

Co-authored-by: Copilot <[email protected]>
A chunk smaller than the trailer ([8-byte hash][4-byte stubLen][stub]) can never be framed, so the stream would never complete. Add a public MinChunkSize const and throw ArgumentOutOfRangeException for any smaller chunkSize. Add Reader_ChunkSizeBelowMinimumThrows test.

Co-authored-by: Copilot <[email protected]>
The minimum chunk size is a property of the serializer's wire format (the trailer is the largest single-chunk framing element), not the reader. Define MinChunkSize on RangeIndexChunkedSerializer and have RangeIndexMigrationReader.MinChunkSize delegate to it.

Co-authored-by: Copilot <[email protected]>
Below-minimum chunkSize was already covered (Reader_ChunkSizeBelowMinimumThrows). Add RoundTrip_ExactMinChunkSize to verify the boundary: a chunk of exactly MinChunkSize successfully completes a stream (trailer just fits) through both the serializer helper and the real reader.

Co-authored-by: Copilot <[email protected]>
The forward-progress minimum (trailer size) applies to the destination buffer that ReadNextChunkAsync frames into, not to the chunkSize-sized internal file-read buffer. Move the MinChunkSize check into ReadNextChunkAsync (throws ArgumentException on a too-small destination), relax the constructor to require chunkSize > 0, and drop the misplaced RangeIndexMigrationReader.MinChunkSize (callers use RangeIndexChunkedSerializer.MinChunkSize). Replace the constructor test with constructor-non-positive and destination-too-small tests.

Co-authored-by: Copilot <[email protected]>
The earlier doc described the pre-validation behavior (return 0 on a too-small buffer; caller treats as error). Now that a too-small destination throws ArgumentException upfront, document that the method always returns a positive count while incomplete, and drop the stale internal-read-buffer mention from the guard comment.

Co-authored-by: Copilot <[email protected]>
The remarks duplicated the skip-on-existing-key behavior already evident from the method body and the PublishMigratedIndexResult enum docs.

Co-authored-by: Copilot <[email protected]>
Correct the migration section (E) and its rows in section G to reflect the shipped code: receive class/file is RangeIndexMigrationReceiveState in RangeIndexMigrationReceiveSession.cs; finalization is RangeIndexManager.PublishMigratedIndex (slot IMPORTING check + KeyExists gate + replaceOption skip outcomes via PublishMigratedIndexResult), not a deserializer Publish(); reader ctor takes tempFilePath; KEYS path discovery returns keys only (no stub, TOCTOU) and skips via the rangeIndexKeysToMigrate set; single SerializedRangeIndexStream=4 wire tag; drop stale aspirational names (HandleMigratedRangeIndexChunk/Stub, MigrateSessionRangeIndex.cs, RangeIndexSnapshotChunk/RangeIndexStub). Changes remain scoped to the migration section.

Co-authored-by: Copilot <[email protected]>
Replace the verbose <remarks> block on KeyExists with a one-line comment at the return explaining the Found/IsWrongType check.

Co-authored-by: Copilot <[email protected]>
Cover what happens to a migrated RangeIndex tree across the destination
node's storage lifecycle: checkpoint/flush of a migrated-then-modified
tree, checkpoint + process restart recovery, delete (and recreate at the
same key), back-migration round-trip + restart, and restart-then-continue.

Co-authored-by: Copilot <[email protected]>
PublishMigratedIndex now treats only a Created RICREATE result as success;
an InPlaceUpdated/CopyUpdated result means a concurrent write raced the
publish (the KeyExists gate already proved nothing existed), so the
recovered BfTree is discarded with a warning. Adds TODO(RI) notes for
handling the IMPORTING-state write race, AOF propagation of the migrated
file, MIGRATE REPLACE support, and checkpoint-vs-snapshot interaction.

Co-authored-by: Copilot <[email protected]>
…rd branch

Rework the IMPORTING-state race TODO for clarity, and split the
non-Created RICREATE handling: an InPlaceUpdated/CopyUpdated result (a
concurrent ASKING write that raced the publish) is logged but the bfTree
is left intact since the updated record may now reference it; any other
status logs and disposes the unregistered tree.

Co-authored-by: Copilot <[email protected]>
Tiago Napoli and others added 3 commits June 16, 2026 11:41
Removes four cluster-migration tests that crash the test host via a native
bf-tree panic (mini_page_op / circular_buffer assertion):

- ClusterMigrateRangeIndexWhileModifyingAsync
- ClusterMigrateRangeIndexWhileReadingAndWritingAsync
- ClusterMigrateRangeIndexSlowTransmitWithConcurrentOpsAsync
  (drop-vs-insert use-after-free: the deferred bftree_drop on the migration
  delete races a concurrent InsertByPtr that runs under the shared lock but
  outside epoch protection)
- ClusterMigrateRangeIndexLargeTree
  (concurrency-independent: snapshotting/dropping a large multi-page tree
  during migration crashes)

These are preserved (marked [Explicit]) in the RangeIndex crash/recovery PR
so the bugs stay tracked. The remaining migration suite now runs green.

Co-authored-by: Copilot <[email protected]>
…nk count

The test waited on CurrentChunkCount==1 then immediately asserted the temp
snapshot file exists. But OnChunkReceived bumps the chunk count just BEFORE
ProcessChunk creates the file, so under load the main thread could observe
the count and check for the file in that gap, seeing 0. Wait for the temp
file to appear (created by ProcessChunk, immediately before the pause point)
instead, which reliably proves the worker is parked in-flight.

Co-authored-by: Copilot <[email protected]>
Tiago Napoli and others added 2 commits June 16, 2026 22:20
The cluster test step ran dotnet test with no dump collection, so a native
test-host crash (e.g. the RangeIndex bf-tree panic/abort that aborts the run)
left no dump to analyze. Add --blame-crash (full) and --blame-hang (30m, full)
so a crash or hang produces a dump in the results directory, which the
existing always() upload-artifact step publishes. --blame-crash wires up the
runtime createdump handler, capturing native SIGABRT crashes too.

Co-authored-by: Copilot <[email protected]>
@tiagonapoli tiagonapoli merged commit 39fd9d3 into main Jun 17, 2026
300 of 301 checks passed
@tiagonapoli tiagonapoli deleted the tiagonapoli/bftree-migration branch June 17, 2026 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants