More IO optimizations by badrishc · Pull Request #1867 · microsoft/garnet

badrishc · 2026-06-09T18:24:50Z

No description provided.

Copilot

Pull request overview

This PR restructures Tsavorite’s pending-IO pipeline to eliminate the AsyncGetFromDiskResult wrapper and the per-session ioPendingRequests dictionary, instead carrying PendingContext directly on a pooled per-session pending-op object that flows through the device callback and ready-queue. It also enhances the LocalMemoryDevice to support an inline-completion mode (parallelism = 0) and exposes ring-capacity tuning through Devices.CreateLogDevice, with benchmark flags to exercise these modes.

Changes:

Replace wrapper + dictionary-based pending-IO tracking with a single pooled pending-op object (PendingIoContext) that carries both device-facing fields and the typed PendingContext.
Move disk-chain verification/reissue to a shared helper (TryVerifyOrReissuePendingRead) and (for async ops) perform verification on the run-thread drain rather than the completion thread.
Add LocalMemoryDevice inline completion and configurable ring capacity; update KV benchmark options to support these and map throttle intent to ring sizing.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
libs/storage/Tsavorite/cs/src/core/Utilities/AsyncResultTypes.cs	Removes the `AsyncGetFromDiskResult<TContext>` wrapper type.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/TsavoriteThread.cs	Drains `AsyncIOContext` directly from `readyResponses`; disposes/returns pending ops and maintains `pendingCount`.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Tsavorite.cs	Updates comments describing pending-context handling.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/HandleOperationStatus.cs	Rents a pooled pending-op and copies `PendingContext` onto it (no dictionary); increments `pendingCount` before issuing IO.
libs/storage/Tsavorite/cs/src/core/Index/Common/PendingIoContext.cs	Introduces `PendingIoContext<TInput,TOutput,TContext> : AsyncIOContext` to carry the typed pending context.
libs/storage/Tsavorite/cs/src/core/Index/Common/PendingContext.cs	Updates the `id` documentation to reflect the new pending-op model.
libs/storage/Tsavorite/cs/src/core/Index/Common/ExecutionContext.cs	Replaces `ioPendingRequests` with `pendingCount`; changes `readyResponses` to `AsyncQueue<AsyncIOContext>`; pools `PendingIoContext` instances.
libs/storage/Tsavorite/cs/src/core/Device/LocalMemoryDevice.cs	Adds inline completion mode, improves re-entrant submit handling across device instances, and changes default ring capacity.
libs/storage/Tsavorite/cs/src/core/Device/Devices.cs	Extends `CreateLogDevice` with `localMemoryRingCapacity` and updates `numCompletionThreads` semantics for LocalMemory (0 = inline, <0 = ProcessorCount).
libs/storage/Tsavorite/cs/src/core/Allocator/AsyncIOContext.cs	Updates `callbackQueue` to enqueue `AsyncIOContext` directly (no wrapper).
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs	Removes wrapper pool; passes `AsyncIOContext` directly to device; adds `TryVerifyOrReissuePendingRead` and updates callback behavior accordingly.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Options.cs	Enhances device-throttle help text; adds `--device-inline-completion`.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.Setup.cs	Wires inline completion and maps throttle intent to LocalMemory ring capacity.

Two Copilot-reviewer findings on PR #1867: 1. AsyncIOContext XML doc still listed "dictionary values" as a pending-IO storage site, but the ioPendingRequests dictionary was removed in this PR (the async path now flows through the per-session ready queue). Drop the stale reference. 2. KV.benchmark mapped --device-throttle to a power-of-two ring via `while (r < throttle) r <<= 1`, which overflows to a negative r and spins forever for user-supplied values > 2^30. Make it overflow-safe: cap at the largest power-of-two int (1 << 30). Co-authored-by: Copilot <[email protected]>

The per-allocator AsyncGetFromDiskResult pool is fronted by a per-thread ThreadLocalCache with a fixed 64-slot tier. The success-path Rent (issue) and Return (drain) both run on the worker thread, but a pipelined read batch keeps far more than 64 wrappers in flight, so every batch spills the overflow to the shared ConcurrentQueue. On an instant device that shared head/tail Interlocked traffic dominates and caps scaling (kv.bench LocalMemory inline 100% reads, b=1024: 1.16M ops/sec flat at 32 threads); real disk devices hide it behind IO latency. Grow the per-thread array geometrically up to a per-call maxCapacity (8192 for the wrapper pool) so the same-thread Rent/Return cycle stays in TLS; only a burst beyond the ceiling falls back to the shared queue. PooledArrayMemoryPool keeps the default cap, so its behavior is unchanged. kv.bench LocalMemory inline 100% reads, b=1024, now scales 0.93M (t=1) -> 21.7M (t=32). No change to NVMe disk-bound throughput (device-bound) or the in-memory read path (no pending IO, no wrapper). Co-authored-by: Copilot <[email protected]>

Inline completion runs the copy + completion callback synchronously on the submitting thread - no IO rings or completion threads, zero cross-thread handoff. Exposed via Devices.CreateLogDevice(numCompletionThreads: 0) and KV.benchmark --device-inline-completion (negative numCompletionThreads keeps the prior ProcessorCount default). Isolates per-op work from the run-thread -> completion-thread data handoff. On kv.bench LocalMemory (100% disk reads, 16MB log, 100B values) inline scales near-linearly (t1 1.08M, t8 8.63M, t16 16.9M, t32 33.0M) while the normal completion-thread model caps (~3M, flat/declining). The two are identical at t=1, so the per-op work is not the bottleneck - the cross-thread handoff is. Co-authored-by: Copilot <[email protected]>

Merge the AsyncGetFromDiskResult wrapper into AsyncIOContext and carry the PendingContext on the rented op, dropping the per-session ioPendingRequests dictionary. - Delete AsyncGetFromDiskResult<T> and its per-allocator TLS/global pool; the AsyncIOContext is now the object-context handed to the device and is enqueued directly into readyResponses. Re-read re-issues the same op, so there is no separate wrapper to rent/return. - Unseal AsyncIOContext; add generic PendingIoContext<TInput,TOutput,TContext> carrying the PendingContext. Rent it from the (now typed) per-session pool; the run-thread drain downcasts to reach the context (completion thread only ever touches the non-generic base). - Replace ioPendingRequests.Count with a scalar pendingCount (the vestigial asyncPendingCount is removed). Decrement in the drain's outer finally, after any re-pend increment, so the count is never transiently zero and stays correct even if completion processing throws (no CompletePending hang). - Re-pend keeps fresh-op-per-hop with move-on-re-pend ownership of the input container / requestKey; terminal and exception disposal are idempotent. One rent -> complete -> return per pending op, all per-session: no global queue, no dictionary Add/Remove/192B struct copy per op. Co-authored-by: Copilot <[email protected]>

Keep the device completion thread minimal for asynchronous pending reads: it now only records the transfer size and hands the op to the run thread via the per-session ready queue. Record verification (header parse, key compare, DiskLogRecord transfer) and disk hash-chain re-issue now run in the run-thread CompletePending drain instead of on the completion thread. - Extract TryVerifyOrReissuePendingRead on AllocatorBase: verify the record and, on an incomplete record or key mismatch within range, re-issue the next chain read on the SAME op (reusing it, exactly as before). Returns false when it re-issued (still pending) and true when resolved (verified record or walked below the resolvable range). - AsyncGetFromDiskCallback: the async branch is now enqueue-only; the synchronous iterator/scan path (completionEvent) still verifies + re-issues on the completion thread, since its waiter reads diskLogRecord directly after Wait(). - InternalCompletePendingRequest calls TryVerifyOrReissuePendingRead first; a chain-walk re-issue leaves the op in flight (skips return/decrement), so the pending count stays correct. Verify runs under the drain's epoch protection. This makes the async completion path symmetric and minimal (like the scaffold that scales): the completion thread reaps the device ring faster, and all record/key resolution for a pending read happens on one thread. Co-authored-by: Copilot <[email protected]>

A pending read/RMW with empty input (e.g. GET, which has no input) previously rented a SpanByteHeapContainer from the per-session pool only for it to early- return without allocating a buffer. Add a shared do-nothing EmptyHeapContainer<T> singleton (Get returns an immutable default, Dispose is a no-op) and use it for empty PinnedSpanByte input, avoiding the per-op wrapper rent/return entirely. The backing value is never written (empty input is read-only), so the singleton is safe to share across sessions. Co-authored-by: Copilot <[email protected]>

Co-authored-by: Copilot <[email protected]>

EmptyHeapContainer.Get() returns a ref to a process-wide shared field; a read never writes through the input ref, but an RMW updater could, which would race across sessions on the shared default. Gate the empty-input fast path on OperationType.READ so only reads use the shared container; RMW (even with empty input) keeps renting a per-op container. Removes the footgun while retaining the benefit for the read-only pending path. Co-authored-by: Copilot <[email protected]>

LocalMemoryDevice does not override Throttle(), so its in-flight depth was bounded only by its per-ring SPSC capacity (default 4096) and --device-throttle was silently ignored. Because the ring's producer-blocks-when-full backpressure already IS a per-submitter in-flight bound, the right way to throttle this device is to size the ring from the throttle — that caps in-flight with no contended device-wide numPending counter (which would add a cross-core atomic per IO, hurting the very scaling we care about). Add an optional localMemoryRingCapacity parameter to Devices.CreateLogDevice and have KV.benchmark pass it from --device-throttle (rounded up to a power of two). The default (throttle 0) is unchanged. Measured at t=16: throttle 0 -> 3.85M (unchanged), 1024 -> 4.41M (+14%), 512 -> 4.65M (+21%) — the over-deep default in-flight (4096 x 8KB = 32MB /thread working set) was blowing the LLC and saturating memory bandwidth. Additive and benchmark-confined: CreateLogDevice's LocalMemory branch is only used by the benchmark (tests construct LocalMemoryDevice directly), and the new parameter defaults to the prior behavior. Co-authored-by: Copilot <[email protected]>

The per-ring capacity is the device's per-submitter in-flight bound. The previous default of 4096 let ~4096 reads be in flight per submitter, each holding an 8KB read buffer (~32MB/thread working set), which blows the LLC and saturates memory bandwidth on the completion-thread memcpy. 1024 is deep enough to pipeline yet keeps the working set far smaller, and measures faster across thread counts (t=8 +18%, t=16 +10%, t=32 +6%) with no regression at t=1. Updated the LocalMemoryDevice constructor default and the Devices.CreateLogDevice LocalMemory fallback. Validated: Tsavorite.test 206/0, test.hlog 509/0 (incl. the reentrant-full-ring deadlock test), test.recovery 305/0, test.stress 48/0; format clean. Co-authored-by: Copilot <[email protected]>

…s fast NVMe The default device throttle (120 in-flight) caps queue depth and under-drives a fast NVMe, yielding a misleadingly low IOPS number (~500K on a Dell P5600 that sustains ~750K). Two config factors, both surfaced via device.bench + iostat: - throttle 120 caps QD; >=512 is needed to saturate the device queue. - a small --num-keys yields a small log LBA span; the device itself delivers fewer IOPS over a small span (engages fewer NAND channels). device.bench raw measured 554K over a 1.2GB span vs 746K over 16GB on the same drive. With --device-throttle >=512 and a keyspace large enough to span the device (>=100M keys here), KV.benchmark reaches 734K vs the 751K device.bench/fio ceiling (within ~2%) — i.e. the engine already tracks the device ceiling; the gap was a measurement-config artifact, not an engine limit. Clarify the help text so the default no longer reads as the device's true capability. Co-authored-by: Copilot <[email protected]>

…inel Two review findings addressed (no must-fix issues were found): 1. (should-fix) EmptyHeapContainer was a process-wide singleton exposed by ref T via ReadCompletionCallback(ref TInput input); a callback that writes through the ref would corrupt the value seen by empty-input reads in OTHER sessions. Scope the empty container per session (TsavoriteExecutionContext.EmptyInputContainer, lazily created) and reset it to default on Dispose, so cross-session leakage is impossible and intra-session reuse is clean. Removes the Unsafe.As reinterpret at the use site. 2. (consider) LocalMemoryDevice's drain-thread sentinel (ProcessorThreadRingIdx) is process-wide, so a completion callback on device A's drain thread submitting IO to device B would inline- complete on B, bypassing B's ring/throttle/latency. Pair the sentinel with a per-thread owner reference and inline only when ReferenceEquals(owner, this); a cross-instance submit routes through this instance's ring. Guard the sentinel in ring selection to avoid index underflow. The review also flagged that making SectorAlignedBufferPool's ctor private (while the type stays public) breaks external 'new' callers. This is intentional: the pool is deliberately Shared-only and no external construction compatibility is needed. Co-authored-by: Copilot <[email protected]>

…tinel Two findings from a second review pass: 1. (must-fix) The empty-input shared-container optimization aliases one per-session EmptyHeapContainer across coexisting CompletedOutputs. CompletePendingWithOutputs transfers pendingContext.input into each CompletedOutput (CompletedOutput.TransferFrom), whose Input property exposes a mutable `ref` (ref inputContainer.Get()). Multiple empty- input reads completed in one batch therefore share the same container, and one output's Dispose() resets the value the others still reference. Per-session scoping (the round-1 fix) cannot isolate coexisting outputs. Revert the optimization: every PinnedSpanByte input again rents a per-op SpanByteHeapContainer (an empty input early-returns without a buffer, so the rent is cheap and fully isolated). Removes EmptyHeapContainer and the per-session EmptyInputContainer, eliminating the shared-mutable-by-ref footgun class (both this and the round-1 ReadCompletionCallback surface). 2. (should-fix) LocalMemoryDevice.AssignRingIdx unconditionally cached the assigned ring index into t_ringIdxPlusOne, clobbering a drain thread's process-wide sentinel when that thread routed a rare cross-instance submit. That would disable same-instance reentrant inline completion and risk self-deadlock (a later submit back to this instance spinning in WaitForSlotEmpty on a ring only this thread drains). Guard the cache write: preserve the sentinel; a drain thread routing cross-instance just round-robins per call. Co-authored-by: Copilot <[email protected]>

localMemoryRingCapacity was appended after the ILogger logger parameter, breaking the convention that logger is always last. Move it ahead of logger (matching the existing <param> doc order); all callers use named arguments for these tail parameters, so the reorder is source-compatible. Also fill in the empty logger <param> doc. Co-authored-by: Copilot <[email protected]>

…essorOwner t_processorOwner is set once on the drain thread and never cleared, which is correct only because drain work runs on dedicated threads (never pooled). Note the invariant so a future move to pooled drain threads doesn't silently leak the identity to reused threads. Co-authored-by: Copilot <[email protected]>

This branch had grown ThreadLocalCache<T> from a fixed-capacity TLS cache into a geometric-growth one with a per-call maxCapacity parameter. The only consumer (PooledArrayMemoryPool) calls TryReturn(item) with no maxCapacity, so it runs at the default and the growth path is never exercised — and the design's own note already recorded that raising the cap gives no throughput change because the global-queue fallback is not the bottleneck at these rates. Drop the unused complexity: restore the simpler fixed-Capacity TryReturn(item) (back to main) and fix the doc cref. Co-authored-by: Copilot <[email protected]>

Two Copilot-reviewer findings on PR #1867: 1. AsyncIOContext XML doc still listed "dictionary values" as a pending-IO storage site, but the ioPendingRequests dictionary was removed in this PR (the async path now flows through the per-session ready queue). Drop the stale reference. 2. KV.benchmark mapped --device-throttle to a power-of-two ring via `while (r < throttle) r <<= 1`, which overflows to a negative r and spins forever for user-supplied values > 2^30. Make it overflow-safe: cap at the largest power-of-two int (1 << 30). Co-authored-by: Copilot <[email protected]>

…r path LocalMemoryDevice has no device-wide Throttle()/numPending counter (it inherits Throttle() => false), so setting IDevice.ThrottleLimit on it is a no-op. Its per-ring SPSC backpressure (producer blocks on a full ring) is the only in-flight bound. The KV benchmark already maps --device-throttle onto the ring capacity for this reason; do the same on the server path so --device-throttle-limit actually bounds in-flight when the server runs with --device-type localmemory (previously the ring was always the default). LocalStorageNamedDeviceFactory.Get now rounds the throttle up to a power of two (via Utility.NextPowerOf2, capped at 1<<30 for overflow safety) and passes it as localMemoryRingCapacity for DeviceType.LocalMemory; other device types are unchanged (they continue to honor ThrottleLimit). Updated the factory/creator param docs and the --device-throttle-limit help text. Verified end-to-end: server started with --device-type localmemory --device-throttle-limit 2048/3000 logs ringCapacity=2048/4096 (power-of-two rounding) and serves PING/SET/GET. Co-authored-by: Copilot <[email protected]>

Mirror the CLI help: for DeviceType=LocalMemory (which has no device-wide throttle) DeviceThrottleLimit instead sets the per-ring in-flight capacity, rounded up to a power of two. Co-authored-by: Copilot <[email protected]>

Per review discussion: readyResponses is AsyncQueue<AsyncIOContext> (not the generic PendingIoContext) because the allocator's IO-completion callback that enqueues into it is not generic — it only has the base AsyncIOContext. Every entry is in fact a PendingIoContext, which the run-thread drain recovers by downcast where the session's type arguments are in scope. Co-authored-by: Copilot <[email protected]>

badrishc force-pushed the badrishc/io-local-mem branch 4 times, most recently from d20e38c to 00e562a Compare June 10, 2026 19:16

badrishc marked this pull request as ready for review June 10, 2026 19:17

Copilot AI review requested due to automatic review settings June 10, 2026 19:17

Copilot started reviewing on behalf of badrishc June 10, 2026 19:17 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Comment thread libs/storage/Tsavorite/cs/src/core/Allocator/AsyncIOContext.cs Outdated

Comment thread libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.Setup.cs

badrishc and others added 17 commits June 10, 2026 20:46

Tsavorite: drop final newline in PendingIoContext.cs (editorconfig)

8fe08e1

Co-authored-by: Copilot <[email protected]>

badrishc force-pushed the badrishc/io-local-mem branch from 7a2e7f8 to 7c77aa0 Compare June 11, 2026 03:51

badrishc and others added 2 commits June 10, 2026 21:18

TedHartMS approved these changes Jun 11, 2026

View reviewed changes

TedHartMS merged commit 9a10cf5 into main Jun 11, 2026
195 checks passed

TedHartMS deleted the badrishc/io-local-mem branch June 11, 2026 04:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More IO optimizations#1867

More IO optimizations#1867
TedHartMS merged 19 commits into
mainfrom
badrishc/io-local-mem

badrishc commented Jun 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

badrishc commented Jun 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants