More IO optimizations#1867
Merged
Merged
Conversation
d20e38c to
00e562a
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR restructures Tsavorite’s pending-IO pipeline to eliminate the AsyncGetFromDiskResult wrapper and the per-session ioPendingRequests dictionary, instead carrying PendingContext directly on a pooled per-session pending-op object that flows through the device callback and ready-queue. It also enhances the LocalMemoryDevice to support an inline-completion mode (parallelism = 0) and exposes ring-capacity tuning through Devices.CreateLogDevice, with benchmark flags to exercise these modes.
Changes:
- Replace wrapper + dictionary-based pending-IO tracking with a single pooled pending-op object (
PendingIoContext) that carries both device-facing fields and the typedPendingContext. - Move disk-chain verification/reissue to a shared helper (
TryVerifyOrReissuePendingRead) and (for async ops) perform verification on the run-thread drain rather than the completion thread. - Add
LocalMemoryDeviceinline completion and configurable ring capacity; update KV benchmark options to support these and map throttle intent to ring sizing.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| libs/storage/Tsavorite/cs/src/core/Utilities/AsyncResultTypes.cs | Removes the AsyncGetFromDiskResult<TContext> wrapper type. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/TsavoriteThread.cs | Drains AsyncIOContext directly from readyResponses; disposes/returns pending ops and maintains pendingCount. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Tsavorite.cs | Updates comments describing pending-context handling. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/HandleOperationStatus.cs | Rents a pooled pending-op and copies PendingContext onto it (no dictionary); increments pendingCount before issuing IO. |
| libs/storage/Tsavorite/cs/src/core/Index/Common/PendingIoContext.cs | Introduces PendingIoContext<TInput,TOutput,TContext> : AsyncIOContext to carry the typed pending context. |
| libs/storage/Tsavorite/cs/src/core/Index/Common/PendingContext.cs | Updates the id documentation to reflect the new pending-op model. |
| libs/storage/Tsavorite/cs/src/core/Index/Common/ExecutionContext.cs | Replaces ioPendingRequests with pendingCount; changes readyResponses to AsyncQueue<AsyncIOContext>; pools PendingIoContext instances. |
| libs/storage/Tsavorite/cs/src/core/Device/LocalMemoryDevice.cs | Adds inline completion mode, improves re-entrant submit handling across device instances, and changes default ring capacity. |
| libs/storage/Tsavorite/cs/src/core/Device/Devices.cs | Extends CreateLogDevice with localMemoryRingCapacity and updates numCompletionThreads semantics for LocalMemory (0 = inline, <0 = ProcessorCount). |
| libs/storage/Tsavorite/cs/src/core/Allocator/AsyncIOContext.cs | Updates callbackQueue to enqueue AsyncIOContext directly (no wrapper). |
| libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs | Removes wrapper pool; passes AsyncIOContext directly to device; adds TryVerifyOrReissuePendingRead and updates callback behavior accordingly. |
| libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Options.cs | Enhances device-throttle help text; adds --device-inline-completion. |
| libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.Setup.cs | Wires inline completion and maps throttle intent to LocalMemory ring capacity. |
badrishc
added a commit
that referenced
this pull request
Jun 10, 2026
Two Copilot-reviewer findings on PR #1867: 1. AsyncIOContext XML doc still listed "dictionary values" as a pending-IO storage site, but the ioPendingRequests dictionary was removed in this PR (the async path now flows through the per-session ready queue). Drop the stale reference. 2. KV.benchmark mapped --device-throttle to a power-of-two ring via `while (r < throttle) r <<= 1`, which overflows to a negative r and spins forever for user-supplied values > 2^30. Make it overflow-safe: cap at the largest power-of-two int (1 << 30). Co-authored-by: Copilot <[email protected]>
The per-allocator AsyncGetFromDiskResult pool is fronted by a per-thread ThreadLocalCache with a fixed 64-slot tier. The success-path Rent (issue) and Return (drain) both run on the worker thread, but a pipelined read batch keeps far more than 64 wrappers in flight, so every batch spills the overflow to the shared ConcurrentQueue. On an instant device that shared head/tail Interlocked traffic dominates and caps scaling (kv.bench LocalMemory inline 100% reads, b=1024: 1.16M ops/sec flat at 32 threads); real disk devices hide it behind IO latency. Grow the per-thread array geometrically up to a per-call maxCapacity (8192 for the wrapper pool) so the same-thread Rent/Return cycle stays in TLS; only a burst beyond the ceiling falls back to the shared queue. PooledArrayMemoryPool keeps the default cap, so its behavior is unchanged. kv.bench LocalMemory inline 100% reads, b=1024, now scales 0.93M (t=1) -> 21.7M (t=32). No change to NVMe disk-bound throughput (device-bound) or the in-memory read path (no pending IO, no wrapper). Co-authored-by: Copilot <[email protected]>
Inline completion runs the copy + completion callback synchronously on the submitting thread - no IO rings or completion threads, zero cross-thread handoff. Exposed via Devices.CreateLogDevice(numCompletionThreads: 0) and KV.benchmark --device-inline-completion (negative numCompletionThreads keeps the prior ProcessorCount default). Isolates per-op work from the run-thread -> completion-thread data handoff. On kv.bench LocalMemory (100% disk reads, 16MB log, 100B values) inline scales near-linearly (t1 1.08M, t8 8.63M, t16 16.9M, t32 33.0M) while the normal completion-thread model caps (~3M, flat/declining). The two are identical at t=1, so the per-op work is not the bottleneck - the cross-thread handoff is. Co-authored-by: Copilot <[email protected]>
Merge the AsyncGetFromDiskResult wrapper into AsyncIOContext and carry the PendingContext on the rented op, dropping the per-session ioPendingRequests dictionary. - Delete AsyncGetFromDiskResult<T> and its per-allocator TLS/global pool; the AsyncIOContext is now the object-context handed to the device and is enqueued directly into readyResponses. Re-read re-issues the same op, so there is no separate wrapper to rent/return. - Unseal AsyncIOContext; add generic PendingIoContext<TInput,TOutput,TContext> carrying the PendingContext. Rent it from the (now typed) per-session pool; the run-thread drain downcasts to reach the context (completion thread only ever touches the non-generic base). - Replace ioPendingRequests.Count with a scalar pendingCount (the vestigial asyncPendingCount is removed). Decrement in the drain's outer finally, after any re-pend increment, so the count is never transiently zero and stays correct even if completion processing throws (no CompletePending hang). - Re-pend keeps fresh-op-per-hop with move-on-re-pend ownership of the input container / requestKey; terminal and exception disposal are idempotent. One rent -> complete -> return per pending op, all per-session: no global queue, no dictionary Add/Remove/192B struct copy per op. Co-authored-by: Copilot <[email protected]>
Keep the device completion thread minimal for asynchronous pending reads: it now only records the transfer size and hands the op to the run thread via the per-session ready queue. Record verification (header parse, key compare, DiskLogRecord transfer) and disk hash-chain re-issue now run in the run-thread CompletePending drain instead of on the completion thread. - Extract TryVerifyOrReissuePendingRead on AllocatorBase: verify the record and, on an incomplete record or key mismatch within range, re-issue the next chain read on the SAME op (reusing it, exactly as before). Returns false when it re-issued (still pending) and true when resolved (verified record or walked below the resolvable range). - AsyncGetFromDiskCallback: the async branch is now enqueue-only; the synchronous iterator/scan path (completionEvent) still verifies + re-issues on the completion thread, since its waiter reads diskLogRecord directly after Wait(). - InternalCompletePendingRequest calls TryVerifyOrReissuePendingRead first; a chain-walk re-issue leaves the op in flight (skips return/decrement), so the pending count stays correct. Verify runs under the drain's epoch protection. This makes the async completion path symmetric and minimal (like the scaffold that scales): the completion thread reaps the device ring faster, and all record/key resolution for a pending read happens on one thread. Co-authored-by: Copilot <[email protected]>
A pending read/RMW with empty input (e.g. GET, which has no input) previously rented a SpanByteHeapContainer from the per-session pool only for it to early- return without allocating a buffer. Add a shared do-nothing EmptyHeapContainer<T> singleton (Get returns an immutable default, Dispose is a no-op) and use it for empty PinnedSpanByte input, avoiding the per-op wrapper rent/return entirely. The backing value is never written (empty input is read-only), so the singleton is safe to share across sessions. Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
EmptyHeapContainer.Get() returns a ref to a process-wide shared field; a read never writes through the input ref, but an RMW updater could, which would race across sessions on the shared default. Gate the empty-input fast path on OperationType.READ so only reads use the shared container; RMW (even with empty input) keeps renting a per-op container. Removes the footgun while retaining the benefit for the read-only pending path. Co-authored-by: Copilot <[email protected]>
LocalMemoryDevice does not override Throttle(), so its in-flight depth was bounded only by its per-ring SPSC capacity (default 4096) and --device-throttle was silently ignored. Because the ring's producer-blocks-when-full backpressure already IS a per-submitter in-flight bound, the right way to throttle this device is to size the ring from the throttle — that caps in-flight with no contended device-wide numPending counter (which would add a cross-core atomic per IO, hurting the very scaling we care about). Add an optional localMemoryRingCapacity parameter to Devices.CreateLogDevice and have KV.benchmark pass it from --device-throttle (rounded up to a power of two). The default (throttle 0) is unchanged. Measured at t=16: throttle 0 -> 3.85M (unchanged), 1024 -> 4.41M (+14%), 512 -> 4.65M (+21%) — the over-deep default in-flight (4096 x 8KB = 32MB /thread working set) was blowing the LLC and saturating memory bandwidth. Additive and benchmark-confined: CreateLogDevice's LocalMemory branch is only used by the benchmark (tests construct LocalMemoryDevice directly), and the new parameter defaults to the prior behavior. Co-authored-by: Copilot <[email protected]>
The per-ring capacity is the device's per-submitter in-flight bound. The previous default of 4096 let ~4096 reads be in flight per submitter, each holding an 8KB read buffer (~32MB/thread working set), which blows the LLC and saturates memory bandwidth on the completion-thread memcpy. 1024 is deep enough to pipeline yet keeps the working set far smaller, and measures faster across thread counts (t=8 +18%, t=16 +10%, t=32 +6%) with no regression at t=1. Updated the LocalMemoryDevice constructor default and the Devices.CreateLogDevice LocalMemory fallback. Validated: Tsavorite.test 206/0, test.hlog 509/0 (incl. the reentrant-full-ring deadlock test), test.recovery 305/0, test.stress 48/0; format clean. Co-authored-by: Copilot <[email protected]>
…s fast NVMe
The default device throttle (120 in-flight) caps queue depth and under-drives a
fast NVMe, yielding a misleadingly low IOPS number (~500K on a Dell P5600 that
sustains ~750K). Two config factors, both surfaced via device.bench + iostat:
- throttle 120 caps QD; >=512 is needed to saturate the device queue.
- a small --num-keys yields a small log LBA span; the device itself delivers
fewer IOPS over a small span (engages fewer NAND channels). device.bench raw
measured 554K over a 1.2GB span vs 746K over 16GB on the same drive.
With --device-throttle >=512 and a keyspace large enough to span the device
(>=100M keys here), KV.benchmark reaches 734K vs the 751K device.bench/fio
ceiling (within ~2%) — i.e. the engine already tracks the device ceiling; the
gap was a measurement-config artifact, not an engine limit. Clarify the help
text so the default no longer reads as the device's true capability.
Co-authored-by: Copilot <[email protected]>
…inel Two review findings addressed (no must-fix issues were found): 1. (should-fix) EmptyHeapContainer was a process-wide singleton exposed by ref T via ReadCompletionCallback(ref TInput input); a callback that writes through the ref would corrupt the value seen by empty-input reads in OTHER sessions. Scope the empty container per session (TsavoriteExecutionContext.EmptyInputContainer, lazily created) and reset it to default on Dispose, so cross-session leakage is impossible and intra-session reuse is clean. Removes the Unsafe.As reinterpret at the use site. 2. (consider) LocalMemoryDevice's drain-thread sentinel (ProcessorThreadRingIdx) is process-wide, so a completion callback on device A's drain thread submitting IO to device B would inline- complete on B, bypassing B's ring/throttle/latency. Pair the sentinel with a per-thread owner reference and inline only when ReferenceEquals(owner, this); a cross-instance submit routes through this instance's ring. Guard the sentinel in ring selection to avoid index underflow. The review also flagged that making SectorAlignedBufferPool's ctor private (while the type stays public) breaks external 'new' callers. This is intentional: the pool is deliberately Shared-only and no external construction compatibility is needed. Co-authored-by: Copilot <[email protected]>
…tinel Two findings from a second review pass: 1. (must-fix) The empty-input shared-container optimization aliases one per-session EmptyHeapContainer across coexisting CompletedOutputs. CompletePendingWithOutputs transfers pendingContext.input into each CompletedOutput (CompletedOutput.TransferFrom), whose Input property exposes a mutable `ref` (ref inputContainer.Get()). Multiple empty- input reads completed in one batch therefore share the same container, and one output's Dispose() resets the value the others still reference. Per-session scoping (the round-1 fix) cannot isolate coexisting outputs. Revert the optimization: every PinnedSpanByte input again rents a per-op SpanByteHeapContainer (an empty input early-returns without a buffer, so the rent is cheap and fully isolated). Removes EmptyHeapContainer and the per-session EmptyInputContainer, eliminating the shared-mutable-by-ref footgun class (both this and the round-1 ReadCompletionCallback surface). 2. (should-fix) LocalMemoryDevice.AssignRingIdx unconditionally cached the assigned ring index into t_ringIdxPlusOne, clobbering a drain thread's process-wide sentinel when that thread routed a rare cross-instance submit. That would disable same-instance reentrant inline completion and risk self-deadlock (a later submit back to this instance spinning in WaitForSlotEmpty on a ring only this thread drains). Guard the cache write: preserve the sentinel; a drain thread routing cross-instance just round-robins per call. Co-authored-by: Copilot <[email protected]>
localMemoryRingCapacity was appended after the ILogger logger parameter, breaking the convention that logger is always last. Move it ahead of logger (matching the existing <param> doc order); all callers use named arguments for these tail parameters, so the reorder is source-compatible. Also fill in the empty logger <param> doc. Co-authored-by: Copilot <[email protected]>
…essorOwner t_processorOwner is set once on the drain thread and never cleared, which is correct only because drain work runs on dedicated threads (never pooled). Note the invariant so a future move to pooled drain threads doesn't silently leak the identity to reused threads. Co-authored-by: Copilot <[email protected]>
This branch had grown ThreadLocalCache<T> from a fixed-capacity TLS cache into a geometric-growth one with a per-call maxCapacity parameter. The only consumer (PooledArrayMemoryPool) calls TryReturn(item) with no maxCapacity, so it runs at the default and the growth path is never exercised — and the design's own note already recorded that raising the cap gives no throughput change because the global-queue fallback is not the bottleneck at these rates. Drop the unused complexity: restore the simpler fixed-Capacity TryReturn(item) (back to main) and fix the doc cref. Co-authored-by: Copilot <[email protected]>
Two Copilot-reviewer findings on PR #1867: 1. AsyncIOContext XML doc still listed "dictionary values" as a pending-IO storage site, but the ioPendingRequests dictionary was removed in this PR (the async path now flows through the per-session ready queue). Drop the stale reference. 2. KV.benchmark mapped --device-throttle to a power-of-two ring via `while (r < throttle) r <<= 1`, which overflows to a negative r and spins forever for user-supplied values > 2^30. Make it overflow-safe: cap at the largest power-of-two int (1 << 30). Co-authored-by: Copilot <[email protected]>
…r path LocalMemoryDevice has no device-wide Throttle()/numPending counter (it inherits Throttle() => false), so setting IDevice.ThrottleLimit on it is a no-op. Its per-ring SPSC backpressure (producer blocks on a full ring) is the only in-flight bound. The KV benchmark already maps --device-throttle onto the ring capacity for this reason; do the same on the server path so --device-throttle-limit actually bounds in-flight when the server runs with --device-type localmemory (previously the ring was always the default). LocalStorageNamedDeviceFactory.Get now rounds the throttle up to a power of two (via Utility.NextPowerOf2, capped at 1<<30 for overflow safety) and passes it as localMemoryRingCapacity for DeviceType.LocalMemory; other device types are unchanged (they continue to honor ThrottleLimit). Updated the factory/creator param docs and the --device-throttle-limit help text. Verified end-to-end: server started with --device-type localmemory --device-throttle-limit 2048/3000 logs ringCapacity=2048/4096 (power-of-two rounding) and serves PING/SET/GET. Co-authored-by: Copilot <[email protected]>
7a2e7f8 to
7c77aa0
Compare
Mirror the CLI help: for DeviceType=LocalMemory (which has no device-wide throttle) DeviceThrottleLimit instead sets the per-ring in-flight capacity, rounded up to a power of two. Co-authored-by: Copilot <[email protected]>
Per review discussion: readyResponses is AsyncQueue<AsyncIOContext> (not the generic PendingIoContext) because the allocator's IO-completion callback that enqueues into it is not generic — it only has the base AsyncIOContext. Every entry is in fact a PendingIoContext, which the run-thread drain recovers by downcast where the session's type arguments are in scope. Co-authored-by: Copilot <[email protected]>
TedHartMS
approved these changes
Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.