Skip to content

More IO optimizations#1867

Merged
TedHartMS merged 19 commits into
mainfrom
badrishc/io-local-mem
Jun 11, 2026
Merged

More IO optimizations#1867
TedHartMS merged 19 commits into
mainfrom
badrishc/io-local-mem

Conversation

@badrishc

@badrishc badrishc commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

@badrishc badrishc force-pushed the badrishc/io-local-mem branch 4 times, most recently from d20e38c to 00e562a Compare June 10, 2026 19:16
@badrishc badrishc marked this pull request as ready for review June 10, 2026 19:17
Copilot AI review requested due to automatic review settings June 10, 2026 19:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR restructures Tsavorite’s pending-IO pipeline to eliminate the AsyncGetFromDiskResult wrapper and the per-session ioPendingRequests dictionary, instead carrying PendingContext directly on a pooled per-session pending-op object that flows through the device callback and ready-queue. It also enhances the LocalMemoryDevice to support an inline-completion mode (parallelism = 0) and exposes ring-capacity tuning through Devices.CreateLogDevice, with benchmark flags to exercise these modes.

Changes:

  • Replace wrapper + dictionary-based pending-IO tracking with a single pooled pending-op object (PendingIoContext) that carries both device-facing fields and the typed PendingContext.
  • Move disk-chain verification/reissue to a shared helper (TryVerifyOrReissuePendingRead) and (for async ops) perform verification on the run-thread drain rather than the completion thread.
  • Add LocalMemoryDevice inline completion and configurable ring capacity; update KV benchmark options to support these and map throttle intent to ring sizing.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
libs/storage/Tsavorite/cs/src/core/Utilities/AsyncResultTypes.cs Removes the AsyncGetFromDiskResult<TContext> wrapper type.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/TsavoriteThread.cs Drains AsyncIOContext directly from readyResponses; disposes/returns pending ops and maintains pendingCount.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Tsavorite.cs Updates comments describing pending-context handling.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/HandleOperationStatus.cs Rents a pooled pending-op and copies PendingContext onto it (no dictionary); increments pendingCount before issuing IO.
libs/storage/Tsavorite/cs/src/core/Index/Common/PendingIoContext.cs Introduces PendingIoContext<TInput,TOutput,TContext> : AsyncIOContext to carry the typed pending context.
libs/storage/Tsavorite/cs/src/core/Index/Common/PendingContext.cs Updates the id documentation to reflect the new pending-op model.
libs/storage/Tsavorite/cs/src/core/Index/Common/ExecutionContext.cs Replaces ioPendingRequests with pendingCount; changes readyResponses to AsyncQueue<AsyncIOContext>; pools PendingIoContext instances.
libs/storage/Tsavorite/cs/src/core/Device/LocalMemoryDevice.cs Adds inline completion mode, improves re-entrant submit handling across device instances, and changes default ring capacity.
libs/storage/Tsavorite/cs/src/core/Device/Devices.cs Extends CreateLogDevice with localMemoryRingCapacity and updates numCompletionThreads semantics for LocalMemory (0 = inline, <0 = ProcessorCount).
libs/storage/Tsavorite/cs/src/core/Allocator/AsyncIOContext.cs Updates callbackQueue to enqueue AsyncIOContext directly (no wrapper).
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs Removes wrapper pool; passes AsyncIOContext directly to device; adds TryVerifyOrReissuePendingRead and updates callback behavior accordingly.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Options.cs Enhances device-throttle help text; adds --device-inline-completion.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.Setup.cs Wires inline completion and maps throttle intent to LocalMemory ring capacity.

Comment thread libs/storage/Tsavorite/cs/src/core/Allocator/AsyncIOContext.cs Outdated
badrishc added a commit that referenced this pull request Jun 10, 2026
Two Copilot-reviewer findings on PR #1867:
1. AsyncIOContext XML doc still listed "dictionary values" as a pending-IO storage site, but the
   ioPendingRequests dictionary was removed in this PR (the async path now flows through the
   per-session ready queue). Drop the stale reference.
2. KV.benchmark mapped --device-throttle to a power-of-two ring via `while (r < throttle) r <<= 1`,
   which overflows to a negative r and spins forever for user-supplied values > 2^30. Make it
   overflow-safe: cap at the largest power-of-two int (1 << 30).

Co-authored-by: Copilot <[email protected]>
badrishc and others added 17 commits June 10, 2026 20:46
The per-allocator AsyncGetFromDiskResult pool is fronted by a per-thread
ThreadLocalCache with a fixed 64-slot tier. The success-path Rent (issue)
and Return (drain) both run on the worker thread, but a pipelined read
batch keeps far more than 64 wrappers in flight, so every batch spills the
overflow to the shared ConcurrentQueue. On an instant device that shared
head/tail Interlocked traffic dominates and caps scaling (kv.bench
LocalMemory inline 100% reads, b=1024: 1.16M ops/sec flat at 32 threads);
real disk devices hide it behind IO latency.

Grow the per-thread array geometrically up to a per-call maxCapacity
(8192 for the wrapper pool) so the same-thread Rent/Return cycle stays in
TLS; only a burst beyond the ceiling falls back to the shared queue.
PooledArrayMemoryPool keeps the default cap, so its behavior is unchanged.

kv.bench LocalMemory inline 100% reads, b=1024, now scales
0.93M (t=1) -> 21.7M (t=32). No change to NVMe disk-bound throughput
(device-bound) or the in-memory read path (no pending IO, no wrapper).

Co-authored-by: Copilot <[email protected]>
Inline completion runs the copy + completion callback synchronously on the
submitting thread - no IO rings or completion threads, zero cross-thread
handoff. Exposed via Devices.CreateLogDevice(numCompletionThreads: 0) and
KV.benchmark --device-inline-completion (negative numCompletionThreads keeps
the prior ProcessorCount default).

Isolates per-op work from the run-thread -> completion-thread data handoff.
On kv.bench LocalMemory (100% disk reads, 16MB log, 100B values) inline scales
near-linearly (t1 1.08M, t8 8.63M, t16 16.9M, t32 33.0M) while the normal
completion-thread model caps (~3M, flat/declining). The two are identical at
t=1, so the per-op work is not the bottleneck - the cross-thread handoff is.

Co-authored-by: Copilot <[email protected]>
Merge the AsyncGetFromDiskResult wrapper into AsyncIOContext and carry the
PendingContext on the rented op, dropping the per-session ioPendingRequests
dictionary.

- Delete AsyncGetFromDiskResult<T> and its per-allocator TLS/global pool; the
  AsyncIOContext is now the object-context handed to the device and is enqueued
  directly into readyResponses. Re-read re-issues the same op, so there is no
  separate wrapper to rent/return.
- Unseal AsyncIOContext; add generic PendingIoContext<TInput,TOutput,TContext>
  carrying the PendingContext. Rent it from the (now typed) per-session pool;
  the run-thread drain downcasts to reach the context (completion thread only
  ever touches the non-generic base).
- Replace ioPendingRequests.Count with a scalar pendingCount (the vestigial
  asyncPendingCount is removed). Decrement in the drain's outer finally, after
  any re-pend increment, so the count is never transiently zero and stays
  correct even if completion processing throws (no CompletePending hang).
- Re-pend keeps fresh-op-per-hop with move-on-re-pend ownership of the input
  container / requestKey; terminal and exception disposal are idempotent.

One rent -> complete -> return per pending op, all per-session: no global
queue, no dictionary Add/Remove/192B struct copy per op.

Co-authored-by: Copilot <[email protected]>
Keep the device completion thread minimal for asynchronous pending reads: it now
only records the transfer size and hands the op to the run thread via the
per-session ready queue. Record verification (header parse, key compare,
DiskLogRecord transfer) and disk hash-chain re-issue now run in the run-thread
CompletePending drain instead of on the completion thread.

- Extract TryVerifyOrReissuePendingRead on AllocatorBase: verify the record and,
  on an incomplete record or key mismatch within range, re-issue the next chain
  read on the SAME op (reusing it, exactly as before). Returns false when it
  re-issued (still pending) and true when resolved (verified record or walked
  below the resolvable range).
- AsyncGetFromDiskCallback: the async branch is now enqueue-only; the synchronous
  iterator/scan path (completionEvent) still verifies + re-issues on the
  completion thread, since its waiter reads diskLogRecord directly after Wait().
- InternalCompletePendingRequest calls TryVerifyOrReissuePendingRead first; a
  chain-walk re-issue leaves the op in flight (skips return/decrement), so the
  pending count stays correct. Verify runs under the drain's epoch protection.

This makes the async completion path symmetric and minimal (like the scaffold
that scales): the completion thread reaps the device ring faster, and all
record/key resolution for a pending read happens on one thread.

Co-authored-by: Copilot <[email protected]>
A pending read/RMW with empty input (e.g. GET, which has no input) previously
rented a SpanByteHeapContainer from the per-session pool only for it to early-
return without allocating a buffer. Add a shared do-nothing EmptyHeapContainer<T>
singleton (Get returns an immutable default, Dispose is a no-op) and use it for
empty PinnedSpanByte input, avoiding the per-op wrapper rent/return entirely. The
backing value is never written (empty input is read-only), so the singleton is
safe to share across sessions.

Co-authored-by: Copilot <[email protected]>
EmptyHeapContainer.Get() returns a ref to a process-wide shared field; a read
never writes through the input ref, but an RMW updater could, which would race
across sessions on the shared default. Gate the empty-input fast path on
OperationType.READ so only reads use the shared container; RMW (even with empty
input) keeps renting a per-op container. Removes the footgun while retaining the
benefit for the read-only pending path.

Co-authored-by: Copilot <[email protected]>
LocalMemoryDevice does not override Throttle(), so its in-flight depth was bounded
only by its per-ring SPSC capacity (default 4096) and --device-throttle was silently
ignored. Because the ring's producer-blocks-when-full backpressure already IS a
per-submitter in-flight bound, the right way to throttle this device is to size the
ring from the throttle — that caps in-flight with no contended device-wide numPending
counter (which would add a cross-core atomic per IO, hurting the very scaling we care
about).

Add an optional localMemoryRingCapacity parameter to Devices.CreateLogDevice and have
KV.benchmark pass it from --device-throttle (rounded up to a power of two). The default
(throttle 0) is unchanged. Measured at t=16: throttle 0 -> 3.85M (unchanged), 1024 ->
4.41M (+14%), 512 -> 4.65M (+21%) — the over-deep default in-flight (4096 x 8KB = 32MB
/thread working set) was blowing the LLC and saturating memory bandwidth.

Additive and benchmark-confined: CreateLogDevice's LocalMemory branch is only used by
the benchmark (tests construct LocalMemoryDevice directly), and the new parameter
defaults to the prior behavior.

Co-authored-by: Copilot <[email protected]>
The per-ring capacity is the device's per-submitter in-flight bound. The previous
default of 4096 let ~4096 reads be in flight per submitter, each holding an 8KB read
buffer (~32MB/thread working set), which blows the LLC and saturates memory bandwidth
on the completion-thread memcpy. 1024 is deep enough to pipeline yet keeps the working
set far smaller, and measures faster across thread counts (t=8 +18%, t=16 +10%,
t=32 +6%) with no regression at t=1.

Updated the LocalMemoryDevice constructor default and the Devices.CreateLogDevice
LocalMemory fallback. Validated: Tsavorite.test 206/0, test.hlog 509/0 (incl. the
reentrant-full-ring deadlock test), test.recovery 305/0, test.stress 48/0; format clean.

Co-authored-by: Copilot <[email protected]>
…s fast NVMe

The default device throttle (120 in-flight) caps queue depth and under-drives a
fast NVMe, yielding a misleadingly low IOPS number (~500K on a Dell P5600 that
sustains ~750K). Two config factors, both surfaced via device.bench + iostat:

  - throttle 120 caps QD; >=512 is needed to saturate the device queue.
  - a small --num-keys yields a small log LBA span; the device itself delivers
    fewer IOPS over a small span (engages fewer NAND channels). device.bench raw
    measured 554K over a 1.2GB span vs 746K over 16GB on the same drive.

With --device-throttle >=512 and a keyspace large enough to span the device
(>=100M keys here), KV.benchmark reaches 734K vs the 751K device.bench/fio
ceiling (within ~2%) — i.e. the engine already tracks the device ceiling; the
gap was a measurement-config artifact, not an engine limit. Clarify the help
text so the default no longer reads as the device's true capability.

Co-authored-by: Copilot <[email protected]>
…inel

Two review findings addressed (no must-fix issues were found):

1. (should-fix) EmptyHeapContainer was a process-wide singleton exposed by ref T via
   ReadCompletionCallback(ref TInput input); a callback that writes through the ref would
   corrupt the value seen by empty-input reads in OTHER sessions. Scope the empty container
   per session (TsavoriteExecutionContext.EmptyInputContainer, lazily created) and reset it to
   default on Dispose, so cross-session leakage is impossible and intra-session reuse is clean.
   Removes the Unsafe.As reinterpret at the use site.

2. (consider) LocalMemoryDevice's drain-thread sentinel (ProcessorThreadRingIdx) is process-wide,
   so a completion callback on device A's drain thread submitting IO to device B would inline-
   complete on B, bypassing B's ring/throttle/latency. Pair the sentinel with a per-thread owner
   reference and inline only when ReferenceEquals(owner, this); a cross-instance submit routes
   through this instance's ring. Guard the sentinel in ring selection to avoid index underflow.

The review also flagged that making SectorAlignedBufferPool's ctor private (while the type stays
public) breaks external 'new' callers. This is intentional: the pool is deliberately Shared-only
and no external construction compatibility is needed.

Co-authored-by: Copilot <[email protected]>
…tinel

Two findings from a second review pass:

1. (must-fix) The empty-input shared-container optimization aliases one per-session
   EmptyHeapContainer across coexisting CompletedOutputs. CompletePendingWithOutputs
   transfers pendingContext.input into each CompletedOutput (CompletedOutput.TransferFrom),
   whose Input property exposes a mutable `ref` (ref inputContainer.Get()). Multiple empty-
   input reads completed in one batch therefore share the same container, and one output's
   Dispose() resets the value the others still reference. Per-session scoping (the round-1
   fix) cannot isolate coexisting outputs. Revert the optimization: every PinnedSpanByte
   input again rents a per-op SpanByteHeapContainer (an empty input early-returns without a
   buffer, so the rent is cheap and fully isolated). Removes EmptyHeapContainer and the
   per-session EmptyInputContainer, eliminating the shared-mutable-by-ref footgun class
   (both this and the round-1 ReadCompletionCallback surface).

2. (should-fix) LocalMemoryDevice.AssignRingIdx unconditionally cached the assigned ring
   index into t_ringIdxPlusOne, clobbering a drain thread's process-wide sentinel when that
   thread routed a rare cross-instance submit. That would disable same-instance reentrant
   inline completion and risk self-deadlock (a later submit back to this instance spinning
   in WaitForSlotEmpty on a ring only this thread drains). Guard the cache write: preserve
   the sentinel; a drain thread routing cross-instance just round-robins per call.

Co-authored-by: Copilot <[email protected]>
localMemoryRingCapacity was appended after the ILogger logger parameter, breaking the
convention that logger is always last. Move it ahead of logger (matching the existing
<param> doc order); all callers use named arguments for these tail parameters, so the
reorder is source-compatible. Also fill in the empty logger <param> doc.

Co-authored-by: Copilot <[email protected]>
…essorOwner

t_processorOwner is set once on the drain thread and never cleared, which is correct only
because drain work runs on dedicated threads (never pooled). Note the invariant so a future
move to pooled drain threads doesn't silently leak the identity to reused threads.

Co-authored-by: Copilot <[email protected]>
This branch had grown ThreadLocalCache<T> from a fixed-capacity TLS cache into a
geometric-growth one with a per-call maxCapacity parameter. The only consumer
(PooledArrayMemoryPool) calls TryReturn(item) with no maxCapacity, so it runs at the
default and the growth path is never exercised — and the design's own note already
recorded that raising the cap gives no throughput change because the global-queue
fallback is not the bottleneck at these rates. Drop the unused complexity: restore the
simpler fixed-Capacity TryReturn(item) (back to main) and fix the doc cref.

Co-authored-by: Copilot <[email protected]>
Two Copilot-reviewer findings on PR #1867:
1. AsyncIOContext XML doc still listed "dictionary values" as a pending-IO storage site, but the
   ioPendingRequests dictionary was removed in this PR (the async path now flows through the
   per-session ready queue). Drop the stale reference.
2. KV.benchmark mapped --device-throttle to a power-of-two ring via `while (r < throttle) r <<= 1`,
   which overflows to a negative r and spins forever for user-supplied values > 2^30. Make it
   overflow-safe: cap at the largest power-of-two int (1 << 30).

Co-authored-by: Copilot <[email protected]>
…r path

LocalMemoryDevice has no device-wide Throttle()/numPending counter (it inherits
Throttle() => false), so setting IDevice.ThrottleLimit on it is a no-op. Its per-ring
SPSC backpressure (producer blocks on a full ring) is the only in-flight bound. The KV
benchmark already maps --device-throttle onto the ring capacity for this reason; do the
same on the server path so --device-throttle-limit actually bounds in-flight when the
server runs with --device-type localmemory (previously the ring was always the default).

LocalStorageNamedDeviceFactory.Get now rounds the throttle up to a power of two (via
Utility.NextPowerOf2, capped at 1<<30 for overflow safety) and passes it as
localMemoryRingCapacity for DeviceType.LocalMemory; other device types are unchanged
(they continue to honor ThrottleLimit). Updated the factory/creator param docs and the
--device-throttle-limit help text.

Verified end-to-end: server started with --device-type localmemory --device-throttle-limit
2048/3000 logs ringCapacity=2048/4096 (power-of-two rounding) and serves PING/SET/GET.

Co-authored-by: Copilot <[email protected]>
@badrishc badrishc force-pushed the badrishc/io-local-mem branch from 7a2e7f8 to 7c77aa0 Compare June 11, 2026 03:51
badrishc and others added 2 commits June 10, 2026 21:18
Mirror the CLI help: for DeviceType=LocalMemory (which has no device-wide throttle)
DeviceThrottleLimit instead sets the per-ring in-flight capacity, rounded up to a power
of two.

Co-authored-by: Copilot <[email protected]>
Per review discussion: readyResponses is AsyncQueue<AsyncIOContext> (not the generic
PendingIoContext) because the allocator's IO-completion callback that enqueues into it is not
generic — it only has the base AsyncIOContext. Every entry is in fact a PendingIoContext, which
the run-thread drain recovers by downcast where the session's type arguments are in scope.

Co-authored-by: Copilot <[email protected]>
@TedHartMS TedHartMS merged commit 9a10cf5 into main Jun 11, 2026
195 checks passed
@TedHartMS TedHartMS deleted the badrishc/io-local-mem branch June 11, 2026 04:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants