Skip to content

More IO optimization#1854

Merged
badrishc merged 38 commits into
mainfrom
badrish/io-optimization
Jun 9, 2026
Merged

More IO optimization#1854
badrishc merged 38 commits into
mainfrom
badrish/io-optimization

Conversation

@badrishc

@badrishc badrishc commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

Comment thread libs/server/Resp/BasicCommands.cs Fixed
@badrishc badrishc marked this pull request as ready for review June 5, 2026 21:39
Copilot AI review requested due to automatic review settings June 5, 2026 21:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes a broad set of IO-path performance and tooling updates across Tsavorite and Garnet: it reduces per-IO copying/allocations on the Tsavorite pending-read path (notably around AsyncIOContext/wrapper pooling), improves Garnet RESP scatter-gather GET behavior (and enables it by default), and adds a syscall-free LocalMemory IDevice backend plus a dedicated Device.benchmark project for measuring raw IDevice IOPS.

Title/description alignment: The current title (“More IO optimization”) is too vague for the scope (new device backend + new benchmark + default behavior change + engine changes). Consider a more searchable title like:
[Tsavorite][RESP] Reduce pending-IO overhead; add LocalMemory device; enable SG GET by default

Changes:

  • Tsavorite pending-IO optimizations: AsyncIOContext becomes a pooled reference type; add TLS + global pooling for AsyncGetFromDiskResult wrappers.
  • Add LocalMemory IDevice with per-thread ring submission and a new Device.benchmark for backend IOPS characterization.
  • Garnet scatter-gather GET improvements (scratch-slot backing + rewind) and default enablement; docs + tests updated accordingly.

Reviewed changes

Copilot reviewed 46 out of 47 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
website/docs/welcome/compatibility.md Updates compatibility note to reflect SG-GET enabled-by-default semantics.
website/docs/getting-started/configuration.md Updates --sg-get option description (MGET always uses SG).
test/standalone/Garnet.test/ScratchBufferAllocatorTests.cs Adds unit tests for ScratchBufferAllocator.TryRewindToOffset behavior.
test/standalone/Garnet.test/RespGetLowMemoryTests.cs Adds mixed GET/SET pipeline test + MGET-on-disk test with sg-get disabled.
libs/storage/Tsavorite/cs/Tsavorite.slnx Adds Device.benchmark to Tsavorite solution.
libs/storage/Tsavorite/cs/test/TestUtils.cs Renames LocalMemory latency arg to microseconds and updates defaults.
libs/storage/Tsavorite/cs/test/test.recovery/ObjectIterationTests.cs Updates LocalMemoryDevice ctor arg from latencyMs to latencyUs.
libs/storage/Tsavorite/cs/test/test.hlog/LogShiftTailStressTest.cs Updates LocalMemory latency usage to microseconds.
libs/storage/Tsavorite/cs/test/test.hlog/DeviceLogTests.cs Updates LocalMemory latency usage to microseconds.
libs/storage/Tsavorite/cs/test/LowMemoryTests.cs Updates LocalMemory latency usage to microseconds.
libs/storage/Tsavorite/cs/test/BasicTests.cs Updates helper signatures/uses from ms to us latency.
libs/storage/Tsavorite/cs/src/core/Utilities/ThreadLocalCache.cs Adds generic TLS LIFO cache helper for two-tier pooling.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/TsavoriteThread.cs Returns pooled AsyncIOContext on completion and re-pend; refactors comments.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Tsavorite.cs Adds rationale comment for keeping PendingContext stack-allocated.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/HandleOperationStatus.cs Rents AsyncIOContext from per-session pool; adjusts defaults for ref-type context.
libs/storage/Tsavorite/cs/src/core/Index/Common/ExecutionContext.cs Adds per-session AsyncIOContext pool with rent/return helpers.
libs/storage/Tsavorite/cs/src/core/Device/SpscRing.cs Adds bounded ring with lost-wakeup-safe consumer wait + MPSC enqueue semantics.
libs/storage/Tsavorite/cs/src/core/Device/LocalMemoryDevice.cs Reworks LocalMemory device: rings + processor threads, lazy segments, µs latency simulation.
libs/storage/Tsavorite/cs/src/core/Device/DeviceType.cs Adds DeviceType.LocalMemory enum value + docs.
libs/storage/Tsavorite/cs/src/core/Device/Devices.cs Extends CreateLogDevice to construct LocalMemoryDevice and adds new params.
libs/storage/Tsavorite/cs/src/core/Allocator/ObjectSerialization/IStreamBuffer.cs Removes duplicate sentence in comment for InitialIOSize.
libs/storage/Tsavorite/cs/src/core/Allocator/ConditionallyHoistedKey.cs Updates comment to reflect new AsyncIOContext ref-type and focus on PendingContext.
libs/storage/Tsavorite/cs/src/core/Allocator/AsyncIOContext.cs Converts to sealed class, adds Reset, adjusts completion-event Set semantics.
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs Adds TLS+global pooling for AsyncGetFromDiskResult wrappers; adjusts callback path.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/README.md Rewrites/condenses README; documents 3-layer stack and LocalMemory scenario.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Options.cs Adds localmemory device option + updates help/validation text.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.Worker.cs Updates BasicContext.Read call signature usage.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.Setup.cs Adds LocalMemory device sizing/creation logic for KV.benchmark.
libs/storage/Tsavorite/cs/benchmark/Device.benchmark/README.md Adds new Tsavorite-scoped Device.benchmark README (NVMe + LocalMemory scenarios).
libs/storage/Tsavorite/cs/benchmark/Device.benchmark/Program.cs Updates output + adds LocalMemory construction path and aggregation logic.
libs/storage/Tsavorite/cs/benchmark/Device.benchmark/Options.cs Adds LocalMemory device-type docs and clarifies completion-thread semantics.
libs/storage/Tsavorite/cs/benchmark/Device.benchmark/DeviceUtils.cs Adds sector-aligned read/write helpers (note: alignment bug flagged in review).
libs/storage/Tsavorite/cs/benchmark/Device.benchmark/Device.benchmark.csproj Adds new benchmark project under Tsavorite benchmarks.
libs/storage/Tsavorite/cs/benchmark/Device.benchmark/BenchWorker.cs Switches to per-worker counters to avoid global atomic contention.
libs/server/Servers/GarnetServerOptions.cs Sets EnableScatterGatherGet default true + clarifies doc comment.
libs/server/Resp/MGetReadArgBatch.cs Removes non-SG MGET batch implementation (MGET always uses SG now).
libs/server/Resp/BasicCommands.cs SG-GET: scratch-slot backing, rewind savepoint, and fast prefix peeking (note: predicate bug flagged).
libs/server/Resp/ArrayCommands.cs Always uses SG MGET batch and drains pending together.
libs/server/ArgSlice/ScratchBufferAllocator.cs Adds TryRewindToOffset API and docs.
libs/host/defaults.conf Updates default EnableScatterGatherGet to true and comment.
libs/host/Configuration/Options.cs Updates --sg-get help text + defaults option to true.
libs/common/Memory/PooledArrayMemoryPool.cs Switches TLS caching to Tsavorite.core.ThreadLocalCache<T>.
Garnet.slnx Removes old top-level benchmark/Device.benchmark project reference.
benchmark/Resp.benchmark/README.md Condenses README; aligns with Device/KV layer model and LocalMemory scenario.
benchmark/README.md Updates links/description now that Device.benchmark lives under Tsavorite.
benchmark/Device.benchmark/README.md Deletes legacy Device.benchmark README under root benchmark folder.
benchmark/Device.benchmark/Device.benchmark.csproj Deletes legacy Device.benchmark project under root benchmark folder.

Comment thread libs/server/Resp/BasicCommands.cs Outdated
Comment thread libs/storage/Tsavorite/cs/src/core/Device/Devices.cs Outdated
@badrishc badrishc force-pushed the badrish/io-optimization branch 3 times, most recently from b722b17 to 2fe9f0f Compare June 6, 2026 21:04
Comment thread libs/storage/Tsavorite/cs/src/core/Device/LocalMemoryDevice.cs Outdated
Comment thread libs/storage/Tsavorite/cs/test/LowMemoryTests.cs Outdated
Comment thread test/standalone/Garnet.test/ScratchBufferAllocatorTests.cs Outdated
badrishc and others added 21 commits June 8, 2026 17:57
Two surgical changes to the pending-read path so initial speculative reads
no longer cross 4 KB NAND page boundaries on NVMe with libaio:

1. IStreamBuffer.InitialIOSize: 4096 -> 128 bytes.

   The previous default (Environment.SystemPageSize = 4096 on Linux x64)
   meant every initial speculative read rounded up to ~4-8 KB. Combined
   with sector-aligned (512 B) record offsets, ~75% of reads of small
   records crossed a 4 KB NAND-page boundary, forcing the NVMe controller
   to fetch two NAND pages per IO and roughly doubling per-IO service
   time at the device (~0.92 ms vs ~0.67 ms for sector-aligned 4 KB reads).

   The new default (128 B) rounds up to one device sector (typically 512 B)
   which always fits within a NAND page AND comfortably captures any
   small-record-shaped payload (header + small key + small value) from
   any valid_offset position within the sector. For Garnet's typical
   workload of 8 B key + ~100 B value (=128 B aligned record), every
   pending read completes in a single IO with no re-read regardless of
   where the record starts within the sector. Records larger than what
   fits in the speculative read trigger a precise re-read via
   VerifyRecordFromDiskCallback with the now-known recordLength, same
   as before.

   We empirically swept InitialIOSize at 32 / 128 / 256 / 512 / 4096 on
   the 100M x 100B workload. At 32 B, ~33% of operations triggered re-
   reads (records starting at 384/512 within a sector had only 128 valid
   bytes available, which was tight relative to the actual recordLength
   on the wire), dropping throughput to 578 K ops/sec. At 128 B, 0 re-
   reads were observed and throughput hit the device ceiling. Values
   above 128 progressively re-introduce the NAND-crossing penalty.

2. AsyncGetFromDiskCallback: subtract valid_offset when recording
   available_bytes from the IO transfer count.

   available_bytes counts valid bytes starting at GetValidPointer()
   (= aligned_pointer + valid_offset), but the prior code stored the
   raw device transfer count (which starts at aligned_pointer). With
   the previous 4 KB InitialIOSize this only mis-counted by up to 511
   bytes (max valid_offset for 512 B sectors), which rarely tripped
   the recordLength check. With the smaller speculative read it becomes
   essential — without this fix, a 512 B read with valid_offset of, say,
   384, would set available_bytes=512 while only 128 valid bytes are
   actually available from the record start.

Measured on 100M x 100B uniform random reads, 16 MB log, libaio,
throttle 512, NUMA-pinned:

  threads | main      | this branch | Device.bench (ceiling)
  --------+-----------+-------------+-----------------------
    1     | 176K      | 184K        | 249K
    8     | 552K      | 729K (+32%) | 739K

  At t=8, device-side iostat: r/s 567K -> 721K, r_await 0.92 -> 0.70 ms,
  rareq-sz 4.37 -> 0.60 KB. KV.benchmark now matches Device.benchmark's
  per-IO device ceiling within 1.4%.

Ported minimally from tedhar/io-4k branch (commits 4f4fbfe, e9ef3b0).
That branch carries a broader refactor (configurable per-Read/RMW
InitialIORecordSize, test conversions for 4K sector handling); this
commit applies only the two settings needed to close the device-side
gap on the libaio fast path, leaving the broader API surface alone.

Co-authored-by: Copilot <[email protected]>
…t path

Two surgical refactors that together remove every Buffer.BulkMoveWithWriteBarrier
call site on the single-thread KV.benchmark hot path. Profiling showed PollGCWorker
samples accounting for ~18% of worker CPU at single-thread libaio random reads, all
nested under BulkMove invocations (mis-attribution by the .NET sampler, but the
underlying BulkMove sites were real). After this change, profiles show
PollGCWorker=0%, BulkMove=0%, with the freed sample-budget cleanly attributed to
the actual NativeStorageDevice.ReadAsync syscall (71% -> 88%).

1. AsyncIOContext: struct -> sealed class.

   Was a ~112-byte struct with 5+ object refs that the pending-read path copied
   into AsyncGetFromDiskResult.context on every IO (one BulkMove per Read).
   Now an instance type with a per-session pool (RentAsyncIOContext /
   ReturnAsyncIOContext on TsavoriteExecutionContext). The wrapper field store
   becomes a single 8-byte reference, no BulkMove. Callers that previously did
   request equals default allocate-or-rent (or pass null when not used). The
   AsyncIOContextCompletionEvent.request field stays — class refs work the same
   way except Reset() replaces equals default.

2. PendingContext stored via PendingContextHolder<TIn, TOut, TCtx> wrapper.

   Was a ~200-byte struct that ioPendingRequests.Add bulk-copied into the
   dictionary slot (second BulkMove per IO). The new holder class wraps the
   struct as a public field; ContextRead (the KV.benchmark hot entry) rents
   a holder upfront so 'ref holder.value' IS the PendingContext used by all
   'ref PendingContext' callees. The dict then stores an 8-byte holder
   reference. Holder pool on TsavoriteExecutionContext mirrors the IOContext
   pool. HandleOperationStatus accepts the holder via an optional parameter
   with null fallback (the fallback rents-and-copies for legacy entry sites:
   AllocatorScan, ConditionalCopyToTail, the IOContext-driven re-read path).

Both changes are independent and additive. Either alone removes ~50% of the
BulkMove sample budget; together they remove 100%.

Per-op throughput (3 iterations, 100M x 100B, libaio, throttle=512, NUMA-pinned):

  threads | baseline   | this change      | Device.benchmark (ceiling)
  --------+------------+------------------+----------------------------
    1     | 181K       | 182K (+0.6%)     | 248K (4.03 us/op)
    8     | 740K       | 733K (-1%)       | 739K

Throughput is essentially flat: the time previously attributed to PollGCWorker
was sampler-side mis-attribution of trap-flag-induced poll-loop CPU INSIDE
BulkMove. Removing the BulkMove sites doesn't reduce total CPU — the same
worker-thread time is now (more accurately) attributed to ReadAsync, which is
where the io_submit syscall is actually parked. The single-thread gap to
Device.benchmark (4.86 us/op vs 4.03 us/op = +21%) is the residual Tsavorite
per-op CPU: hash lookup, RECORD_ON_DISK flow, AsyncGetFromDisk, pool
rent/return, session epoch Resume/Suspend. Closing further would require
leaner per-op control flow rather than further struct-copy elimination.

Architectural wins that remain regardless of the throughput-neutral result:
* Pending-read path is now allocation-light: alloc/wkr at t=1 drops ~40%
  (~1.01 MB / 15s to ~0.63 MB / 15s) since per-IO transient AsyncIOContext
  bytes are now reused pool instances.
* Dictionary storage shrinks from ~200B per entry to 8B per ref
  (ioPendingRequests memory footprint scales with in-flight IOs, not
  pending-context size).
* Future per-op CPU reductions (inlining, smaller hot loop) can now be
  measured cleanly because the sampler will attribute their savings
  correctly instead of being masked by BulkMove mis-attribution.

All 159 smoke tests pass.

Co-authored-by: Copilot <[email protected]>
Production-quality overhaul of the in-process 'fast device' backend, intended for
benchmarks that need to characterize the upper bound of Tsavorite/Garnet
throughput without paying any kernel-syscall cost.

Why
---
The old implementation had a broken latency model (sleep only fired on the first
item dequeued from an idle queue — backlog masked the latency entirely) and
routed by segmentId%parallelism, which made every queue an MPSC queue with the
tail-CAS bouncing across all submitter threads. Throughput collapsed past 4-8
threads instead of scaling.

Design
------
- New SpscRing<T>: bounded, lock-free, single-producer / single-consumer ring
  with cache-line-padded head/tail and producer/consumer cached counters so the
  hot path is one volatile load + one volatile store with no cross-cache-line
  traffic when the ring is neither empty nor full.
- LocalMemoryDevice now uses one SpscRing per IO processor thread. Submitters
  are bound to a ring via ThreadStatic routing (round-robin on first use), so
  each ring sees exactly one producer thread when #submitter threads ≤
  parallelism. This is the standard production pattern used by io_uring
  per-thread rings and SPDK per-CPU NVMe queue pairs.
- If round-robin assignment wraps past parallelism (more submitters than rings),
  the device transparently falls back to MPSC via a per-ring spinlock — kept
  off the fast path until the first collision is detected.
- Latency simulation now uses Stopwatch.GetTimestamp and a per-request deadline
  measured from enqueue, so every IO respects the configured latency regardless
  of queue depth (the old model only delayed the first item out of an idle queue).
- Lazy per-segment GC.AllocateArray(pinned: true) instead of allocating the full
  capacity upfront. Capacity is sized to the dataset in callers.
- Background, named threads (TsavoriteLocalMemIO-N).
- Proper bounds checking on segmentId; ENOENT-equivalent callback for unwritten
  segments instead of array-out-of-bounds.
- Drains pending requests during Dispose before joining processor threads.

Sync mode removed
-----------------
The previous parallelism=0 'inline' mode (callback fires on submitter thread)
was non-physical — no real high-IOPS device works that way; io_uring, SPDK,
and persistent-memory backends all have submit/complete on separate hardware
queues. Removed to keep the device an honest simulation of real fast IO.

Benchmark wiring
----------------
- Device.benchmark gains --local-memory, --local-memory-parallelism,
  --local-memory-latency-ms. Per-worker counters in BenchWorker eliminate the
  Interlocked.Increment(ref Program.totalCompletedOk) cache-line ping-pong
  that was capping device.benchmark at ~12M ops/sec regardless of device.
- KV.benchmark gains --localmem-parallelism, --localmem-latency-ms. Device
  capacity is sized to fit the dataset (Keys * (8 + ValueSize + 16) rounded
  up to segment size, plus 4 segments slack). --validate is honored.

Results on a Dell P5600 NVMe box, 80 cores NUMA node 0 (numactl pinned)
-----------------------------------------------------------------------
Device.benchmark + LocalMemoryDevice, 512B reads, parallelism = threads:
  threads  1    4    8    16    32    40    64
  ops/sec  3M   7M   12M  22M   28M   82M   73M
  peak 81.8M ops/sec = 41.8 GB/s memcpy (84% of DDR4 dual-channel BW)

KV.benchmark + LocalMemoryDevice (100M keys × 100B, log=16MB, hashpack=1):
  threads  1     4     8     16    32
  ops/sec  927K  1.04M 926K  784K  615K
The device is no longer the bottleneck — Tsavorite-side scaling investigation
to follow.

Co-authored-by: Copilot <[email protected]>
Promotes LocalMemoryDevice to a first-class DeviceType enum member so callers
can construct it via Devices.CreateLogDevice instead of bypassing the factory.

Changes
-------
- DeviceType: new LocalMemory = 5 entry, sitting between AzureStorage and Null.
- Devices.CreateLogDevice: new optional params (localMemorySegmentSize,
  localMemoryParallelism, localMemoryLatencyMs) that are honored only when
  deviceType == LocalMemory. Existing signature stays binary-compatible for
  all other backends (the new params have defaults). LocalMemory enforces a
  positive capacity, defaults parallelism to Environment.ProcessorCount, and
  accepts an optional virtual logPath as the device file-name identifier.
- KV.benchmark: --device localmemory now parses to DeviceType.LocalMemory and
  the setup path constructs the device via Devices.CreateLogDevice. Removes
  the prior string-match hack in CreateDevice.
- Device.benchmark: replaces the bespoke --local-memory bool flag with the
  standard --device-type LocalMemory, also routed through Devices.CreateLogDevice.
  The --local-memory-parallelism / --local-memory-latency-ms flags remain as
  LocalMemory-specific knobs.

Tested
------
- Tsavorite.test (BasicTests, LowMemoryTests, DeviceLogTests,
  LogShiftTailStressTest) all pass.
- Device.benchmark --device-type LocalMemory yields ~70M ops/sec at p=32,t=32.
- KV.benchmark --device localmemory --validate reports [validate] OK.

Co-authored-by: Copilot <[email protected]>
…lism knob

DeviceType.LocalMemory's 'number of SPSC ring drainer threads' is
conceptually the same parameter as DeviceType.Native's numCompletionThreads
('background IO completion drain threads, each bound 1:1 to its own
queue/ring with submitters distributing via per-thread affinity'). Having
both was redundant.

Changes
-------
- Devices.CreateLogDevice: dropped the localMemoryParallelism parameter;
  LocalMemory now uses numCompletionThreads (defaulting to
  Environment.ProcessorCount when unspecified). Doc comment updated to
  describe the unified semantics across Native and LocalMemory.
- KV.benchmark: removed --localmem-parallelism; LocalMemory now reuses
  --device-completion-threads. Help text for --device-completion-threads
  updated to cover both backends.
- Device.benchmark: removed --local-memory-parallelism; LocalMemory now
  reuses --completion-threads. Help text updated accordingly.
- Both benchmark callers updated to pass the resolved value through
  numCompletionThreads and log the source in their startup banner.

Verified
--------
- Device.benchmark --device-type LocalMemory --completion-threads 32 -t 32:
  ~70M ops/sec (unchanged from the previous bespoke flag).
- KV.benchmark --device localmemory --device-completion-threads 4 --validate:
  reports [validate] OK.
- BasicTests pass.

Co-authored-by: Copilot <[email protected]>
…ded)

The processor loop used to spin via SpinWait.SpinOnce when the ring was empty.
SpinWait does eventually escalate to Thread.Sleep(1), but the actual measured
cost on an EPYC system was ~3.3% CPU per idle processor — 32 idle processors
collectively burnt a whole core just polling. With KV.benchmark sweeping
--device-completion-threads 32 across smaller --threads counts, this stole real
CPU from the workers.

This change adds proper park/wake to SpscRing:

  - SpscRing owns a ManualResetEventSlim and a wake-needed flag, both placed on
    their own cache lines (away from _tail and _head).
  - Producer hot path: after publishing _tail (release), insert one
    Interlocked.MemoryBarrier (mfence on x86 ~3-5ns) and read _wakeNeeded
    (L1 hit in S state when consumer is awake). If non-zero, call Set.
    In steady state the consumer is awake, _wakeNeeded == 0 stays cache-stable
    in producer's L1, and no event ops happen — net hot-path cost ≈ one mfence.
  - Consumer WaitForWork: brief spin to catch back-to-back work, then
    Reset → Interlocked.Exchange(_wakeNeeded, 1) (full fence) → recheck tail
    AND _stopped → Wait (no timeout). Lost-wakeup-safe under x86 TSO via the
    fence pairing.
  - SpscRing owns _stopped. Stop() sets _stopped then Set()s the signal.
    The consumer's post-fence recheck of _stopped catches the shutdown signal
    even when its Reset wipes Dispose's Set, so we need neither a Wait timeout
    nor a per-iteration delegate callback.
  - WaitForWork reads _stopped as a direct field load (not via Func<bool>),
    so there's no virtual call cost in the wait path.

LocalMemoryDevice loses its own 'terminated' field; lifecycle is per-ring now,
which is cleaner (each ring is self-contained: producer publishes, consumer
parks, Stop tears it down).

Verified on Dell P5600 / 80-core NUMA node 0
--------------------------------------------
- Idle (32 processors, no submits, 10 s):
    Before: 32 × 3.3% CPU = ~105% of one core (always-on polling)
    After:  32 × 0.0% CPU = 0% (kernel-parked on futex)
    Total process CPU at 10s: 2.1% (just .NET runtime threads)
- Wake latency after parking (first read after 500 ms idle):
    Median ~175 µs (kernel scheduler), occasional ~1.5 ms cold-start.
- Device.benchmark scaling unchanged within noise:
    t=1: 1.8M, t=8: 27.6M, t=16: 61.2M, t=32: 76.6M, t=40: 78.4M
    (peak was 76.4M / 79.8M before; difference is run-to-run variance).
- Tsavorite BasicTests pass.

Co-authored-by: Copilot <[email protected]>
…llback

CPU profile of KV.bench at t=16 (parents-of-CPU_TIME) revealed two distinct
hotspots in the device layer:

  11.8% Monitor.Enter_Slowpath  (45 sec / 384 sec total managed CPU)
  12.3% Thread.Sleep            (47 sec)

Both came from the LocalMemoryDevice's SPSC ring infrastructure:

(1) Monitor.Enter_Slowpath: the previous design had a *global* ringNeedsLock
    flag that, once any 33rd-unique submitter thread showed up (a ThreadPool
    flush helper during load), forced EVERY subsequent Enqueue on EVERY ring
    through 'lock (ringEnqueueLocks[idx])' — even worker rings that were
    strictly SPSC. The Monitor.lock fast path is reasonable but contended; its
    slow path goes through a kernel futex which is ~hundreds of ns.

(2) Thread.Sleep: SpscRing.WaitForWork used SpinWait.SpinOnce in its 'brief
    spin' phase before parking on the event. SpinOnce escalates to
    Thread.Yield / Sleep(0) / Sleep(1) within ~20 iterations, so on
    lightly-loaded rings the drainer would wake briefly, see empty, sleep
    for ~1 ms, wake, re-spin → pure CPU waste.

Fix
---
1. Replace SpscRing with a Vyukov-style MPSC ring:
     - Per-slot sequence number tells producer/consumer when the slot is
       empty/full/awaiting-reuse.
     - Producer enqueue is a single Interlocked.Increment on _tail to claim
       a unique slot index, followed by data store + sequence release-store.
       No CAS retry loop, no lock anywhere — graceful degradation from SPSC
       to MPSC under contention.
     - Consumer dequeue is a sequence-check + data load + sequence
       release-store (re-emptying the slot for the next round).
   This eliminates the need for ringNeedsLock / ringEnqueueLocks entirely.

2. Replace SpinWait.SpinOnce with bounded Thread.SpinWait(8) × 64
   (~1-2 µs total, pure busy, never escalates) in WaitForWork's spin
   phase. When the spin expires we park on the event indefinitely (0% CPU
   until the producer signals — already correct from the prior change).

3. Keep SpinWait.SpinOnce in the producer's ring-full backpressure path
   (slot not yet drained by consumer). Here the consumer is the one that
   needs CPU to drain, so SpinOnce's eventual Yield/Sleep IS correct —
   pure Thread.SpinWait would starve the consumer when the pair lands on
   hyperthread siblings.

LocalMemoryDevice simplification
--------------------------------
Dropped the global ringNeedsLock + ringEnqueueLocks[] entirely. Enqueue is
now a flat one-liner that just calls rings[idx].Enqueue — SpscRing handles
both SPSC and MPSC natively at uniform per-op cost.

Implications
------------
- Per-op cost: 1 Interlocked.Increment (~5 ns) + memcpy + 1 mfence + a
  couple of Volatile reads/writes. Around 10-15 ns of pure overhead per
  Enqueue, vs. ~5 ns for a strict SPSC ring with a plain tail store.
- No path to the kernel scheduler in the hot path; no Monitor / futex.
- SPSC and MPSC cases use identical code.

Device.benchmark scaling (parallelism = threads, 512 B sectors, batch 1024)
---------------------------------------------------------------------------
3 runs each, NUMA-pinned to node 0 (80 cores):
  t=1:  4.8-5.3 MIOps/s      (was 1.5)
  t=8:  33-36                (was 33)
  t=16: 59-63                (was 61)
  t=20: 64-70                (was 65)
  t=24: 66-73                (was 70)
  t=28: 74-76                (new)
  t=32: 74-82                (was 73)
  t=36: 79-83                (new)
  t=40: 37, 84, 84           (was 74-84) - edge case at threads==cores
  t=48: 74-78                (was 76)
Peak ~80-84 M IOPS at t=32-36; ~38-42 GB/s of memcpy ~84% of memory BW.
The t=40 outlier is consistent with OS scheduling pathology when total
threads exactly matches available cores (40 producers + 40 consumers = 80).

Verified: --validate passes, BasicTests pass.

Co-authored-by: Copilot <[email protected]>
…AS-claim wake

Two wins from a CPU profile of KV.bench at t=16 (LocalMemoryDevice, 10M keys,
100B values, log=16MB so 100% reads go pending):

(1) KV.bench Worker: drop the input-span from Read
    The bench was calling bContext.Read(key, ref pinnedInputSpan, ref _output, ...)
    with a 100-byte pinned input even for read-only ops. Tsavorite's
    CreatePendingReadContext -> CopyInputsForReadOrRMW -> SpanByteHeapContainer.Initialize
    then did a SectorAlignedBufferPool.Get(~108B) + 100-byte copy for EVERY
    pending read, hammering the global pool. Switching to the no-input overload
    bContext.Read(key, ref _output, ...) defaults TInput, so SpanByteHeapContainer.Initialize
    sees item.Length==0 and skips the buffer rental entirely.

(2) SpscRing: CAS-claim the wake flag
    Producer hot path used to call _signal.Set() on every Enqueue as long as
    _wakeNeeded == 1. A burst of N enqueues after the consumer parks would issue
    N redundant Sets (each Set internally takes a Monitor). New WakeIfParked
    helper atomically CASes _wakeNeeded from 1 to 0; only the winning producer
    calls Set. Subsequent producers in the same burst skip Set entirely.

Results: KV.bench --device localmemory --device-completion-threads 32, 10M x 100B,
log=16MB, 3 runs each:
  threads   before    after       delta
   t=1      0.94 M    0.57-0.99 M  (variable; t=1 small absolute)
   t=8      1.93 M    3.12 M       +62%
   t=16     0.90 M    2.83 M       +214%   ← the meaningful number
   t=32     0.63 M    2.22 M       +250%
Bench --validate still passes; tests still pass.

The remaining bottleneck is somewhere else (consumer drain rate). Next:
recapture profile and chase the new top frame.

Co-authored-by: Copilot <[email protected]>
Two-tier object pool for the AsyncGetFromDiskResult wrappers used on the
pending-read hot path: a ThreadStatic Stack<T> per worker thread (capacity 32)
absorbs back-to-back Get/Return cycles, falling back to the existing shared
ConcurrentQueue on miss/overflow.

Measured impact: at t=16 on KV.bench (LocalMemoryDevice), no significant
throughput change — the global ConcurrentQueue wasn't the actual bottleneck
the profile suggested (single-CAS per access; cache-line ping-pong is small
relative to the consumer-drain wait). Kept as a no-regression / cleaner-shape
change in preparation for the next investigation: the SectorAlignedMemory
pool (bufferPool) hit on each pending read, which has bigger size buckets
and may see more contention.

Co-authored-by: Copilot <[email protected]>
When a Vyukov-style producer claims a slot that the consumer hasn't yet
drained (ring full at this slot), WaitForSlotEmpty's Phase 2 was using
SpinWait.SpinOnce which escalates to Thread.Sleep(1) within ~20 iterations.
The 1 ms Sleep granularity is much coarser than the consumer's actual drain
time (~167 ns per slot), so producers were stalled for 1 ms each time the
ring filled even though the slot would be free in microseconds.

Switching to plain Thread.Yield drops only the current scheduler quantum
(typically µs on Linux) without parking, so the producer comes back as soon
as it is re-scheduled and re-checks the slot. Tradeoff: Yield uses CPU
during the wait whereas Sleep doesn't, but the ring-full case is rare in
steady state and the producer is the consumer's only customer, so giving
up CPU briefly to let the consumer run is the right behavior.

Measured at t=16 KV.bench (LocalMemoryDevice, 10M x 100B):
  before: 2.83 M ops/sec
  after:  2.93 M ops/sec  (+3.5%)

Small win — the dominant remaining bottleneck is the per-completion work
on the worker side (InternalCompletePendingRequestFromContext ~5 us per
op), not the producer's ring-full wait. But the change is correctness-
neutral and removes a misleading hot frame from the CPU profile.

Co-authored-by: Copilot <[email protected]>
…hout explicit sizing

Devices.CreateLogDevice previously threw on DeviceType.LocalMemory when capacity
was unspecified (the common case for LocalStorageNamedDeviceFactory, which
doesn't pass a capacity through). This made it impossible to plug LocalMemory
into Garnet (or any framework using the standard factory pattern) without
manually wiring a capacity argument all the way down.

Since segments are already lazily allocated via EnsureSegment, the "capacity"
field only sizes the per-segment lookup arrays (byte[][] segmentArrays +
byte*[] segmentPtrs). Defaulting it to a generous bounded value
(DefaultMaxSegments = 4096 × sz_segment = 4 TB at the 1 GB default segment)
costs only ~32 KB of pointer-array overhead, while letting callers pass
CAPACITY_UNSPECIFIED (or <=0) without error.

Co-authored-by: Copilot <[email protected]>
Each pending GET in a scatter-gather batch used to land in its own freshly
rented IMemoryOwner<byte> (PooledArrayMemoryPool wrapper + 128-byte ArrayPool
slice + the matching round-trips on Dispose), because the StringOutput passed
to GET_WithPending after the first pending op was always `new StringOutput()`
(IsSpanByte=false), which forces CopyRespTo down the heap-rent branch. Only
the first pending op got the direct-to-network-buffer path (its output was
the GetStringOutput()-backed one from the loop's initial state).

This change gives every post-first-pending submission a pre-reserved slot in
a per-session pinned scratch byte[]. The Reader's existing IsSpanByte fast
path writes the RESP-encoded response directly into the slot — same fast path
as the in-memory hit case. The wrap-up loop then memcpy's slots into the
network buffer in batch order (preserving RESP wire ordering despite
out-of-order IO completion). Rents are reserved for the rare overflow case
(response larger than PendingScratchSlotSize=128 B), at which point the
existing functionsState.memoryPool.Rent + ConvertToHeap path engages and the
overflow IMemoryOwner is disposed via SendAndReset in the emit loop.

Prefix-sync hits (sync ops before any pending) are unchanged — they still
ProcessOutput direct-to-network with dcurr advancing per op.

Scratch is grow-only across batches (geometric grow on first batch that
exceeds capacity, never shrunk), pinned via GC.AllocateArray, and the cached
byte* pointer is re-derived on each grow.

Measured on resp.benchmark disk-bound recipe (10M × 100B, 16 MB log,
LocalMemoryDevice, --sg-get true, identical load):

  threads | baseline | new      | delta
  --------+----------+----------+--------
       1  |  419 K   |  651 K   | +55.4 %
       8  | 2061 K   | 2360 K   | +14.5 %
      16  | 2699 K   | 2777 K   |  +2.9 %

The win is largest at low concurrency where per-op allocation cost dominates;
at high concurrency the wake/park round-trip in the IO completion path
becomes the binding constraint and this change buys back only the few percent
of allocation overhead that remained.

Co-authored-by: Copilot <[email protected]>
…ledArrayMemoryPool

The two two-tier pools shipped earlier in this PR (commit 583aef4 for
AsyncGetFromDiskResult, plus PooledArrayMemoryPool in libs/common) used
different backing structures (Stack<T> vs T[]+count) and different cap
names (AsyncGetFromDiskResultCacheCapacity vs TlsCacheSize). This commit
unifies both on the same shape:

  [ThreadStatic] private static T[] tlsCache;
  [ThreadStatic] private static int tlsCount;
  private const int TlsCacheCapacity = 64;

with the same bump-pointer LIFO Rent/Return code. The T[]+count form is
chosen over Stack<T> to avoid virtual dispatch on the hot path; the
behavioral semantics (LIFO ordering, single-thread access, lazy alloc)
are identical.

Cap reconciled at 64 for both (was 32/64). Tested against KV.bench and
resp.bench at the disk-bound LocalMemoryDevice recipe:

  KV.bench (3 iters):
    t |   cap=32   |   cap=64
    1 | 983 K/sec  | 986 K/sec
    8 | 3.46 M/sec | 3.45 M/sec
   16 | 3.06 M/sec | 3.05 M/sec
  → indistinguishable (<1% delta, within noise)

  resp.bench (mean of 3 sweeps):
    t |   cap=32   |   cap=64
    1 |  640 K/sec |  610 K/sec
    8 | 2.29 M/sec | 2.45 M/sec  (+6.7%)
   16 | 2.57 M/sec | 2.58 M/sec
  → cap=64 modestly better at t=8, flat elsewhere

Memory cost at cap=64 per thread that uses each pool: ~512 B of
references in the backing array. With ~160 active worker threads on a
large Garnet deployment the per-pool footprint is ~80 KB — trivial.

We tested raising the cap to 1024 (to fully absorb the SG-GET burst
pattern of 1024 pending ops between drains) on both pools earlier and
saw zero throughput change despite TLS hit rate going from <2% to 99%+.
Conclusion: the shared ConcurrentQueue fallback is not the bottleneck
at the rates this code operates at; cap value is cosmetic past ~32.

Co-authored-by: Copilot <[email protected]>
Both AsyncGetFromDiskResult and PooledArrayMemoryPool need the same
per-thread LIFO cache fronting a global ConcurrentQueue. The previous
commit (1cdbd84) harmonized the inline implementations but each
consumer still owned its own [ThreadStatic] T[] + int count fields.

This commit extracts that storage into a generic Tsavorite.core helper:

  ThreadLocalCache<T>.TryRent(out T item)
  ThreadLocalCache<T>.TryReturn(T item)
  ThreadLocalCache<T>.Capacity = 64

The TLS slot is keyed on the closed generic T, so the two consumers
(AsyncGetFromDiskResult<AsyncIOContext> and PooledBuffer) get separate
slots automatically — no cross-contamination, no per-pool pool-id
indexing needed. AggressiveInlining on TryRent/TryReturn means the JIT
hoists the array access right back into the consumer's hot path; the
generated code is functionally identical to the inline version that
preceded it.

A theoretical concern with [ThreadStatic] on a generic class is that
the runtime needs a per-closed-type TLS slot lookup (vs a compile-time
offset for non-generic ThreadStatic). Measured KV.bench and resp.bench
disk-bound recipes pre vs post refactor:

  KV.bench (3 iters):
    t |  pre (inline) | post (helper)
    1 | 986 K/sec     | 987 K/sec     (+0.1%)
    8 | 3.45 M/sec    | 3.47 M/sec    (+0.6%)
   16 | 3.05 M/sec    | 3.10 M/sec    (+1.6%)

  resp.bench (mean of 3 sweeps):
    t |  pre (inline) | post (helper)
    1 | 610 K/sec     | 604 K/sec     (-1.0%)
    8 | 2.45 M/sec    | 2.51 M/sec    (+2.5%)
   16 | 2.58 M/sec    | 2.63 M/sec    (+1.9%)

All within run-to-run noise; net effect slightly positive. The JIT
appears to fold the generic-TLS lookup well enough that there's no
measurable overhead vs the inline form. Net deletion: each consumer
loses ~25 lines of duplicated TLS plumbing and gains a one-line call.

If a third two-tier pool emerges (PendingContextHolder, heap-container,
etc.), it can now adopt the same pattern with two-line additions.

Co-authored-by: Copilot <[email protected]>
The struct→class conversion of AsyncIOContext introduced aliasing bugs:

- AsyncIOContextCompletionEvent.Set() disposed the just-read record because
  request and ctx are now the same pooled instance (use-after-free / NRE in
  iterator/scan disk reads, e.g. IterateKeyVersions, compaction). Guarded the
  dispose+reassign with ReferenceEquals.
- On re-pend (multi-hop chain walk), HandleOperationStatus dropped the stale
  diskLogRecord without disposing it (SectorAlignedMemory leak). Now disposes
  it (holder path is sole owner) before clearing.
- On re-pend, the completed AsyncIOContext was never returned to the per-session
  pool (orphan → per-hop allocation on the disk path). Now returned, making the
  rent/return lifecycle symmetric.
- Updated stale struct-era comments (AsyncIOContext is a class now).

Co-authored-by: Copilot <[email protected]>
…ntextRead

BDN Operations.RawStringOperations showed the read path regressed badly vs main:
GetFound +29% (13.15→16.97us), GetNotFound +63% (7.63→12.41us). Root cause: the
PendingContextHolder optimization made ContextRead rent a holder and construct the
~200-byte PendingContext on the heap for EVERY read — including the overwhelmingly
common in-memory read that never goes pending.

Revert ContextRead to a stack-allocated PendingContext (as before). The holder is now
rented lazily by HandleOperationStatus only when a read actually goes pending
(RECORD_ON_DISK), via the existing fallback path that copies the struct into a pooled
holder once. This keeps the dictionary's 8-byte-ref benefit and zero-GC on the disk path
while removing all holder cost from the in-memory path. Also dropped the now-unused holder
parameter from the 3-arg HandleOperationStatus overload.

After fix (BDN, net10.0): GetFound 12.51us (-4.9% vs main), GetNotFound 7.59us (parity);
all other RawStringOperations within noise. Disk path unchanged (gc=0/0/0).

Co-authored-by: Copilot <[email protected]>
…ed by A/B)

Verified each pending-path wrapper's value with an isolated A/B on the disk-bound
LocalMemory KV.bench (100% reads, t=1, gc=0/0/0, 5 iters each):

  no-holder + class AsyncIOContext : 1.064M ops/sec   (best)
  holder    + class AsyncIOContext : 1.029M ops/sec   (holder costs -3.4%)
  no-holder + struct AsyncIOContext: 1.021M ops/sec   (struct AIC costs -4.0%)

Conclusion:
- AsyncIOContext as a pooled class IS useful (+4.0%): the IO crosses threads through
  the heap AsyncGetFromDiskResult wrapper and the readyResponses queue, so referencing
  it avoids two ~112B barriered struct copies per hop; the pool rent/return is cheaper
  than those copies. KEEP.
- PendingContextHolder is NOT useful (-3.4%, i.e. removing it is +3.4%): after ContextRead
  was reverted to a stack PendingContext, the holder no longer avoids any copy (the
  fallback path still copies the struct into it) and only adds pool rent/return plus a
  heap indirection per dictionary access. A plain Dictionary<long, PendingContext> stores
  the struct in a cache-friendly inline array (baseline behavior). REMOVE.

Net: -53 lines. In-memory reads unchanged (ContextRead untouched). Disk reads +3.9% at
t=1 / parity at t=8, gc=0/0/0 in steady state. All Tsavorite suites (session, recordops,
hlog, stress, recovery) and Garnet (SG/low-mem/RespTests) green.

Co-authored-by: Copilot <[email protected]>
Collapse the ~24 _padXN filler fields and four #pragma CS0169 pairs into a single
cache-line-isolated SpscRingState struct (LayoutKind.Explicit + FieldOffset), matching
the repo idiom (cf. LightEpoch.Entry). Tail, Head, and the signaling word (WakeNeeded +
Stopped) each occupy a separate cache line; leading/trailing pad lines isolate the block
from neighboring object fields — same false-sharing guarantee as the manual padding, far
less noise.

The struct is declared non-generic at namespace scope because explicit layout is not
permitted on a type nested in a generic type (SpscRing<T>) — caught at runtime as a
TypeLoadException, so verified by actually running device.bench.

Verified: device.bench LocalMemory t=32 = 78.2-78.7M ops/sec (parity with pre-refactor
78.5M), 0 errors under MPSC contention; DeviceLogTests, BasicTests, LowMemoryTests,
LogShiftTailStressTest all green.

Co-authored-by: Copilot <[email protected]>
The per-session asyncIOContextPool self-bounds at the session's peak concurrent pending
depth (every Return balances a prior Rent), which is the same high-water already retained,
uncapped, by the sibling per-session structures with identical rent-on-IO-issue /
return-on-drain lifecycles: ioPendingRequests (Dictionary), heapContainerPool, and the
allocator's asyncGetFromDiskResultPool. Capping only this one pool would be asymmetric and
incomplete (a large batch still pins the other three), so document the rationale instead.

Co-authored-by: Copilot <[email protected]>
…ferAllocator

The scatter-gather GET handler had its own ~50-line custom pinned scratch buffer
(pendingScratch/pendingScratchPtr/pendingScratchUsed + GetPendingScratchOutput +
ResetPendingScratch, with manual GC.AllocateArray(pinned) grow/re-pin) to back the
StringOutput slot for each post-first-pending response. This duplicates the session's
existing ScratchBufferAllocator, whose CreateArgSlice(length) returns exactly such a
pinned slice and — by design — keeps prior slices valid across subsequent allocations
until the batch-end Reset.

Replace the custom buffer with scratchBufferAllocator.CreateArgSlice(128). This is safe:
the allocator is reset only at network-batch end (ProcessMessages), on txn commit/abort,
and via an explicit API — never mid-command — and the GET read/Reader path does not touch
it, so every slot reserved in the issue loop stays valid through GET_CompletePending and
the wrap-up loop. The wrap-up logic (first-pending -> net buffer; IsSpanByte slot -> memcpy;
overflow -> SendAndReset) is unchanged; the >128B overflow now happens at the Reader (the
slot is always a real 128B slice) instead of via a special 'scratch exhausted' default.

Allocation-free in steady state (SBA reuses/grows pinned buffers) with near-identical
per-slot cost; net -43 lines. Validated: RespGetLowMemoryTests (scatter-gather, incl.
3000-key MGET forcing growth + overflow), RespTests, RespCommandTests, custom-command and
low-memory suites all green.

Co-authored-by: Copilot <[email protected]>
…mory

The scatter-gather GET handler reserves a 128-byte ScratchBufferAllocator slot
per post-first-pending op, but relied on the batch-end Reset to release them.
Within one network batch, multiple GET runs (separated by other commands) then
accumulated slots until batch end (sum of runs), whereas the previous dedicated
buffer reset per call (peak = max single run). For mixed pipelines this could
retain a much larger high-water buffer for the session lifetime.

Capture the allocator offset at entry and rewind to it after the wrap-up has
copied all slots to the network buffer (the slots are dead by then). The rewind
is skipped if the allocator grew during the call, leaving the larger buffer as
the new high-water mark for the next batch-end Reset to rewind. This is a
once-per-call O(1) operation off the per-slot hot path; A/B on the LocalMemory
disk-bound SG GET path shows parity (~2.7M ops/sec, b=1024, t=16).

Adds ScratchBufferAllocator.RetainedBufferCount + TryRewindToOffset and unit
tests covering no-grow rewind, grow-skip, default-buffer rewind, and bounded
memory across repeated reserve/rewind.

Co-authored-by: Copilot <[email protected]>
badrishc and others added 17 commits June 8, 2026 17:57
Previously the rewind was skipped entirely when the allocator grew during the
scope, so a network batch with several large GET runs could keep re-growing.
Reclaim to the earliest reachable point in the current buffer instead:
max(0, savedOffset - prevScratchBuffersOffset). A negative value means a grow
pushed the savepoint into a buffer below the current one, which proves the
current buffer is entirely post-savepoint and safe to reset to its base; the
buffers below are released by the next Reset. This unifies the no-grow and grow
cases into one clamp and removes the separate buffer-count savepoint parameter.

Adds tests for grow-reclaims-current and repeated growing runs reusing the
largest buffer without unbounded growth.

Co-authored-by: Copilot <[email protected]>
A single (request/response) GET, or a GET followed by a non-GET, previously paid
the scatter-gather setup and a full look-ahead parse + backoff, regressing those
patterns ~20% vs the simple GET path. Two changes remove that cost:

- Dispatch gate: take the SG path only when the next command speculatively looks
  like a GET, so lone or non-GET-followed GETs use the simple direct-write path.
- Look-ahead: gate ParseGETAndKey with a single 8-byte compare of the next
  command's RESP framing (*2\r\n\r\n). A non-GET next command (e.g. SET, which
  is *3...) is rejected without a full parse, avoiding a double-parse in mixed
  GET/SET pipelines. A same-shaped command (e.g. TTL) is caught by the full parse;
  a miss only forgoes batching and is never incorrect.

BDN Operations.RawStringOperations (net10.0): batch=1 GET now at parity with
sg-get off (was ~+20%); batch=100 stays ~6-9% faster. Adds a mixed GET/SET
pipeline correctness test.

Co-authored-by: Copilot <[email protected]>
The non-SG MGET path (MGetReadArgBatch) completed each pending read inline from
SetOutput via the public GET_CompletePending(wait:true), which calls
UnsafeResumeThread -> epoch.Resume(). But ReadWithPrefetch already holds the
epoch across its whole per-key loop, so this is a re-entrant epoch acquire. On a
disk-bound (pending) read it corrupts the epoch state and the wait spin never
observes completion -> the server stalls (reproduced down to a single thread
reading two on-disk keys). The scatter-gather batch avoids this by deferring all
completion to after ReadWithPrefetch returns (epoch released).

Route all MGET through the scatter-gather batch and delete the non-SG path. SG is
correct on disk and within ~1-2% of the old path fully in-memory (measured), so
there is no reason to keep the blocking variant. The --sg-get flag now governs
only the contiguous-GET path; MGET always uses scatter-gather.

Adds RespMGetLowMemoryNoSgTests.MGetOnDiskCompletesWithoutScatterGatherFlag as a
regression guard (disk-bound MGET with sg-get off, which previously deadlocked).

Co-authored-by: Copilot <[email protected]>
Default EnableScatterGatherGet to true (defaults.conf, GarnetServerOptions, and
the Options.cs GetValueOrDefault). With the singleton/mixed fast-path, a lone or
non-GET-followed GET still takes the simple direct-write path, so the default-on
change is parity for request/response and faster for pipelined GET (BDN), while
disk-bound pipelined GET saturates random-read IO. Updates help text and docs to
reflect the new default and that MGET always uses scatter-gather.

Co-authored-by: Copilot <[email protected]>
…tency knobs

The simulated per-IO latency parameter was milliseconds, too coarse to model
modern single-digit-microsecond NVMe devices. The internal wait was already
Stopwatch-tick based (sub-microsecond resolution), so only the parameter unit was
coarse: rename latencyMs -> latencyUs and convert via Stopwatch.Frequency / 1e6.
The processor wait now sleeps only while >1ms remains and busy-spins the
sub-millisecond remainder so microsecond latencies release close to on time. The
latency==0 path (the default for the Garnet server and perf runs) is unchanged and
fully bypasses this code, so there is no perf impact.

Latency emulation is a test-only modeling knob, so drop the --local-memory-latency-ms
(device.bench) and --localmem-latency-ms (kv.bench) options and the
Devices.CreateLogDevice localMemoryLatencyMs parameter; tests construct
LocalMemoryDevice directly with latencyUs. Test latencies preserved (20ms -> 20000us,
50ms -> 50000us).

Co-authored-by: Copilot <[email protected]>
CI 'Format Garnet'/'Format Tsavorite' jobs flagged a FINALNEWLINE violation
(the repo convention is no trailing final newline).

Co-authored-by: Copilot <[email protected]>
…p-up

CodeQL flagged a critical 'unvalidated local pointer arithmetic' alert on the
hand-rolled CopyToNetworkBuffer (dcurr += toCopy on a pointer sourced from the
virtual GetResponseObjectHead). The scratch slot it copied is always <=
PendingScratchSlotSize (overflow uses the rented-buffer path), well under the
network buffer, so replace the bespoke chunking memcpy with the idiomatic
RespWriteUtils.TryWriteDirect(span, ref dcurr, dend) + SendAndReset loop used
throughout the RESP layer. Same behavior, validated bounds, no raw pointer math.

Co-authored-by: Copilot <[email protected]>
The three benchmark-layer READMEs had grown long and hard to use. Rewrite each to
keep only the key information and center it on the three reproducible scenarios:

  1. memory-bound        - data in RAM, no IO (engine/server ceiling)
  2. NVMe storage-bound  - reads hit real disk
  3. memory-device-bound - reads hit the in-RAM LocalMemory device (no disk latency)

KV and Resp cover all three; Device (an IO benchmark) covers the two device-IO
scenarios and points at KV/Resp for the no-IO case. Each scenario is a copy-paste
recipe with the verified flags and reference numbers (NVMe ~750K IOPS, LocalMemory
~78M ops/sec device / ~2.7-3.7M resp), cross-linked by anchor. Adds the resp.bench
A/B methodology (pure load via --runtime 0, INFO store verification, interleaved
runs, real-PID server shutdown). Net: Device 197->109, KV 570->107, Resp 467->154.

Co-authored-by: Copilot <[email protected]>
Device.benchmark benchmarks Tsavorite's IDevice implementations and imports only
Tsavorite.core, but it lived in the Garnet-level benchmark/ folder and its csproj
carried unused references to Garnet.client, Garnet.server, HdrHistogram,
StackExchange.Redis and the logging packages (copy-pasted from Resp.benchmark),
plus a TLS testcert. It is the bottom of the Resp <= KV <= Device stack, the same
layer as KV.benchmark.

Move it to libs/storage/Tsavorite/cs/benchmark/Device.benchmark/ alongside
KV.benchmark and YCSB.benchmark, trim the project to a single Tsavorite.core
reference (matching the sibling benchmarks), move it from Garnet.slnx to
Tsavorite.slnx, and fix the README cross-links. It now builds under the Tsavorite
solution/CI rather than Garnet's. Both solutions build clean; the benchmark runs
unchanged from the new location.

Co-authored-by: Copilot <[email protected]>
…fication

--runtime 0 was silently promoted to a 15s run because LightOperate treats
TimeSpan.FromSeconds(0) == default(TimeSpan) as 'unset -> 15s'. Guard the offline
Run loop so --runtime 0 seeds the keyspace and skips the run phase, as documented.

Also add a note to the Resp/KV benchmark READMEs on confirming the storage-bound
scenario truly hits the device (O_DIRECT bypasses page cache; verify via iostat
aqu-sz and per-process /proc/<pid>/io read_bytes on shared hosts).

Co-authored-by: Copilot <[email protected]>
The run loop's chunk depth was a hardcoded constant (1024). Expose it as -b/--batch-size
(default 1024, unchanged behavior) for parity with Resp/Device benchmarks. In-flight depth
stays bounded by --device-throttle (opportunistic streaming drain), so this is largely
throughput-neutral; a batch-boundary barrier (CompletePending(wait:true)) was evaluated and
rejected as it only regresses zero-latency LocalMemory (no benefit elsewhere).

Co-authored-by: Copilot <[email protected]>
- NextCommandMaybeGet now compares the full GET token (*2\r\n$3\r\nGET\r\n) rather than
  just the 2-arg/3-byte framing, so a same-shaped next command (TTL, DEL) no longer enters
  the scatter-gather path. Still a single small compare (no perf loss); a lowercase 'get'
  miss only forgoes batching, never incorrect. Adds a GET->TTL mixed-pipeline test.
- Devices.CreateLogDevice: correct the capacity doc — LocalMemory accepts capacity <= 0
  (defaults to a large bounded capacity), it is not required to be > 0.

Co-authored-by: Copilot <[email protected]>
The GET dispatch read storeWrapper.serverOptions.EnableScatterGatherGet (3 dependent
loads) on every GET. Cache it once in a session readonly field (like allowMultiDb), so
the hot-path check is a single load. Closes the small singleton-GET dispatch overhead
the scatter-gather gate added; BDN GetFound (batch=1) improves and batch=100 GET stays
~12% faster than main.

Co-authored-by: Copilot <[email protected]>
…read re-entrancy deadlock

Two robustness bugs found in code review of the new device:

1. Routing crash: the per-thread ring index is [ThreadStatic] (process-wide), so a thread
   that submitted to one LocalMemoryDevice carried that index into another instance with
   fewer rings, indexing rings[] out of bounds (reproduced as IndexOutOfRangeException with
   instances of parallelism 4 then 1, which the test suite creates). Enqueue now validates
   the cached index against THIS instance's parallelism via a single unsigned compare.

2. Re-entrancy deadlock: IO completion callbacks run synchronously on the device's single
   drain thread, and Tsavorite's AsyncFlushPageCallback re-issues WriteAsync from inside the
   callback. With the bounded ring, that re-entrant submit blocked in WaitForSlotEmpty on a
   full ring only this thread can drain — self-deadlock (deterministic at parallelism=1).
   Drain threads now execute re-entrant submits inline (recognized via a sentinel in the
   existing single TLS read, so no added hot-path cost; device.bench LocalMemory unchanged).

Also guard the completion callback so a throw can't kill the drain thread and wedge the ring,
and document RemoveSegment's no-in-flight-request precondition. Adds a deterministic regression
test (deadlocks pre-fix, passes post-fix).

Co-authored-by: Copilot <[email protected]>
…ment

- PooledArrayMemoryPool.Rent: collapse the empty 'got it' TLS-hit branch into a single
  negated fast-path check; behavior identical, less awkward.
- Tsavorite.ContextRead: the comment described an abandoned 'holder' pooling design; the
  dictionary stores the PendingContext struct directly. Correct the comment to match.

Co-authored-by: Copilot <[email protected]>
Subsequent GETs in a scatter-gather run are consumed inside NetworkGET_SG and bypass the
per-command CanServeSlot in ProcessMessages, so in cluster mode a cross-slot pipelined GET
run could serve keys this node does not own instead of returning -MOVED/-ASK. Verify each
batched key with a no-response slot check; if not servable, rewind the parse and end the run
so the dispatch loop re-reads it and writes the redirect in order. Any pending reads from
earlier keys in the run are drained and emitted by the existing wrap-up before the early-exit.

ACL stays correct without a per-key recheck because Garnet supports only the all-keys (~*)
pattern; ACLParser notes that adding per-key patterns would require revisiting this fast path.

Test pipelines many local GETs (forced on-disk via lowMemory, so the reads go pending) ahead
of a cross-slot GET, asserting every local value returns in order before the redirect.

Co-authored-by: Copilot <[email protected]>
…ant, consolidate scratch buffer tests

- LocalMemoryDevice: reuse base.Capacity instead of recomputing the default,
  validate the multiple-of-sz_segment requirement alongside the other ctor
  args, and move the defaulting note into the capacity param doc.
- LowMemoryTests: use DefaultLocalMemoryDeviceLatencyUs instead of a literal.
- Consolidate the new ScratchBufferAllocator rewind tests into the existing
  Garnet.test.extensions/ScratchBufferAllocatorTests.cs (removing the duplicate
  fixture in Garnet.test) and tighten the offset assertion to an exact check.

Co-authored-by: Copilot <[email protected]>
@badrishc badrishc force-pushed the badrish/io-optimization branch from 2fe9f0f to 2e7077c Compare June 9, 2026 01:05
@badrishc badrishc merged commit c586d51 into main Jun 9, 2026
384 of 385 checks passed
@badrishc badrishc deleted the badrish/io-optimization branch June 9, 2026 02:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants