More IO optimization#1854
Conversation
There was a problem hiding this comment.
Pull request overview
This PR makes a broad set of IO-path performance and tooling updates across Tsavorite and Garnet: it reduces per-IO copying/allocations on the Tsavorite pending-read path (notably around AsyncIOContext/wrapper pooling), improves Garnet RESP scatter-gather GET behavior (and enables it by default), and adds a syscall-free LocalMemory IDevice backend plus a dedicated Device.benchmark project for measuring raw IDevice IOPS.
Title/description alignment: The current title (“More IO optimization”) is too vague for the scope (new device backend + new benchmark + default behavior change + engine changes). Consider a more searchable title like:
[Tsavorite][RESP] Reduce pending-IO overhead; add LocalMemory device; enable SG GET by default
Changes:
- Tsavorite pending-IO optimizations:
AsyncIOContextbecomes a pooled reference type; add TLS + global pooling forAsyncGetFromDiskResultwrappers. - Add
LocalMemoryIDevicewith per-thread ring submission and a newDevice.benchmarkfor backend IOPS characterization. - Garnet scatter-gather GET improvements (scratch-slot backing + rewind) and default enablement; docs + tests updated accordingly.
Reviewed changes
Copilot reviewed 46 out of 47 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| website/docs/welcome/compatibility.md | Updates compatibility note to reflect SG-GET enabled-by-default semantics. |
| website/docs/getting-started/configuration.md | Updates --sg-get option description (MGET always uses SG). |
| test/standalone/Garnet.test/ScratchBufferAllocatorTests.cs | Adds unit tests for ScratchBufferAllocator.TryRewindToOffset behavior. |
| test/standalone/Garnet.test/RespGetLowMemoryTests.cs | Adds mixed GET/SET pipeline test + MGET-on-disk test with sg-get disabled. |
| libs/storage/Tsavorite/cs/Tsavorite.slnx | Adds Device.benchmark to Tsavorite solution. |
| libs/storage/Tsavorite/cs/test/TestUtils.cs | Renames LocalMemory latency arg to microseconds and updates defaults. |
| libs/storage/Tsavorite/cs/test/test.recovery/ObjectIterationTests.cs | Updates LocalMemoryDevice ctor arg from latencyMs to latencyUs. |
| libs/storage/Tsavorite/cs/test/test.hlog/LogShiftTailStressTest.cs | Updates LocalMemory latency usage to microseconds. |
| libs/storage/Tsavorite/cs/test/test.hlog/DeviceLogTests.cs | Updates LocalMemory latency usage to microseconds. |
| libs/storage/Tsavorite/cs/test/LowMemoryTests.cs | Updates LocalMemory latency usage to microseconds. |
| libs/storage/Tsavorite/cs/test/BasicTests.cs | Updates helper signatures/uses from ms to us latency. |
| libs/storage/Tsavorite/cs/src/core/Utilities/ThreadLocalCache.cs | Adds generic TLS LIFO cache helper for two-tier pooling. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/TsavoriteThread.cs | Returns pooled AsyncIOContext on completion and re-pend; refactors comments. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Tsavorite.cs | Adds rationale comment for keeping PendingContext stack-allocated. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/HandleOperationStatus.cs | Rents AsyncIOContext from per-session pool; adjusts defaults for ref-type context. |
| libs/storage/Tsavorite/cs/src/core/Index/Common/ExecutionContext.cs | Adds per-session AsyncIOContext pool with rent/return helpers. |
| libs/storage/Tsavorite/cs/src/core/Device/SpscRing.cs | Adds bounded ring with lost-wakeup-safe consumer wait + MPSC enqueue semantics. |
| libs/storage/Tsavorite/cs/src/core/Device/LocalMemoryDevice.cs | Reworks LocalMemory device: rings + processor threads, lazy segments, µs latency simulation. |
| libs/storage/Tsavorite/cs/src/core/Device/DeviceType.cs | Adds DeviceType.LocalMemory enum value + docs. |
| libs/storage/Tsavorite/cs/src/core/Device/Devices.cs | Extends CreateLogDevice to construct LocalMemoryDevice and adds new params. |
| libs/storage/Tsavorite/cs/src/core/Allocator/ObjectSerialization/IStreamBuffer.cs | Removes duplicate sentence in comment for InitialIOSize. |
| libs/storage/Tsavorite/cs/src/core/Allocator/ConditionallyHoistedKey.cs | Updates comment to reflect new AsyncIOContext ref-type and focus on PendingContext. |
| libs/storage/Tsavorite/cs/src/core/Allocator/AsyncIOContext.cs | Converts to sealed class, adds Reset, adjusts completion-event Set semantics. |
| libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs | Adds TLS+global pooling for AsyncGetFromDiskResult wrappers; adjusts callback path. |
| libs/storage/Tsavorite/cs/benchmark/KV.benchmark/README.md | Rewrites/condenses README; documents 3-layer stack and LocalMemory scenario. |
| libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Options.cs | Adds localmemory device option + updates help/validation text. |
| libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.Worker.cs | Updates BasicContext.Read call signature usage. |
| libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.Setup.cs | Adds LocalMemory device sizing/creation logic for KV.benchmark. |
| libs/storage/Tsavorite/cs/benchmark/Device.benchmark/README.md | Adds new Tsavorite-scoped Device.benchmark README (NVMe + LocalMemory scenarios). |
| libs/storage/Tsavorite/cs/benchmark/Device.benchmark/Program.cs | Updates output + adds LocalMemory construction path and aggregation logic. |
| libs/storage/Tsavorite/cs/benchmark/Device.benchmark/Options.cs | Adds LocalMemory device-type docs and clarifies completion-thread semantics. |
| libs/storage/Tsavorite/cs/benchmark/Device.benchmark/DeviceUtils.cs | Adds sector-aligned read/write helpers (note: alignment bug flagged in review). |
| libs/storage/Tsavorite/cs/benchmark/Device.benchmark/Device.benchmark.csproj | Adds new benchmark project under Tsavorite benchmarks. |
| libs/storage/Tsavorite/cs/benchmark/Device.benchmark/BenchWorker.cs | Switches to per-worker counters to avoid global atomic contention. |
| libs/server/Servers/GarnetServerOptions.cs | Sets EnableScatterGatherGet default true + clarifies doc comment. |
| libs/server/Resp/MGetReadArgBatch.cs | Removes non-SG MGET batch implementation (MGET always uses SG now). |
| libs/server/Resp/BasicCommands.cs | SG-GET: scratch-slot backing, rewind savepoint, and fast prefix peeking (note: predicate bug flagged). |
| libs/server/Resp/ArrayCommands.cs | Always uses SG MGET batch and drains pending together. |
| libs/server/ArgSlice/ScratchBufferAllocator.cs | Adds TryRewindToOffset API and docs. |
| libs/host/defaults.conf | Updates default EnableScatterGatherGet to true and comment. |
| libs/host/Configuration/Options.cs | Updates --sg-get help text + defaults option to true. |
| libs/common/Memory/PooledArrayMemoryPool.cs | Switches TLS caching to Tsavorite.core.ThreadLocalCache<T>. |
| Garnet.slnx | Removes old top-level benchmark/Device.benchmark project reference. |
| benchmark/Resp.benchmark/README.md | Condenses README; aligns with Device/KV layer model and LocalMemory scenario. |
| benchmark/README.md | Updates links/description now that Device.benchmark lives under Tsavorite. |
| benchmark/Device.benchmark/README.md | Deletes legacy Device.benchmark README under root benchmark folder. |
| benchmark/Device.benchmark/Device.benchmark.csproj | Deletes legacy Device.benchmark project under root benchmark folder. |
b722b17 to
2fe9f0f
Compare
Two surgical changes to the pending-read path so initial speculative reads
no longer cross 4 KB NAND page boundaries on NVMe with libaio:
1. IStreamBuffer.InitialIOSize: 4096 -> 128 bytes.
The previous default (Environment.SystemPageSize = 4096 on Linux x64)
meant every initial speculative read rounded up to ~4-8 KB. Combined
with sector-aligned (512 B) record offsets, ~75% of reads of small
records crossed a 4 KB NAND-page boundary, forcing the NVMe controller
to fetch two NAND pages per IO and roughly doubling per-IO service
time at the device (~0.92 ms vs ~0.67 ms for sector-aligned 4 KB reads).
The new default (128 B) rounds up to one device sector (typically 512 B)
which always fits within a NAND page AND comfortably captures any
small-record-shaped payload (header + small key + small value) from
any valid_offset position within the sector. For Garnet's typical
workload of 8 B key + ~100 B value (=128 B aligned record), every
pending read completes in a single IO with no re-read regardless of
where the record starts within the sector. Records larger than what
fits in the speculative read trigger a precise re-read via
VerifyRecordFromDiskCallback with the now-known recordLength, same
as before.
We empirically swept InitialIOSize at 32 / 128 / 256 / 512 / 4096 on
the 100M x 100B workload. At 32 B, ~33% of operations triggered re-
reads (records starting at 384/512 within a sector had only 128 valid
bytes available, which was tight relative to the actual recordLength
on the wire), dropping throughput to 578 K ops/sec. At 128 B, 0 re-
reads were observed and throughput hit the device ceiling. Values
above 128 progressively re-introduce the NAND-crossing penalty.
2. AsyncGetFromDiskCallback: subtract valid_offset when recording
available_bytes from the IO transfer count.
available_bytes counts valid bytes starting at GetValidPointer()
(= aligned_pointer + valid_offset), but the prior code stored the
raw device transfer count (which starts at aligned_pointer). With
the previous 4 KB InitialIOSize this only mis-counted by up to 511
bytes (max valid_offset for 512 B sectors), which rarely tripped
the recordLength check. With the smaller speculative read it becomes
essential — without this fix, a 512 B read with valid_offset of, say,
384, would set available_bytes=512 while only 128 valid bytes are
actually available from the record start.
Measured on 100M x 100B uniform random reads, 16 MB log, libaio,
throttle 512, NUMA-pinned:
threads | main | this branch | Device.bench (ceiling)
--------+-----------+-------------+-----------------------
1 | 176K | 184K | 249K
8 | 552K | 729K (+32%) | 739K
At t=8, device-side iostat: r/s 567K -> 721K, r_await 0.92 -> 0.70 ms,
rareq-sz 4.37 -> 0.60 KB. KV.benchmark now matches Device.benchmark's
per-IO device ceiling within 1.4%.
Ported minimally from tedhar/io-4k branch (commits 4f4fbfe, e9ef3b0).
That branch carries a broader refactor (configurable per-Read/RMW
InitialIORecordSize, test conversions for 4K sector handling); this
commit applies only the two settings needed to close the device-side
gap on the libaio fast path, leaving the broader API surface alone.
Co-authored-by: Copilot <[email protected]>
…t path
Two surgical refactors that together remove every Buffer.BulkMoveWithWriteBarrier
call site on the single-thread KV.benchmark hot path. Profiling showed PollGCWorker
samples accounting for ~18% of worker CPU at single-thread libaio random reads, all
nested under BulkMove invocations (mis-attribution by the .NET sampler, but the
underlying BulkMove sites were real). After this change, profiles show
PollGCWorker=0%, BulkMove=0%, with the freed sample-budget cleanly attributed to
the actual NativeStorageDevice.ReadAsync syscall (71% -> 88%).
1. AsyncIOContext: struct -> sealed class.
Was a ~112-byte struct with 5+ object refs that the pending-read path copied
into AsyncGetFromDiskResult.context on every IO (one BulkMove per Read).
Now an instance type with a per-session pool (RentAsyncIOContext /
ReturnAsyncIOContext on TsavoriteExecutionContext). The wrapper field store
becomes a single 8-byte reference, no BulkMove. Callers that previously did
request equals default allocate-or-rent (or pass null when not used). The
AsyncIOContextCompletionEvent.request field stays — class refs work the same
way except Reset() replaces equals default.
2. PendingContext stored via PendingContextHolder<TIn, TOut, TCtx> wrapper.
Was a ~200-byte struct that ioPendingRequests.Add bulk-copied into the
dictionary slot (second BulkMove per IO). The new holder class wraps the
struct as a public field; ContextRead (the KV.benchmark hot entry) rents
a holder upfront so 'ref holder.value' IS the PendingContext used by all
'ref PendingContext' callees. The dict then stores an 8-byte holder
reference. Holder pool on TsavoriteExecutionContext mirrors the IOContext
pool. HandleOperationStatus accepts the holder via an optional parameter
with null fallback (the fallback rents-and-copies for legacy entry sites:
AllocatorScan, ConditionalCopyToTail, the IOContext-driven re-read path).
Both changes are independent and additive. Either alone removes ~50% of the
BulkMove sample budget; together they remove 100%.
Per-op throughput (3 iterations, 100M x 100B, libaio, throttle=512, NUMA-pinned):
threads | baseline | this change | Device.benchmark (ceiling)
--------+------------+------------------+----------------------------
1 | 181K | 182K (+0.6%) | 248K (4.03 us/op)
8 | 740K | 733K (-1%) | 739K
Throughput is essentially flat: the time previously attributed to PollGCWorker
was sampler-side mis-attribution of trap-flag-induced poll-loop CPU INSIDE
BulkMove. Removing the BulkMove sites doesn't reduce total CPU — the same
worker-thread time is now (more accurately) attributed to ReadAsync, which is
where the io_submit syscall is actually parked. The single-thread gap to
Device.benchmark (4.86 us/op vs 4.03 us/op = +21%) is the residual Tsavorite
per-op CPU: hash lookup, RECORD_ON_DISK flow, AsyncGetFromDisk, pool
rent/return, session epoch Resume/Suspend. Closing further would require
leaner per-op control flow rather than further struct-copy elimination.
Architectural wins that remain regardless of the throughput-neutral result:
* Pending-read path is now allocation-light: alloc/wkr at t=1 drops ~40%
(~1.01 MB / 15s to ~0.63 MB / 15s) since per-IO transient AsyncIOContext
bytes are now reused pool instances.
* Dictionary storage shrinks from ~200B per entry to 8B per ref
(ioPendingRequests memory footprint scales with in-flight IOs, not
pending-context size).
* Future per-op CPU reductions (inlining, smaller hot loop) can now be
measured cleanly because the sampler will attribute their savings
correctly instead of being masked by BulkMove mis-attribution.
All 159 smoke tests pass.
Co-authored-by: Copilot <[email protected]>
Production-quality overhaul of the in-process 'fast device' backend, intended for benchmarks that need to characterize the upper bound of Tsavorite/Garnet throughput without paying any kernel-syscall cost. Why --- The old implementation had a broken latency model (sleep only fired on the first item dequeued from an idle queue — backlog masked the latency entirely) and routed by segmentId%parallelism, which made every queue an MPSC queue with the tail-CAS bouncing across all submitter threads. Throughput collapsed past 4-8 threads instead of scaling. Design ------ - New SpscRing<T>: bounded, lock-free, single-producer / single-consumer ring with cache-line-padded head/tail and producer/consumer cached counters so the hot path is one volatile load + one volatile store with no cross-cache-line traffic when the ring is neither empty nor full. - LocalMemoryDevice now uses one SpscRing per IO processor thread. Submitters are bound to a ring via ThreadStatic routing (round-robin on first use), so each ring sees exactly one producer thread when #submitter threads ≤ parallelism. This is the standard production pattern used by io_uring per-thread rings and SPDK per-CPU NVMe queue pairs. - If round-robin assignment wraps past parallelism (more submitters than rings), the device transparently falls back to MPSC via a per-ring spinlock — kept off the fast path until the first collision is detected. - Latency simulation now uses Stopwatch.GetTimestamp and a per-request deadline measured from enqueue, so every IO respects the configured latency regardless of queue depth (the old model only delayed the first item out of an idle queue). - Lazy per-segment GC.AllocateArray(pinned: true) instead of allocating the full capacity upfront. Capacity is sized to the dataset in callers. - Background, named threads (TsavoriteLocalMemIO-N). - Proper bounds checking on segmentId; ENOENT-equivalent callback for unwritten segments instead of array-out-of-bounds. - Drains pending requests during Dispose before joining processor threads. Sync mode removed ----------------- The previous parallelism=0 'inline' mode (callback fires on submitter thread) was non-physical — no real high-IOPS device works that way; io_uring, SPDK, and persistent-memory backends all have submit/complete on separate hardware queues. Removed to keep the device an honest simulation of real fast IO. Benchmark wiring ---------------- - Device.benchmark gains --local-memory, --local-memory-parallelism, --local-memory-latency-ms. Per-worker counters in BenchWorker eliminate the Interlocked.Increment(ref Program.totalCompletedOk) cache-line ping-pong that was capping device.benchmark at ~12M ops/sec regardless of device. - KV.benchmark gains --localmem-parallelism, --localmem-latency-ms. Device capacity is sized to fit the dataset (Keys * (8 + ValueSize + 16) rounded up to segment size, plus 4 segments slack). --validate is honored. Results on a Dell P5600 NVMe box, 80 cores NUMA node 0 (numactl pinned) ----------------------------------------------------------------------- Device.benchmark + LocalMemoryDevice, 512B reads, parallelism = threads: threads 1 4 8 16 32 40 64 ops/sec 3M 7M 12M 22M 28M 82M 73M peak 81.8M ops/sec = 41.8 GB/s memcpy (84% of DDR4 dual-channel BW) KV.benchmark + LocalMemoryDevice (100M keys × 100B, log=16MB, hashpack=1): threads 1 4 8 16 32 ops/sec 927K 1.04M 926K 784K 615K The device is no longer the bottleneck — Tsavorite-side scaling investigation to follow. Co-authored-by: Copilot <[email protected]>
Promotes LocalMemoryDevice to a first-class DeviceType enum member so callers can construct it via Devices.CreateLogDevice instead of bypassing the factory. Changes ------- - DeviceType: new LocalMemory = 5 entry, sitting between AzureStorage and Null. - Devices.CreateLogDevice: new optional params (localMemorySegmentSize, localMemoryParallelism, localMemoryLatencyMs) that are honored only when deviceType == LocalMemory. Existing signature stays binary-compatible for all other backends (the new params have defaults). LocalMemory enforces a positive capacity, defaults parallelism to Environment.ProcessorCount, and accepts an optional virtual logPath as the device file-name identifier. - KV.benchmark: --device localmemory now parses to DeviceType.LocalMemory and the setup path constructs the device via Devices.CreateLogDevice. Removes the prior string-match hack in CreateDevice. - Device.benchmark: replaces the bespoke --local-memory bool flag with the standard --device-type LocalMemory, also routed through Devices.CreateLogDevice. The --local-memory-parallelism / --local-memory-latency-ms flags remain as LocalMemory-specific knobs. Tested ------ - Tsavorite.test (BasicTests, LowMemoryTests, DeviceLogTests, LogShiftTailStressTest) all pass. - Device.benchmark --device-type LocalMemory yields ~70M ops/sec at p=32,t=32. - KV.benchmark --device localmemory --validate reports [validate] OK. Co-authored-by: Copilot <[email protected]>
…lism knob
DeviceType.LocalMemory's 'number of SPSC ring drainer threads' is
conceptually the same parameter as DeviceType.Native's numCompletionThreads
('background IO completion drain threads, each bound 1:1 to its own
queue/ring with submitters distributing via per-thread affinity'). Having
both was redundant.
Changes
-------
- Devices.CreateLogDevice: dropped the localMemoryParallelism parameter;
LocalMemory now uses numCompletionThreads (defaulting to
Environment.ProcessorCount when unspecified). Doc comment updated to
describe the unified semantics across Native and LocalMemory.
- KV.benchmark: removed --localmem-parallelism; LocalMemory now reuses
--device-completion-threads. Help text for --device-completion-threads
updated to cover both backends.
- Device.benchmark: removed --local-memory-parallelism; LocalMemory now
reuses --completion-threads. Help text updated accordingly.
- Both benchmark callers updated to pass the resolved value through
numCompletionThreads and log the source in their startup banner.
Verified
--------
- Device.benchmark --device-type LocalMemory --completion-threads 32 -t 32:
~70M ops/sec (unchanged from the previous bespoke flag).
- KV.benchmark --device localmemory --device-completion-threads 4 --validate:
reports [validate] OK.
- BasicTests pass.
Co-authored-by: Copilot <[email protected]>
…ded)
The processor loop used to spin via SpinWait.SpinOnce when the ring was empty.
SpinWait does eventually escalate to Thread.Sleep(1), but the actual measured
cost on an EPYC system was ~3.3% CPU per idle processor — 32 idle processors
collectively burnt a whole core just polling. With KV.benchmark sweeping
--device-completion-threads 32 across smaller --threads counts, this stole real
CPU from the workers.
This change adds proper park/wake to SpscRing:
- SpscRing owns a ManualResetEventSlim and a wake-needed flag, both placed on
their own cache lines (away from _tail and _head).
- Producer hot path: after publishing _tail (release), insert one
Interlocked.MemoryBarrier (mfence on x86 ~3-5ns) and read _wakeNeeded
(L1 hit in S state when consumer is awake). If non-zero, call Set.
In steady state the consumer is awake, _wakeNeeded == 0 stays cache-stable
in producer's L1, and no event ops happen — net hot-path cost ≈ one mfence.
- Consumer WaitForWork: brief spin to catch back-to-back work, then
Reset → Interlocked.Exchange(_wakeNeeded, 1) (full fence) → recheck tail
AND _stopped → Wait (no timeout). Lost-wakeup-safe under x86 TSO via the
fence pairing.
- SpscRing owns _stopped. Stop() sets _stopped then Set()s the signal.
The consumer's post-fence recheck of _stopped catches the shutdown signal
even when its Reset wipes Dispose's Set, so we need neither a Wait timeout
nor a per-iteration delegate callback.
- WaitForWork reads _stopped as a direct field load (not via Func<bool>),
so there's no virtual call cost in the wait path.
LocalMemoryDevice loses its own 'terminated' field; lifecycle is per-ring now,
which is cleaner (each ring is self-contained: producer publishes, consumer
parks, Stop tears it down).
Verified on Dell P5600 / 80-core NUMA node 0
--------------------------------------------
- Idle (32 processors, no submits, 10 s):
Before: 32 × 3.3% CPU = ~105% of one core (always-on polling)
After: 32 × 0.0% CPU = 0% (kernel-parked on futex)
Total process CPU at 10s: 2.1% (just .NET runtime threads)
- Wake latency after parking (first read after 500 ms idle):
Median ~175 µs (kernel scheduler), occasional ~1.5 ms cold-start.
- Device.benchmark scaling unchanged within noise:
t=1: 1.8M, t=8: 27.6M, t=16: 61.2M, t=32: 76.6M, t=40: 78.4M
(peak was 76.4M / 79.8M before; difference is run-to-run variance).
- Tsavorite BasicTests pass.
Co-authored-by: Copilot <[email protected]>
…llback
CPU profile of KV.bench at t=16 (parents-of-CPU_TIME) revealed two distinct
hotspots in the device layer:
11.8% Monitor.Enter_Slowpath (45 sec / 384 sec total managed CPU)
12.3% Thread.Sleep (47 sec)
Both came from the LocalMemoryDevice's SPSC ring infrastructure:
(1) Monitor.Enter_Slowpath: the previous design had a *global* ringNeedsLock
flag that, once any 33rd-unique submitter thread showed up (a ThreadPool
flush helper during load), forced EVERY subsequent Enqueue on EVERY ring
through 'lock (ringEnqueueLocks[idx])' — even worker rings that were
strictly SPSC. The Monitor.lock fast path is reasonable but contended; its
slow path goes through a kernel futex which is ~hundreds of ns.
(2) Thread.Sleep: SpscRing.WaitForWork used SpinWait.SpinOnce in its 'brief
spin' phase before parking on the event. SpinOnce escalates to
Thread.Yield / Sleep(0) / Sleep(1) within ~20 iterations, so on
lightly-loaded rings the drainer would wake briefly, see empty, sleep
for ~1 ms, wake, re-spin → pure CPU waste.
Fix
---
1. Replace SpscRing with a Vyukov-style MPSC ring:
- Per-slot sequence number tells producer/consumer when the slot is
empty/full/awaiting-reuse.
- Producer enqueue is a single Interlocked.Increment on _tail to claim
a unique slot index, followed by data store + sequence release-store.
No CAS retry loop, no lock anywhere — graceful degradation from SPSC
to MPSC under contention.
- Consumer dequeue is a sequence-check + data load + sequence
release-store (re-emptying the slot for the next round).
This eliminates the need for ringNeedsLock / ringEnqueueLocks entirely.
2. Replace SpinWait.SpinOnce with bounded Thread.SpinWait(8) × 64
(~1-2 µs total, pure busy, never escalates) in WaitForWork's spin
phase. When the spin expires we park on the event indefinitely (0% CPU
until the producer signals — already correct from the prior change).
3. Keep SpinWait.SpinOnce in the producer's ring-full backpressure path
(slot not yet drained by consumer). Here the consumer is the one that
needs CPU to drain, so SpinOnce's eventual Yield/Sleep IS correct —
pure Thread.SpinWait would starve the consumer when the pair lands on
hyperthread siblings.
LocalMemoryDevice simplification
--------------------------------
Dropped the global ringNeedsLock + ringEnqueueLocks[] entirely. Enqueue is
now a flat one-liner that just calls rings[idx].Enqueue — SpscRing handles
both SPSC and MPSC natively at uniform per-op cost.
Implications
------------
- Per-op cost: 1 Interlocked.Increment (~5 ns) + memcpy + 1 mfence + a
couple of Volatile reads/writes. Around 10-15 ns of pure overhead per
Enqueue, vs. ~5 ns for a strict SPSC ring with a plain tail store.
- No path to the kernel scheduler in the hot path; no Monitor / futex.
- SPSC and MPSC cases use identical code.
Device.benchmark scaling (parallelism = threads, 512 B sectors, batch 1024)
---------------------------------------------------------------------------
3 runs each, NUMA-pinned to node 0 (80 cores):
t=1: 4.8-5.3 MIOps/s (was 1.5)
t=8: 33-36 (was 33)
t=16: 59-63 (was 61)
t=20: 64-70 (was 65)
t=24: 66-73 (was 70)
t=28: 74-76 (new)
t=32: 74-82 (was 73)
t=36: 79-83 (new)
t=40: 37, 84, 84 (was 74-84) - edge case at threads==cores
t=48: 74-78 (was 76)
Peak ~80-84 M IOPS at t=32-36; ~38-42 GB/s of memcpy ~84% of memory BW.
The t=40 outlier is consistent with OS scheduling pathology when total
threads exactly matches available cores (40 producers + 40 consumers = 80).
Verified: --validate passes, BasicTests pass.
Co-authored-by: Copilot <[email protected]>
…AS-claim wake
Two wins from a CPU profile of KV.bench at t=16 (LocalMemoryDevice, 10M keys,
100B values, log=16MB so 100% reads go pending):
(1) KV.bench Worker: drop the input-span from Read
The bench was calling bContext.Read(key, ref pinnedInputSpan, ref _output, ...)
with a 100-byte pinned input even for read-only ops. Tsavorite's
CreatePendingReadContext -> CopyInputsForReadOrRMW -> SpanByteHeapContainer.Initialize
then did a SectorAlignedBufferPool.Get(~108B) + 100-byte copy for EVERY
pending read, hammering the global pool. Switching to the no-input overload
bContext.Read(key, ref _output, ...) defaults TInput, so SpanByteHeapContainer.Initialize
sees item.Length==0 and skips the buffer rental entirely.
(2) SpscRing: CAS-claim the wake flag
Producer hot path used to call _signal.Set() on every Enqueue as long as
_wakeNeeded == 1. A burst of N enqueues after the consumer parks would issue
N redundant Sets (each Set internally takes a Monitor). New WakeIfParked
helper atomically CASes _wakeNeeded from 1 to 0; only the winning producer
calls Set. Subsequent producers in the same burst skip Set entirely.
Results: KV.bench --device localmemory --device-completion-threads 32, 10M x 100B,
log=16MB, 3 runs each:
threads before after delta
t=1 0.94 M 0.57-0.99 M (variable; t=1 small absolute)
t=8 1.93 M 3.12 M +62%
t=16 0.90 M 2.83 M +214% ← the meaningful number
t=32 0.63 M 2.22 M +250%
Bench --validate still passes; tests still pass.
The remaining bottleneck is somewhere else (consumer drain rate). Next:
recapture profile and chase the new top frame.
Co-authored-by: Copilot <[email protected]>
Two-tier object pool for the AsyncGetFromDiskResult wrappers used on the pending-read hot path: a ThreadStatic Stack<T> per worker thread (capacity 32) absorbs back-to-back Get/Return cycles, falling back to the existing shared ConcurrentQueue on miss/overflow. Measured impact: at t=16 on KV.bench (LocalMemoryDevice), no significant throughput change — the global ConcurrentQueue wasn't the actual bottleneck the profile suggested (single-CAS per access; cache-line ping-pong is small relative to the consumer-drain wait). Kept as a no-regression / cleaner-shape change in preparation for the next investigation: the SectorAlignedMemory pool (bufferPool) hit on each pending read, which has bigger size buckets and may see more contention. Co-authored-by: Copilot <[email protected]>
When a Vyukov-style producer claims a slot that the consumer hasn't yet drained (ring full at this slot), WaitForSlotEmpty's Phase 2 was using SpinWait.SpinOnce which escalates to Thread.Sleep(1) within ~20 iterations. The 1 ms Sleep granularity is much coarser than the consumer's actual drain time (~167 ns per slot), so producers were stalled for 1 ms each time the ring filled even though the slot would be free in microseconds. Switching to plain Thread.Yield drops only the current scheduler quantum (typically µs on Linux) without parking, so the producer comes back as soon as it is re-scheduled and re-checks the slot. Tradeoff: Yield uses CPU during the wait whereas Sleep doesn't, but the ring-full case is rare in steady state and the producer is the consumer's only customer, so giving up CPU briefly to let the consumer run is the right behavior. Measured at t=16 KV.bench (LocalMemoryDevice, 10M x 100B): before: 2.83 M ops/sec after: 2.93 M ops/sec (+3.5%) Small win — the dominant remaining bottleneck is the per-completion work on the worker side (InternalCompletePendingRequestFromContext ~5 us per op), not the producer's ring-full wait. But the change is correctness- neutral and removes a misleading hot frame from the CPU profile. Co-authored-by: Copilot <[email protected]>
…hout explicit sizing Devices.CreateLogDevice previously threw on DeviceType.LocalMemory when capacity was unspecified (the common case for LocalStorageNamedDeviceFactory, which doesn't pass a capacity through). This made it impossible to plug LocalMemory into Garnet (or any framework using the standard factory pattern) without manually wiring a capacity argument all the way down. Since segments are already lazily allocated via EnsureSegment, the "capacity" field only sizes the per-segment lookup arrays (byte[][] segmentArrays + byte*[] segmentPtrs). Defaulting it to a generous bounded value (DefaultMaxSegments = 4096 × sz_segment = 4 TB at the 1 GB default segment) costs only ~32 KB of pointer-array overhead, while letting callers pass CAPACITY_UNSPECIFIED (or <=0) without error. Co-authored-by: Copilot <[email protected]>
Each pending GET in a scatter-gather batch used to land in its own freshly
rented IMemoryOwner<byte> (PooledArrayMemoryPool wrapper + 128-byte ArrayPool
slice + the matching round-trips on Dispose), because the StringOutput passed
to GET_WithPending after the first pending op was always `new StringOutput()`
(IsSpanByte=false), which forces CopyRespTo down the heap-rent branch. Only
the first pending op got the direct-to-network-buffer path (its output was
the GetStringOutput()-backed one from the loop's initial state).
This change gives every post-first-pending submission a pre-reserved slot in
a per-session pinned scratch byte[]. The Reader's existing IsSpanByte fast
path writes the RESP-encoded response directly into the slot — same fast path
as the in-memory hit case. The wrap-up loop then memcpy's slots into the
network buffer in batch order (preserving RESP wire ordering despite
out-of-order IO completion). Rents are reserved for the rare overflow case
(response larger than PendingScratchSlotSize=128 B), at which point the
existing functionsState.memoryPool.Rent + ConvertToHeap path engages and the
overflow IMemoryOwner is disposed via SendAndReset in the emit loop.
Prefix-sync hits (sync ops before any pending) are unchanged — they still
ProcessOutput direct-to-network with dcurr advancing per op.
Scratch is grow-only across batches (geometric grow on first batch that
exceeds capacity, never shrunk), pinned via GC.AllocateArray, and the cached
byte* pointer is re-derived on each grow.
Measured on resp.benchmark disk-bound recipe (10M × 100B, 16 MB log,
LocalMemoryDevice, --sg-get true, identical load):
threads | baseline | new | delta
--------+----------+----------+--------
1 | 419 K | 651 K | +55.4 %
8 | 2061 K | 2360 K | +14.5 %
16 | 2699 K | 2777 K | +2.9 %
The win is largest at low concurrency where per-op allocation cost dominates;
at high concurrency the wake/park round-trip in the IO completion path
becomes the binding constraint and this change buys back only the few percent
of allocation overhead that remained.
Co-authored-by: Copilot <[email protected]>
…ledArrayMemoryPool The two two-tier pools shipped earlier in this PR (commit 583aef4 for AsyncGetFromDiskResult, plus PooledArrayMemoryPool in libs/common) used different backing structures (Stack<T> vs T[]+count) and different cap names (AsyncGetFromDiskResultCacheCapacity vs TlsCacheSize). This commit unifies both on the same shape: [ThreadStatic] private static T[] tlsCache; [ThreadStatic] private static int tlsCount; private const int TlsCacheCapacity = 64; with the same bump-pointer LIFO Rent/Return code. The T[]+count form is chosen over Stack<T> to avoid virtual dispatch on the hot path; the behavioral semantics (LIFO ordering, single-thread access, lazy alloc) are identical. Cap reconciled at 64 for both (was 32/64). Tested against KV.bench and resp.bench at the disk-bound LocalMemoryDevice recipe: KV.bench (3 iters): t | cap=32 | cap=64 1 | 983 K/sec | 986 K/sec 8 | 3.46 M/sec | 3.45 M/sec 16 | 3.06 M/sec | 3.05 M/sec → indistinguishable (<1% delta, within noise) resp.bench (mean of 3 sweeps): t | cap=32 | cap=64 1 | 640 K/sec | 610 K/sec 8 | 2.29 M/sec | 2.45 M/sec (+6.7%) 16 | 2.57 M/sec | 2.58 M/sec → cap=64 modestly better at t=8, flat elsewhere Memory cost at cap=64 per thread that uses each pool: ~512 B of references in the backing array. With ~160 active worker threads on a large Garnet deployment the per-pool footprint is ~80 KB — trivial. We tested raising the cap to 1024 (to fully absorb the SG-GET burst pattern of 1024 pending ops between drains) on both pools earlier and saw zero throughput change despite TLS hit rate going from <2% to 99%+. Conclusion: the shared ConcurrentQueue fallback is not the bottleneck at the rates this code operates at; cap value is cosmetic past ~32. Co-authored-by: Copilot <[email protected]>
Both AsyncGetFromDiskResult and PooledArrayMemoryPool need the same per-thread LIFO cache fronting a global ConcurrentQueue. The previous commit (1cdbd84) harmonized the inline implementations but each consumer still owned its own [ThreadStatic] T[] + int count fields. This commit extracts that storage into a generic Tsavorite.core helper: ThreadLocalCache<T>.TryRent(out T item) ThreadLocalCache<T>.TryReturn(T item) ThreadLocalCache<T>.Capacity = 64 The TLS slot is keyed on the closed generic T, so the two consumers (AsyncGetFromDiskResult<AsyncIOContext> and PooledBuffer) get separate slots automatically — no cross-contamination, no per-pool pool-id indexing needed. AggressiveInlining on TryRent/TryReturn means the JIT hoists the array access right back into the consumer's hot path; the generated code is functionally identical to the inline version that preceded it. A theoretical concern with [ThreadStatic] on a generic class is that the runtime needs a per-closed-type TLS slot lookup (vs a compile-time offset for non-generic ThreadStatic). Measured KV.bench and resp.bench disk-bound recipes pre vs post refactor: KV.bench (3 iters): t | pre (inline) | post (helper) 1 | 986 K/sec | 987 K/sec (+0.1%) 8 | 3.45 M/sec | 3.47 M/sec (+0.6%) 16 | 3.05 M/sec | 3.10 M/sec (+1.6%) resp.bench (mean of 3 sweeps): t | pre (inline) | post (helper) 1 | 610 K/sec | 604 K/sec (-1.0%) 8 | 2.45 M/sec | 2.51 M/sec (+2.5%) 16 | 2.58 M/sec | 2.63 M/sec (+1.9%) All within run-to-run noise; net effect slightly positive. The JIT appears to fold the generic-TLS lookup well enough that there's no measurable overhead vs the inline form. Net deletion: each consumer loses ~25 lines of duplicated TLS plumbing and gains a one-line call. If a third two-tier pool emerges (PendingContextHolder, heap-container, etc.), it can now adopt the same pattern with two-line additions. Co-authored-by: Copilot <[email protected]>
The struct→class conversion of AsyncIOContext introduced aliasing bugs: - AsyncIOContextCompletionEvent.Set() disposed the just-read record because request and ctx are now the same pooled instance (use-after-free / NRE in iterator/scan disk reads, e.g. IterateKeyVersions, compaction). Guarded the dispose+reassign with ReferenceEquals. - On re-pend (multi-hop chain walk), HandleOperationStatus dropped the stale diskLogRecord without disposing it (SectorAlignedMemory leak). Now disposes it (holder path is sole owner) before clearing. - On re-pend, the completed AsyncIOContext was never returned to the per-session pool (orphan → per-hop allocation on the disk path). Now returned, making the rent/return lifecycle symmetric. - Updated stale struct-era comments (AsyncIOContext is a class now). Co-authored-by: Copilot <[email protected]>
…ntextRead BDN Operations.RawStringOperations showed the read path regressed badly vs main: GetFound +29% (13.15→16.97us), GetNotFound +63% (7.63→12.41us). Root cause: the PendingContextHolder optimization made ContextRead rent a holder and construct the ~200-byte PendingContext on the heap for EVERY read — including the overwhelmingly common in-memory read that never goes pending. Revert ContextRead to a stack-allocated PendingContext (as before). The holder is now rented lazily by HandleOperationStatus only when a read actually goes pending (RECORD_ON_DISK), via the existing fallback path that copies the struct into a pooled holder once. This keeps the dictionary's 8-byte-ref benefit and zero-GC on the disk path while removing all holder cost from the in-memory path. Also dropped the now-unused holder parameter from the 3-arg HandleOperationStatus overload. After fix (BDN, net10.0): GetFound 12.51us (-4.9% vs main), GetNotFound 7.59us (parity); all other RawStringOperations within noise. Disk path unchanged (gc=0/0/0). Co-authored-by: Copilot <[email protected]>
…ed by A/B) Verified each pending-path wrapper's value with an isolated A/B on the disk-bound LocalMemory KV.bench (100% reads, t=1, gc=0/0/0, 5 iters each): no-holder + class AsyncIOContext : 1.064M ops/sec (best) holder + class AsyncIOContext : 1.029M ops/sec (holder costs -3.4%) no-holder + struct AsyncIOContext: 1.021M ops/sec (struct AIC costs -4.0%) Conclusion: - AsyncIOContext as a pooled class IS useful (+4.0%): the IO crosses threads through the heap AsyncGetFromDiskResult wrapper and the readyResponses queue, so referencing it avoids two ~112B barriered struct copies per hop; the pool rent/return is cheaper than those copies. KEEP. - PendingContextHolder is NOT useful (-3.4%, i.e. removing it is +3.4%): after ContextRead was reverted to a stack PendingContext, the holder no longer avoids any copy (the fallback path still copies the struct into it) and only adds pool rent/return plus a heap indirection per dictionary access. A plain Dictionary<long, PendingContext> stores the struct in a cache-friendly inline array (baseline behavior). REMOVE. Net: -53 lines. In-memory reads unchanged (ContextRead untouched). Disk reads +3.9% at t=1 / parity at t=8, gc=0/0/0 in steady state. All Tsavorite suites (session, recordops, hlog, stress, recovery) and Garnet (SG/low-mem/RespTests) green. Co-authored-by: Copilot <[email protected]>
Collapse the ~24 _padXN filler fields and four #pragma CS0169 pairs into a single cache-line-isolated SpscRingState struct (LayoutKind.Explicit + FieldOffset), matching the repo idiom (cf. LightEpoch.Entry). Tail, Head, and the signaling word (WakeNeeded + Stopped) each occupy a separate cache line; leading/trailing pad lines isolate the block from neighboring object fields — same false-sharing guarantee as the manual padding, far less noise. The struct is declared non-generic at namespace scope because explicit layout is not permitted on a type nested in a generic type (SpscRing<T>) — caught at runtime as a TypeLoadException, so verified by actually running device.bench. Verified: device.bench LocalMemory t=32 = 78.2-78.7M ops/sec (parity with pre-refactor 78.5M), 0 errors under MPSC contention; DeviceLogTests, BasicTests, LowMemoryTests, LogShiftTailStressTest all green. Co-authored-by: Copilot <[email protected]>
The per-session asyncIOContextPool self-bounds at the session's peak concurrent pending depth (every Return balances a prior Rent), which is the same high-water already retained, uncapped, by the sibling per-session structures with identical rent-on-IO-issue / return-on-drain lifecycles: ioPendingRequests (Dictionary), heapContainerPool, and the allocator's asyncGetFromDiskResultPool. Capping only this one pool would be asymmetric and incomplete (a large batch still pins the other three), so document the rationale instead. Co-authored-by: Copilot <[email protected]>
…ferAllocator The scatter-gather GET handler had its own ~50-line custom pinned scratch buffer (pendingScratch/pendingScratchPtr/pendingScratchUsed + GetPendingScratchOutput + ResetPendingScratch, with manual GC.AllocateArray(pinned) grow/re-pin) to back the StringOutput slot for each post-first-pending response. This duplicates the session's existing ScratchBufferAllocator, whose CreateArgSlice(length) returns exactly such a pinned slice and — by design — keeps prior slices valid across subsequent allocations until the batch-end Reset. Replace the custom buffer with scratchBufferAllocator.CreateArgSlice(128). This is safe: the allocator is reset only at network-batch end (ProcessMessages), on txn commit/abort, and via an explicit API — never mid-command — and the GET read/Reader path does not touch it, so every slot reserved in the issue loop stays valid through GET_CompletePending and the wrap-up loop. The wrap-up logic (first-pending -> net buffer; IsSpanByte slot -> memcpy; overflow -> SendAndReset) is unchanged; the >128B overflow now happens at the Reader (the slot is always a real 128B slice) instead of via a special 'scratch exhausted' default. Allocation-free in steady state (SBA reuses/grows pinned buffers) with near-identical per-slot cost; net -43 lines. Validated: RespGetLowMemoryTests (scatter-gather, incl. 3000-key MGET forcing growth + overflow), RespTests, RespCommandTests, custom-command and low-memory suites all green. Co-authored-by: Copilot <[email protected]>
…mory The scatter-gather GET handler reserves a 128-byte ScratchBufferAllocator slot per post-first-pending op, but relied on the batch-end Reset to release them. Within one network batch, multiple GET runs (separated by other commands) then accumulated slots until batch end (sum of runs), whereas the previous dedicated buffer reset per call (peak = max single run). For mixed pipelines this could retain a much larger high-water buffer for the session lifetime. Capture the allocator offset at entry and rewind to it after the wrap-up has copied all slots to the network buffer (the slots are dead by then). The rewind is skipped if the allocator grew during the call, leaving the larger buffer as the new high-water mark for the next batch-end Reset to rewind. This is a once-per-call O(1) operation off the per-slot hot path; A/B on the LocalMemory disk-bound SG GET path shows parity (~2.7M ops/sec, b=1024, t=16). Adds ScratchBufferAllocator.RetainedBufferCount + TryRewindToOffset and unit tests covering no-grow rewind, grow-skip, default-buffer rewind, and bounded memory across repeated reserve/rewind. Co-authored-by: Copilot <[email protected]>
Previously the rewind was skipped entirely when the allocator grew during the scope, so a network batch with several large GET runs could keep re-growing. Reclaim to the earliest reachable point in the current buffer instead: max(0, savedOffset - prevScratchBuffersOffset). A negative value means a grow pushed the savepoint into a buffer below the current one, which proves the current buffer is entirely post-savepoint and safe to reset to its base; the buffers below are released by the next Reset. This unifies the no-grow and grow cases into one clamp and removes the separate buffer-count savepoint parameter. Adds tests for grow-reclaims-current and repeated growing runs reusing the largest buffer without unbounded growth. Co-authored-by: Copilot <[email protected]>
A single (request/response) GET, or a GET followed by a non-GET, previously paid the scatter-gather setup and a full look-ahead parse + backoff, regressing those patterns ~20% vs the simple GET path. Two changes remove that cost: - Dispatch gate: take the SG path only when the next command speculatively looks like a GET, so lone or non-GET-followed GETs use the simple direct-write path. - Look-ahead: gate ParseGETAndKey with a single 8-byte compare of the next command's RESP framing (*2\r\n\r\n). A non-GET next command (e.g. SET, which is *3...) is rejected without a full parse, avoiding a double-parse in mixed GET/SET pipelines. A same-shaped command (e.g. TTL) is caught by the full parse; a miss only forgoes batching and is never incorrect. BDN Operations.RawStringOperations (net10.0): batch=1 GET now at parity with sg-get off (was ~+20%); batch=100 stays ~6-9% faster. Adds a mixed GET/SET pipeline correctness test. Co-authored-by: Copilot <[email protected]>
The non-SG MGET path (MGetReadArgBatch) completed each pending read inline from SetOutput via the public GET_CompletePending(wait:true), which calls UnsafeResumeThread -> epoch.Resume(). But ReadWithPrefetch already holds the epoch across its whole per-key loop, so this is a re-entrant epoch acquire. On a disk-bound (pending) read it corrupts the epoch state and the wait spin never observes completion -> the server stalls (reproduced down to a single thread reading two on-disk keys). The scatter-gather batch avoids this by deferring all completion to after ReadWithPrefetch returns (epoch released). Route all MGET through the scatter-gather batch and delete the non-SG path. SG is correct on disk and within ~1-2% of the old path fully in-memory (measured), so there is no reason to keep the blocking variant. The --sg-get flag now governs only the contiguous-GET path; MGET always uses scatter-gather. Adds RespMGetLowMemoryNoSgTests.MGetOnDiskCompletesWithoutScatterGatherFlag as a regression guard (disk-bound MGET with sg-get off, which previously deadlocked). Co-authored-by: Copilot <[email protected]>
Default EnableScatterGatherGet to true (defaults.conf, GarnetServerOptions, and the Options.cs GetValueOrDefault). With the singleton/mixed fast-path, a lone or non-GET-followed GET still takes the simple direct-write path, so the default-on change is parity for request/response and faster for pipelined GET (BDN), while disk-bound pipelined GET saturates random-read IO. Updates help text and docs to reflect the new default and that MGET always uses scatter-gather. Co-authored-by: Copilot <[email protected]>
…tency knobs The simulated per-IO latency parameter was milliseconds, too coarse to model modern single-digit-microsecond NVMe devices. The internal wait was already Stopwatch-tick based (sub-microsecond resolution), so only the parameter unit was coarse: rename latencyMs -> latencyUs and convert via Stopwatch.Frequency / 1e6. The processor wait now sleeps only while >1ms remains and busy-spins the sub-millisecond remainder so microsecond latencies release close to on time. The latency==0 path (the default for the Garnet server and perf runs) is unchanged and fully bypasses this code, so there is no perf impact. Latency emulation is a test-only modeling knob, so drop the --local-memory-latency-ms (device.bench) and --localmem-latency-ms (kv.bench) options and the Devices.CreateLogDevice localMemoryLatencyMs parameter; tests construct LocalMemoryDevice directly with latencyUs. Test latencies preserved (20ms -> 20000us, 50ms -> 50000us). Co-authored-by: Copilot <[email protected]>
CI 'Format Garnet'/'Format Tsavorite' jobs flagged a FINALNEWLINE violation (the repo convention is no trailing final newline). Co-authored-by: Copilot <[email protected]>
…p-up CodeQL flagged a critical 'unvalidated local pointer arithmetic' alert on the hand-rolled CopyToNetworkBuffer (dcurr += toCopy on a pointer sourced from the virtual GetResponseObjectHead). The scratch slot it copied is always <= PendingScratchSlotSize (overflow uses the rented-buffer path), well under the network buffer, so replace the bespoke chunking memcpy with the idiomatic RespWriteUtils.TryWriteDirect(span, ref dcurr, dend) + SendAndReset loop used throughout the RESP layer. Same behavior, validated bounds, no raw pointer math. Co-authored-by: Copilot <[email protected]>
The three benchmark-layer READMEs had grown long and hard to use. Rewrite each to keep only the key information and center it on the three reproducible scenarios: 1. memory-bound - data in RAM, no IO (engine/server ceiling) 2. NVMe storage-bound - reads hit real disk 3. memory-device-bound - reads hit the in-RAM LocalMemory device (no disk latency) KV and Resp cover all three; Device (an IO benchmark) covers the two device-IO scenarios and points at KV/Resp for the no-IO case. Each scenario is a copy-paste recipe with the verified flags and reference numbers (NVMe ~750K IOPS, LocalMemory ~78M ops/sec device / ~2.7-3.7M resp), cross-linked by anchor. Adds the resp.bench A/B methodology (pure load via --runtime 0, INFO store verification, interleaved runs, real-PID server shutdown). Net: Device 197->109, KV 570->107, Resp 467->154. Co-authored-by: Copilot <[email protected]>
Device.benchmark benchmarks Tsavorite's IDevice implementations and imports only Tsavorite.core, but it lived in the Garnet-level benchmark/ folder and its csproj carried unused references to Garnet.client, Garnet.server, HdrHistogram, StackExchange.Redis and the logging packages (copy-pasted from Resp.benchmark), plus a TLS testcert. It is the bottom of the Resp <= KV <= Device stack, the same layer as KV.benchmark. Move it to libs/storage/Tsavorite/cs/benchmark/Device.benchmark/ alongside KV.benchmark and YCSB.benchmark, trim the project to a single Tsavorite.core reference (matching the sibling benchmarks), move it from Garnet.slnx to Tsavorite.slnx, and fix the README cross-links. It now builds under the Tsavorite solution/CI rather than Garnet's. Both solutions build clean; the benchmark runs unchanged from the new location. Co-authored-by: Copilot <[email protected]>
…fication --runtime 0 was silently promoted to a 15s run because LightOperate treats TimeSpan.FromSeconds(0) == default(TimeSpan) as 'unset -> 15s'. Guard the offline Run loop so --runtime 0 seeds the keyspace and skips the run phase, as documented. Also add a note to the Resp/KV benchmark READMEs on confirming the storage-bound scenario truly hits the device (O_DIRECT bypasses page cache; verify via iostat aqu-sz and per-process /proc/<pid>/io read_bytes on shared hosts). Co-authored-by: Copilot <[email protected]>
The run loop's chunk depth was a hardcoded constant (1024). Expose it as -b/--batch-size (default 1024, unchanged behavior) for parity with Resp/Device benchmarks. In-flight depth stays bounded by --device-throttle (opportunistic streaming drain), so this is largely throughput-neutral; a batch-boundary barrier (CompletePending(wait:true)) was evaluated and rejected as it only regresses zero-latency LocalMemory (no benefit elsewhere). Co-authored-by: Copilot <[email protected]>
- NextCommandMaybeGet now compares the full GET token (*2\r\n$3\r\nGET\r\n) rather than just the 2-arg/3-byte framing, so a same-shaped next command (TTL, DEL) no longer enters the scatter-gather path. Still a single small compare (no perf loss); a lowercase 'get' miss only forgoes batching, never incorrect. Adds a GET->TTL mixed-pipeline test. - Devices.CreateLogDevice: correct the capacity doc — LocalMemory accepts capacity <= 0 (defaults to a large bounded capacity), it is not required to be > 0. Co-authored-by: Copilot <[email protected]>
The GET dispatch read storeWrapper.serverOptions.EnableScatterGatherGet (3 dependent loads) on every GET. Cache it once in a session readonly field (like allowMultiDb), so the hot-path check is a single load. Closes the small singleton-GET dispatch overhead the scatter-gather gate added; BDN GetFound (batch=1) improves and batch=100 GET stays ~12% faster than main. Co-authored-by: Copilot <[email protected]>
…read re-entrancy deadlock Two robustness bugs found in code review of the new device: 1. Routing crash: the per-thread ring index is [ThreadStatic] (process-wide), so a thread that submitted to one LocalMemoryDevice carried that index into another instance with fewer rings, indexing rings[] out of bounds (reproduced as IndexOutOfRangeException with instances of parallelism 4 then 1, which the test suite creates). Enqueue now validates the cached index against THIS instance's parallelism via a single unsigned compare. 2. Re-entrancy deadlock: IO completion callbacks run synchronously on the device's single drain thread, and Tsavorite's AsyncFlushPageCallback re-issues WriteAsync from inside the callback. With the bounded ring, that re-entrant submit blocked in WaitForSlotEmpty on a full ring only this thread can drain — self-deadlock (deterministic at parallelism=1). Drain threads now execute re-entrant submits inline (recognized via a sentinel in the existing single TLS read, so no added hot-path cost; device.bench LocalMemory unchanged). Also guard the completion callback so a throw can't kill the drain thread and wedge the ring, and document RemoveSegment's no-in-flight-request precondition. Adds a deterministic regression test (deadlocks pre-fix, passes post-fix). Co-authored-by: Copilot <[email protected]>
…ment - PooledArrayMemoryPool.Rent: collapse the empty 'got it' TLS-hit branch into a single negated fast-path check; behavior identical, less awkward. - Tsavorite.ContextRead: the comment described an abandoned 'holder' pooling design; the dictionary stores the PendingContext struct directly. Correct the comment to match. Co-authored-by: Copilot <[email protected]>
Subsequent GETs in a scatter-gather run are consumed inside NetworkGET_SG and bypass the per-command CanServeSlot in ProcessMessages, so in cluster mode a cross-slot pipelined GET run could serve keys this node does not own instead of returning -MOVED/-ASK. Verify each batched key with a no-response slot check; if not servable, rewind the parse and end the run so the dispatch loop re-reads it and writes the redirect in order. Any pending reads from earlier keys in the run are drained and emitted by the existing wrap-up before the early-exit. ACL stays correct without a per-key recheck because Garnet supports only the all-keys (~*) pattern; ACLParser notes that adding per-key patterns would require revisiting this fast path. Test pipelines many local GETs (forced on-disk via lowMemory, so the reads go pending) ahead of a cross-slot GET, asserting every local value returns in order before the redirect. Co-authored-by: Copilot <[email protected]>
…ant, consolidate scratch buffer tests - LocalMemoryDevice: reuse base.Capacity instead of recomputing the default, validate the multiple-of-sz_segment requirement alongside the other ctor args, and move the defaulting note into the capacity param doc. - LowMemoryTests: use DefaultLocalMemoryDeviceLatencyUs instead of a literal. - Consolidate the new ScratchBufferAllocator rewind tests into the existing Garnet.test.extensions/ScratchBufferAllocatorTests.cs (removing the duplicate fixture in Garnet.test) and tighten the offset assertion to an exact check. Co-authored-by: Copilot <[email protected]>
2fe9f0f to
2e7077c
Compare
No description provided.