Skip to content

Native storage device: improved error handling#1884

Merged
TedHartMS merged 8 commits into
mainfrom
badrishc/native-device-fixes
Jun 17, 2026
Merged

Native storage device: improved error handling#1884
TedHartMS merged 8 commits into
mainfrom
badrishc/native-device-fixes

Conversation

@badrishc

Copy link
Copy Markdown
Collaborator

Under disk-bound load with many client threads (e.g. Resp.benchmark GET at 4+ threads with --device-io-backend libaio --device-throttle-limit 512), the Garnet server could deadlock/terminate and require kill -9.

Root cause: when the throttle limit exceeds the fixed 128-deep kernel submission ring, reads spin in the io_submit/io_uring_get_sqe EAGAIN backoff while holding a FASTER epoch/thread-id slot. Once more than kMaxNumThreads (96) threads hold a slot concurrently, Thread::ReserveEntry threw std::runtime_error("Too many threads!") across the extern "C"/P-Invoke boundary, which is undefined on Linux and triggers std::terminate (often recursively across threads), wedging the process. Device.benchmark was immune because it uses a fixed small thread set; the full server grows its threadpool.

Fixes:

  • Size the kernel submission ring to the throttle limit (NextPowerOf2, clamped to a kernel-safe ceiling) so io_submit never spills; plumb a max_events parameter C# -> C ABI -> libaio/io_uring handlers.
  • Never throw from the native epoch table: ReserveEntry yield-spins until a slot frees (kMaxNumThreads 96 -> 256).
  • Make epoch acquire/release exception-safe via an EpochGuard RAII helper.
  • Release the epoch outside the submit backoff (unwind + retry) rather than spinning while holding it and the bundle protection.
  • Add a C ABI exception firewall (CABIGuard) around every native_device export so no C++ exception crosses P-Invoke; surface a logged, actionable error instead of a silent terminate. Managed side logs NativeDevice_GetLastError.
  • Hard-cap the effective throttle at the kernel-safe ceiling (4096) with a warning, and size the result pool above it; precompute the clamp off the hot Throttle() path.

Verified with an A/B repro: the original code aborts with "terminate called recursively ... Too many threads!" under forced over-subscription, while the fixed build runs the full server stably at 128-256 client threads with epoch tables as small as 16. 70/70 Tsavorite device tests pass.

Note: the Windows prebuilt runtimes/win-x64/native/native_device.dll must be rebuilt on Windows to carry these changes.

Under disk-bound load with many client threads (e.g. Resp.benchmark GET at
4+ threads with --device-io-backend libaio --device-throttle-limit 512), the
Garnet server could deadlock/terminate and require kill -9.

Root cause: when the throttle limit exceeds the fixed 128-deep kernel
submission ring, reads spin in the io_submit/io_uring_get_sqe EAGAIN backoff
while holding a FASTER epoch/thread-id slot. Once more than kMaxNumThreads (96)
threads hold a slot concurrently, Thread::ReserveEntry threw
std::runtime_error("Too many threads!") across the extern "C"/P-Invoke
boundary, which is undefined on Linux and triggers std::terminate (often
recursively across threads), wedging the process. Device.benchmark was immune
because it uses a fixed small thread set; the full server grows its threadpool.

Fixes:
- Size the kernel submission ring to the throttle limit (NextPowerOf2, clamped
  to a kernel-safe ceiling) so io_submit never spills; plumb a max_events
  parameter C# -> C ABI -> libaio/io_uring handlers. The per-context depth is
  the throttle divided across io_contexts so total ring capacity stays bounded
  (libaio io_setup draws from the shared fs.aio-max-nr budget).
- Never throw from the native epoch table: ReserveEntry yield-spins until a slot
  frees (kMaxNumThreads 96 -> 256).
- Make epoch acquire/release exception-safe via an EpochGuard RAII helper.
- Release the epoch outside the submit backoff (unwind + retry) rather than
  spinning while holding it and the bundle protection.
- Add a C ABI exception firewall (CABIGuard) around every native_device export
  so no C++ exception crosses P-Invoke; surface a logged, actionable error
  instead of a silent terminate. set_last_error is no-throw so the firewall
  cannot be defeated on the out-of-memory path. Managed side logs
  NativeDevice_GetLastError.
- Hard-cap the effective throttle at the kernel-safe ceiling (4096) with a
  warning, and size the result pool above it; precompute the clamp off the hot
  Throttle() path.

Verified with an A/B repro: the original code aborts with "terminate called
recursively ... Too many threads!" under forced over-subscription, while the
fixed build runs the full server stably at 128-256 client threads with epoch
tables as small as 16. 70/70 Tsavorite device tests pass, and Device.benchmark
and KV.benchmark larger-than-memory NVMe runs show no regression (within +/-2%).

Note: the Windows prebuilt runtimes/win-x64/native/native_device.dll must be
rebuilt on Windows to carry these changes.

Co-authored-by: Copilot <[email protected]>
@badrishc badrishc force-pushed the badrishc/native-device-fixes branch from 9defffa to 067a448 Compare June 17, 2026 01:40
@badrishc badrishc changed the title Fix native storage device hang under high client concurrency Native storage device: improved error handling Jun 17, 2026
badrishc and others added 5 commits June 16, 2026 18:48
The native device requires the Spectre-mitigated CRT libs; when the CMake
Visual Studio generator defaults to a newer toolset than the one whose Spectre
libs are installed (common with VS 2026 Insiders), configure/link fails. Pass
`-T v143` (or the matching toolset) to select it explicitly.

Co-authored-by: Copilot <[email protected]>
Independent module-by-module review (two code-review passes + a peer) surfaced
several latent defects beyond the original hang fix. Addressed:

io_uring (blocker + high):
- io_uring_submit() returns the count of ALL flushed SQEs, not "1 for this op".
  A stale no-op SQE (left by a prior failed-submit/unwind or Wake) made a later
  submit return 2; the `res == 1` check then treated success as failure and
  freed an io_context whose op was already in flight -> use-after-free on
  completion. Treat `res >= 1` as success (our SQE is the last prepared, so any
  flush includes it); `res < 1` rewrites the prepared SQE to a no-op and frees.
  Same fix applied to UringIoHandler::Wake.
- The uring submit loop spun in-epoch indefinitely on -EAGAIN/-EBUSY. It now
  yields a bounded budget then unwinds (Status::Pending) like the get_sqe/libaio
  paths, so the epoch/thread-id slot is never held across a long wait.
- UringIoHandler::pick_ring() cached a thread_local ring index without the
  owner+bounds guard QueueIoHandler::pick_context() has, so a thread that used a
  uring device with more rings then one with fewer read out of bounds. Mirrored
  the libaio guard.

Managed (high + medium):
- Reset/RemoveSegment/GetFileSize/TryComplete captured the native handle without
  the numPending lease ReadAsync/WriteAsync use, so a concurrent Dispose() could
  free it mid-call -> native use-after-free. They now lease numPending (with the
  disposedFlag fast-path) and decrement in finally. Removed the dead, misleading
  EnsureReadyOrSilent helper.
- Throttle() seeded effectiveThrottleLimit to MaxThrottle, so a cold-start burst
  before lazy device creation bypassed the configured throttle and flooded the
  just-sized ring. It now gates on the live clamped limit until the handle exists.

Cleanup:
- Deleted the dead reserving default Thread::ThreadId() ctor (would yield-spin
  forever if ever used; the only instance uses the explicit kInvalidId ctor).
- Unified the in-epoch submit yield budget to a single small constant (16) and
  removed the unused kMaxSubmitRetries.

Documented and deliberately NOT changed: SubmitWithEpoch holds the epoch across
log_.ReadAsync -> OpenSegment -> BumpCurrentEpoch (the standard FASTER protocol);
a self-deadlock needs 256 concurrent segment opens within one thread's microsecond
epoch region and is practically unreachable — and the unwind releases the epoch
more often than the original, reducing exposure.

Validation: 70/70 device tests; libaio/uring Device.benchmark NVMe storage-bound
0 validation errors at ~740K/738K ops/sec; KV.benchmark 100M storage-bound ~729K
ops/sec — no regression vs the pre-review build.

Note: the win-x64 native_device.dll is unaffected in behavior (the changes are
Linux-submit-path-only or dead-code removal; the shared-header edits are no-ops on
the IOCP path) but is built from slightly older source; rebuild for source parity.

Co-authored-by: Copilot <[email protected]>
… tests

GetFileSize() called log_.size() without acquiring the epoch, but size()
can lazily OpenSegment() whose bundle-expand path calls
epoch_->BumpCurrentEpoch(). BumpCurrentEpoch requires the calling thread to
hold epoch protection; on an unprotected thread ProtectAndDrain dereferences
an unreserved epoch-table slot and segfaults. This is reachable in production
(recovery, and any GetFileSize on a not-yet-open segment once another segment
is open, including after RemoveSegment/Truncate). A cold-device GetFileSize(0)
survived only because the first-ever OpenSegment takes a no-bump path.

Fix: wrap GetFileSize in EpochGuard, exactly like RemoveSegment / Read / Write.
The change is in the shared header, so it applies to all platforms; the
Linux .so are rebuilt here and the Windows DLL must be rebuilt to pick it up.

Tests (test.hlog/DeviceTests.cs): add coverage that previously had none —
- both IO backends (libaio + io_uring) round-trip and parallel reads;
- multi-completion-thread / multi-io_context sharding;
- multiple devices with different shard counts driven from many threads
  (regression for the pick_context/pick_ring owner+bounds guard);
- throttle -> submission-ring sizing and throttle-above-MaxThrottle clamping;
- high submitter concurrency (64 threads) as a direct regression for the
  original production hang (slot-exhaustion / in-epoch full-ring spin);
- late P/Invoke entry points GetFileSize / Reset / TryComplete / RemoveSegment.
The RemoveSegment + GetFileSize-after-remove test is what surfaced the crash
fixed above.

Co-authored-by: Copilot <[email protected]>
@badrishc badrishc marked this pull request as ready for review June 17, 2026 04:06
Copilot AI review requested due to automatic review settings June 17, 2026 04:06

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens Tsavorite’s Linux native storage device against high-concurrency, disk-bound workloads by preventing submit-path backpressure from pinning epoch/thread-id slots and by ensuring native exceptions cannot unwind across the C ABI into .NET, improving reliability and debuggability under load.

Changes:

  • Size libaio/io_uring submission rings from the configured throttle limit (clamped to a kernel-safe ceiling) and adjust managed throttling to match the effective cap.
  • Make native epoch/thread-id acquisition exception-safe (RAII) and eliminate “too many threads” abort behavior by yield-spinning until a slot is available.
  • Add a C ABI exception firewall around native exports and expand unit tests to cover io_uring, sharding, high throttle, and late P/Invoke entry points.

Reviewed changes

Copilot reviewed 10 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
libs/storage/Tsavorite/cs/test/test.hlog/DeviceTests.cs Adds “phase 8” coverage for io_uring, sharded contexts, multi-device shard-index safety, high throttle/high concurrency, and late native entry points.
libs/storage/Tsavorite/cs/src/core/Device/NativeStorageDevice.cs Implements throttle clamping, computes native ring depth from throttle, plumbs max-events into native creation, improves native error surfacing, and hardens late-entry P/Invoke calls against dispose races.
libs/storage/Tsavorite/cc/src/device/thread.h Increases thread-id table size and changes slot exhaustion from throwing to yield/sleep spinning; deletes unused reserving default ctor.
libs/storage/Tsavorite/cc/src/device/native_device.h Adds EpochGuard RAII and SubmitWithEpoch to avoid holding epoch across sustained submit backoff; plumbs max-events into handler creation.
libs/storage/Tsavorite/cc/src/device/native_device_wrapper.cc Adds CABIGuard/CABIGuardVoid firewalling around exports; plumbs max-events to device constructors; adds exception sentinel for QueueRunFor.
libs/storage/Tsavorite/cc/src/device/native_device_error.h Makes set_last_error noexcept and best-effort to avoid exceptions escaping the firewall.
libs/storage/Tsavorite/cc/src/device/file_windows.h Adds 3-arg handler ctor for cross-platform signature symmetry (max_events ignored on Windows).
libs/storage/Tsavorite/cc/src/device/file_linux.h Adds per-context/ring max_events plumbing, power-of-two rounding for io_uring, and TLS owner/bounds guard for sharding affinity.
libs/storage/Tsavorite/cc/src/device/file_linux.cc Converts sustained ring-full waits into an unwind (Status::Pending) so the caller can retry outside epoch; fixes io_uring_submit success criteria.
libs/storage/Tsavorite/cc/README.md Documents explicit MSVC toolset selection for Spectre CRT/linker compatibility.

Comment thread libs/storage/Tsavorite/cs/src/core/Device/NativeStorageDevice.cs Outdated
…floor

Addresses PR review comment: the doc comment said 'simply NextPowerOf2(ThrottleLimit)'
for single context, but the implementation floors to DefaultNativeRingDepth (128), so
ThrottleLimit < 128 returns 128, not NextPowerOf2(ThrottleLimit). Updated comment to
match the code: max(DefaultNativeRingDepth, NextPowerOf2(ThrottleLimit)).

Co-authored-by: Copilot <[email protected]>
@TedHartMS TedHartMS merged commit 94d6a1f into main Jun 17, 2026
400 of 401 checks passed
@TedHartMS TedHartMS deleted the badrishc/native-device-fixes branch June 17, 2026 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants