Native storage device: improved error handling by badrishc · Pull Request #1884 · microsoft/garnet

badrishc · 2026-06-17T01:40:12Z

Under disk-bound load with many client threads (e.g. Resp.benchmark GET at 4+ threads with --device-io-backend libaio --device-throttle-limit 512), the Garnet server could deadlock/terminate and require kill -9.

Root cause: when the throttle limit exceeds the fixed 128-deep kernel submission ring, reads spin in the io_submit/io_uring_get_sqe EAGAIN backoff while holding a FASTER epoch/thread-id slot. Once more than kMaxNumThreads (96) threads hold a slot concurrently, Thread::ReserveEntry threw std::runtime_error("Too many threads!") across the extern "C"/P-Invoke boundary, which is undefined on Linux and triggers std::terminate (often recursively across threads), wedging the process. Device.benchmark was immune because it uses a fixed small thread set; the full server grows its threadpool.

Fixes:

Size the kernel submission ring to the throttle limit (NextPowerOf2, clamped to a kernel-safe ceiling) so io_submit never spills; plumb a max_events parameter C# -> C ABI -> libaio/io_uring handlers.
Never throw from the native epoch table: ReserveEntry yield-spins until a slot frees (kMaxNumThreads 96 -> 256).
Make epoch acquire/release exception-safe via an EpochGuard RAII helper.
Release the epoch outside the submit backoff (unwind + retry) rather than spinning while holding it and the bundle protection.
Add a C ABI exception firewall (CABIGuard) around every native_device export so no C++ exception crosses P-Invoke; surface a logged, actionable error instead of a silent terminate. Managed side logs NativeDevice_GetLastError.
Hard-cap the effective throttle at the kernel-safe ceiling (4096) with a warning, and size the result pool above it; precompute the clamp off the hot Throttle() path.

Verified with an A/B repro: the original code aborts with "terminate called recursively ... Too many threads!" under forced over-subscription, while the fixed build runs the full server stably at 128-256 client threads with epoch tables as small as 16. 70/70 Tsavorite device tests pass.

Note: the Windows prebuilt runtimes/win-x64/native/native_device.dll must be rebuilt on Windows to carry these changes.

Under disk-bound load with many client threads (e.g. Resp.benchmark GET at 4+ threads with --device-io-backend libaio --device-throttle-limit 512), the Garnet server could deadlock/terminate and require kill -9. Root cause: when the throttle limit exceeds the fixed 128-deep kernel submission ring, reads spin in the io_submit/io_uring_get_sqe EAGAIN backoff while holding a FASTER epoch/thread-id slot. Once more than kMaxNumThreads (96) threads hold a slot concurrently, Thread::ReserveEntry threw std::runtime_error("Too many threads!") across the extern "C"/P-Invoke boundary, which is undefined on Linux and triggers std::terminate (often recursively across threads), wedging the process. Device.benchmark was immune because it uses a fixed small thread set; the full server grows its threadpool. Fixes: - Size the kernel submission ring to the throttle limit (NextPowerOf2, clamped to a kernel-safe ceiling) so io_submit never spills; plumb a max_events parameter C# -> C ABI -> libaio/io_uring handlers. The per-context depth is the throttle divided across io_contexts so total ring capacity stays bounded (libaio io_setup draws from the shared fs.aio-max-nr budget). - Never throw from the native epoch table: ReserveEntry yield-spins until a slot frees (kMaxNumThreads 96 -> 256). - Make epoch acquire/release exception-safe via an EpochGuard RAII helper. - Release the epoch outside the submit backoff (unwind + retry) rather than spinning while holding it and the bundle protection. - Add a C ABI exception firewall (CABIGuard) around every native_device export so no C++ exception crosses P-Invoke; surface a logged, actionable error instead of a silent terminate. set_last_error is no-throw so the firewall cannot be defeated on the out-of-memory path. Managed side logs NativeDevice_GetLastError. - Hard-cap the effective throttle at the kernel-safe ceiling (4096) with a warning, and size the result pool above it; precompute the clamp off the hot Throttle() path. Verified with an A/B repro: the original code aborts with "terminate called recursively ... Too many threads!" under forced over-subscription, while the fixed build runs the full server stably at 128-256 client threads with epoch tables as small as 16. 70/70 Tsavorite device tests pass, and Device.benchmark and KV.benchmark larger-than-memory NVMe runs show no regression (within +/-2%). Note: the Windows prebuilt runtimes/win-x64/native/native_device.dll must be rebuilt on Windows to carry these changes. Co-authored-by: Copilot <[email protected]>

The native device requires the Spectre-mitigated CRT libs; when the CMake Visual Studio generator defaults to a newer toolset than the one whose Spectre libs are installed (common with VS 2026 Insiders), configure/link fails. Pass `-T v143` (or the matching toolset) to select it explicitly. Co-authored-by: Copilot <[email protected]>

Independent module-by-module review (two code-review passes + a peer) surfaced several latent defects beyond the original hang fix. Addressed: io_uring (blocker + high): - io_uring_submit() returns the count of ALL flushed SQEs, not "1 for this op". A stale no-op SQE (left by a prior failed-submit/unwind or Wake) made a later submit return 2; the `res == 1` check then treated success as failure and freed an io_context whose op was already in flight -> use-after-free on completion. Treat `res >= 1` as success (our SQE is the last prepared, so any flush includes it); `res < 1` rewrites the prepared SQE to a no-op and frees. Same fix applied to UringIoHandler::Wake. - The uring submit loop spun in-epoch indefinitely on -EAGAIN/-EBUSY. It now yields a bounded budget then unwinds (Status::Pending) like the get_sqe/libaio paths, so the epoch/thread-id slot is never held across a long wait. - UringIoHandler::pick_ring() cached a thread_local ring index without the owner+bounds guard QueueIoHandler::pick_context() has, so a thread that used a uring device with more rings then one with fewer read out of bounds. Mirrored the libaio guard. Managed (high + medium): - Reset/RemoveSegment/GetFileSize/TryComplete captured the native handle without the numPending lease ReadAsync/WriteAsync use, so a concurrent Dispose() could free it mid-call -> native use-after-free. They now lease numPending (with the disposedFlag fast-path) and decrement in finally. Removed the dead, misleading EnsureReadyOrSilent helper. - Throttle() seeded effectiveThrottleLimit to MaxThrottle, so a cold-start burst before lazy device creation bypassed the configured throttle and flooded the just-sized ring. It now gates on the live clamped limit until the handle exists. Cleanup: - Deleted the dead reserving default Thread::ThreadId() ctor (would yield-spin forever if ever used; the only instance uses the explicit kInvalidId ctor). - Unified the in-epoch submit yield budget to a single small constant (16) and removed the unused kMaxSubmitRetries. Documented and deliberately NOT changed: SubmitWithEpoch holds the epoch across log_.ReadAsync -> OpenSegment -> BumpCurrentEpoch (the standard FASTER protocol); a self-deadlock needs 256 concurrent segment opens within one thread's microsecond epoch region and is practically unreachable — and the unwind releases the epoch more often than the original, reducing exposure. Validation: 70/70 device tests; libaio/uring Device.benchmark NVMe storage-bound 0 validation errors at ~740K/738K ops/sec; KV.benchmark 100M storage-bound ~729K ops/sec — no regression vs the pre-review build. Note: the win-x64 native_device.dll is unaffected in behavior (the changes are Linux-submit-path-only or dead-code removal; the shared-header edits are no-ops on the IOCP path) but is built from slightly older source; rebuild for source parity. Co-authored-by: Copilot <[email protected]>

… tests GetFileSize() called log_.size() without acquiring the epoch, but size() can lazily OpenSegment() whose bundle-expand path calls epoch_->BumpCurrentEpoch(). BumpCurrentEpoch requires the calling thread to hold epoch protection; on an unprotected thread ProtectAndDrain dereferences an unreserved epoch-table slot and segfaults. This is reachable in production (recovery, and any GetFileSize on a not-yet-open segment once another segment is open, including after RemoveSegment/Truncate). A cold-device GetFileSize(0) survived only because the first-ever OpenSegment takes a no-bump path. Fix: wrap GetFileSize in EpochGuard, exactly like RemoveSegment / Read / Write. The change is in the shared header, so it applies to all platforms; the Linux .so are rebuilt here and the Windows DLL must be rebuilt to pick it up. Tests (test.hlog/DeviceTests.cs): add coverage that previously had none — - both IO backends (libaio + io_uring) round-trip and parallel reads; - multi-completion-thread / multi-io_context sharding; - multiple devices with different shard counts driven from many threads (regression for the pick_context/pick_ring owner+bounds guard); - throttle -> submission-ring sizing and throttle-above-MaxThrottle clamping; - high submitter concurrency (64 threads) as a direct regression for the original production hang (slot-exhaustion / in-epoch full-ring spin); - late P/Invoke entry points GetFileSize / Reset / TryComplete / RemoveSegment. The RemoveSegment + GetFileSize-after-remove test is what surfaced the crash fixed above. Co-authored-by: Copilot <[email protected]>

Copilot

Pull request overview

This PR hardens Tsavorite’s Linux native storage device against high-concurrency, disk-bound workloads by preventing submit-path backpressure from pinning epoch/thread-id slots and by ensuring native exceptions cannot unwind across the C ABI into .NET, improving reliability and debuggability under load.

Changes:

Size libaio/io_uring submission rings from the configured throttle limit (clamped to a kernel-safe ceiling) and adjust managed throttling to match the effective cap.
Make native epoch/thread-id acquisition exception-safe (RAII) and eliminate “too many threads” abort behavior by yield-spinning until a slot is available.
Add a C ABI exception firewall around native exports and expand unit tests to cover io_uring, sharding, high throttle, and late P/Invoke entry points.

Reviewed changes

Copilot reviewed 10 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
libs/storage/Tsavorite/cs/test/test.hlog/DeviceTests.cs	Adds “phase 8” coverage for io_uring, sharded contexts, multi-device shard-index safety, high throttle/high concurrency, and late native entry points.
libs/storage/Tsavorite/cs/src/core/Device/NativeStorageDevice.cs	Implements throttle clamping, computes native ring depth from throttle, plumbs max-events into native creation, improves native error surfacing, and hardens late-entry P/Invoke calls against dispose races.
libs/storage/Tsavorite/cc/src/device/thread.h	Increases thread-id table size and changes slot exhaustion from throwing to yield/sleep spinning; deletes unused reserving default ctor.
libs/storage/Tsavorite/cc/src/device/native_device.h	Adds EpochGuard RAII and SubmitWithEpoch to avoid holding epoch across sustained submit backoff; plumbs max-events into handler creation.
libs/storage/Tsavorite/cc/src/device/native_device_wrapper.cc	Adds CABIGuard/CABIGuardVoid firewalling around exports; plumbs max-events to device constructors; adds exception sentinel for QueueRunFor.
libs/storage/Tsavorite/cc/src/device/native_device_error.h	Makes set_last_error noexcept and best-effort to avoid exceptions escaping the firewall.
libs/storage/Tsavorite/cc/src/device/file_windows.h	Adds 3-arg handler ctor for cross-platform signature symmetry (max_events ignored on Windows).
libs/storage/Tsavorite/cc/src/device/file_linux.h	Adds per-context/ring max_events plumbing, power-of-two rounding for io_uring, and TLS owner/bounds guard for sharding affinity.
libs/storage/Tsavorite/cc/src/device/file_linux.cc	Converts sustained ring-full waits into an unwind (Status::Pending) so the caller can retry outside epoch; fixes io_uring_submit success criteria.
libs/storage/Tsavorite/cc/README.md	Documents explicit MSVC toolset selection for Spectre CRT/linker compatibility.

…floor Addresses PR review comment: the doc comment said 'simply NextPowerOf2(ThrottleLimit)' for single context, but the implementation floors to DefaultNativeRingDepth (128), so ThrottleLimit < 128 returns 128, not NextPowerOf2(ThrottleLimit). Updated comment to match the code: max(DefaultNativeRingDepth, NextPowerOf2(ThrottleLimit)). Co-authored-by: Copilot <[email protected]>

badrishc force-pushed the badrishc/native-device-fixes branch from 9defffa to 067a448 Compare June 17, 2026 01:40

badrishc changed the title ~~Fix native storage device hang under high client concurrency~~ Native storage device: improved error handling Jun 17, 2026

badrishc and others added 5 commits June 16, 2026 18:48

Add binaries

7dda840

Add binaries

0228f2d

badrishc marked this pull request as ready for review June 17, 2026 04:06

Copilot AI review requested due to automatic review settings June 17, 2026 04:06

Copilot started reviewing on behalf of badrishc June 17, 2026 04:07 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Comment thread libs/storage/Tsavorite/cs/src/core/Device/NativeStorageDevice.cs Outdated

TedHartMS approved these changes Jun 17, 2026

View reviewed changes

Merge branch 'main' into badrishc/native-device-fixes

df74069

TedHartMS merged commit 94d6a1f into main Jun 17, 2026
400 of 401 checks passed

TedHartMS deleted the badrishc/native-device-fixes branch June 17, 2026 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Native storage device: improved error handling#1884

Native storage device: improved error handling#1884
TedHartMS merged 8 commits into
mainfrom
badrishc/native-device-fixes

badrishc commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

badrishc commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants