Native storage device: improved error handling#1884
Merged
Conversation
Under disk-bound load with many client threads (e.g. Resp.benchmark GET at
4+ threads with --device-io-backend libaio --device-throttle-limit 512), the
Garnet server could deadlock/terminate and require kill -9.
Root cause: when the throttle limit exceeds the fixed 128-deep kernel
submission ring, reads spin in the io_submit/io_uring_get_sqe EAGAIN backoff
while holding a FASTER epoch/thread-id slot. Once more than kMaxNumThreads (96)
threads hold a slot concurrently, Thread::ReserveEntry threw
std::runtime_error("Too many threads!") across the extern "C"/P-Invoke
boundary, which is undefined on Linux and triggers std::terminate (often
recursively across threads), wedging the process. Device.benchmark was immune
because it uses a fixed small thread set; the full server grows its threadpool.
Fixes:
- Size the kernel submission ring to the throttle limit (NextPowerOf2, clamped
to a kernel-safe ceiling) so io_submit never spills; plumb a max_events
parameter C# -> C ABI -> libaio/io_uring handlers. The per-context depth is
the throttle divided across io_contexts so total ring capacity stays bounded
(libaio io_setup draws from the shared fs.aio-max-nr budget).
- Never throw from the native epoch table: ReserveEntry yield-spins until a slot
frees (kMaxNumThreads 96 -> 256).
- Make epoch acquire/release exception-safe via an EpochGuard RAII helper.
- Release the epoch outside the submit backoff (unwind + retry) rather than
spinning while holding it and the bundle protection.
- Add a C ABI exception firewall (CABIGuard) around every native_device export
so no C++ exception crosses P-Invoke; surface a logged, actionable error
instead of a silent terminate. set_last_error is no-throw so the firewall
cannot be defeated on the out-of-memory path. Managed side logs
NativeDevice_GetLastError.
- Hard-cap the effective throttle at the kernel-safe ceiling (4096) with a
warning, and size the result pool above it; precompute the clamp off the hot
Throttle() path.
Verified with an A/B repro: the original code aborts with "terminate called
recursively ... Too many threads!" under forced over-subscription, while the
fixed build runs the full server stably at 128-256 client threads with epoch
tables as small as 16. 70/70 Tsavorite device tests pass, and Device.benchmark
and KV.benchmark larger-than-memory NVMe runs show no regression (within +/-2%).
Note: the Windows prebuilt runtimes/win-x64/native/native_device.dll must be
rebuilt on Windows to carry these changes.
Co-authored-by: Copilot <[email protected]>
9defffa to
067a448
Compare
The native device requires the Spectre-mitigated CRT libs; when the CMake Visual Studio generator defaults to a newer toolset than the one whose Spectre libs are installed (common with VS 2026 Insiders), configure/link fails. Pass `-T v143` (or the matching toolset) to select it explicitly. Co-authored-by: Copilot <[email protected]>
Independent module-by-module review (two code-review passes + a peer) surfaced several latent defects beyond the original hang fix. Addressed: io_uring (blocker + high): - io_uring_submit() returns the count of ALL flushed SQEs, not "1 for this op". A stale no-op SQE (left by a prior failed-submit/unwind or Wake) made a later submit return 2; the `res == 1` check then treated success as failure and freed an io_context whose op was already in flight -> use-after-free on completion. Treat `res >= 1` as success (our SQE is the last prepared, so any flush includes it); `res < 1` rewrites the prepared SQE to a no-op and frees. Same fix applied to UringIoHandler::Wake. - The uring submit loop spun in-epoch indefinitely on -EAGAIN/-EBUSY. It now yields a bounded budget then unwinds (Status::Pending) like the get_sqe/libaio paths, so the epoch/thread-id slot is never held across a long wait. - UringIoHandler::pick_ring() cached a thread_local ring index without the owner+bounds guard QueueIoHandler::pick_context() has, so a thread that used a uring device with more rings then one with fewer read out of bounds. Mirrored the libaio guard. Managed (high + medium): - Reset/RemoveSegment/GetFileSize/TryComplete captured the native handle without the numPending lease ReadAsync/WriteAsync use, so a concurrent Dispose() could free it mid-call -> native use-after-free. They now lease numPending (with the disposedFlag fast-path) and decrement in finally. Removed the dead, misleading EnsureReadyOrSilent helper. - Throttle() seeded effectiveThrottleLimit to MaxThrottle, so a cold-start burst before lazy device creation bypassed the configured throttle and flooded the just-sized ring. It now gates on the live clamped limit until the handle exists. Cleanup: - Deleted the dead reserving default Thread::ThreadId() ctor (would yield-spin forever if ever used; the only instance uses the explicit kInvalidId ctor). - Unified the in-epoch submit yield budget to a single small constant (16) and removed the unused kMaxSubmitRetries. Documented and deliberately NOT changed: SubmitWithEpoch holds the epoch across log_.ReadAsync -> OpenSegment -> BumpCurrentEpoch (the standard FASTER protocol); a self-deadlock needs 256 concurrent segment opens within one thread's microsecond epoch region and is practically unreachable — and the unwind releases the epoch more often than the original, reducing exposure. Validation: 70/70 device tests; libaio/uring Device.benchmark NVMe storage-bound 0 validation errors at ~740K/738K ops/sec; KV.benchmark 100M storage-bound ~729K ops/sec — no regression vs the pre-review build. Note: the win-x64 native_device.dll is unaffected in behavior (the changes are Linux-submit-path-only or dead-code removal; the shared-header edits are no-ops on the IOCP path) but is built from slightly older source; rebuild for source parity. Co-authored-by: Copilot <[email protected]>
… tests GetFileSize() called log_.size() without acquiring the epoch, but size() can lazily OpenSegment() whose bundle-expand path calls epoch_->BumpCurrentEpoch(). BumpCurrentEpoch requires the calling thread to hold epoch protection; on an unprotected thread ProtectAndDrain dereferences an unreserved epoch-table slot and segfaults. This is reachable in production (recovery, and any GetFileSize on a not-yet-open segment once another segment is open, including after RemoveSegment/Truncate). A cold-device GetFileSize(0) survived only because the first-ever OpenSegment takes a no-bump path. Fix: wrap GetFileSize in EpochGuard, exactly like RemoveSegment / Read / Write. The change is in the shared header, so it applies to all platforms; the Linux .so are rebuilt here and the Windows DLL must be rebuilt to pick it up. Tests (test.hlog/DeviceTests.cs): add coverage that previously had none — - both IO backends (libaio + io_uring) round-trip and parallel reads; - multi-completion-thread / multi-io_context sharding; - multiple devices with different shard counts driven from many threads (regression for the pick_context/pick_ring owner+bounds guard); - throttle -> submission-ring sizing and throttle-above-MaxThrottle clamping; - high submitter concurrency (64 threads) as a direct regression for the original production hang (slot-exhaustion / in-epoch full-ring spin); - late P/Invoke entry points GetFileSize / Reset / TryComplete / RemoveSegment. The RemoveSegment + GetFileSize-after-remove test is what surfaced the crash fixed above. Co-authored-by: Copilot <[email protected]>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens Tsavorite’s Linux native storage device against high-concurrency, disk-bound workloads by preventing submit-path backpressure from pinning epoch/thread-id slots and by ensuring native exceptions cannot unwind across the C ABI into .NET, improving reliability and debuggability under load.
Changes:
- Size libaio/io_uring submission rings from the configured throttle limit (clamped to a kernel-safe ceiling) and adjust managed throttling to match the effective cap.
- Make native epoch/thread-id acquisition exception-safe (RAII) and eliminate “too many threads” abort behavior by yield-spinning until a slot is available.
- Add a C ABI exception firewall around native exports and expand unit tests to cover io_uring, sharding, high throttle, and late P/Invoke entry points.
Reviewed changes
Copilot reviewed 10 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| libs/storage/Tsavorite/cs/test/test.hlog/DeviceTests.cs | Adds “phase 8” coverage for io_uring, sharded contexts, multi-device shard-index safety, high throttle/high concurrency, and late native entry points. |
| libs/storage/Tsavorite/cs/src/core/Device/NativeStorageDevice.cs | Implements throttle clamping, computes native ring depth from throttle, plumbs max-events into native creation, improves native error surfacing, and hardens late-entry P/Invoke calls against dispose races. |
| libs/storage/Tsavorite/cc/src/device/thread.h | Increases thread-id table size and changes slot exhaustion from throwing to yield/sleep spinning; deletes unused reserving default ctor. |
| libs/storage/Tsavorite/cc/src/device/native_device.h | Adds EpochGuard RAII and SubmitWithEpoch to avoid holding epoch across sustained submit backoff; plumbs max-events into handler creation. |
| libs/storage/Tsavorite/cc/src/device/native_device_wrapper.cc | Adds CABIGuard/CABIGuardVoid firewalling around exports; plumbs max-events to device constructors; adds exception sentinel for QueueRunFor. |
| libs/storage/Tsavorite/cc/src/device/native_device_error.h | Makes set_last_error noexcept and best-effort to avoid exceptions escaping the firewall. |
| libs/storage/Tsavorite/cc/src/device/file_windows.h | Adds 3-arg handler ctor for cross-platform signature symmetry (max_events ignored on Windows). |
| libs/storage/Tsavorite/cc/src/device/file_linux.h | Adds per-context/ring max_events plumbing, power-of-two rounding for io_uring, and TLS owner/bounds guard for sharding affinity. |
| libs/storage/Tsavorite/cc/src/device/file_linux.cc | Converts sustained ring-full waits into an unwind (Status::Pending) so the caller can retry outside epoch; fixes io_uring_submit success criteria. |
| libs/storage/Tsavorite/cc/README.md | Documents explicit MSVC toolset selection for Spectre CRT/linker compatibility. |
…floor Addresses PR review comment: the doc comment said 'simply NextPowerOf2(ThrottleLimit)' for single context, but the implementation floors to DefaultNativeRingDepth (128), so ThrottleLimit < 128 returns 128, not NextPowerOf2(ThrottleLimit). Updated comment to match the code: max(DefaultNativeRingDepth, NextPowerOf2(ThrottleLimit)). Co-authored-by: Copilot <[email protected]>
TedHartMS
approved these changes
Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Under disk-bound load with many client threads (e.g. Resp.benchmark GET at 4+ threads with --device-io-backend libaio --device-throttle-limit 512), the Garnet server could deadlock/terminate and require kill -9.
Root cause: when the throttle limit exceeds the fixed 128-deep kernel submission ring, reads spin in the io_submit/io_uring_get_sqe EAGAIN backoff while holding a FASTER epoch/thread-id slot. Once more than kMaxNumThreads (96) threads hold a slot concurrently, Thread::ReserveEntry threw std::runtime_error("Too many threads!") across the extern "C"/P-Invoke boundary, which is undefined on Linux and triggers std::terminate (often recursively across threads), wedging the process. Device.benchmark was immune because it uses a fixed small thread set; the full server grows its threadpool.
Fixes:
Verified with an A/B repro: the original code aborts with "terminate called recursively ... Too many threads!" under forced over-subscription, while the fixed build runs the full server stably at 128-256 client threads with epoch tables as small as 16. 70/70 Tsavorite device tests pass.
Note: the Windows prebuilt runtimes/win-x64/native/native_device.dll must be rebuilt on Windows to carry these changes.