[WIP] Use io_uring for sockets on Linux#124374
Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements an experimental opt-in io_uring-backed socket event engine for Linux as an alternative to epoll. The implementation is comprehensive, including both readiness-based polling (Phase 1) and completion-based I/O operations (Phase 2), along with extensive testing infrastructure and evidence collection tooling.
Changes:
- Native layer: cmake configuration, PAL networking headers, and io_uring system call integration with graceful epoll fallback
- Managed layer: socket async engine extensions for io_uring completion handling, operation lifecycle tracking, buffer pinning, and telemetry
- Testing: comprehensive functional tests, layout contract validation, stress tests, and CI infrastructure for dual-mode test execution
- Tooling: evidence collection and validation scripts for performance comparison and envelope testing
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/native/libs/configure.cmake | Adds CMake configuration checks for io_uring header and poll32_events struct member |
| src/native/libs/System.Native/pal_networking.h | Defines new io_uring interop structures (IoUringCompletion, IoUringSocketEventPortDiagnostics) and function signatures |
| src/native/libs/System.Native/entrypoints.c | Registers new io_uring-related PAL export entry points |
| src/native/libs/Common/pal_config.h.in | Adds CMake defines for io_uring feature detection |
| src/libraries/System.Net.Sockets/tests/FunctionalTests/TelemetryTest.cs | Adds layout contract tests for io_uring interop structures and telemetry counter verification |
| src/libraries/System.Net.Sockets/tests/FunctionalTests/System.Net.Sockets.Tests.csproj | Implements MSBuild infrastructure for creating io_uring test archive variants (enabled/disabled/default) |
| src/libraries/System.Net.Sockets/tests/FunctionalTests/IoUring.Unix.cs | Adds comprehensive functional and stress tests for io_uring socket workflows |
| src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketsTelemetry.cs | Adds 12 new PollingCounters for io_uring observability metrics |
| src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs | Implements managed wrappers for io_uring prepare operations with error handling |
| src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs | Core io_uring integration: submission batching, completion handling, operation tracking, and diagnostics polling |
| src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs | Operation-level io_uring support: buffer pinning, user_data allocation, completion processing, and state machine |
| src/libraries/Common/src/Interop/Unix/System.Native/Interop.SocketEvent.cs | Defines managed interop structures matching native layout for io_uring operations |
| eng/testing/io-uring/validate-collect-sockets-io-uring-evidence-smoke.sh | Smoke validation script for evidence collection tooling |
| eng/testing/io-uring/collect-sockets-io-uring-evidence.sh | Comprehensive evidence collection script for functional/perf validation and envelope testing |
| docs/workflow/testing/libraries/testing.md | Adds references to io_uring-specific documentation |
| docs/workflow/testing/libraries/testing-linux-sockets-io-uring.md | Detailed validation guide for io_uring backend testing |
| docs/workflow/testing/libraries/io-uring-pr-evidence-template.md | PR evidence template for documenting io_uring validation results |
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/tests/FunctionalTests/IoUring.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/tests/FunctionalTests/TelemetryTest.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Outdated
Show resolved
Hide resolved
| { | ||
| get | ||
| { | ||
| Segment head = Volatile.Read(ref _head.Value)!; |
There was a problem hiding this comment.
This seems pretty computationally heavy; is there a reason you just can't have a single _count variable that you atomically increment/decrement that you just check for 0 here?
| fixedRecvBufferId, | ||
| ref completionAuxiliaryData)) | ||
| { | ||
| completionResultCode = -Interop.Sys.ConvertErrorPalToPlatform(Interop.Error.ENOBUFS); |
There was a problem hiding this comment.
Why the negation? I see you do it below as well. I did a quick search around the repo and only saw this referenced in one other place and they did not do negation and the folks referencing that code don't appear to being a negation either.
| int32_t state = atomic_load_explicit(&s_forceEnterEintrRetryLimitOnce, memory_order_relaxed); | ||
| if (state < 0) | ||
| { | ||
| const char* configuredValue = getenv(SHIM_TEST_FORCE_ENTER_EINTR_RETRY_LIMIT_ONCE_ENV); |
There was a problem hiding this comment.
Should this be behind a #ifdef DEBUG?
| private const string ConnectActivityName = ActivitySourceName + ".Connect"; | ||
| private static readonly ActivitySource s_connectActivitySource = new ActivitySource(ActivitySourceName); | ||
|
|
||
| internal static class Keywords |
There was a problem hiding this comment.
Maybe IoUringKeywords would be a better description.
| #if DEBUG | ||
| // Test-only knob to make wait-buffer saturation deterministic for io_uring diagnostics coverage. | ||
| // Only available in DEBUG builds so production code never reads test env vars. | ||
| if (OperatingSystem.IsLinux()) |
There was a problem hiding this comment.
Should you also check DOTNET_SYSTEM_NET_SOCKETS_IO_URING or do we assume that DOTNET_SYSTEM_NET_SOCKETS_IO_URING_TEST_EVENT_BUFFER_COUNT is only set when the feature flag is enabled?
| try | ||
| { | ||
| RecordAndAssertEventLoopThreadIdentity(); | ||
| LinuxEventLoopEnableRings(); |
There was a problem hiding this comment.
Wonder if these could be more generic. i.e.
LinuxEventLoopEnableRings -> EventLoopInit
LinuxEventLoopBeforeWait -> EventLoopBeforeWait
LinuxEventLoopTryCompletionWait -> EventLoopTryCompleteWait
etc.
There was a problem hiding this comment.
Hmm, I guess it would be an issue if someone wanted to add their own "EventLoopInit" or equivalent for the other methods :)
| } | ||
| else | ||
| { | ||
| Debug.Assert( |
There was a problem hiding this comment.
Does this mean we have not tested this on Kernels before 6.1?
| { | ||
| // Snapshot the wakeup generation counter before entering the blocking syscall. | ||
| // After waking, we compare to detect wakeups that arrived during the syscall. | ||
| uint wakeGenBefore = Volatile.Read(ref _ioUringWakeupGeneration); |
There was a problem hiding this comment.
Going to need to define this outside the if statement so you can reference it after the if/else
| /// <summary> | ||
| /// Returns whether SQPOLL mode has been explicitly requested. | ||
| /// SQPOLL requires dual opt-in: AppContext switch + environment variable. | ||
| /// This is intentionally stricter than the primary io_uring gate | ||
| /// (`IsIoUringEnabled`), which accepts either source. | ||
| /// SQPOLL pins a kernel thread, so accidental activation should require | ||
| /// explicit confirmation from both configuration surfaces. | ||
| /// </summary> | ||
| private static bool IsSqPollRequested() | ||
| { | ||
| IoUringConfigurationInputs inputs = ReadIoUringConfigurationInputs(); | ||
| return ResolveSqPollRequested(inputs); | ||
| } | ||
|
|
||
| private static bool ResolveSqPollRequested(in IoUringConfigurationInputs inputs) | ||
| { | ||
| if (!inputs.SqPollFeatureSwitchEnabled) | ||
| { | ||
| return false; | ||
| } | ||
|
|
||
| return string.Equals(inputs.SqPollEnvironmentValue, "1", StringComparison.Ordinal); | ||
| } |
There was a problem hiding this comment.
Dual knob feels unnecessary here since runtime configuration is typically supplied either via the environment variable or via the AppContext switch, not both. For example, setting DOTNET_Thread_DefaultStackSize too low fails to initialize threads, while setting it too high just burns resources. Users enabling this kind of feature are expected to understand the implications rather than being gated by added obscurity.
Also consider caching these like existing knobs in main. We typically perform a one-time static lookup per process and do not support changing this kind of configuration mid-process:
There was a problem hiding this comment.
Agree, .NET already has knobs that can be activated from either environment variables or runtime configuration, that would degrade performance if used by more than a single process at the same time, such as Server GC.
|
Putting back to draft as still needs some work |
Contributes to #753
Summary
This document describes the complete, production-grade io_uring socket I/O engine in .NET's
System.Net.Socketslayer.When enabled via
DOTNET_SYSTEM_NET_SOCKETS_IO_URING=1on Linux kernel 6.1+, the engine replaces epoll with a managed io_uring completion-mode backend that:/sys/devices/system/cputopology, with CPU-aware socket migration on first receive completionThe native shim is intentionally minimal -- 537 lines of C wrapping the three io_uring syscalls (setup, enter, register) plus eventfd and mmap helpers. All ring management, SQE construction, CQE dispatch, operation lifecycle, feature negotiation, overflow recovery, and SQPOLL wakeup detection lives in managed code.
The engine proper is organized as eight partial class files extending
SocketAsyncEngine: the main file (SocketAsyncEngine.Linux.cs, 4,664 lines) holds ring setup, flag negotiation, CQE drain, SQE prep orchestration, completion slot layout, multi-engine topology detection, SO_REUSEPORT shadow listener management, CPU-affinity-based socket migration, and the event loop; the remaining seven partials handle ring mmap lifecycle (IoUringRings, 365 lines), completion slot pool management (IoUringSlots, 469 lines), SQE writing (IoUringSqeWriters, 249 lines), completion dispatch (IoUringCompletionDispatch, 847 lines), diagnostics logging (IoUringDiagnostics, 164 lines), configuration resolution (IoUringConfiguration, 429 lines), and debug test hook stubs (IoUringTestHooks.Stubs, 15 lines). A separateIoUringTestAccessors.Linux.csfile (1,048 lines) in the test infrastructure directory exposes all test-observable state through strongly-typed accessors. Tests access this surface throughInternalTestShims.Linux.cs(707 lines), a centralized reflection shim with[DynamicDependency]annotations for trimmer/AOT safety.Key metrics:
2. Architecture
Ring Ownership and Event Loop
The architecture follows the SINGLE_ISSUER contract: exactly one thread -- the event loop thread -- owns each io_uring instance. All ring mutations (SQE writes, CQ head advances, io_uring_enter calls) happen on this thread. Other threads communicate via two MPSC queues.
graph TD WT[Worker Threads] -->|"MpscQueue<IoUringPrepareWorkItem>"| EL[Event Loop Thread] WT -->|"MpscQueue<ulong> (cancel)"| EL WT -->|"eventfd write (wake)"| EL EL -->|"Writes SQEs / Drains CQEs / io_uring_enter"| K[Kernel - io_uring] K -->|"CQE completions"| EL EL -->|"ThreadPool.QueueUserWorkItem"| TP[ThreadPool]Multi-Engine Topology
When io_uring is enabled, the engine array is sized according to detected physical core topology.
LinuxInitializeEngineAffinityTopologyreads/sys/devices/system/cpu/cpu*/topology/{physical_package_id,core_id}to discover physical core groups, then creates one engine per physical core (up to the configured engine count cap). Each engine's event loop thread is pinned to its representative CPU viasched_setaffinity. As_cpuToEngineIndexmapping array enables CPU-aware socket placement.When topology detection fails, the engine count falls back to
Math.Min(Environment.ProcessorCount, 32)with no CPU pinning.CPU-Aware Socket Migration
On the first receive completion for a connected socket,
TryMigrateIoUringEngineOnFirstReceiveCompletionreadsSO_INCOMING_CPUviagetsockoptand looks up the target engine viaGetEngineIndexForCpu. If the socket's current engine differs from the CPU-optimal engine, the socket migrates to the target engine. This one-shot migration (guarded by_migrationState) improves cache locality for workloads where the kernel's receive-side CPU selection is stable.SO_REUSEPORT Accept Distribution
For listening sockets with
SO_REUSEPORTenabled and multiple engines active, the engine arms shadow listener sockets on non-primary engines. Each shadow listener duplicates the primary listener's socket viaSO_REUSEPORTand arms its own multishot accept SQE. Accepted file descriptors from shadow listeners are forwarded to the primary listener's pre-accept queue viaDispatchReusePortAcceptIoUringCompletion, which enqueues a readiness fallback event on the primary engine. This distributes accept load across kernel completion queues without requiring the application to manage multiple listener sockets.Shadow listener setup requests flow through
MpscQueue<ReusePortShadowSetupRequest>, and accepted fd engine affinity is tracked in thes_fdEngineAffinityarray so that subsequentTryRegisterSocketcalls can place the accepted socket on the same engine that accepted it.The
IoUringCompletionOperationKind.ReusePortAcceptvariant distinguishes shadow-listener accept slots from primary-listener accept slots in the CQE dispatch path. Shadow accept slots carry cross-engine references (ReusePortPrimaryContext,ReusePortPrimaryEngine) inIoUringCompletionSlotStorage.SO_REUSEPORT accept distribution is an emergency-killable feature via
DOTNET_SYSTEM_NET_SOCKETS_IO_URING_DISABLE_REUSEPORT_ACCEPT=1.The Thin Native Shim Approach
The native shim (
pal_io_uring_shim.c, 537 lines) wraps exactly:io_uring_setup(viasyscall(__NR_io_uring_setup, ...)withSYS_io_uring_setupfallback)io_uring_enter(with and without EXT_ARG; EINTR retry with a 1024-iteration circuit breaker)io_uring_registermmap/munmap(for ring mapping)eventfd/read/write(for cross-thread wakeup; EINTR-looped)uname(for kernel version detection)sched_setaffinity/sched_getaffinity(for CPU pinning)All ring pointer arithmetic, SQE field population, CQE parsing, SQPOLL wakeup detection (via
Volatile.Readon the mmap'd SQ flags word), overflow recovery, and operation lifecycle management happens in managed C#. This is deliberate:<linux/io_uring.h>-- no liburing dependency._Static_assert(IORING_SETUP_CLOEXEC == (1U << 19), ...)in the shim and layout contract tests in C#).Threading Model
Each engine's event loop thread owns:
_completionSlots[]/_completionSlotStorage[]arraysSQ_NEED_WAKEUPon the mmap'd SQ flags pointerWorker threads interact solely through:
TryEnqueueIoUringPreparation()-> MPSC prepare queue -> eventfd writeTryRequestIoUringCancellation()-> MPSC cancel queue -> eventfd writeVolatile.Readon_ioUringTeardownInitiatedto avoid publishing work after shutdownio_uring initialization is deferred to the event loop thread so that
io_uring_setupsetssubmitter_taskto the event loop thread, as required byDEFER_TASKRUN.TryRegisterSocketwaits on aManualResetEventSlim(_ioUringInitSignal) before handing sockets to an engine, ensuring no socket registers before initialization completes.Partial Class File Organization
SocketAsyncEngine.Linux.csSocketAsyncEngine.IoUringSlots.Linux.csSocketAsyncEngine.IoUringRings.Linux.csTryMmapRings: maps SQ/CQ/SQE regions, validates mmap offset bounds, derives all ring pointers.CleanupManagedRings: multi-step teardown.LinuxFreeIoUringResources: full teardown orchestrationSocketAsyncEngine.IoUringSqeWriters.Linux.csWrite*Sqemethods: send, sendZc, recv, readFixed, providedBufferRecv, multishotRecv, accept, multishotAccept, sendMsg, sendMsgZc, recvMsg, connect, asyncCancel. Deduplicated viaWriteSendLikeSqeandWriteSendMsgLikeSqeSocketAsyncEngine.IoUringCompletionDispatch.Linux.csSocketEventHandlerpartial:DispatchSingleIoUringCompletion,DispatchMultishotIoUringCompletion,DispatchZeroCopyIoUringNotification,DispatchReusePortAcceptIoUringCompletion, multishot accept/recv dispatch, buffer materialization, completion result routingSocketAsyncEngine.IoUringDiagnostics.Linux.csPublishIoUringManagedDiagnosticsDelta, periodic provided buffer ring resize evaluation, zero-copy NOTIF pending slot gauge samplingSocketAsyncEngine.IoUringConfiguration.Linux.csIsIoUringEnabled,IsSqPollRequested,IsZeroCopySendOptedIn,IsIoUringDirectSqeDisabled,IsMultishotAcceptDisabled,IsReusePortAcceptDisabledwith[FeatureSwitchDefinition]annotations for JIT-eliminable code paths;LinuxInitializeEngineAffinityTopologywith physical core topology detection via sysfs; CPU pinning viasched_setaffinitySocketAsyncEngine.IoUringTestHooks.Stubs.Linux.cs#if DEBUG-gated test hook partialsIoUringTestAccessors.Linux.csSubmission Path: Standard vs. SQPOLL
In standard mode,
io_uring_entersubmits pending SQEs and optionally waits for CQEs. In SQPOLL mode, a kernel thread continuously polls the SQ ring. Managed code detects idle viaVolatile.Readon the mmap'd_managedSqFlagsPtrchecking forIORING_SQ_NEED_WAKEUP. When the kernel thread is awake, noio_uring_enteris needed for submission.Flag Negotiation (Peel Loop)
Setup builds an initial flag set:
CQSIZE | SUBMIT_ALL | COOP_TASKRUN | SINGLE_ISSUER | NO_SQARRAY | CLOEXEC. SQPOLL (mutually exclusive with DEFER_TASKRUN) or DEFER_TASKRUN is added based on configuration. OnEINVAL, flags are peeled in order:NO_SQARRAYfirst, thenCLOEXEC.EPERMis never retried (respects seccomp/kernel policy). After setup,FD_CLOEXECis set as a fallback viafcntlfor kernels whereIORING_SETUP_CLOEXECwas peeled.CQ Overflow Recovery State Machine
CQ overflow is detected on every
DrainCqeRingBatchentry viaObserveManagedCqOverflowCounter, which compares the mmap'd overflow counter against the last-observed value using wrapping uint32 delta arithmetic. When a delta is seen, the engine enters a three-branch recovery state machine:_liveAcceptCompletionSlotCount > 0and not in teardown. Defers multishot accept re-arm nudges until post-drain._ioUringTeardownInitiatedis set. Teardown owns recovery completion.During overflow recovery, CQ head advances happen per-CQE (not batched) to relieve kernel pressure immediately. Recovery completes when the CQ ring is fully drained and no new overflow delta is observed. On completion:
AssertCompletionSlotPoolConsistencyvalidates free-list integrity, telemetry is incremented, and for the MultishotAcceptArming branch,TryQueueDeferredMultishotAcceptRearmAfterRecoverynudges accept contexts.After recovery completes, a delayed sweep (
TrySweepStaleTrackedIoUringOperationsAfterCqOverflowRecovery) fires 250ms later to retire tracked operations whose CQEs were dropped. The sweep skips intentionally long-lived multishot accept and persistent multishot recv slots. Operations still in the waiting state are canceled; already-transitioned operations are detached and their slots freed.3. Key Data Structures
Completion Slot Pool
Four parallel SoA arrays, all indexed by slot index:
IoUringCompletionSlot[](hot, 24 bytes each,[StructLayout(LayoutKind.Explicit, Size = 24)]):Generation(ulong) -- 40-bit generation fieldFreeListNext(int) -- intrusive free list, -1 = end_packedState(uint) --IoUringCompletionOperationKindin low 8 bits, boolean flagsIsZeroCopySend/ZeroCopyNotificationPending/UsesFixedRecvBufferin bits 8-10FixedRecvBufferId(ushort)#if DEBUGonly):TestForcedResult(int)IoUringTrackedOperationState[]: Per-slot tracked operation reference (TrackedOperation,TrackedOperationGeneration) for ABA-safe operation tracking.IoUringCompletionSlotStorage[](cold):DangerousRefSocketHandlefor fd lifetime, pre-allocated native inline storage slab (NativeMsghdr + 4 IOVectors + 128B socket addr + 128B control + socklen_t), message writeback pointers for recvmsg, and cross-engine references for SO_REUSEPORT accept forwarding (ReusePortPrimaryContext,ReusePortPrimaryEngine).MemoryHandle[](zero-copy pin holds): OneSystem.Buffers.MemoryHandleper slot index, holding the pin for SEND_ZC payloads until the NOTIF CQE arrives.Layout contract tests verify
IoUringCompletionSlotfield offsets and the 24-byte total size via reflection on every test run. ADebug.AssertinInitializeCompletionSlotPoolfires if the size drifts.Generation Encoding
16-bit slot index (
SlotIndexBits = 16, capacity 65,536) and 40-bit generation (GenerationBits = 56 - 16 = 40,GenerationMask = (1UL << 40) - 1UL) packed into the 56-bituser_datapayload. The upper 8 bits of user_data carry a tag byte (2 = reserved completion, 3 = wakeup signal). Generation is initialized to 1 (not 0) so stale CQEs referencing generation 0 are rejected. On wrap, generation remaps from2^40-1back to 1, skipping zero.IoUringCompletionOperationKind
A 4-variant enum (
None,Accept,Message,ReusePortAccept) stored in the packed state of eachIoUringCompletionSlot. This determines per-completion post-processing behavior: accept completions read sockaddr length from the native slab; message completions copy writeback data from the native msghdr; reuse-port accept completions forward accepted fds to the primary listener's pre-accept queue on its owning engine.IoUringCompletionDispatchKind
A 10-variant enum (
Default,ReadOperation,WriteOperation,SendOperation,BufferListSendOperation,BufferMemoryReceiveOperation,BufferListReceiveOperation,ReceiveMessageFromOperation,AcceptOperation,ConnectOperation) stored as a packed integer inside eachAsyncOperation, set at operation creation time and consumed at CQE dispatch to route completions without virtual dispatch. Defined in the shared Unix partial class (SocketAsyncContext.Unix.cs) so it compiles on all Unix TFMs.MPSC Queue
MpscQueue<T>is a lock-free segmented queue with cache-line-padded head/tail pointers and anEnqueueIndexcounter per segment. Features:Lock) to reduce allocation pressure during burst enqueue patternsTryEnqueueFast/TryDequeueFast) inlined for the common non-full/non-empty caseIsEmptyproperty is snapshot-based, not linearizable -- a return of true can mean an enqueue is mid-flightProvided Buffer Ring
IoUringProvidedBufferRing(1,115 lines): Kernel-registered buffer pool for recv operations. Features:IORING_REGISTER_PBUF_RINGDebug.Assert(IsCurrentThreadEventLoopThread())on resize evaluationBeginDeferredRecyclePublish/EndDeferredRecyclePublishbracket the CQE drain loop to batchPublishTailcallsEvaluateProvidedBufferRingResize, gated bySystem.Net.Sockets.IoUringAdaptiveBufferSizingAppContext switchInUseCount == 0and_trackedIoUringOperationCount == 0before swapIORING_REGISTER_BUFFERSfor fixed-buffer recv viaREAD_FIXEDopcodeLinuxIoUringCapabilities
An immutable
readonly structsnapshot captured after ring setup and stored as_ioUringCapabilities. Uses bitfield packing (uint _flagswith seven single-bit flags). ExposesIsIoUringPort,Mode,SupportsMultishotRecv,SupportsMultishotAccept,SupportsZeroCopySend,SqPollEnabled,SupportsProvidedBufferRings, andHasRegisteredBuffers. Each capability flag is immutable after construction viaWith*builder methods. Eliminates scattered per-capability flag reads; the entire capability set is decided once at initialization and updated only for provided-buffer state changes.IoUringResolvedConfiguration
An immutable
readonly structcapturing all resolved configuration inputs at startup:IoUringEnabled,SqPollRequested,DirectSqeDisabled,ZeroCopySendOptedIn,RegisterBuffersEnabled,AdaptiveProvidedBufferSizingEnabled,ProvidedBufferSize,PrepareQueueCapacity,CancellationQueueCapacity. IncludesIoUringConfigurationWarningFlagsdetection for misconfiguration scenarios (e.g. SQPOLL requested without io_uring enabled). Logged once viaSocketsTelemetry.Log.ReportIoUringResolvedConfigurationandNetEventSource.Info.SocketIOEventQueue
The event queue type is
SocketIOEventQueue(replacingConcurrentQueue<SocketIOEvent>), providing the inter-thread channel between event loop threads and the ThreadPool work item processing path.4. Feature Inventory
Complete Feature Stack
IoUringSqe*pointers via mmap'd ring_multishotAcceptState(0=disarmed, 1=arming, otherwise encoded user_data); emergency kill-switch viaDOTNET_SYSTEM_NET_SOCKETS_IO_URING_DISABLE_MULTISHOT_ACCEPT=1_persistentMultishotRecvDataQueue[FeatureSwitchDefinition]+ env var); JIT-eliminable when switch is falseIoUringCompletionDispatchKindeliminates virtual dispatch on the CQE hot pathIORING_SETUP_CLOEXECflag with static assert in shim; fcntl fallback; dedicated test#if DEBUG), per-opcode mask; forced EPERM on submit; forced EINTR retry limit exhaustion; forced kernel version unsupported; forced provided-buffer-ring OOM[Conditional("DEBUG")]AssertSingleThreadAccessat CQE dispatch entry points; mmap offset bounds validationsched_setaffinitySO_INCOMING_CPUlookup on first receive completion, one-shot migration to CPU-local engines_fdEngineAffinityarray5. Configuration Surface
Production Environment Variables
DOTNET_SYSTEM_NET_SOCKETS_IO_URING"1"to enableDOTNET_SYSTEM_NET_SOCKETS_IO_URING_SQPOLL"1"to enableDOTNET_SYSTEM_NET_SOCKETS_IO_URING_DISABLE_MULTISHOT_ACCEPT"1"to disableDOTNET_SYSTEM_NET_SOCKETS_IO_URING_DISABLE_REUSEPORT_ACCEPT"1"to disableProduction AppContext Switches
System.Net.Sockets.UseIoUringfalse[FeatureSwitchDefinition])System.Net.Sockets.UseIoUringSqPollfalse[FeatureSwitchDefinition]enables JIT elimination)System.Net.Sockets.IoUringAdaptiveBufferSizingfalsePrecedence: Environment variable wins over AppContext switch for the master gate. SQPOLL requires both surfaces enabled (dual opt-in).
SQPOLL dual opt-in: Both the AppContext switch AND the environment variable must be enabled. The AppContext switch is the outer gate -- if false,
IsSqPollRequested()returns immediately without checking the env var, and the JIT can statically eliminate all SQPOLL branches.Debug-Only Test Controls
All
DOTNET_SYSTEM_NET_SOCKETS_IO_URING_TEST_*environment variables are gated behind#if DEBUG:TEST_DIRECT_SQE(0/1): disable/enable direct SQE submissionTEST_ZERO_COPY_SEND(0/1): disable/enable zero-copy sendTEST_REGISTER_BUFFERS: control registered buffer behaviorTEST_PROVIDED_BUFFER_SIZE: override provided buffer sizeTEST_ADAPTIVE_BUFFER_SIZING(1): force adaptive sizing onTEST_PREPARE_QUEUE_CAPACITY: override prepare queue capacityTEST_QUEUE_ENTRIES: override SQ ring size (must be power of 2, 2-1024)TEST_EVENT_BUFFER_COUNT: override event buffer count for deterministic diagnostics coverageTEST_FORCE_EAGAIN_ONCE_MASK: comma-separated opcode names for forced EAGAINTEST_FORCE_ECANCELED_ONCE_MASK: comma-separated opcode names for forced ECANCELEDTEST_FORCE_SUBMIT_EPERM_ONCE(1): force a singleio_uring_entersubmission to return EPERMTEST_FORCE_ENTER_EINTR_RETRY_LIMIT_ONCE(1): force the native EINTR retry circuit breaker to trigger onceTEST_FORCE_KERNEL_VERSION_UNSUPPORTED(1): force kernel version check to failTEST_FORCE_PROVIDED_BUFFER_RING_OOM_ONCE(1): force provided buffer ring allocation to fail once6. Safety and Correctness Measures
Fd Lifetime Management
Every direct SQE preparation takes a
DangerousAddRefon the socket'sSafeSocketHandle, stored in_completionSlotStorage[slotIndex].DangerousRefSocketHandle. This keeps the fd alive from SQE prep through CQE retirement, preventing fd-reuse races after close. The ref is released inFreeCompletionSlot.Stale CQE Protection
Generation-based ABA protection. Each completion slot starts at generation 1. On free, generation increments (wrapping from
2^40-1to 1, skipping 0). CQE dispatch compares the CQE's encoded generation against the slot's current generation; mismatches are silently dropped as stale.Zero-Copy Send Lifecycle
SEND_ZC produces two CQEs: a data completion and a NOTIF. The slot's
IsZeroCopySendandZeroCopyNotificationPendingflags track this two-phase lifecycle. After the first CQE, the slot is kept alive and the tracked operation is reattached viaTryReattachTrackedIoUringOperation(generation CAS from 0 to new generation, then operation CAS from null to operation). The NOTIF CQE triggersHandleZeroCopyNotificationwhich frees the slot and releases the pin hold.Multishot Accept Arming
The
_multishotAcceptStatefield uses a three-state protocol:0(disarmed),1(arming -- SQE being written but user_data not yet published), or the encoded user_data value itself (armed).GetArmedMultishotAcceptUserDataForCancellationspins briefly if the arming transition is in flight.Persistent Multishot Recv Guard
During CQE batch draining, persistent multishot recv completions check
operation.IoUringUserData == 0to detect operations that the ThreadPool has recycled (reset to Waiting state) before the event loop finishes the CQE batch.IoUringUserDatais zeroed on the event-loop thread at completion and only restored during prepare-queue drain, making it a reliable recycled-operation sentinel independent of ThreadPool-driven state changes.Teardown Ordering
LinuxFreeIoUringResourcesfollows a strict multi-phase teardown:CleanupManagedRings(also closes ring fd, terminating SQPOLL thread)DrainQueuedIoUringOperationsForTeardownruns twice -- once before and once after native port closure to catch late-arriving items)DrainTrackedIoUringOperationsForTeardownNativeMemory.FreeCleanupManagedRingsnulls all mmap-derived pointers before unmapping to prevent use-after-unmap.Nullable Avoidance
The SQE retry drain path avoids wrapping
SocketEventHandler(a struct) in aNullable<T>wrapper. Presence is tracked via a separatedrainHandlerInitializedboolean, avoiding boxing pressure on the hot path.SQE Size Validation
TryGetNextManagedSqechecksringInfo.SqeSize != (uint)sizeof(IoUringSqe)at runtime, catching 128-byte SQE kernels that would corrupt the ring.TryMmapRingsadditionally rejectsSetupSqe128negotiations.7. Performance Optimizations
CQ Head Advance Batching
Outside of overflow recovery, CQ head advances are deferred:
_managedCachedCqHeadis incremented locally and the singleVolatile.Writeto*_managedCqHeadPtrhappens once at the end of the drain batch (in thefinallyblock). During overflow recovery, advances happen per-CQE to relieve kernel pressure.SQE Zeroing
Each
TryGetNextManagedSqecall writesUnsafe.WriteUnaligned(sqe, default(IoUringSqe))for JIT-vectorized 64-byte zeroing before returning the SQE. This eliminates stale field concerns and enables eachWrite*Sqemethod to write only the fields it needs.SQE Writer Deduplication
Send-like operations share
WriteSendLikeSqe(differing only by opcode:SendvsSendZc). Sendmsg-like operations shareWriteSendMsgLikeSqe(SendMsgvsSendMsgZc). This reduces copy-paste without sacrificing readability.SQE Acquire With Retry
TryAcquireManagedSqeWithRetryattempts up toMaxIoUringSqeAcquireSubmitAttempts(16) rounds. Between retries, it runsDrainCqeRingBatchto free CQ slots, then submits pending SQEs. The drain handler is lazily initialized to avoid struct construction on the fast path.Completion Slot Drain Recovery
When
AllocateCompletionSlotreturns -1 (pool exhausted), the engine drains CQEs inline (guarded by_completionSlotDrainInProgressto prevent recursion) and retries allocation.Provided Buffer Deferred Recycle
BeginDeferredRecyclePublish/EndDeferredRecyclePublishbracket the CQE drain loop. Buffer descriptor writes accumulate without individualVolatile.Writetail publishes. A single tail publish happens atEndDeferredRecyclePublish.Diagnostics Polling
Diagnostic counters are polled every
IoUringDiagnosticsPollInterval(64) event loop iterations, not on every CQE. Managed deltas are accumulated in per-engine fields and published in batch toSocketsTelemetry.Lazy Lock Allocation
_multishotAcceptQueueGate,_persistentMultishotRecvDataGate, and_reusePortShadowListenersGateonSocketAsyncContextare lazy-initialized viaEnsureLockInitialized(CAS from null). Most sockets never use these paths, so theLockobjects are only allocated when needed.Event Loop Wait
The event loop first tries a non-blocking
DrainCqeRingBatch. If no CQEs are available, it issuesio_uring_enterwithGETEVENTSand a 50ms EXT_ARG timeout (bounded wait). A secondary 1ms circuit-breaker timeout (WakeFailureFallbackWaitTimeoutNanos) is used after repeated eventfd wake failures. This trades worst-case latency for starvation resilience when eventfd wakes are missed or deferred.Fd Engine Affinity
The
s_fdEngineAffinityarray maps file descriptor numbers to preferred engine indices. When a SO_REUSEPORT shadow listener accepts an fd,SetFdEngineAffinityrecords the accepting engine's index. SubsequentTryRegisterSocketcalls consume this affinity hint viaInterlocked.Exchange, placing the socket on the engine that accepted it. This avoids cross-engine cache pollution for accepted connections.8. Telemetry and Observability
Stable PollingCounters (12)
Published when the EventSource is enabled on Linux. Counter names are centralized in
IoUringCounterNames:io-uring-prepare-nonpinnable-fallbacksio-uring-socket-event-buffer-fullio-uring-cq-overflowsio-uring-cq-overflow-recoveriesio-uring-prepare-queue-overflowsio-uring-prepare-queue-overflow-fallbacksio-uring-completion-slot-exhaustionsio-uring-completion-slot-high-water-markio-uring-cancellation-queue-overflowsio-uring-provided-buffer-depletionsio-uring-sqpoll-wakeupsio-uring-sqpoll-submissions-skippedDiagnostic Backing Fields (29)
Written internally for structured logging and test access. Not published as PollingCounters. Include:
Startup Events
ReportIoUringResolvedConfiguration(event ID 9): Logged once with all resolved config inputs, including validation warnings for misconfigured knobsReportSocketEngineBackendSelected(event ID 7): Reports io_uring_completion vs. epoll selection and SQPOLL statusReportIoUringSqPollNegotiatedWarning(event ID 8): WARNING-level when SQPOLL is negotiatedStructured Logging
IoUringDiagnostics.Linux.cscentralizes managed diagnostic delta publication withPublishIoUringManagedDiagnosticsDelta:Collectible via
dotnet-counters,dotnet-trace, or any OpenTelemetry-compatible collector.9. Test Coverage
Test Access Architecture
The test project does not use
InternalsVisibleTo. Instead:IoUringTestAccessors.Linux.cs(1,048 lines) inIoUringTestInfrastructure/defines all test-visible snapshot types and accessor methods insideSocketAsyncEngine(production assembly)InternalTestShims.Linux.cs(707 lines) in the test project mirrors these types and resolves them via reflectionSocketAsyncEngine.IoUringTestHooks.Linux.cs(229 lines) inIoUringTestInfrastructure/provides#if DEBUG-gated EAGAIN/ECANCELED forced result injection[DynamicDependency(DynamicallyAccessedMemberTypes.All, "System.Net.Sockets.SocketAsyncEngine", "System.Net.Sockets")]attribute preserves all targets under trimming and AOTTest Suite (159 test methods across 7,723 lines)
Coverage areas:
#if DEBUG); forced EPERM on submit; forced EINTR retry limit exhaustion; forced kernel version unsupported; forced provided-buffer-ring OOMNativeMsghdrLayoutContract_IsStableandCompletionSlotLayoutContract_IsStableverify ABI alignment via reflectionCqOverflow_ReflectionTargets_Stableensures field names are documented and stableRingFd_HasCloexecFlag_Setverifies theFD_CLOEXECbit viafcntlTelemetryTest.cs)MpscQueueTests.cs)SO_INCOMING_CPUdetection, one-shot migration guardHard to Test In-Process
TEST_FORCE_KERNEL_VERSION_UNSUPPORTED)10. Graceful Degradation
11. Remaining Open Items
The following areas represent future work rather than outstanding defects:
Performance Follow-Ups (Tier 2)
PublishTailcalls. Current implementation usesBeginDeferredRecyclePublish/EndDeferredRecyclePublishwhich partially addresses this.Unsafe.WriteUnaligned<IoUringSqe>(sqe, default)for JIT-vectorized zeroing. Per-field writes remain as a potential micro-optimization.Path to Default-On
SQPOLL will likely remain opt-in permanently due to its CPU cost trade-off.
Future Kernel Features
Multi-Engine Evolution
The single-process comment in
SocketAsyncEngine.Linux.csnotes that io_uring completion mode uses one engine per physical core. Future work may evaluate finer-grained socket affinity sharding when high-core throughput data justifies the additional complexity.12. Distribution Readiness
Kernel Version Matrix
The minimum kernel cutoff is a single 6.1 requirement. All sub-features are detected at runtime via opcode probing.
Memory Overhead
Memory scales linearly with the number of active engines (one per physical core).