Skip to content

Comments

[WIP] Use io_uring for sockets on Linux#124374

Draft
benaadams wants to merge 1 commit intodotnet:mainfrom
benaadams:io_uring
Draft

[WIP] Use io_uring for sockets on Linux#124374
benaadams wants to merge 1 commit intodotnet:mainfrom
benaadams:io_uring

Conversation

@benaadams
Copy link
Member

@benaadams benaadams commented Feb 13, 2026

Contributes to #753

Summary

This document describes the complete, production-grade io_uring socket I/O engine in .NET's System.Net.Sockets layer.

When enabled via DOTNET_SYSTEM_NET_SOCKETS_IO_URING=1 on Linux kernel 6.1+, the engine replaces epoll with a managed io_uring completion-mode backend that:

  • Directly writes SQEs to mmap'd kernel ring buffers from C#
  • Processes CQEs inline on the event loop thread
  • Supports multishot accept, multishot recv with provided buffer rings, zero-copy send (SEND_ZC/SENDMSG_ZC), registered files, registered buffers, adaptive buffer sizing, and SQPOLL kernel-side submission polling
  • Recovers safely from CQ overflow across three discriminated branches
  • Sweeps stale tracked operations after CQ overflow recovery via a delayed-deadline mechanism
  • Distributes accept load across multiple engine instances via SO_REUSEPORT shadow listeners with cross-engine forwarding
  • Pins event loop threads to physical cores discovered from /sys/devices/system/cpu topology, with CPU-aware socket migration on first receive completion

The native shim is intentionally minimal -- 537 lines of C wrapping the three io_uring syscalls (setup, enter, register) plus eventfd and mmap helpers. All ring management, SQE construction, CQE dispatch, operation lifecycle, feature negotiation, overflow recovery, and SQPOLL wakeup detection lives in managed code.

The engine proper is organized as eight partial class files extending SocketAsyncEngine: the main file (SocketAsyncEngine.Linux.cs, 4,664 lines) holds ring setup, flag negotiation, CQE drain, SQE prep orchestration, completion slot layout, multi-engine topology detection, SO_REUSEPORT shadow listener management, CPU-affinity-based socket migration, and the event loop; the remaining seven partials handle ring mmap lifecycle (IoUringRings, 365 lines), completion slot pool management (IoUringSlots, 469 lines), SQE writing (IoUringSqeWriters, 249 lines), completion dispatch (IoUringCompletionDispatch, 847 lines), diagnostics logging (IoUringDiagnostics, 164 lines), configuration resolution (IoUringConfiguration, 429 lines), and debug test hook stubs (IoUringTestHooks.Stubs, 15 lines). A separate IoUringTestAccessors.Linux.cs file (1,048 lines) in the test infrastructure directory exposes all test-observable state through strongly-typed accessors. Tests access this surface through InternalTestShims.Linux.cs (707 lines), a centralized reflection shim with [DynamicDependency] annotations for trimmer/AOT safety.

Key metrics:

Metric Value
Partial class files (SocketAsyncEngine) 9 (main + 8 partials)
New managed source lines (socket layer) ~14,500
Native shim lines ~537 (C) + 27 (header)
New tests ~159 (ConditionalFact/ConditionalTheory in IoUring.Unix.cs)
Test lines ~7,723 (IoUring.Unix.cs) + 707 (InternalTestShims) + 295 (MpscQueueTests) + 876 (TelemetryTest)
Breaking API changes 0 -- purely additive, behind opt-in env var

2. Architecture

Ring Ownership and Event Loop

The architecture follows the SINGLE_ISSUER contract: exactly one thread -- the event loop thread -- owns each io_uring instance. All ring mutations (SQE writes, CQ head advances, io_uring_enter calls) happen on this thread. Other threads communicate via two MPSC queues.

graph TD
    WT[Worker Threads] -->|"MpscQueue<IoUringPrepareWorkItem>"| EL[Event Loop Thread]
    WT -->|"MpscQueue<ulong> (cancel)"| EL
    WT -->|"eventfd write (wake)"| EL
    EL -->|"Writes SQEs / Drains CQEs / io_uring_enter"| K[Kernel - io_uring]
    K -->|"CQE completions"| EL
    EL -->|"ThreadPool.QueueUserWorkItem"| TP[ThreadPool]
Loading

Multi-Engine Topology

When io_uring is enabled, the engine array is sized according to detected physical core topology. LinuxInitializeEngineAffinityTopology reads /sys/devices/system/cpu/cpu*/topology/{physical_package_id,core_id} to discover physical core groups, then creates one engine per physical core (up to the configured engine count cap). Each engine's event loop thread is pinned to its representative CPU via sched_setaffinity. A s_cpuToEngineIndex mapping array enables CPU-aware socket placement.

When topology detection fails, the engine count falls back to Math.Min(Environment.ProcessorCount, 32) with no CPU pinning.

CPU-Aware Socket Migration

On the first receive completion for a connected socket, TryMigrateIoUringEngineOnFirstReceiveCompletion reads SO_INCOMING_CPU via getsockopt and looks up the target engine via GetEngineIndexForCpu. If the socket's current engine differs from the CPU-optimal engine, the socket migrates to the target engine. This one-shot migration (guarded by _migrationState) improves cache locality for workloads where the kernel's receive-side CPU selection is stable.

SO_REUSEPORT Accept Distribution

For listening sockets with SO_REUSEPORT enabled and multiple engines active, the engine arms shadow listener sockets on non-primary engines. Each shadow listener duplicates the primary listener's socket via SO_REUSEPORT and arms its own multishot accept SQE. Accepted file descriptors from shadow listeners are forwarded to the primary listener's pre-accept queue via DispatchReusePortAcceptIoUringCompletion, which enqueues a readiness fallback event on the primary engine. This distributes accept load across kernel completion queues without requiring the application to manage multiple listener sockets.

Shadow listener setup requests flow through MpscQueue<ReusePortShadowSetupRequest>, and accepted fd engine affinity is tracked in the s_fdEngineAffinity array so that subsequent TryRegisterSocket calls can place the accepted socket on the same engine that accepted it.

The IoUringCompletionOperationKind.ReusePortAccept variant distinguishes shadow-listener accept slots from primary-listener accept slots in the CQE dispatch path. Shadow accept slots carry cross-engine references (ReusePortPrimaryContext, ReusePortPrimaryEngine) in IoUringCompletionSlotStorage.

SO_REUSEPORT accept distribution is an emergency-killable feature via DOTNET_SYSTEM_NET_SOCKETS_IO_URING_DISABLE_REUSEPORT_ACCEPT=1.

The Thin Native Shim Approach

The native shim (pal_io_uring_shim.c, 537 lines) wraps exactly:

  • io_uring_setup (via syscall(__NR_io_uring_setup, ...) with SYS_io_uring_setup fallback)
  • io_uring_enter (with and without EXT_ARG; EINTR retry with a 1024-iteration circuit breaker)
  • io_uring_register
  • mmap / munmap (for ring mapping)
  • eventfd / read / write (for cross-thread wakeup; EINTR-looped)
  • uname (for kernel version detection)
  • sched_setaffinity / sched_getaffinity (for CPU pinning)

All ring pointer arithmetic, SQE field population, CQE parsing, SQPOLL wakeup detection (via Volatile.Read on the mmap'd SQ flags word), overflow recovery, and operation lifecycle management happens in managed C#. This is deliberate:

  • Managed code is easier to debug, profile, and modify. The JIT can inline hot paths. No P/Invoke on the SQE write path.
  • The shim compiles on any Linux with <linux/io_uring.h> -- no liburing dependency.
  • Feature negotiation (flag peeling, opcode probing) is entirely managed and testable.
  • Requires exact ABI-level knowledge of kernel structs (mitigated by _Static_assert(IORING_SETUP_CLOEXEC == (1U << 19), ...) in the shim and layout contract tests in C#).

Threading Model

Each engine's event loop thread owns:

  • The io_uring ring fd and all mmap'd ring pointers for that engine
  • All SQE writes and CQ drains
  • The _completionSlots[] / _completionSlotStorage[] arrays
  • Eventfd registered-file entry management
  • Adaptive buffer sizing evaluation
  • SQPOLL idle detection via SQ_NEED_WAKEUP on the mmap'd SQ flags pointer
  • CQ overflow recovery state machine
  • SO_REUSEPORT shadow listener setup processing

Worker threads interact solely through:

  • TryEnqueueIoUringPreparation() -> MPSC prepare queue -> eventfd write
  • TryRequestIoUringCancellation() -> MPSC cancel queue -> eventfd write
  • Volatile.Read on _ioUringTeardownInitiated to avoid publishing work after shutdown

io_uring initialization is deferred to the event loop thread so that io_uring_setup sets submitter_task to the event loop thread, as required by DEFER_TASKRUN. TryRegisterSocket waits on a ManualResetEventSlim (_ioUringInitSignal) before handing sockets to an engine, ensuring no socket registers before initialization completes.

Partial Class File Organization

File Lines Responsibility
SocketAsyncEngine.Linux.cs 4,664 Core: ring setup, flag negotiation, CQE drain loop, SQE prep orchestration, event loop hooks, completion slot lifetime, tracked operation management, overflow recovery, SQPOLL wakeup, queue management, feature resolution, multi-engine topology, SO_REUSEPORT shadow listener management, CPU-affinity-based socket migration, fd engine affinity tracking
SocketAsyncEngine.IoUringSlots.Linux.cs 469 SoA completion slot allocation, free-list management, native per-slot slab layout, message header inline copy/writeback, zero-copy pin hold transfer, slot encode/decode
SocketAsyncEngine.IoUringRings.Linux.cs 365 TryMmapRings: maps SQ/CQ/SQE regions, validates mmap offset bounds, derives all ring pointers. CleanupManagedRings: multi-step teardown. LinuxFreeIoUringResources: full teardown orchestration
SocketAsyncEngine.IoUringSqeWriters.Linux.cs 249 All Write*Sqe methods: send, sendZc, recv, readFixed, providedBufferRecv, multishotRecv, accept, multishotAccept, sendMsg, sendMsgZc, recvMsg, connect, asyncCancel. Deduplicated via WriteSendLikeSqe and WriteSendMsgLikeSqe
SocketAsyncEngine.IoUringCompletionDispatch.Linux.cs 847 SocketEventHandler partial: DispatchSingleIoUringCompletion, DispatchMultishotIoUringCompletion, DispatchZeroCopyIoUringNotification, DispatchReusePortAcceptIoUringCompletion, multishot accept/recv dispatch, buffer materialization, completion result routing
SocketAsyncEngine.IoUringDiagnostics.Linux.cs 164 Managed diagnostic delta publication via PublishIoUringManagedDiagnosticsDelta, periodic provided buffer ring resize evaluation, zero-copy NOTIF pending slot gauge sampling
SocketAsyncEngine.IoUringConfiguration.Linux.cs 429 IsIoUringEnabled, IsSqPollRequested, IsZeroCopySendOptedIn, IsIoUringDirectSqeDisabled, IsMultishotAcceptDisabled, IsReusePortAcceptDisabled with [FeatureSwitchDefinition] annotations for JIT-eliminable code paths; LinuxInitializeEngineAffinityTopology with physical core topology detection via sysfs; CPU pinning via sched_setaffinity
SocketAsyncEngine.IoUringTestHooks.Stubs.Linux.cs 15 Release-build stubs for #if DEBUG-gated test hook partials
IoUringTestAccessors.Linux.cs 1,048 Strongly-typed snapshot structs and accessor methods for all testable engine state

Submission Path: Standard vs. SQPOLL

In standard mode, io_uring_enter submits pending SQEs and optionally waits for CQEs. In SQPOLL mode, a kernel thread continuously polls the SQ ring. Managed code detects idle via Volatile.Read on the mmap'd _managedSqFlagsPtr checking for IORING_SQ_NEED_WAKEUP. When the kernel thread is awake, no io_uring_enter is needed for submission.

Flag Negotiation (Peel Loop)

Setup builds an initial flag set: CQSIZE | SUBMIT_ALL | COOP_TASKRUN | SINGLE_ISSUER | NO_SQARRAY | CLOEXEC. SQPOLL (mutually exclusive with DEFER_TASKRUN) or DEFER_TASKRUN is added based on configuration. On EINVAL, flags are peeled in order: NO_SQARRAY first, then CLOEXEC. EPERM is never retried (respects seccomp/kernel policy). After setup, FD_CLOEXEC is set as a fallback via fcntl for kernels where IORING_SETUP_CLOEXEC was peeled.

CQ Overflow Recovery State Machine

CQ overflow is detected on every DrainCqeRingBatch entry via ObserveManagedCqOverflowCounter, which compares the mmap'd overflow counter against the last-observed value using wrapping uint32 delta arithmetic. When a delta is seen, the engine enters a three-branch recovery state machine:

  • MultishotAcceptArming: Active when _liveAcceptCompletionSlotCount > 0 and not in teardown. Defers multishot accept re-arm nudges until post-drain.
  • Teardown: Active when _ioUringTeardownInitiated is set. Teardown owns recovery completion.
  • DualWave: Steady-state branch for all other overflow scenarios, including escalation when new overflow occurs during existing recovery.

During overflow recovery, CQ head advances happen per-CQE (not batched) to relieve kernel pressure immediately. Recovery completes when the CQ ring is fully drained and no new overflow delta is observed. On completion: AssertCompletionSlotPoolConsistency validates free-list integrity, telemetry is incremented, and for the MultishotAcceptArming branch, TryQueueDeferredMultishotAcceptRearmAfterRecovery nudges accept contexts.

After recovery completes, a delayed sweep (TrySweepStaleTrackedIoUringOperationsAfterCqOverflowRecovery) fires 250ms later to retire tracked operations whose CQEs were dropped. The sweep skips intentionally long-lived multishot accept and persistent multishot recv slots. Operations still in the waiting state are canceled; already-transitioned operations are detached and their slots freed.


3. Key Data Structures

Completion Slot Pool

Four parallel SoA arrays, all indexed by slot index:

  • IoUringCompletionSlot[] (hot, 24 bytes each, [StructLayout(LayoutKind.Explicit, Size = 24)]):

    • Offset 0: Generation (ulong) -- 40-bit generation field
    • Offset 8: FreeListNext (int) -- intrusive free list, -1 = end
    • Offset 12: _packedState (uint) -- IoUringCompletionOperationKind in low 8 bits, boolean flags IsZeroCopySend/ZeroCopyNotificationPending/UsesFixedRecvBuffer in bits 8-10
    • Offset 16: FixedRecvBufferId (ushort)
    • Offset 20 (#if DEBUG only): TestForcedResult (int)
  • IoUringTrackedOperationState[]: Per-slot tracked operation reference (TrackedOperation, TrackedOperationGeneration) for ABA-safe operation tracking.

  • IoUringCompletionSlotStorage[] (cold): DangerousRefSocketHandle for fd lifetime, pre-allocated native inline storage slab (NativeMsghdr + 4 IOVectors + 128B socket addr + 128B control + socklen_t), message writeback pointers for recvmsg, and cross-engine references for SO_REUSEPORT accept forwarding (ReusePortPrimaryContext, ReusePortPrimaryEngine).

  • MemoryHandle[] (zero-copy pin holds): One System.Buffers.MemoryHandle per slot index, holding the pin for SEND_ZC payloads until the NOTIF CQE arrives.

Layout contract tests verify IoUringCompletionSlot field offsets and the 24-byte total size via reflection on every test run. A Debug.Assert in InitializeCompletionSlotPool fires if the size drifts.

Generation Encoding

16-bit slot index (SlotIndexBits = 16, capacity 65,536) and 40-bit generation (GenerationBits = 56 - 16 = 40, GenerationMask = (1UL << 40) - 1UL) packed into the 56-bit user_data payload. The upper 8 bits of user_data carry a tag byte (2 = reserved completion, 3 = wakeup signal). Generation is initialized to 1 (not 0) so stale CQEs referencing generation 0 are rejected. On wrap, generation remaps from 2^40-1 back to 1, skipping zero.

IoUringCompletionOperationKind

A 4-variant enum (None, Accept, Message, ReusePortAccept) stored in the packed state of each IoUringCompletionSlot. This determines per-completion post-processing behavior: accept completions read sockaddr length from the native slab; message completions copy writeback data from the native msghdr; reuse-port accept completions forward accepted fds to the primary listener's pre-accept queue on its owning engine.

IoUringCompletionDispatchKind

A 10-variant enum (Default, ReadOperation, WriteOperation, SendOperation, BufferListSendOperation, BufferMemoryReceiveOperation, BufferListReceiveOperation, ReceiveMessageFromOperation, AcceptOperation, ConnectOperation) stored as a packed integer inside each AsyncOperation, set at operation creation time and consumed at CQE dispatch to route completions without virtual dispatch. Defined in the shared Unix partial class (SocketAsyncContext.Unix.cs) so it compiles on all Unix TFMs.

MPSC Queue

MpscQueue<T> is a lock-free segmented queue with cache-line-padded head/tail pointers and an EnqueueIndex counter per segment. Features:

  • Platform-aware cache line padding: 128-byte on ARM64/LoongArch64, 64-byte otherwise
  • 4-slot unlinked segment cache (guarded by a small Lock) to reduce allocation pressure during burst enqueue patterns
  • Segment recycling limited to segments that lost the tail-link CAS race (never previously published), avoiding need for producer quiescence tracking
  • Fast path (TryEnqueueFast/TryDequeueFast) inlined for the common non-full/non-empty case
  • IsEmpty property is snapshot-based, not linearizable -- a return of true can mean an enqueue is mid-flight

Provided Buffer Ring

IoUringProvidedBufferRing (1,115 lines): Kernel-registered buffer pool for recv operations. Features:

  • Registered with kernel via IORING_REGISTER_PBUF_RING
  • Thread-affinity enforced via Debug.Assert(IsCurrentThreadEventLoopThread()) on resize evaluation
  • Deferred recycle publish: BeginDeferredRecyclePublish/EndDeferredRecyclePublish bracket the CQE drain loop to batch PublishTail calls
  • Adaptive sizing (default OFF): runtime adjustment of buffer size based on utilization via EvaluateProvidedBufferRingResize, gated by System.Net.Sockets.IoUringAdaptiveBufferSizing AppContext switch
  • Hot-swap resize: creates a new ring with an alternating group ID (1 or 2), registers it, unregisters the old one, and disposes it
  • Resize quiescence check: requires InUseCount == 0 and _trackedIoUringOperationCount == 0 before swap
  • Registered buffer support: IORING_REGISTER_BUFFERS for fixed-buffer recv via READ_FIXED opcode

LinuxIoUringCapabilities

An immutable readonly struct snapshot captured after ring setup and stored as _ioUringCapabilities. Uses bitfield packing (uint _flags with seven single-bit flags). Exposes IsIoUringPort, Mode, SupportsMultishotRecv, SupportsMultishotAccept, SupportsZeroCopySend, SqPollEnabled, SupportsProvidedBufferRings, and HasRegisteredBuffers. Each capability flag is immutable after construction via With* builder methods. Eliminates scattered per-capability flag reads; the entire capability set is decided once at initialization and updated only for provided-buffer state changes.

IoUringResolvedConfiguration

An immutable readonly struct capturing all resolved configuration inputs at startup: IoUringEnabled, SqPollRequested, DirectSqeDisabled, ZeroCopySendOptedIn, RegisterBuffersEnabled, AdaptiveProvidedBufferSizingEnabled, ProvidedBufferSize, PrepareQueueCapacity, CancellationQueueCapacity. Includes IoUringConfigurationWarningFlags detection for misconfiguration scenarios (e.g. SQPOLL requested without io_uring enabled). Logged once via SocketsTelemetry.Log.ReportIoUringResolvedConfiguration and NetEventSource.Info.

SocketIOEventQueue

The event queue type is SocketIOEventQueue (replacing ConcurrentQueue<SocketIOEvent>), providing the inter-thread channel between event loop threads and the ThreadPool work item processing path.


4. Feature Inventory

Complete Feature Stack

  1. Ring initialization with progressive flag negotiation (SQPOLL -> NO_SQARRAY -> CLOEXEC fallback via fcntl)
  2. Managed ring mmap -- SQ ring, CQ ring, and SQE array mapped directly into managed address space; SINGLE_MMAP feature detected for combined SQ/CQ mapping
  3. Direct SQE writes from C# -- no P/Invoke for SQE construction; managed code writes to IoUringSqe* pointers via mmap'd ring
  4. Managed CQE drain -- reads completions directly from mmap'd CQ ring with batched head-advance (deferred until drain completes, except during overflow recovery)
  5. Completion mode -- all socket operations submitted as io_uring ops, not epoll readiness
  6. Multishot accept (kernel 5.19+) -- single SQE arms persistent accept; multishot accept state tracked via _multishotAcceptState (0=disarmed, 1=arming, otherwise encoded user_data); emergency kill-switch via DOTNET_SYSTEM_NET_SOCKETS_IO_URING_DISABLE_MULTISHOT_ACCEPT=1
  7. Multishot recv (kernel 6.0+) -- persistent recv with provided buffer selection, early-data buffering via _persistentMultishotRecvDataQueue
  8. Provided buffer rings -- kernel-managed buffer pool for recv, with deferred recycle publish batching
  9. Adaptive buffer sizing -- runtime adjustment of provided buffer size based on utilization (defaults to OFF)
  10. Registered buffers (IORING_REGISTER_BUFFERS) -- pre-registered I/O vectors for fixed-buffer recv
  11. Fixed-buffer recv (READ_FIXED) -- kernel reads directly into registered buffers
  12. Zero-copy send (SEND_ZC, kernel 6.0+) -- avoids kernel buffer copies for large payloads (>16KB)
  13. Zero-copy sendmsg (SENDMSG_ZC, kernel 6.1+) -- zero-copy for vectored/message sends
  14. Registered files -- file descriptor table registration (used for eventfd)
  15. Registered ring fd (IORING_REGISTER_RING_FD) -- eliminates fget/fput on io_uring_enter itself
  16. DEFER_TASKRUN -- completions processed on the event loop thread, improving cache locality
  17. SINGLE_ISSUER -- kernel optimization for single-threaded submission
  18. SQPOLL (kernel 5.11+, unprivileged 5.12+) -- kernel-side submission thread polls the SQ ring; mutually exclusive with DEFER_TASKRUN; requires dual opt-in (AppContext [FeatureSwitchDefinition] + env var); JIT-eliminable when switch is false
  19. EXT_ARG bounded wait -- 50ms timeout on io_uring_enter for responsive event loops
  20. Eventfd cross-thread wakeup -- MPSC queues + eventfd for thread-safe operation submission
  21. ASYNC_CANCEL -- kernel-level cancellation of in-flight operations
  22. Opcode probing (IORING_REGISTER_PROBE) -- runtime feature detection per opcode
  23. Completion slot pool -- SoA arrays with 24-byte explicit layout, generation-based ABA protection
  24. 40-bit generation field -- ~1.1 trillion incarnations per slot before wrap
  25. 16-bit slot index -- supports up to 65,536 completion slots per engine
  26. Precomputed dispatch kind -- IoUringCompletionDispatchKind eliminates virtual dispatch on the CQE hot path
  27. CLOEXEC ring fd -- IORING_SETUP_CLOEXEC flag with static assert in shim; fcntl fallback; dedicated test
  28. CQ overflow recovery -- three-branch state machine with post-recovery stale tracked operation sweep
  29. Test hook injection -- forced EAGAIN/ECANCELED results (gated behind #if DEBUG), per-opcode mask; forced EPERM on submit; forced EINTR retry limit exhaustion; forced kernel version unsupported; forced provided-buffer-ring OOM
  30. Thread-affinity assertions -- [Conditional("DEBUG")] AssertSingleThreadAccess at CQE dispatch entry points; mmap offset bounds validation
  31. Comprehensive telemetry -- 12 stable PollingCounters + 29 diagnostic backing fields + structured logging
  32. Multi-engine topology -- one engine per physical core, event loop thread pinned to representative CPU via sched_setaffinity
  33. CPU-aware socket migration -- SO_INCOMING_CPU lookup on first receive completion, one-shot migration to CPU-local engine
  34. SO_REUSEPORT accept distribution -- shadow listener sockets on non-primary engines with cross-engine fd forwarding; fd engine affinity tracking via s_fdEngineAffinity array
  35. EINTR retry circuit breaker -- native shim retries io_uring_enter on EINTR up to 1,024 iterations before returning the error

5. Configuration Surface

Production Environment Variables

Variable Values Default Purpose
DOTNET_SYSTEM_NET_SOCKETS_IO_URING "1" to enable Disabled Master enable switch
DOTNET_SYSTEM_NET_SOCKETS_IO_URING_SQPOLL "1" to enable Disabled SQPOLL kernel-side polling (also requires AppContext switch)
DOTNET_SYSTEM_NET_SOCKETS_IO_URING_DISABLE_MULTISHOT_ACCEPT "1" to disable Enabled Emergency kill-switch for multishot accept
DOTNET_SYSTEM_NET_SOCKETS_IO_URING_DISABLE_REUSEPORT_ACCEPT "1" to disable Enabled Emergency kill-switch for SO_REUSEPORT accept distribution

Production AppContext Switches

Switch Name Type Default Purpose
System.Net.Sockets.UseIoUring Boolean false Master enable switch ([FeatureSwitchDefinition])
System.Net.Sockets.UseIoUringSqPoll Boolean false SQPOLL dual opt-in ([FeatureSwitchDefinition] enables JIT elimination)
System.Net.Sockets.IoUringAdaptiveBufferSizing Boolean false Adaptive provided-buffer ring sizing

Precedence: Environment variable wins over AppContext switch for the master gate. SQPOLL requires both surfaces enabled (dual opt-in).

SQPOLL dual opt-in: Both the AppContext switch AND the environment variable must be enabled. The AppContext switch is the outer gate -- if false, IsSqPollRequested() returns immediately without checking the env var, and the JIT can statically eliminate all SQPOLL branches.

Debug-Only Test Controls

All DOTNET_SYSTEM_NET_SOCKETS_IO_URING_TEST_* environment variables are gated behind #if DEBUG:

  • TEST_DIRECT_SQE (0/1): disable/enable direct SQE submission
  • TEST_ZERO_COPY_SEND (0/1): disable/enable zero-copy send
  • TEST_REGISTER_BUFFERS: control registered buffer behavior
  • TEST_PROVIDED_BUFFER_SIZE: override provided buffer size
  • TEST_ADAPTIVE_BUFFER_SIZING (1): force adaptive sizing on
  • TEST_PREPARE_QUEUE_CAPACITY: override prepare queue capacity
  • TEST_QUEUE_ENTRIES: override SQ ring size (must be power of 2, 2-1024)
  • TEST_EVENT_BUFFER_COUNT: override event buffer count for deterministic diagnostics coverage
  • TEST_FORCE_EAGAIN_ONCE_MASK: comma-separated opcode names for forced EAGAIN
  • TEST_FORCE_ECANCELED_ONCE_MASK: comma-separated opcode names for forced ECANCELED
  • TEST_FORCE_SUBMIT_EPERM_ONCE (1): force a single io_uring_enter submission to return EPERM
  • TEST_FORCE_ENTER_EINTR_RETRY_LIMIT_ONCE (1): force the native EINTR retry circuit breaker to trigger once
  • TEST_FORCE_KERNEL_VERSION_UNSUPPORTED (1): force kernel version check to fail
  • TEST_FORCE_PROVIDED_BUFFER_RING_OOM_ONCE (1): force provided buffer ring allocation to fail once

6. Safety and Correctness Measures

Fd Lifetime Management

Every direct SQE preparation takes a DangerousAddRef on the socket's SafeSocketHandle, stored in _completionSlotStorage[slotIndex].DangerousRefSocketHandle. This keeps the fd alive from SQE prep through CQE retirement, preventing fd-reuse races after close. The ref is released in FreeCompletionSlot.

Stale CQE Protection

Generation-based ABA protection. Each completion slot starts at generation 1. On free, generation increments (wrapping from 2^40-1 to 1, skipping 0). CQE dispatch compares the CQE's encoded generation against the slot's current generation; mismatches are silently dropped as stale.

Zero-Copy Send Lifecycle

SEND_ZC produces two CQEs: a data completion and a NOTIF. The slot's IsZeroCopySend and ZeroCopyNotificationPending flags track this two-phase lifecycle. After the first CQE, the slot is kept alive and the tracked operation is reattached via TryReattachTrackedIoUringOperation (generation CAS from 0 to new generation, then operation CAS from null to operation). The NOTIF CQE triggers HandleZeroCopyNotification which frees the slot and releases the pin hold.

Multishot Accept Arming

The _multishotAcceptState field uses a three-state protocol: 0 (disarmed), 1 (arming -- SQE being written but user_data not yet published), or the encoded user_data value itself (armed). GetArmedMultishotAcceptUserDataForCancellation spins briefly if the arming transition is in flight.

Persistent Multishot Recv Guard

During CQE batch draining, persistent multishot recv completions check operation.IoUringUserData == 0 to detect operations that the ThreadPool has recycled (reset to Waiting state) before the event loop finishes the CQE batch. IoUringUserData is zeroed on the event-loop thread at completion and only restored during prepare-queue drain, making it a reliable recycled-operation sentinel independent of ThreadPool-driven state changes.

Teardown Ordering

LinuxFreeIoUringResources follows a strict multi-phase teardown:

  1. Unregister provided buffer ring (needs ring fd)
  2. Mark registered ring fd inactive
  3. Close wakeup eventfd
  4. Unmap rings via CleanupManagedRings (also closes ring fd, terminating SQPOLL thread)
  5. Disable managed flags
  6. Drain queued operations (DrainQueuedIoUringOperationsForTeardown runs twice -- once before and once after native port closure to catch late-arriving items)
  7. Drain tracked operations via DrainTrackedIoUringOperationsForTeardown
  8. Clear all aliasing pointers before NativeMemory.Free
  9. Zero all state fields and publish final diagnostics

CleanupManagedRings nulls all mmap-derived pointers before unmapping to prevent use-after-unmap.

Nullable Avoidance

The SQE retry drain path avoids wrapping SocketEventHandler (a struct) in a Nullable<T> wrapper. Presence is tracked via a separate drainHandlerInitialized boolean, avoiding boxing pressure on the hot path.

SQE Size Validation

TryGetNextManagedSqe checks ringInfo.SqeSize != (uint)sizeof(IoUringSqe) at runtime, catching 128-byte SQE kernels that would corrupt the ring. TryMmapRings additionally rejects SetupSqe128 negotiations.


7. Performance Optimizations

CQ Head Advance Batching

Outside of overflow recovery, CQ head advances are deferred: _managedCachedCqHead is incremented locally and the single Volatile.Write to *_managedCqHeadPtr happens once at the end of the drain batch (in the finally block). During overflow recovery, advances happen per-CQE to relieve kernel pressure.

SQE Zeroing

Each TryGetNextManagedSqe call writes Unsafe.WriteUnaligned(sqe, default(IoUringSqe)) for JIT-vectorized 64-byte zeroing before returning the SQE. This eliminates stale field concerns and enables each Write*Sqe method to write only the fields it needs.

SQE Writer Deduplication

Send-like operations share WriteSendLikeSqe (differing only by opcode: Send vs SendZc). Sendmsg-like operations share WriteSendMsgLikeSqe (SendMsg vs SendMsgZc). This reduces copy-paste without sacrificing readability.

SQE Acquire With Retry

TryAcquireManagedSqeWithRetry attempts up to MaxIoUringSqeAcquireSubmitAttempts (16) rounds. Between retries, it runs DrainCqeRingBatch to free CQ slots, then submits pending SQEs. The drain handler is lazily initialized to avoid struct construction on the fast path.

Completion Slot Drain Recovery

When AllocateCompletionSlot returns -1 (pool exhausted), the engine drains CQEs inline (guarded by _completionSlotDrainInProgress to prevent recursion) and retries allocation.

Provided Buffer Deferred Recycle

BeginDeferredRecyclePublish/EndDeferredRecyclePublish bracket the CQE drain loop. Buffer descriptor writes accumulate without individual Volatile.Write tail publishes. A single tail publish happens at EndDeferredRecyclePublish.

Diagnostics Polling

Diagnostic counters are polled every IoUringDiagnosticsPollInterval (64) event loop iterations, not on every CQE. Managed deltas are accumulated in per-engine fields and published in batch to SocketsTelemetry.

Lazy Lock Allocation

_multishotAcceptQueueGate, _persistentMultishotRecvDataGate, and _reusePortShadowListenersGate on SocketAsyncContext are lazy-initialized via EnsureLockInitialized (CAS from null). Most sockets never use these paths, so the Lock objects are only allocated when needed.

Event Loop Wait

The event loop first tries a non-blocking DrainCqeRingBatch. If no CQEs are available, it issues io_uring_enter with GETEVENTS and a 50ms EXT_ARG timeout (bounded wait). A secondary 1ms circuit-breaker timeout (WakeFailureFallbackWaitTimeoutNanos) is used after repeated eventfd wake failures. This trades worst-case latency for starvation resilience when eventfd wakes are missed or deferred.

Fd Engine Affinity

The s_fdEngineAffinity array maps file descriptor numbers to preferred engine indices. When a SO_REUSEPORT shadow listener accepts an fd, SetFdEngineAffinity records the accepting engine's index. Subsequent TryRegisterSocket calls consume this affinity hint via Interlocked.Exchange, placing the socket on the engine that accepted it. This avoids cross-engine cache pollution for accepted connections.


8. Telemetry and Observability

Stable PollingCounters (12)

Published when the EventSource is enabled on Linux. Counter names are centralized in IoUringCounterNames:

Counter What to watch for
io-uring-prepare-nonpinnable-fallbacks Operations that couldn't use direct preparation
io-uring-socket-event-buffer-full Event buffer capacity pressure
io-uring-cq-overflows Event loop can't keep up with kernel completions
io-uring-cq-overflow-recoveries Successful overflow recovery completions
io-uring-prepare-queue-overflows Submission queue capacity pressure
io-uring-prepare-queue-overflow-fallbacks Operations that fell back to readiness dispatch
io-uring-completion-slot-exhaustions Slot capacity pressure
io-uring-completion-slot-high-water-mark Peak concurrent slot usage
io-uring-cancellation-queue-overflows Cancellation queue capacity pressure
io-uring-provided-buffer-depletions Provided buffer ring ran out of buffers
io-uring-sqpoll-wakeups SQPOLL kernel thread wakeups from idle
io-uring-sqpoll-submissions-skipped Zero-syscall fast path hits (SQPOLL)

Diagnostic Backing Fields (29)

Written internally for structured logging and test access. Not published as PollingCounters. Include:

  • Async cancel CQE counts
  • Completion requeue failures
  • Zero-copy notification pending slots gauge
  • Prepare queue depth
  • Completion slot drain recoveries
  • Provided buffer current size, recycles, resizes
  • Registered buffer initial/re-registration success and failure
  • Fixed recv selected/fallbacks
  • Persistent multishot recv reuse, termination, early data
  • SQPOLL wakeups and submissions skipped

Startup Events

  • ReportIoUringResolvedConfiguration (event ID 9): Logged once with all resolved config inputs, including validation warnings for misconfigured knobs
  • ReportSocketEngineBackendSelected (event ID 7): Reports io_uring_completion vs. epoll selection and SQPOLL status
  • ReportIoUringSqPollNegotiatedWarning (event ID 8): WARNING-level when SQPOLL is negotiated

Structured Logging

IoUringDiagnostics.Linux.cs centralizes managed diagnostic delta publication with PublishIoUringManagedDiagnosticsDelta:

  • Per-engine delta-based counter publishing (compares source and published baselines)
  • Zero-copy NOTIF pending slot gauge sampling from completion slot array
  • Non-pinnable prepare fallback delta publishing
  • Prepare queue overflow and depth tracking
  • Completion slot exhaustion and drain recovery deltas

Collectible via dotnet-counters, dotnet-trace, or any OpenTelemetry-compatible collector.


9. Test Coverage

Test Access Architecture

The test project does not use InternalsVisibleTo. Instead:

  1. IoUringTestAccessors.Linux.cs (1,048 lines) in IoUringTestInfrastructure/ defines all test-visible snapshot types and accessor methods inside SocketAsyncEngine (production assembly)
  2. InternalTestShims.Linux.cs (707 lines) in the test project mirrors these types and resolves them via reflection
  3. SocketAsyncEngine.IoUringTestHooks.Linux.cs (229 lines) in IoUringTestInfrastructure/ provides #if DEBUG-gated EAGAIN/ECANCELED forced result injection
  4. A [DynamicDependency(DynamicallyAccessedMemberTypes.All, "System.Net.Sockets.SocketAsyncEngine", "System.Net.Sockets")] attribute preserves all targets under trimming and AOT

Test Suite (159 test methods across 7,723 lines)

Coverage areas:

  • All operation types: send, recv, accept, connect, sendmsg, recvmsg
  • Completion mode vs. fallback: forced-fallback tests via environment variables
  • Per-opcode disable: env-var-driven opcode disabling for isolation
  • Forced-result injection: EAGAIN and ECANCELED injection per opcode (#if DEBUG); forced EPERM on submit; forced EINTR retry limit exhaustion; forced kernel version unsupported; forced provided-buffer-ring OOM
  • Multishot accept: basic flow, cancellation, queue drain, dispose-during-arming race, one-shot fallback (deterministic via reflection override)
  • Multishot recv: basic iteration, cancellation, peer close, early data buffering, multishot gating by socket type (datagram exclusion)
  • Provided buffers: depletion, recycling, adaptive sizing, registered buffer toggle
  • Zero-copy send: threshold behavior, notification lifecycle, mixed mode
  • SQPOLL mode: basic send/receive, fallback, idle wakeup, multishot recv, zero-copy send, telemetry, SQ_NEED_WAKEUP contract (7 dedicated tests)
  • CQ overflow recovery: five-test suite covering all three branches
    • Test 1: inject overflow, verify telemetry counter increment and slot/op settlement
    • Test 2 (branch a): multishot accept arming during overflow -- no silent drop
    • Test 3 (branch b): teardown under overflow -- no deadlock within 60s
    • Test 4: DEBUG single-issuer assertion fires on non-event-loop thread
    • Test 5 (branch c): sustained 10s adversarial overflow injection with concurrent workload
  • Layout contracts: NativeMsghdrLayoutContract_IsStable and CompletionSlotLayoutContract_IsStable verify ABI alignment via reflection
  • Reflection target stability: CqOverflow_ReflectionTargets_Stable ensures field names are documented and stable
  • CLOEXEC: RingFd_HasCloexecFlag_Set verifies the FD_CLOEXEC bit via fcntl
  • ARM64 and concurrency: ARM64 MPSC stress, generation-transition stress, concurrent resize-swap
  • Cancellation: concurrent cancel/submit contention, teardown drain
  • Buffer pressure: bounded queue capacity, slot exhaustion recovery
  • Telemetry: stable counter name contract validation, counter increment verification (876-line TelemetryTest.cs)
  • Config: dual opt-in SQPOLL validation, removed-knobs-default-enabled verification
  • Teardown: clean shutdown, resource cleanup
  • Non-pinnable fallback publication: concurrent publisher stress test via reflection shim
  • MPSC queue: dedicated 295-line test suite (MpscQueueTests.cs)
  • SO_REUSEPORT accept distribution: shadow listener arm/disarm, cross-engine fd forwarding, kill-switch validation
  • CPU migration: SO_INCOMING_CPU detection, one-shot migration guard

Hard to Test In-Process

  • True CQ overflow (requires kernel-level timing control; mitigated by managed overflow counter injection via reflection)
  • RLIMIT_MEMLOCK failures (requires container-level constraints)
  • Kernel version degradation (requires multiple kernel environments; partially mitigated by TEST_FORCE_KERNEL_VERSION_UNSUPPORTED)
  • SQPOLL CPU consumption (requires system-level profiling)
  • Real-world latency distributions (requires benchmark infrastructure)

10. Graceful Degradation

Condition Behavior
Kernel < 6.1 Epoll used
Env var not set to "1" (and no AppContext switch) Epoll used
io_uring_setup fails Epoll fallback
SQPOLL not supported (EINVAL or EPERM) Flag peeled; DEFER_TASKRUN added; engine continues
NO_SQARRAY unsupported Flag peeled; SQ array identity-mapped
CLOEXEC unsupported Flag peeled; fcntl FD_CLOEXEC fallback
Opcode probe fails Advanced opcodes disabled; basic ops still work
Provided buffer ring fails Multishot recv disabled; one-shot recv with inline buffers
RLIMIT_MEMLOCK prevents buffer registration Engine continues without registered buffers
Completion slot exhaustion Drain CQEs inline; retry allocation; fall back to readiness dispatch
Prepare queue overflow Fall back to readiness dispatch for overflowed op
CQ overflow detected Three-branch recovery state machine; delayed stale sweep
SQE ring full Retry with intermediate submit + CQ drain (up to 16 attempts)
NativeMsghdr layout unsupported (non-64-bit) io_uring disabled entirely
CPU topology detection fails Engines created without CPU pinning; round-robin socket assignment
SO_REUSEPORT not enabled on listener Shadow listeners not created; accept handled by primary engine only
io_uring_enter EINTR Native shim retries up to 1,024 iterations; returns error on exhaustion

11. Remaining Open Items

The following areas represent future work rather than outstanding defects:

Performance Follow-Ups (Tier 2)

  • Provided-buffer ring tail batching: Further batching of PublishTail calls. Current implementation uses BeginDeferredRecyclePublish/EndDeferredRecyclePublish which partially addresses this.
  • MpscQueue segment cache expansion: Currently caches 4 unlinked segments (expanded from original 1). Epoch-based drain-side recycling remains as an alternative approach.
  • SQE zeroing optimization: Currently uses Unsafe.WriteUnaligned<IoUringSqe>(sqe, default) for JIT-vectorized zeroing. Per-field writes remain as a potential micro-optimization.

Path to Default-On

  1. Opt-in environment variable (current state)
  2. Extensive testing (CI, stress tests, TechEmpower)
  3. Default-on for kernel >= 6.1 with runtime capability detection
  4. Remove the gate; io_uring is the Linux backend

SQPOLL will likely remain opt-in permanently due to its CPU cost trade-off.

Future Kernel Features

  • Incremental buffer rings (kernel 6.12+): Partial buffer consumption without full ring cycle
  • RecvSend bundles (kernel 6.10+): Single SQE performs recv then send
  • Zero-copy RX (kernel 6.7+): True zero-copy receive sharing NIC ring buffers

Multi-Engine Evolution

The single-process comment in SocketAsyncEngine.Linux.cs notes that io_uring completion mode uses one engine per physical core. Future work may evaluate finer-grained socket affinity sharding when high-core throughput data justifies the additional complexity.


12. Distribution Readiness

Kernel Version Matrix

The minimum kernel cutoff is a single 6.1 requirement. All sub-features are detected at runtime via opcode probing.

Distribution Version Kernel io_uring (6.1+)
Ubuntu 24.04 LTS GA 6.8 Yes
Ubuntu 22.04 LTS GA 5.15 No (epoll fallback)
Ubuntu 22.04 LTS HWE 6.8 Yes
RHEL 10 GA 6.12 Yes
RHEL 9 GA 5.14 No (epoll fallback)
Debian 13 (Trixie) GA 6.12 Yes
Debian 12 (Bookworm) GA 6.1 Yes
Azure Linux 3 GA 6.6 Yes
Amazon Linux 2023 Default 6.1 Yes
Amazon Linux 2 Default 5.10 No (epoll fallback)

Memory Overhead

Component Size Notes
SQ ring ~16KB 1024 entries
CQ ring ~64KB 4096 entries (4x SQ)
SQE array ~64KB 1024 entries * 64B
Provided buffer pool ~4MB 1024 * 4KB default
Completion slots (hot) ~1.5MB 65,536 slots * 24B
Tracked operations ~varies Parallel managed object array
Completion slot storage (cold) ~varies Managed object array
Native per-slot slab ~varies NativeMemory.AllocZeroed
Zero-copy pin holds ~512KB 65,536 * sizeof(MemoryHandle)
Total ~7MB+ Per engine instance (userspace)

Memory scales linearly with the number of active engines (one per physical core).

Copilot AI review requested due to automatic review settings February 13, 2026 11:18
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 13, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements an experimental opt-in io_uring-backed socket event engine for Linux as an alternative to epoll. The implementation is comprehensive, including both readiness-based polling (Phase 1) and completion-based I/O operations (Phase 2), along with extensive testing infrastructure and evidence collection tooling.

Changes:

  • Native layer: cmake configuration, PAL networking headers, and io_uring system call integration with graceful epoll fallback
  • Managed layer: socket async engine extensions for io_uring completion handling, operation lifecycle tracking, buffer pinning, and telemetry
  • Testing: comprehensive functional tests, layout contract validation, stress tests, and CI infrastructure for dual-mode test execution
  • Tooling: evidence collection and validation scripts for performance comparison and envelope testing

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/native/libs/configure.cmake Adds CMake configuration checks for io_uring header and poll32_events struct member
src/native/libs/System.Native/pal_networking.h Defines new io_uring interop structures (IoUringCompletion, IoUringSocketEventPortDiagnostics) and function signatures
src/native/libs/System.Native/entrypoints.c Registers new io_uring-related PAL export entry points
src/native/libs/Common/pal_config.h.in Adds CMake defines for io_uring feature detection
src/libraries/System.Net.Sockets/tests/FunctionalTests/TelemetryTest.cs Adds layout contract tests for io_uring interop structures and telemetry counter verification
src/libraries/System.Net.Sockets/tests/FunctionalTests/System.Net.Sockets.Tests.csproj Implements MSBuild infrastructure for creating io_uring test archive variants (enabled/disabled/default)
src/libraries/System.Net.Sockets/tests/FunctionalTests/IoUring.Unix.cs Adds comprehensive functional and stress tests for io_uring socket workflows
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketsTelemetry.cs Adds 12 new PollingCounters for io_uring observability metrics
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs Implements managed wrappers for io_uring prepare operations with error handling
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs Core io_uring integration: submission batching, completion handling, operation tracking, and diagnostics polling
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs Operation-level io_uring support: buffer pinning, user_data allocation, completion processing, and state machine
src/libraries/Common/src/Interop/Unix/System.Native/Interop.SocketEvent.cs Defines managed interop structures matching native layout for io_uring operations
eng/testing/io-uring/validate-collect-sockets-io-uring-evidence-smoke.sh Smoke validation script for evidence collection tooling
eng/testing/io-uring/collect-sockets-io-uring-evidence.sh Comprehensive evidence collection script for functional/perf validation and envelope testing
docs/workflow/testing/libraries/testing.md Adds references to io_uring-specific documentation
docs/workflow/testing/libraries/testing-linux-sockets-io-uring.md Detailed validation guide for io_uring backend testing
docs/workflow/testing/libraries/io-uring-pr-evidence-template.md PR evidence template for documenting io_uring validation results

Copilot AI review requested due to automatic review settings February 13, 2026 11:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings February 13, 2026 12:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings February 13, 2026 14:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated no new comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 21 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings February 14, 2026 01:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 20 changed files in this pull request and generated 7 comments.

Copilot AI review requested due to automatic review settings February 14, 2026 05:21
{
get
{
Segment head = Volatile.Read(ref _head.Value)!;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty computationally heavy; is there a reason you just can't have a single _count variable that you atomically increment/decrement that you just check for 0 here?

fixedRecvBufferId,
ref completionAuxiliaryData))
{
completionResultCode = -Interop.Sys.ConvertErrorPalToPlatform(Interop.Error.ENOBUFS);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the negation? I see you do it below as well. I did a quick search around the repo and only saw this referenced in one other place and they did not do negation and the folks referencing that code don't appear to being a negation either.

int32_t state = atomic_load_explicit(&s_forceEnterEintrRetryLimitOnce, memory_order_relaxed);
if (state < 0)
{
const char* configuredValue = getenv(SHIM_TEST_FORCE_ENTER_EINTR_RETRY_LIMIT_ONCE_ENV);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be behind a #ifdef DEBUG?

private const string ConnectActivityName = ActivitySourceName + ".Connect";
private static readonly ActivitySource s_connectActivitySource = new ActivitySource(ActivitySourceName);

internal static class Keywords

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe IoUringKeywords would be a better description.

#if DEBUG
// Test-only knob to make wait-buffer saturation deterministic for io_uring diagnostics coverage.
// Only available in DEBUG builds so production code never reads test env vars.
if (OperatingSystem.IsLinux())
Copy link

@deathly809 deathly809 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you also check DOTNET_SYSTEM_NET_SOCKETS_IO_URING or do we assume that DOTNET_SYSTEM_NET_SOCKETS_IO_URING_TEST_EVENT_BUFFER_COUNT is only set when the feature flag is enabled?

try
{
RecordAndAssertEventLoopThreadIdentity();
LinuxEventLoopEnableRings();
Copy link

@deathly809 deathly809 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if these could be more generic. i.e.

LinuxEventLoopEnableRings -> EventLoopInit
LinuxEventLoopBeforeWait -> EventLoopBeforeWait
LinuxEventLoopTryCompletionWait -> EventLoopTryCompleteWait
etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess it would be an issue if someone wanted to add their own "EventLoopInit" or equivalent for the other methods :)

}
else
{
Debug.Assert(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we have not tested this on Kernels before 6.1?

{
// Snapshot the wakeup generation counter before entering the blocking syscall.
// After waking, we compare to detect wakeups that arrived during the syscall.
uint wakeGenBefore = Volatile.Read(ref _ioUringWakeupGeneration);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to need to define this outside the if statement so you can reference it after the if/else

Comment on lines 187 to 209
/// <summary>
/// Returns whether SQPOLL mode has been explicitly requested.
/// SQPOLL requires dual opt-in: AppContext switch + environment variable.
/// This is intentionally stricter than the primary io_uring gate
/// (`IsIoUringEnabled`), which accepts either source.
/// SQPOLL pins a kernel thread, so accidental activation should require
/// explicit confirmation from both configuration surfaces.
/// </summary>
private static bool IsSqPollRequested()
{
IoUringConfigurationInputs inputs = ReadIoUringConfigurationInputs();
return ResolveSqPollRequested(inputs);
}

private static bool ResolveSqPollRequested(in IoUringConfigurationInputs inputs)
{
if (!inputs.SqPollFeatureSwitchEnabled)
{
return false;
}

return string.Equals(inputs.SqPollEnvironmentValue, "1", StringComparison.Ordinal);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dual knob feels unnecessary here since runtime configuration is typically supplied either via the environment variable or via the AppContext switch, not both. For example, setting DOTNET_Thread_DefaultStackSize too low fails to initialize threads, while setting it too high just burns resources. Users enabling this kind of feature are expected to understand the implications rather than being gated by added obscurity.

Also consider caching these like existing knobs in main. We typically perform a one-time static lookup per process and do not support changing this kind of configuration mid-process:

internal static bool Invariant { get; } = AppContextConfigHelper.GetBooleanConfig("System.Globalization.Invariant", "DOTNET_SYSTEM_GLOBALIZATION_INVARIANT");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, .NET already has knobs that can be activated from either environment variables or runtime configuration, that would degrade performance if used by more than a single process at the same time, such as Server GC.

Copilot AI review requested due to automatic review settings February 21, 2026 17:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 36 changed files in this pull request and generated no new comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 35 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings February 22, 2026 16:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 36 changed files in this pull request and generated no new comments.

@benaadams benaadams marked this pull request as draft February 23, 2026 05:19
@benaadams benaadams changed the title Use io_uring for sockets on Linux [WIP] Use io_uring for sockets on Linux Feb 23, 2026
@benaadams
Copy link
Member Author

Putting back to draft as still needs some work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.Net.Sockets community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants