[WIP] Use io_uring for sockets on Linux by benaadams · Pull Request #124374 · dotnet/runtime

benaadams · 2026-02-13T11:18:10Z

Contributes to #753

Summary

This document describes the complete, production-grade io_uring socket I/O engine in .NET's System.Net.Sockets layer.

When enabled via DOTNET_SYSTEM_NET_SOCKETS_IO_URING=1 on Linux kernel 6.1+, the engine replaces epoll with a managed io_uring completion-mode backend that:

Directly writes SQEs to mmap'd kernel ring buffers from C#
Processes CQEs inline on the event loop thread
Supports multishot accept, multishot recv with provided buffer rings, zero-copy send (SEND_ZC/SENDMSG_ZC), registered files, registered buffers, adaptive buffer sizing, and SQPOLL kernel-side submission polling
Recovers safely from CQ overflow across three discriminated branches
Sweeps stale tracked operations after CQ overflow recovery via a delayed-deadline mechanism
Distributes accept load across multiple engine instances via SO_REUSEPORT shadow listeners with cross-engine forwarding
Pins event loop threads to physical cores discovered from /sys/devices/system/cpu topology, with CPU-aware socket migration on first receive completion

The native shim is intentionally minimal -- 537 lines of C wrapping the three io_uring syscalls (setup, enter, register) plus eventfd and mmap helpers. All ring management, SQE construction, CQE dispatch, operation lifecycle, feature negotiation, overflow recovery, and SQPOLL wakeup detection lives in managed code.

The engine proper is organized as eight partial class files extending SocketAsyncEngine: the main file (SocketAsyncEngine.Linux.cs, 4,664 lines) holds ring setup, flag negotiation, CQE drain, SQE prep orchestration, completion slot layout, multi-engine topology detection, SO_REUSEPORT shadow listener management, CPU-affinity-based socket migration, and the event loop; the remaining seven partials handle ring mmap lifecycle (IoUringRings, 365 lines), completion slot pool management (IoUringSlots, 469 lines), SQE writing (IoUringSqeWriters, 249 lines), completion dispatch (IoUringCompletionDispatch, 847 lines), diagnostics logging (IoUringDiagnostics, 164 lines), configuration resolution (IoUringConfiguration, 429 lines), and debug test hook stubs (IoUringTestHooks.Stubs, 15 lines). A separate IoUringTestAccessors.Linux.cs file (1,048 lines) in the test infrastructure directory exposes all test-observable state through strongly-typed accessors. Tests access this surface through InternalTestShims.Linux.cs (707 lines), a centralized reflection shim with [DynamicDependency] annotations for trimmer/AOT safety.

Key metrics:

Metric	Value
Partial class files (SocketAsyncEngine)	9 (main + 8 partials)
New managed source lines (socket layer)	~14,500
Native shim lines	~537 (C) + 27 (header)
New tests	~159 (ConditionalFact/ConditionalTheory in IoUring.Unix.cs)
Test lines	~7,723 (IoUring.Unix.cs) + 707 (InternalTestShims) + 295 (MpscQueueTests) + 876 (TelemetryTest)
Breaking API changes	0 -- purely additive, behind opt-in env var

2. Architecture

Ring Ownership and Event Loop

The architecture follows the SINGLE_ISSUER contract: exactly one thread -- the event loop thread -- owns each io_uring instance. All ring mutations (SQE writes, CQ head advances, io_uring_enter calls) happen on this thread. Other threads communicate via two MPSC queues.

graph TD
    WT[Worker Threads] -->|"MpscQueue&lt;IoUringPrepareWorkItem&gt;"| EL[Event Loop Thread]
    WT -->|"MpscQueue&lt;ulong&gt; (cancel)"| EL
    WT -->|"eventfd write (wake)"| EL
    EL -->|"Writes SQEs / Drains CQEs / io_uring_enter"| K[Kernel - io_uring]
    K -->|"CQE completions"| EL
    EL -->|"ThreadPool.QueueUserWorkItem"| TP[ThreadPool]

Multi-Engine Topology

When io_uring is enabled, the engine array is sized according to detected physical core topology. LinuxInitializeEngineAffinityTopology reads /sys/devices/system/cpu/cpu*/topology/{physical_package_id,core_id} to discover physical core groups, then creates one engine per physical core (up to the configured engine count cap). Each engine's event loop thread is pinned to its representative CPU via sched_setaffinity. A s_cpuToEngineIndex mapping array enables CPU-aware socket placement.

When topology detection fails, the engine count falls back to Math.Min(Environment.ProcessorCount, 32) with no CPU pinning.

CPU-Aware Socket Migration

On the first receive completion for a connected socket, TryMigrateIoUringEngineOnFirstReceiveCompletion reads SO_INCOMING_CPU via getsockopt and looks up the target engine via GetEngineIndexForCpu. If the socket's current engine differs from the CPU-optimal engine, the socket migrates to the target engine. This one-shot migration (guarded by _migrationState) improves cache locality for workloads where the kernel's receive-side CPU selection is stable.

SO_REUSEPORT Accept Distribution

For listening sockets with SO_REUSEPORT enabled and multiple engines active, the engine arms shadow listener sockets on non-primary engines. Each shadow listener duplicates the primary listener's socket via SO_REUSEPORT and arms its own multishot accept SQE. Accepted file descriptors from shadow listeners are forwarded to the primary listener's pre-accept queue via DispatchReusePortAcceptIoUringCompletion, which enqueues a readiness fallback event on the primary engine. This distributes accept load across kernel completion queues without requiring the application to manage multiple listener sockets.

Shadow listener setup requests flow through MpscQueue<ReusePortShadowSetupRequest>, and accepted fd engine affinity is tracked in the s_fdEngineAffinity array so that subsequent TryRegisterSocket calls can place the accepted socket on the same engine that accepted it.

The IoUringCompletionOperationKind.ReusePortAccept variant distinguishes shadow-listener accept slots from primary-listener accept slots in the CQE dispatch path. Shadow accept slots carry cross-engine references (ReusePortPrimaryContext, ReusePortPrimaryEngine) in IoUringCompletionSlotStorage.

SO_REUSEPORT accept distribution is an emergency-killable feature via DOTNET_SYSTEM_NET_SOCKETS_IO_URING_DISABLE_REUSEPORT_ACCEPT=1.

The Thin Native Shim Approach

The native shim (pal_io_uring_shim.c, 537 lines) wraps exactly:

io_uring_setup (via syscall(__NR_io_uring_setup, ...) with SYS_io_uring_setup fallback)
io_uring_enter (with and without EXT_ARG; EINTR retry with a 1024-iteration circuit breaker)
io_uring_register
mmap / munmap (for ring mapping)
eventfd / read / write (for cross-thread wakeup; EINTR-looped)
uname (for kernel version detection)
sched_setaffinity / sched_getaffinity (for CPU pinning)

All ring pointer arithmetic, SQE field population, CQE parsing, SQPOLL wakeup detection (via Volatile.Read on the mmap'd SQ flags word), overflow recovery, and operation lifecycle management happens in managed C#. This is deliberate:

Managed code is easier to debug, profile, and modify. The JIT can inline hot paths. No P/Invoke on the SQE write path.
The shim compiles on any Linux with <linux/io_uring.h> -- no liburing dependency.
Feature negotiation (flag peeling, opcode probing) is entirely managed and testable.
Requires exact ABI-level knowledge of kernel structs (mitigated by _Static_assert(IORING_SETUP_CLOEXEC == (1U << 19), ...) in the shim and layout contract tests in C#).

Threading Model

Each engine's event loop thread owns:

The io_uring ring fd and all mmap'd ring pointers for that engine
All SQE writes and CQ drains
The _completionSlots[] / _completionSlotStorage[] arrays
Eventfd registered-file entry management
Adaptive buffer sizing evaluation
SQPOLL idle detection via SQ_NEED_WAKEUP on the mmap'd SQ flags pointer
CQ overflow recovery state machine
SO_REUSEPORT shadow listener setup processing

Worker threads interact solely through:

TryEnqueueIoUringPreparation() -> MPSC prepare queue -> eventfd write
TryRequestIoUringCancellation() -> MPSC cancel queue -> eventfd write
Volatile.Read on _ioUringTeardownInitiated to avoid publishing work after shutdown

io_uring initialization is deferred to the event loop thread so that io_uring_setup sets submitter_task to the event loop thread, as required by DEFER_TASKRUN. TryRegisterSocket waits on a ManualResetEventSlim (_ioUringInitSignal) before handing sockets to an engine, ensuring no socket registers before initialization completes.

Partial Class File Organization

File	Lines	Responsibility
`SocketAsyncEngine.Linux.cs`	4,664	Core: ring setup, flag negotiation, CQE drain loop, SQE prep orchestration, event loop hooks, completion slot lifetime, tracked operation management, overflow recovery, SQPOLL wakeup, queue management, feature resolution, multi-engine topology, SO_REUSEPORT shadow listener management, CPU-affinity-based socket migration, fd engine affinity tracking
`SocketAsyncEngine.IoUringSlots.Linux.cs`	469	SoA completion slot allocation, free-list management, native per-slot slab layout, message header inline copy/writeback, zero-copy pin hold transfer, slot encode/decode
`SocketAsyncEngine.IoUringRings.Linux.cs`	365	`TryMmapRings`: maps SQ/CQ/SQE regions, validates mmap offset bounds, derives all ring pointers. `CleanupManagedRings`: multi-step teardown. `LinuxFreeIoUringResources`: full teardown orchestration
`SocketAsyncEngine.IoUringSqeWriters.Linux.cs`	249	All `Write*Sqe` methods: send, sendZc, recv, readFixed, providedBufferRecv, multishotRecv, accept, multishotAccept, sendMsg, sendMsgZc, recvMsg, connect, asyncCancel. Deduplicated via `WriteSendLikeSqe` and `WriteSendMsgLikeSqe`
`SocketAsyncEngine.IoUringCompletionDispatch.Linux.cs`	847	`SocketEventHandler` partial: `DispatchSingleIoUringCompletion`, `DispatchMultishotIoUringCompletion`, `DispatchZeroCopyIoUringNotification`, `DispatchReusePortAcceptIoUringCompletion`, multishot accept/recv dispatch, buffer materialization, completion result routing
`SocketAsyncEngine.IoUringDiagnostics.Linux.cs`	164	Managed diagnostic delta publication via `PublishIoUringManagedDiagnosticsDelta`, periodic provided buffer ring resize evaluation, zero-copy NOTIF pending slot gauge sampling
`SocketAsyncEngine.IoUringConfiguration.Linux.cs`	429	`IsIoUringEnabled`, `IsSqPollRequested`, `IsZeroCopySendOptedIn`, `IsIoUringDirectSqeDisabled`, `IsMultishotAcceptDisabled`, `IsReusePortAcceptDisabled` with `[FeatureSwitchDefinition]` annotations for JIT-eliminable code paths; `LinuxInitializeEngineAffinityTopology` with physical core topology detection via sysfs; CPU pinning via `sched_setaffinity`
`SocketAsyncEngine.IoUringTestHooks.Stubs.Linux.cs`	15	Release-build stubs for `#if DEBUG`-gated test hook partials
`IoUringTestAccessors.Linux.cs`	1,048	Strongly-typed snapshot structs and accessor methods for all testable engine state

Submission Path: Standard vs. SQPOLL

In standard mode, io_uring_enter submits pending SQEs and optionally waits for CQEs. In SQPOLL mode, a kernel thread continuously polls the SQ ring. Managed code detects idle via Volatile.Read on the mmap'd _managedSqFlagsPtr checking for IORING_SQ_NEED_WAKEUP. When the kernel thread is awake, no io_uring_enter is needed for submission.

Flag Negotiation (Peel Loop)

Setup builds an initial flag set: CQSIZE | SUBMIT_ALL | COOP_TASKRUN | SINGLE_ISSUER | NO_SQARRAY | CLOEXEC. SQPOLL (mutually exclusive with DEFER_TASKRUN) or DEFER_TASKRUN is added based on configuration. On EINVAL, flags are peeled in order: NO_SQARRAY first, then CLOEXEC. EPERM is never retried (respects seccomp/kernel policy). After setup, FD_CLOEXEC is set as a fallback via fcntl for kernels where IORING_SETUP_CLOEXEC was peeled.

CQ Overflow Recovery State Machine

CQ overflow is detected on every DrainCqeRingBatch entry via ObserveManagedCqOverflowCounter, which compares the mmap'd overflow counter against the last-observed value using wrapping uint32 delta arithmetic. When a delta is seen, the engine enters a three-branch recovery state machine:

MultishotAcceptArming: Active when _liveAcceptCompletionSlotCount > 0 and not in teardown. Defers multishot accept re-arm nudges until post-drain.
Teardown: Active when _ioUringTeardownInitiated is set. Teardown owns recovery completion.
DualWave: Steady-state branch for all other overflow scenarios, including escalation when new overflow occurs during existing recovery.

During overflow recovery, CQ head advances happen per-CQE (not batched) to relieve kernel pressure immediately. Recovery completes when the CQ ring is fully drained and no new overflow delta is observed. On completion: AssertCompletionSlotPoolConsistency validates free-list integrity, telemetry is incremented, and for the MultishotAcceptArming branch, TryQueueDeferredMultishotAcceptRearmAfterRecovery nudges accept contexts.

After recovery completes, a delayed sweep (TrySweepStaleTrackedIoUringOperationsAfterCqOverflowRecovery) fires 250ms later to retire tracked operations whose CQEs were dropped. The sweep skips intentionally long-lived multishot accept and persistent multishot recv slots. Operations still in the waiting state are canceled; already-transitioned operations are detached and their slots freed.

3. Key Data Structures

Completion Slot Pool

Four parallel SoA arrays, all indexed by slot index:

IoUringCompletionSlot[] (hot, 24 bytes each, [StructLayout(LayoutKind.Explicit, Size = 24)]):
- Offset 0: Generation (ulong) -- 40-bit generation field
- Offset 8: FreeListNext (int) -- intrusive free list, -1 = end
- Offset 12: _packedState (uint) -- IoUringCompletionOperationKind in low 8 bits, boolean flags IsZeroCopySend/ZeroCopyNotificationPending/UsesFixedRecvBuffer in bits 8-10
- Offset 16: FixedRecvBufferId (ushort)
- Offset 20 (#if DEBUG only): TestForcedResult (int)
IoUringTrackedOperationState[]: Per-slot tracked operation reference (TrackedOperation, TrackedOperationGeneration) for ABA-safe operation tracking.
IoUringCompletionSlotStorage[] (cold): DangerousRefSocketHandle for fd lifetime, pre-allocated native inline storage slab (NativeMsghdr + 4 IOVectors + 128B socket addr + 128B control + socklen_t), message writeback pointers for recvmsg, and cross-engine references for SO_REUSEPORT accept forwarding (ReusePortPrimaryContext, ReusePortPrimaryEngine).
MemoryHandle[] (zero-copy pin holds): One System.Buffers.MemoryHandle per slot index, holding the pin for SEND_ZC payloads until the NOTIF CQE arrives.

Layout contract tests verify IoUringCompletionSlot field offsets and the 24-byte total size via reflection on every test run. A Debug.Assert in InitializeCompletionSlotPool fires if the size drifts.

Generation Encoding

16-bit slot index (SlotIndexBits = 16, capacity 65,536) and 40-bit generation (GenerationBits = 56 - 16 = 40, GenerationMask = (1UL << 40) - 1UL) packed into the 56-bit user_data payload. The upper 8 bits of user_data carry a tag byte (2 = reserved completion, 3 = wakeup signal). Generation is initialized to 1 (not 0) so stale CQEs referencing generation 0 are rejected. On wrap, generation remaps from 2^40-1 back to 1, skipping zero.

IoUringCompletionOperationKind

A 4-variant enum (None, Accept, Message, ReusePortAccept) stored in the packed state of each IoUringCompletionSlot. This determines per-completion post-processing behavior: accept completions read sockaddr length from the native slab; message completions copy writeback data from the native msghdr; reuse-port accept completions forward accepted fds to the primary listener's pre-accept queue on its owning engine.

IoUringCompletionDispatchKind

A 10-variant enum (Default, ReadOperation, WriteOperation, SendOperation, BufferListSendOperation, BufferMemoryReceiveOperation, BufferListReceiveOperation, ReceiveMessageFromOperation, AcceptOperation, ConnectOperation) stored as a packed integer inside each AsyncOperation, set at operation creation time and consumed at CQE dispatch to route completions without virtual dispatch. Defined in the shared Unix partial class (SocketAsyncContext.Unix.cs) so it compiles on all Unix TFMs.

MPSC Queue

MpscQueue<T> is a lock-free segmented queue with cache-line-padded head/tail pointers and an EnqueueIndex counter per segment. Features:

Platform-aware cache line padding: 128-byte on ARM64/LoongArch64, 64-byte otherwise
4-slot unlinked segment cache (guarded by a small Lock) to reduce allocation pressure during burst enqueue patterns
Segment recycling limited to segments that lost the tail-link CAS race (never previously published), avoiding need for producer quiescence tracking
Fast path (TryEnqueueFast/TryDequeueFast) inlined for the common non-full/non-empty case
IsEmpty property is snapshot-based, not linearizable -- a return of true can mean an enqueue is mid-flight

Provided Buffer Ring

IoUringProvidedBufferRing (1,115 lines): Kernel-registered buffer pool for recv operations. Features:

Registered with kernel via IORING_REGISTER_PBUF_RING
Thread-affinity enforced via Debug.Assert(IsCurrentThreadEventLoopThread()) on resize evaluation
Deferred recycle publish: BeginDeferredRecyclePublish/EndDeferredRecyclePublish bracket the CQE drain loop to batch PublishTail calls
Adaptive sizing (default OFF): runtime adjustment of buffer size based on utilization via EvaluateProvidedBufferRingResize, gated by System.Net.Sockets.IoUringAdaptiveBufferSizing AppContext switch
Hot-swap resize: creates a new ring with an alternating group ID (1 or 2), registers it, unregisters the old one, and disposes it
Resize quiescence check: requires InUseCount == 0 and _trackedIoUringOperationCount == 0 before swap
Registered buffer support: IORING_REGISTER_BUFFERS for fixed-buffer recv via READ_FIXED opcode

LinuxIoUringCapabilities

An immutable readonly struct snapshot captured after ring setup and stored as _ioUringCapabilities. Uses bitfield packing (uint _flags with seven single-bit flags). Exposes IsIoUringPort, Mode, SupportsMultishotRecv, SupportsMultishotAccept, SupportsZeroCopySend, SqPollEnabled, SupportsProvidedBufferRings, and HasRegisteredBuffers. Each capability flag is immutable after construction via With* builder methods. Eliminates scattered per-capability flag reads; the entire capability set is decided once at initialization and updated only for provided-buffer state changes.

IoUringResolvedConfiguration

An immutable readonly struct capturing all resolved configuration inputs at startup: IoUringEnabled, SqPollRequested, DirectSqeDisabled, ZeroCopySendOptedIn, RegisterBuffersEnabled, AdaptiveProvidedBufferSizingEnabled, ProvidedBufferSize, PrepareQueueCapacity, CancellationQueueCapacity. Includes IoUringConfigurationWarningFlags detection for misconfiguration scenarios (e.g. SQPOLL requested without io_uring enabled). Logged once via SocketsTelemetry.Log.ReportIoUringResolvedConfiguration and NetEventSource.Info.

SocketIOEventQueue

The event queue type is SocketIOEventQueue (replacing ConcurrentQueue<SocketIOEvent>), providing the inter-thread channel between event loop threads and the ThreadPool work item processing path.

4. Feature Inventory

Complete Feature Stack

Ring initialization with progressive flag negotiation (SQPOLL -> NO_SQARRAY -> CLOEXEC fallback via fcntl)
Managed ring mmap -- SQ ring, CQ ring, and SQE array mapped directly into managed address space; SINGLE_MMAP feature detected for combined SQ/CQ mapping
Direct SQE writes from C# -- no P/Invoke for SQE construction; managed code writes to IoUringSqe* pointers via mmap'd ring
Managed CQE drain -- reads completions directly from mmap'd CQ ring with batched head-advance (deferred until drain completes, except during overflow recovery)
Completion mode -- all socket operations submitted as io_uring ops, not epoll readiness
Multishot accept (kernel 5.19+) -- single SQE arms persistent accept; multishot accept state tracked via _multishotAcceptState (0=disarmed, 1=arming, otherwise encoded user_data); emergency kill-switch via DOTNET_SYSTEM_NET_SOCKETS_IO_URING_DISABLE_MULTISHOT_ACCEPT=1
Multishot recv (kernel 6.0+) -- persistent recv with provided buffer selection, early-data buffering via _persistentMultishotRecvDataQueue
Provided buffer rings -- kernel-managed buffer pool for recv, with deferred recycle publish batching
Adaptive buffer sizing -- runtime adjustment of provided buffer size based on utilization (defaults to OFF)
Registered buffers (IORING_REGISTER_BUFFERS) -- pre-registered I/O vectors for fixed-buffer recv
Fixed-buffer recv (READ_FIXED) -- kernel reads directly into registered buffers
Zero-copy send (SEND_ZC, kernel 6.0+) -- avoids kernel buffer copies for large payloads (>16KB)
Zero-copy sendmsg (SENDMSG_ZC, kernel 6.1+) -- zero-copy for vectored/message sends
Registered files -- file descriptor table registration (used for eventfd)
Registered ring fd (IORING_REGISTER_RING_FD) -- eliminates fget/fput on io_uring_enter itself
DEFER_TASKRUN -- completions processed on the event loop thread, improving cache locality
SINGLE_ISSUER -- kernel optimization for single-threaded submission
SQPOLL (kernel 5.11+, unprivileged 5.12+) -- kernel-side submission thread polls the SQ ring; mutually exclusive with DEFER_TASKRUN; requires dual opt-in (AppContext [FeatureSwitchDefinition] + env var); JIT-eliminable when switch is false
EXT_ARG bounded wait -- 50ms timeout on io_uring_enter for responsive event loops
Eventfd cross-thread wakeup -- MPSC queues + eventfd for thread-safe operation submission
ASYNC_CANCEL -- kernel-level cancellation of in-flight operations
Opcode probing (IORING_REGISTER_PROBE) -- runtime feature detection per opcode
Completion slot pool -- SoA arrays with 24-byte explicit layout, generation-based ABA protection
40-bit generation field -- ~1.1 trillion incarnations per slot before wrap
16-bit slot index -- supports up to 65,536 completion slots per engine
Precomputed dispatch kind -- IoUringCompletionDispatchKind eliminates virtual dispatch on the CQE hot path
CLOEXEC ring fd -- IORING_SETUP_CLOEXEC flag with static assert in shim; fcntl fallback; dedicated test
CQ overflow recovery -- three-branch state machine with post-recovery stale tracked operation sweep
Test hook injection -- forced EAGAIN/ECANCELED results (gated behind #if DEBUG), per-opcode mask; forced EPERM on submit; forced EINTR retry limit exhaustion; forced kernel version unsupported; forced provided-buffer-ring OOM
Thread-affinity assertions -- [Conditional("DEBUG")] AssertSingleThreadAccess at CQE dispatch entry points; mmap offset bounds validation
Comprehensive telemetry -- 12 stable PollingCounters + 29 diagnostic backing fields + structured logging
Multi-engine topology -- one engine per physical core, event loop thread pinned to representative CPU via sched_setaffinity
CPU-aware socket migration -- SO_INCOMING_CPU lookup on first receive completion, one-shot migration to CPU-local engine
SO_REUSEPORT accept distribution -- shadow listener sockets on non-primary engines with cross-engine fd forwarding; fd engine affinity tracking via s_fdEngineAffinity array
EINTR retry circuit breaker -- native shim retries io_uring_enter on EINTR up to 1,024 iterations before returning the error

5. Configuration Surface

Production Environment Variables

Variable	Values	Default	Purpose
`DOTNET_SYSTEM_NET_SOCKETS_IO_URING`	`"1"` to enable	Disabled	Master enable switch
`DOTNET_SYSTEM_NET_SOCKETS_IO_URING_SQPOLL`	`"1"` to enable	Disabled	SQPOLL kernel-side polling (also requires AppContext switch)
`DOTNET_SYSTEM_NET_SOCKETS_IO_URING_DISABLE_MULTISHOT_ACCEPT`	`"1"` to disable	Enabled	Emergency kill-switch for multishot accept
`DOTNET_SYSTEM_NET_SOCKETS_IO_URING_DISABLE_REUSEPORT_ACCEPT`	`"1"` to disable	Enabled	Emergency kill-switch for SO_REUSEPORT accept distribution

Production AppContext Switches

Switch Name	Type	Default	Purpose
`System.Net.Sockets.UseIoUring`	Boolean	`false`	Master enable switch (`[FeatureSwitchDefinition]`)
`System.Net.Sockets.UseIoUringSqPoll`	Boolean	`false`	SQPOLL dual opt-in (`[FeatureSwitchDefinition]` enables JIT elimination)
`System.Net.Sockets.IoUringAdaptiveBufferSizing`	Boolean	`false`	Adaptive provided-buffer ring sizing

Precedence: Environment variable wins over AppContext switch for the master gate. SQPOLL requires both surfaces enabled (dual opt-in).

SQPOLL dual opt-in: Both the AppContext switch AND the environment variable must be enabled. The AppContext switch is the outer gate -- if false, IsSqPollRequested() returns immediately without checking the env var, and the JIT can statically eliminate all SQPOLL branches.

Debug-Only Test Controls

All DOTNET_SYSTEM_NET_SOCKETS_IO_URING_TEST_* environment variables are gated behind #if DEBUG:

TEST_DIRECT_SQE (0/1): disable/enable direct SQE submission
TEST_ZERO_COPY_SEND (0/1): disable/enable zero-copy send
TEST_REGISTER_BUFFERS: control registered buffer behavior
TEST_PROVIDED_BUFFER_SIZE: override provided buffer size
TEST_ADAPTIVE_BUFFER_SIZING (1): force adaptive sizing on
TEST_PREPARE_QUEUE_CAPACITY: override prepare queue capacity
TEST_QUEUE_ENTRIES: override SQ ring size (must be power of 2, 2-1024)
TEST_EVENT_BUFFER_COUNT: override event buffer count for deterministic diagnostics coverage
TEST_FORCE_EAGAIN_ONCE_MASK: comma-separated opcode names for forced EAGAIN
TEST_FORCE_ECANCELED_ONCE_MASK: comma-separated opcode names for forced ECANCELED
TEST_FORCE_SUBMIT_EPERM_ONCE (1): force a single io_uring_enter submission to return EPERM
TEST_FORCE_ENTER_EINTR_RETRY_LIMIT_ONCE (1): force the native EINTR retry circuit breaker to trigger once
TEST_FORCE_KERNEL_VERSION_UNSUPPORTED (1): force kernel version check to fail
TEST_FORCE_PROVIDED_BUFFER_RING_OOM_ONCE (1): force provided buffer ring allocation to fail once

6. Safety and Correctness Measures

Fd Lifetime Management

Every direct SQE preparation takes a DangerousAddRef on the socket's SafeSocketHandle, stored in _completionSlotStorage[slotIndex].DangerousRefSocketHandle. This keeps the fd alive from SQE prep through CQE retirement, preventing fd-reuse races after close. The ref is released in FreeCompletionSlot.

Stale CQE Protection

Generation-based ABA protection. Each completion slot starts at generation 1. On free, generation increments (wrapping from 2^40-1 to 1, skipping 0). CQE dispatch compares the CQE's encoded generation against the slot's current generation; mismatches are silently dropped as stale.

Zero-Copy Send Lifecycle

SEND_ZC produces two CQEs: a data completion and a NOTIF. The slot's IsZeroCopySend and ZeroCopyNotificationPending flags track this two-phase lifecycle. After the first CQE, the slot is kept alive and the tracked operation is reattached via TryReattachTrackedIoUringOperation (generation CAS from 0 to new generation, then operation CAS from null to operation). The NOTIF CQE triggers HandleZeroCopyNotification which frees the slot and releases the pin hold.

Multishot Accept Arming

The _multishotAcceptState field uses a three-state protocol: 0 (disarmed), 1 (arming -- SQE being written but user_data not yet published), or the encoded user_data value itself (armed). GetArmedMultishotAcceptUserDataForCancellation spins briefly if the arming transition is in flight.

Persistent Multishot Recv Guard

During CQE batch draining, persistent multishot recv completions check operation.IoUringUserData == 0 to detect operations that the ThreadPool has recycled (reset to Waiting state) before the event loop finishes the CQE batch. IoUringUserData is zeroed on the event-loop thread at completion and only restored during prepare-queue drain, making it a reliable recycled-operation sentinel independent of ThreadPool-driven state changes.

Teardown Ordering

LinuxFreeIoUringResources follows a strict multi-phase teardown:

Unregister provided buffer ring (needs ring fd)
Mark registered ring fd inactive
Close wakeup eventfd
Unmap rings via CleanupManagedRings (also closes ring fd, terminating SQPOLL thread)
Disable managed flags
Drain queued operations (DrainQueuedIoUringOperationsForTeardown runs twice -- once before and once after native port closure to catch late-arriving items)
Drain tracked operations via DrainTrackedIoUringOperationsForTeardown
Clear all aliasing pointers before NativeMemory.Free
Zero all state fields and publish final diagnostics

CleanupManagedRings nulls all mmap-derived pointers before unmapping to prevent use-after-unmap.

Nullable Avoidance

The SQE retry drain path avoids wrapping SocketEventHandler (a struct) in a Nullable<T> wrapper. Presence is tracked via a separate drainHandlerInitialized boolean, avoiding boxing pressure on the hot path.

SQE Size Validation

TryGetNextManagedSqe checks ringInfo.SqeSize != (uint)sizeof(IoUringSqe) at runtime, catching 128-byte SQE kernels that would corrupt the ring. TryMmapRings additionally rejects SetupSqe128 negotiations.

7. Performance Optimizations

CQ Head Advance Batching

Outside of overflow recovery, CQ head advances are deferred: _managedCachedCqHead is incremented locally and the single Volatile.Write to *_managedCqHeadPtr happens once at the end of the drain batch (in the finally block). During overflow recovery, advances happen per-CQE to relieve kernel pressure.

SQE Zeroing

Each TryGetNextManagedSqe call writes Unsafe.WriteUnaligned(sqe, default(IoUringSqe)) for JIT-vectorized 64-byte zeroing before returning the SQE. This eliminates stale field concerns and enables each Write*Sqe method to write only the fields it needs.

SQE Writer Deduplication

Send-like operations share WriteSendLikeSqe (differing only by opcode: Send vs SendZc). Sendmsg-like operations share WriteSendMsgLikeSqe (SendMsg vs SendMsgZc). This reduces copy-paste without sacrificing readability.

SQE Acquire With Retry

TryAcquireManagedSqeWithRetry attempts up to MaxIoUringSqeAcquireSubmitAttempts (16) rounds. Between retries, it runs DrainCqeRingBatch to free CQ slots, then submits pending SQEs. The drain handler is lazily initialized to avoid struct construction on the fast path.

Completion Slot Drain Recovery

When AllocateCompletionSlot returns -1 (pool exhausted), the engine drains CQEs inline (guarded by _completionSlotDrainInProgress to prevent recursion) and retries allocation.

Provided Buffer Deferred Recycle

BeginDeferredRecyclePublish/EndDeferredRecyclePublish bracket the CQE drain loop. Buffer descriptor writes accumulate without individual Volatile.Write tail publishes. A single tail publish happens at EndDeferredRecyclePublish.

Diagnostics Polling

Diagnostic counters are polled every IoUringDiagnosticsPollInterval (64) event loop iterations, not on every CQE. Managed deltas are accumulated in per-engine fields and published in batch to SocketsTelemetry.

Lazy Lock Allocation

_multishotAcceptQueueGate, _persistentMultishotRecvDataGate, and _reusePortShadowListenersGate on SocketAsyncContext are lazy-initialized via EnsureLockInitialized (CAS from null). Most sockets never use these paths, so the Lock objects are only allocated when needed.

Event Loop Wait

The event loop first tries a non-blocking DrainCqeRingBatch. If no CQEs are available, it issues io_uring_enter with GETEVENTS and a 50ms EXT_ARG timeout (bounded wait). A secondary 1ms circuit-breaker timeout (WakeFailureFallbackWaitTimeoutNanos) is used after repeated eventfd wake failures. This trades worst-case latency for starvation resilience when eventfd wakes are missed or deferred.

Fd Engine Affinity

The s_fdEngineAffinity array maps file descriptor numbers to preferred engine indices. When a SO_REUSEPORT shadow listener accepts an fd, SetFdEngineAffinity records the accepting engine's index. Subsequent TryRegisterSocket calls consume this affinity hint via Interlocked.Exchange, placing the socket on the engine that accepted it. This avoids cross-engine cache pollution for accepted connections.

8. Telemetry and Observability

Stable PollingCounters (12)

Published when the EventSource is enabled on Linux. Counter names are centralized in IoUringCounterNames:

Counter	What to watch for
`io-uring-prepare-nonpinnable-fallbacks`	Operations that couldn't use direct preparation
`io-uring-socket-event-buffer-full`	Event buffer capacity pressure
`io-uring-cq-overflows`	Event loop can't keep up with kernel completions
`io-uring-cq-overflow-recoveries`	Successful overflow recovery completions
`io-uring-prepare-queue-overflows`	Submission queue capacity pressure
`io-uring-prepare-queue-overflow-fallbacks`	Operations that fell back to readiness dispatch
`io-uring-completion-slot-exhaustions`	Slot capacity pressure
`io-uring-completion-slot-high-water-mark`	Peak concurrent slot usage
`io-uring-cancellation-queue-overflows`	Cancellation queue capacity pressure
`io-uring-provided-buffer-depletions`	Provided buffer ring ran out of buffers
`io-uring-sqpoll-wakeups`	SQPOLL kernel thread wakeups from idle
`io-uring-sqpoll-submissions-skipped`	Zero-syscall fast path hits (SQPOLL)

Diagnostic Backing Fields (29)

Written internally for structured logging and test access. Not published as PollingCounters. Include:

Async cancel CQE counts
Completion requeue failures
Zero-copy notification pending slots gauge
Prepare queue depth
Completion slot drain recoveries
Provided buffer current size, recycles, resizes
Registered buffer initial/re-registration success and failure
Fixed recv selected/fallbacks
Persistent multishot recv reuse, termination, early data
SQPOLL wakeups and submissions skipped

Startup Events

ReportIoUringResolvedConfiguration (event ID 9): Logged once with all resolved config inputs, including validation warnings for misconfigured knobs
ReportSocketEngineBackendSelected (event ID 7): Reports io_uring_completion vs. epoll selection and SQPOLL status
ReportIoUringSqPollNegotiatedWarning (event ID 8): WARNING-level when SQPOLL is negotiated

Structured Logging

IoUringDiagnostics.Linux.cs centralizes managed diagnostic delta publication with PublishIoUringManagedDiagnosticsDelta:

Per-engine delta-based counter publishing (compares source and published baselines)
Zero-copy NOTIF pending slot gauge sampling from completion slot array
Non-pinnable prepare fallback delta publishing
Prepare queue overflow and depth tracking
Completion slot exhaustion and drain recovery deltas

Collectible via dotnet-counters, dotnet-trace, or any OpenTelemetry-compatible collector.

9. Test Coverage

Test Access Architecture

The test project does not use InternalsVisibleTo. Instead:

IoUringTestAccessors.Linux.cs (1,048 lines) in IoUringTestInfrastructure/ defines all test-visible snapshot types and accessor methods inside SocketAsyncEngine (production assembly)
InternalTestShims.Linux.cs (707 lines) in the test project mirrors these types and resolves them via reflection
SocketAsyncEngine.IoUringTestHooks.Linux.cs (229 lines) in IoUringTestInfrastructure/ provides #if DEBUG-gated EAGAIN/ECANCELED forced result injection
A [DynamicDependency(DynamicallyAccessedMemberTypes.All, "System.Net.Sockets.SocketAsyncEngine", "System.Net.Sockets")] attribute preserves all targets under trimming and AOT

Test Suite (159 test methods across 7,723 lines)

Coverage areas:

All operation types: send, recv, accept, connect, sendmsg, recvmsg
Completion mode vs. fallback: forced-fallback tests via environment variables
Per-opcode disable: env-var-driven opcode disabling for isolation
Forced-result injection: EAGAIN and ECANCELED injection per opcode (#if DEBUG); forced EPERM on submit; forced EINTR retry limit exhaustion; forced kernel version unsupported; forced provided-buffer-ring OOM
Multishot accept: basic flow, cancellation, queue drain, dispose-during-arming race, one-shot fallback (deterministic via reflection override)
Multishot recv: basic iteration, cancellation, peer close, early data buffering, multishot gating by socket type (datagram exclusion)
Provided buffers: depletion, recycling, adaptive sizing, registered buffer toggle
Zero-copy send: threshold behavior, notification lifecycle, mixed mode
SQPOLL mode: basic send/receive, fallback, idle wakeup, multishot recv, zero-copy send, telemetry, SQ_NEED_WAKEUP contract (7 dedicated tests)
CQ overflow recovery: five-test suite covering all three branches
- Test 1: inject overflow, verify telemetry counter increment and slot/op settlement
- Test 2 (branch a): multishot accept arming during overflow -- no silent drop
- Test 3 (branch b): teardown under overflow -- no deadlock within 60s
- Test 4: DEBUG single-issuer assertion fires on non-event-loop thread
- Test 5 (branch c): sustained 10s adversarial overflow injection with concurrent workload
Layout contracts: NativeMsghdrLayoutContract_IsStable and CompletionSlotLayoutContract_IsStable verify ABI alignment via reflection
Reflection target stability: CqOverflow_ReflectionTargets_Stable ensures field names are documented and stable
CLOEXEC: RingFd_HasCloexecFlag_Set verifies the FD_CLOEXEC bit via fcntl
ARM64 and concurrency: ARM64 MPSC stress, generation-transition stress, concurrent resize-swap
Cancellation: concurrent cancel/submit contention, teardown drain
Buffer pressure: bounded queue capacity, slot exhaustion recovery
Telemetry: stable counter name contract validation, counter increment verification (876-line TelemetryTest.cs)
Config: dual opt-in SQPOLL validation, removed-knobs-default-enabled verification
Teardown: clean shutdown, resource cleanup
Non-pinnable fallback publication: concurrent publisher stress test via reflection shim
MPSC queue: dedicated 295-line test suite (MpscQueueTests.cs)
SO_REUSEPORT accept distribution: shadow listener arm/disarm, cross-engine fd forwarding, kill-switch validation
CPU migration: SO_INCOMING_CPU detection, one-shot migration guard

Hard to Test In-Process

True CQ overflow (requires kernel-level timing control; mitigated by managed overflow counter injection via reflection)
RLIMIT_MEMLOCK failures (requires container-level constraints)
Kernel version degradation (requires multiple kernel environments; partially mitigated by TEST_FORCE_KERNEL_VERSION_UNSUPPORTED)
SQPOLL CPU consumption (requires system-level profiling)
Real-world latency distributions (requires benchmark infrastructure)

10. Graceful Degradation

Condition	Behavior
Kernel < 6.1	Epoll used
Env var not set to "1" (and no AppContext switch)	Epoll used
io_uring_setup fails	Epoll fallback
SQPOLL not supported (EINVAL or EPERM)	Flag peeled; DEFER_TASKRUN added; engine continues
NO_SQARRAY unsupported	Flag peeled; SQ array identity-mapped
CLOEXEC unsupported	Flag peeled; fcntl FD_CLOEXEC fallback
Opcode probe fails	Advanced opcodes disabled; basic ops still work
Provided buffer ring fails	Multishot recv disabled; one-shot recv with inline buffers
RLIMIT_MEMLOCK prevents buffer registration	Engine continues without registered buffers
Completion slot exhaustion	Drain CQEs inline; retry allocation; fall back to readiness dispatch
Prepare queue overflow	Fall back to readiness dispatch for overflowed op
CQ overflow detected	Three-branch recovery state machine; delayed stale sweep
SQE ring full	Retry with intermediate submit + CQ drain (up to 16 attempts)
NativeMsghdr layout unsupported (non-64-bit)	io_uring disabled entirely
CPU topology detection fails	Engines created without CPU pinning; round-robin socket assignment
SO_REUSEPORT not enabled on listener	Shadow listeners not created; accept handled by primary engine only
io_uring_enter EINTR	Native shim retries up to 1,024 iterations; returns error on exhaustion

11. Remaining Open Items

The following areas represent future work rather than outstanding defects:

Performance Follow-Ups (Tier 2)

Provided-buffer ring tail batching: Further batching of PublishTail calls. Current implementation uses BeginDeferredRecyclePublish/EndDeferredRecyclePublish which partially addresses this.
MpscQueue segment cache expansion: Currently caches 4 unlinked segments (expanded from original 1). Epoch-based drain-side recycling remains as an alternative approach.
SQE zeroing optimization: Currently uses Unsafe.WriteUnaligned<IoUringSqe>(sqe, default) for JIT-vectorized zeroing. Per-field writes remain as a potential micro-optimization.

Path to Default-On

Opt-in environment variable (current state)
Extensive testing (CI, stress tests, TechEmpower)
Default-on for kernel >= 6.1 with runtime capability detection
Remove the gate; io_uring is the Linux backend

SQPOLL will likely remain opt-in permanently due to its CPU cost trade-off.

Future Kernel Features

Incremental buffer rings (kernel 6.12+): Partial buffer consumption without full ring cycle
RecvSend bundles (kernel 6.10+): Single SQE performs recv then send
Zero-copy RX (kernel 6.7+): True zero-copy receive sharing NIC ring buffers

Multi-Engine Evolution

The single-process comment in SocketAsyncEngine.Linux.cs notes that io_uring completion mode uses one engine per physical core. Future work may evaluate finer-grained socket affinity sharding when high-core throughput data justifies the additional complexity.

12. Distribution Readiness

Kernel Version Matrix

The minimum kernel cutoff is a single 6.1 requirement. All sub-features are detected at runtime via opcode probing.

Distribution	Version	Kernel	io_uring (6.1+)
Ubuntu 24.04 LTS	GA	6.8	Yes
Ubuntu 22.04 LTS	GA	5.15	No (epoll fallback)
Ubuntu 22.04 LTS	HWE	6.8	Yes
RHEL 10	GA	6.12	Yes
RHEL 9	GA	5.14	No (epoll fallback)
Debian 13 (Trixie)	GA	6.12	Yes
Debian 12 (Bookworm)	GA	6.1	Yes
Azure Linux 3	GA	6.6	Yes
Amazon Linux 2023	Default	6.1	Yes
Amazon Linux 2	Default	5.10	No (epoll fallback)

Memory Overhead

Component	Size	Notes
SQ ring	~16KB	1024 entries
CQ ring	~64KB	4096 entries (4x SQ)
SQE array	~64KB	1024 entries * 64B
Provided buffer pool	~4MB	1024 * 4KB default
Completion slots (hot)	~1.5MB	65,536 slots * 24B
Tracked operations	~varies	Parallel managed object array
Completion slot storage (cold)	~varies	Managed object array
Native per-slot slab	~varies	NativeMemory.AllocZeroed
Zero-copy pin holds	~512KB	65,536 * sizeof(MemoryHandle)
Total	~7MB+	Per engine instance (userspace)

Memory scales linearly with the number of active engines (one per physical core).

Copilot

Pull request overview

This PR implements an experimental opt-in io_uring-backed socket event engine for Linux as an alternative to epoll. The implementation is comprehensive, including both readiness-based polling (Phase 1) and completion-based I/O operations (Phase 2), along with extensive testing infrastructure and evidence collection tooling.

Changes:

Native layer: cmake configuration, PAL networking headers, and io_uring system call integration with graceful epoll fallback
Managed layer: socket async engine extensions for io_uring completion handling, operation lifecycle tracking, buffer pinning, and telemetry
Testing: comprehensive functional tests, layout contract validation, stress tests, and CI infrastructure for dual-mode test execution
Tooling: evidence collection and validation scripts for performance comparison and envelope testing

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/native/libs/configure.cmake	Adds CMake configuration checks for io_uring header and poll32_events struct member
src/native/libs/System.Native/pal_networking.h	Defines new io_uring interop structures (IoUringCompletion, IoUringSocketEventPortDiagnostics) and function signatures
src/native/libs/System.Native/entrypoints.c	Registers new io_uring-related PAL export entry points
src/native/libs/Common/pal_config.h.in	Adds CMake defines for io_uring feature detection
src/libraries/System.Net.Sockets/tests/FunctionalTests/TelemetryTest.cs	Adds layout contract tests for io_uring interop structures and telemetry counter verification
src/libraries/System.Net.Sockets/tests/FunctionalTests/System.Net.Sockets.Tests.csproj	Implements MSBuild infrastructure for creating io_uring test archive variants (enabled/disabled/default)
src/libraries/System.Net.Sockets/tests/FunctionalTests/IoUring.Unix.cs	Adds comprehensive functional and stress tests for io_uring socket workflows
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketsTelemetry.cs	Adds 12 new PollingCounters for io_uring observability metrics
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs	Implements managed wrappers for io_uring prepare operations with error handling
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs	Core io_uring integration: submission batching, completion handling, operation tracking, and diagnostics polling
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs	Operation-level io_uring support: buffer pinning, user_data allocation, completion processing, and state machine
src/libraries/Common/src/Interop/Unix/System.Native/Interop.SocketEvent.cs	Defines managed interop structures matching native layout for io_uring operations
eng/testing/io-uring/validate-collect-sockets-io-uring-evidence-smoke.sh	Smoke validation script for evidence collection tooling
eng/testing/io-uring/collect-sockets-io-uring-evidence.sh	Comprehensive evidence collection script for functional/perf validation and envelope testing
docs/workflow/testing/libraries/testing.md	Adds references to io_uring-specific documentation
docs/workflow/testing/libraries/testing-linux-sockets-io-uring.md	Detailed validation guide for io_uring backend testing
docs/workflow/testing/libraries/io-uring-pr-evidence-template.md	PR evidence template for documenting io_uring validation results

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

Copilot

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.

src/libraries/System.Net.Sockets/tests/FunctionalTests/IoUring.Unix.cs

Copilot

Pull request overview

Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

src/native/libs/System.Native/pal_networking.c

Copilot

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 20 out of 21 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 18 out of 20 changed files in this pull request and generated 7 comments.

src/libraries/System.Net.Sockets/tests/FunctionalTests/TelemetryTest.cs

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

src/native/libs/configure.cmake

deathly809 · 2026-02-19T21:57:35Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/MpscQueue.cs

+        {
+            get
+            {
+                Segment head = Volatile.Read(ref _head.Value)!;


This seems pretty computationally heavy; is there a reason you just can't have a single _count variable that you atomically increment/decrement that you just check for 0 here?

deathly809 · 2026-02-19T22:21:46Z

...stem.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.IoUringCompletionDispatch.Linux.cs

+                        fixedRecvBufferId,
+                        ref completionAuxiliaryData))
+                {
+                    completionResultCode = -Interop.Sys.ConvertErrorPalToPlatform(Interop.Error.ENOBUFS);


Why the negation? I see you do it below as well. I did a quick search around the repo and only saw this referenced in one other place and they did not do negation and the folks referencing that code don't appear to being a negation either.

deathly809 · 2026-02-19T22:25:37Z

src/native/libs/System.Native/pal_io_uring_shim.c

+    int32_t state = atomic_load_explicit(&s_forceEnterEintrRetryLimitOnce, memory_order_relaxed);
+    if (state < 0)
+    {
+        const char* configuredValue = getenv(SHIM_TEST_FORCE_ENTER_EINTR_RETRY_LIMIT_ONCE_ENV);


Should this be behind a #ifdef DEBUG?

deathly809 · 2026-02-19T22:36:39Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketsTelemetry.cs

        private const string ConnectActivityName = ActivitySourceName + ".Connect";
        private static readonly ActivitySource s_connectActivitySource = new ActivitySource(ActivitySourceName);

+        internal static class Keywords


Maybe IoUringKeywords would be a better description.

deathly809 · 2026-02-19T22:46:44Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

+#if DEBUG
+            // Test-only knob to make wait-buffer saturation deterministic for io_uring diagnostics coverage.
+            // Only available in DEBUG builds so production code never reads test env vars.
+            if (OperatingSystem.IsLinux())


Should you also check DOTNET_SYSTEM_NET_SOCKETS_IO_URING or do we assume that DOTNET_SYSTEM_NET_SOCKETS_IO_URING_TEST_EVENT_BUFFER_COUNT is only set when the feature flag is enabled?

deathly809 · 2026-02-19T23:00:42Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

            try
            {
+                RecordAndAssertEventLoopThreadIdentity();
+                LinuxEventLoopEnableRings();


Wonder if these could be more generic. i.e.

LinuxEventLoopEnableRings -> EventLoopInit
LinuxEventLoopBeforeWait -> EventLoopBeforeWait
LinuxEventLoopTryCompletionWait -> EventLoopTryCompleteWait
etc.

Hmm, I guess it would be an issue if someone wanted to add their own "EventLoopInit" or equivalent for the other methods :)

deathly809 · 2026-02-19T23:59:38Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Linux.cs

+            }
+            else
+            {
+                Debug.Assert(


Does this mean we have not tested this on Kernels before 6.1?

deathly809 · 2026-02-20T00:23:56Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Linux.cs

+            {
+                // Snapshot the wakeup generation counter before entering the blocking syscall.
+                // After waking, we compare to detect wakeups that arrived during the syscall.
+                uint wakeGenBefore = Volatile.Read(ref _ioUringWakeupGeneration);


Going to need to define this outside the if statement so you can reference it after the if/else

am11 · 2026-02-20T16:18:04Z

...es/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.IoUringConfiguration.Linux.cs

+        /// <summary>
+        /// Returns whether SQPOLL mode has been explicitly requested.
+        /// SQPOLL requires dual opt-in: AppContext switch + environment variable.
+        /// This is intentionally stricter than the primary io_uring gate
+        /// (`IsIoUringEnabled`), which accepts either source.
+        /// SQPOLL pins a kernel thread, so accidental activation should require
+        /// explicit confirmation from both configuration surfaces.
+        /// </summary>
+        private static bool IsSqPollRequested()
+        {
+            IoUringConfigurationInputs inputs = ReadIoUringConfigurationInputs();
+            return ResolveSqPollRequested(inputs);
+        }
+
+        private static bool ResolveSqPollRequested(in IoUringConfigurationInputs inputs)
+        {
+            if (!inputs.SqPollFeatureSwitchEnabled)
+            {
+                return false;
+            }
+
+            return string.Equals(inputs.SqPollEnvironmentValue, "1", StringComparison.Ordinal);
+        }


Dual knob feels unnecessary here since runtime configuration is typically supplied either via the environment variable or via the AppContext switch, not both. For example, setting DOTNET_Thread_DefaultStackSize too low fails to initialize threads, while setting it too high just burns resources. Users enabling this kind of feature are expected to understand the implications rather than being gated by added obscurity.

Also consider caching these like existing knobs in main. We typically perform a one-time static lookup per process and do not support changing this kind of configuration mid-process:

runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/GlobalizationMode.cs

Line 15 in e7b2a9b

internal static bool Invariant { get; } = AppContextConfigHelper.GetBooleanConfig("System.Globalization.Invariant", "DOTNET_SYSTEM_GLOBALIZATION_INVARIANT");

Agree, .NET already has knobs that can be activated from either environment variables or runtime configuration, that would degrade performance if used by more than a single process at the same time, such as Server GC.

Copilot

Pull request overview

Copilot reviewed 33 out of 36 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 32 out of 35 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 33 out of 36 changed files in this pull request and generated no new comments.

benaadams · 2026-02-23T05:19:40Z

Putting back to draft as still needs some work

Copilot AI review requested due to automatic review settings February 13, 2026 11:18

github-actions bot added the area-System.Net.Sockets label Feb 13, 2026

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 13, 2026

Copilot started reviewing on behalf of benaadams February 13, 2026 11:19 View session