Move Overlapped implementation to managed code in CoreLib by filipnavara · Pull Request #23029 · dotnet/coreclr

filipnavara · 2019-03-05T13:12:17Z

Adapts managed implementation of Overlapped from CoreRT for use in CoreCLR. Eventually the goal is to move the code to shared partition, but there are few small things that need to be sorted out first.

I adapted a code from CoreFX tests to see whether using GCHandle will result in measurable performance impact. The benchmark showed that for _userObject pointing to single object or array of three objects there was no significant performance impact.

Before:

Method	Mean	Error	StdDev
AllocOverlappedNull	23.69 us	0.4541 us	0.4248 us
AllocOverlappedObject	23.92 us	0.4692 us	0.9370 us
AllocOverlappedArray	25.25 us	0.8678 us	2.5586 us

After:

Method	Mean	Error	StdDev
AllocOverlappedNull	23.71 us	0.4676 us	0.4593 us
AllocOverlappedObject	23.80 us	0.4592 us	0.6130 us
AllocOverlappedArray	24.35 us	0.4937 us	0.8112 us

Missing things:

Firing ETW for Overlapped.Pack
Merge Overlapped and OverlappedData
Update exceptions and unit test exclusions
Submit updates to unit tests in CoreFX (Update parameter names to match https://github.com/dotnet/coreclr/pull/23029 corefx#35813)
~~Move code to shared partition~~
Will submit as separate PR

filipnavara · 2019-03-05T13:13:45Z

src/System.Private.CoreLib/src/System/Threading/Overlapped.cs

+                *(GCHandle*)(_pNativeOverlapped + 1) = GCHandle.Alloc(this);
+
+                //if (ETW_EVENT_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, ThreadPoolIODequeue))
+                //    FireEtwThreadPoolIOPack(lpOverlapped, overlappedUNSAFE, GetClrInstanceId());


Not sure how to translate this to managed code? I looked into FrameworkEventSource, but the definitions there are not really aligned with ClrEtwAll.man, so I am not sure how to proceed.

The appropriate thing to do is to create an event that logs when an I/O packing happens as an event in FrameworkEventSource (as you guessed). Yes, there is not particular correspondence, and the update may break causality reconstruction logic in PerfView (it will have to be updated to use the new event). I can help with that. (you can see the current use in https://github.com/Microsoft/perfview/blob/master/src/TraceEvent/Computers/ActivityComputer.cs#L158).

But the first thing to do is to have event to replace it. Fundamentally we want to link when async I/O is requested and when it completes (by some ID like the address of the NativeOverlapped structure, or anything else that is shared uniquely between the IO request and its completion).

Ah, I just realised why the event is needed. I was always testing code that called both the ThreadPoolIOPack and ThreadPoolIOEnqueue events and not the native (file/socket) overlapped I/O where only ThreadPoolIOPack is triggered. I will look into it tomorrow to see how to add the event back.

I gave it a whirl, but I am not happy with the result. Now ThreadPoolIOEnqueue and ThreadPoolIOPack appear in different providers. They can be correlated based on the parameters, but it's a bit suboptimal. Moreover, PerfView always picks the manifest for FrameworkEventSource from full NetFX instead of the CoreFX version, so the new event is not properly decoded.

Would it be too bad to keep emitting the event using FCall?

@jkotas Probably not. I have a prototype ready, but I didn't benchmark it yet. I also tried emitting the exact same event from managed code, but could not avoid getting some information from runtime and requiring FCall anyway.

If you need to make a FCall anyway, it would be nice to simply log the event in native code, which avoids breaking compat.

I need FCall if I try to log the event in the same format as before. It's not necessary if I use FrameworkEventSource.

I run some benchmarks, but the results were not very appealing and they were quite inconsistent (some runs were 2x as slow as others). I suspect there's something wrong either with the implementation or the benchmark... I'll look into it more.

Apparently happens on the machine even for the unchanged CoreCLR, so I guess some power management issue :-/

filipnavara · 2019-03-05T14:39:26Z

The tests are failing because ArgumentException has different parameter name. I can re-wrap the exception or change the CoreFX test. None of the names are actually correct name of the parameter. CoreFX test expects null, which CoreCLR used to return. In new code GCHandle throws the exception and it uses value as parameter name. The actual name of the parameter in the public function is userData.

filipnavara · 2019-03-05T15:11:15Z

The last remaining native function is CheckVMForIOPacket. As far as I can see it is an optimization that tries to improve performance of multiple packets being processed in succession. It seems it could be avoided completely if the check is moved just behind this line and kept directly in the thread pool:

coreclr/src/vm/win32threadpool.cpp

Line 3617 in 1079ba8

((LPOVERLAPPED_COMPLETION_ROUTINE) key)(errorCode, numBytes, pOverlapped);

(or in BindIoCompletionCallBack_Worker)

Not sure what condition is handled by the if (overlapped->m_callback == NULL) branch:

coreclr/src/vm/nativeoverlapped.cpp

Lines 51 to 58 in 1079ba8

    
           if (overlapped->m_callback == NULL) 
        
           { 
        
               //We're not initialized yet, go back to the Vm, and process the packet there. 
        
               ThreadpoolMgr::StoreOverlappedInfoInThread(pThread, *errorCode, *numBytes, key, *lpOverlapped); 
        
               *lpOverlapped = NULL; 
        
               return; 
        
           }

/cc @jkotas

jkotas · 2019-03-05T16:14:05Z

CheckVMForIOPacket As far as I can see it is an optimization

Yes, it is an perf optimization. As with any perf optimization, any changes to it should be measured.

jkotas · 2019-03-05T16:15:36Z

change the CoreFX test.

It is fine to update the CoreFX tests to expect the correct argument name. (Disable them in https://github.com/dotnet/coreclr/blob/master/tests/CoreFX/CoreFX.issues.json to make the CoreCLR PR green.)

jkotas · 2019-03-05T16:15:48Z

cc @kouvel

…. It is no longer used since commit 5a7fb90 and thus is not performance critical. This allows removing the knowledge of the object layout from the unmanaged code.

…ions

src/vm/ClrEtwAll.man

jkotas · 2019-03-07T01:06:10Z

src/vm/gcenv.ee.cpp

@@ -1440,27 +1412,7 @@ void GCToEEInterface::WalkAsyncPinned(Object* object, void* context, void (*call
    assert(object != nullptr);


We can just assert false here. This method should be unreachable now.

Apparently it's still called. The CI fails on it big time.

Ok, I see why. Could you please change it back?

src/vm/nativeoverlapped.h

src/vm/mscorlib.h

…anaged code now

This reverts commit dd0f445.

filipnavara · 2019-03-08T00:07:44Z

src/System.Private.CoreLib/shared/System/Diagnostics/Tracing/FrameworkEventSource.cs

            ThreadPoolDequeueWork((long)*((void**)Unsafe.AsPointer(ref workID)));
        }

+        [Event(32, Level = EventLevel.Verbose, Keywords = Keywords.ThreadPool)]


Not sure how the event IDs are assigned. I looked up what CoreFX and NetFX had and used the next available in the row.

The only important thing about the ID is that it is unique. The only thing that makes it at all tricky is that we have two version of this file (.NET Core and .NET Desktop), and we want it to be unique considering both of them. As long as we add events to .NET Core every time we do to Desktop, then everything is OK (and I took a look and this seems to be the case.

The long and the short of this is that 32 is OK to use.

Could you put a comment that we should be using the IDs in the range from 33 to 149 first for any new events.

filipnavara · 2019-03-08T00:08:55Z

src/System.Private.CoreLib/src/System/Threading/Overlapped.cs

                    FreeNativeOverlapped();
+
+                if (success && FrameworkEventSource.Log.IsEnabled(EventLevel.Verbose, FrameworkEventSource.Keywords.ThreadPool))
+                    System.Diagnostics.Tracing.FrameworkEventSource.Log.ThreadPoolIOPackWork((long)(IntPtr)_pNativeOverlapped);


This needs to be checked to see if it produces the same ID as the native code to allow the events to be properly correlated.

It doesn't contain all the relevant data yet. I finally got to check it... Working on it.

filipnavara · 2019-03-08T20:22:21Z

My benchmark machine is broken, so I cannot re-run the benchmark to test for regression. Code is linked in the PR description if anyone wants to give it a go.

src/vm/comthreadpool.cpp

jkotas · 2019-03-08T20:45:21Z

src/vm/comthreadpool.cpp

+
+    HELPER_METHOD_FRAME_BEGIN_RET_0();
+
+    if (ETW_EVENT_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, ThreadPoolIODequeue))


Check of this condition can be before the HELPER_METHOD_FRAME_BEGIN to minimize overhead when the tracing is off.

Wanted to ask about that.

filipnavara · 2019-03-09T19:16:56Z

The benchmark is definitely triggering some code that results in quite unreliable performance. Here are some results I managed to get:

Before:

Method	Mean	Error	StdDev
AllocOverlappedNull	13.00 us	0.2599 us	0.6176 us
AllocOverlappedObject	13.03 us	0.3933 us	1.0222 us
AllocOverlappedArray	13.75 us	0.4887 us	1.3623 us

After:

Method	Mean	Error	StdDev
AllocOverlappedNull	12.01 us	0.1646 us	0.1459 us
AllocOverlappedObject	12.08 us	0.2172 us	0.2032 us
AllocOverlappedArray	13.21 us	0.2812 us	0.8114 us

The problem is that during the benchmark runs there are some quite visible stalls like this:

WorkloadActual  27: 65536 op, 791229500.00 ns, 12.0732 us/op
WorkloadActual  28: 65536 op, 807996900.00 ns, 12.3291 us/op
WorkloadActual  29: 65536 op, 3896247700.00 ns, 59.4520 us/op
WorkloadActual  30: 65536 op, 788609100.00 ns, 12.0332 us/op
WorkloadActual  31: 65536 op, 972309200.00 ns, 14.8363 us/op

While most of the results are within 11-15 us/op there's significant number of outliers which are 49-72 us/op. It could be the thread pool code, GC, or something else, but it happens both before and after my changes.

jkotas · 2019-03-10T12:21:59Z

While most of the results are within 11-15 us/op there's significant number of outliers which are 49-72 us/op

The benchmark seems to be dominated by the thread pool overhead that can vary for the reasons you have mentioned. How much of this number is the threadpool overhead vs. actual cost of overlapped - what would the numbers be for raw alloc/free (e.g. #18360 (comment))?

The fixed cost of Overlapped did show up during socket performance investigations, e.g. #10302 or #21320.

filipnavara · 2019-03-10T12:56:58Z

I'll benchmark the pure Pack/Unpack, but it's not fully representative since it will not show the costs incurred during garbage collection for the old code.

filipnavara · 2019-03-10T16:04:53Z

Reduced/fixed benchmark code:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Threading;

namespace OverlappedBench
{
    [InProcess]
    public class Program
    {
        [Benchmark]
        public unsafe void AllocOverlappedNull() => AllocOverlapped(null);
        [Benchmark]
        public unsafe void AllocOverlappedObject() => AllocOverlapped(IntPtr.Zero);
        [Benchmark]
        public unsafe void AllocOverlappedArray() => AllocOverlapped(new[] { IntPtr.Zero, IntPtr.Zero, IntPtr.Zero });

        private unsafe void AllocOverlapped(object userObject)
        {
            Overlapped ov = new Overlapped();
            NativeOverlapped* nativeOverlapped = ov.Pack(null, userObject);
            Overlapped.Unpack(nativeOverlapped);
            Overlapped.Free(nativeOverlapped);
        }

        static void Main(string[] args)
        {
            BenchmarkRunner.Run<Program>();
        }
    }
}

Before:

Method	Mean	Error	StdDev
AllocOverlappedNull	162.1 ns	1.1808 ns	0.9219 ns
AllocOverlappedObject	173.7 ns	0.9135 ns	0.8545 ns
AllocOverlappedArray	175.8 ns	0.9807 ns	0.9174 ns

After:

Method	Mean	Error	StdDev
AllocOverlappedNull	161.4 ns	1.7787 ns	1.6638 ns
AllocOverlappedObject	172.0 ns	0.6487 ns	0.5417 ns
AllocOverlappedArray	176.4 ns	0.7456 ns	0.6975 ns

filipnavara · 2019-03-10T16:10:34Z

Note that even with the updated benchmark I still get at least one time spike within the run of three benchmarks, so there could still be something wrong. Happens both before/after the changes, it is filtered out by BDN. Here's one example from warmup phase, but it happened during actual benchmark run too:

WorkloadWarmup   1: 4194304 op, 703022900.00 ns, 167.6137 ns/op
WorkloadWarmup   2: 4194304 op, 691389500.00 ns, 164.8401 ns/op
WorkloadWarmup   3: 4194304 op, 2717302000.00 ns, 647.8553 ns/op
WorkloadWarmup   4: 4194304 op, 2700309200.00 ns, 643.8039 ns/op
WorkloadWarmup   5: 4194304 op, 2711021400.00 ns, 646.3579 ns/op
WorkloadWarmup   6: 4194304 op, 683177200.00 ns, 162.8821 ns/op

jkotas · 2019-03-11T16:46:01Z

Does the AllocOverlappedArray actually hit the object[] array path? I think it needs to be

byte[] buffer = new byte[1];
AllocOverlapped(new object[] { buffer, buffer, buffer });

jkotas · 2019-03-11T16:51:47Z

Also, I am not able to replicate these results with low-tech microbenchmarks. For example:

int start = Environment.TickCount;
for (int i = 0; i < 10000000; i++)
{
    object userObject = new IntPtr();
    Overlapped ov = new Overlapped();
    NativeOverlapped* nativeOverlapped = ov.Pack(null, userObject);
    Overlapped.Unpack(nativeOverlapped);
    Overlapped.Free(nativeOverlapped);
}
int end = Environment.TickCount;
Console.WriteLine((end-start).ToString());

Before: 1.9s
After: 12.2s

filipnavara · 2019-03-11T16:57:55Z

I'd expect the performance to be different, so I think there's something wrong with my benchmarking. BDN 0.14 dropped support for direct configurations with local CoreCLR, so I had to switch to using the dotnet run -c Release -- --coreRun <path>. I wasn't sure it was working, so I started to print using reflection the layout of the Overlapped class to ensure that it was really different before and after the changes.

I'll run the low-tech benchmark locally, but the different on your run seems to be too big to be explained by tiered compilation.

filipnavara · 2019-03-11T17:15:19Z

Low-tech benchmark confirms your numbers. I get the same numbers as you on my machine.

I'll just close this as an unsuccessful attempt and perhaps revisit it some other day. I'll also try to get my BDN infrastructure working again because it obviously broke with updates to latest .NET Core / BDN versions and my attempts to get it working didn't succeed (the --coreRun trick from https://benchmarkdotnet.org/articles/configs/toolchains.html no longer works on my machine).

Thanks for all the help and sorry for wasting so much time on this.

jkotas · 2019-03-11T17:44:43Z

No problem. It was an interesting experiment.

This change was attempted before in dotnet/coreclr#23029 and rejected due to performance impact. Things have changed since then that makes it feasible now. Sockets and file I/O do not use pinning feature of Overlapped anymore. They pin memory on their own using `{ReadOnly}Memory<T>.Pin` instead. It means that the async pinned handles are typically not pinning anything. The async pinned handles come with some extra overhead in this common use case. Also, they cause confusion during GC behavior drill downs. This change removes the support for async pinned handles from the GC: - It makes the current most common Overlapped use cheaper. It is hard to measure the impact of eliminating async pinned handles exactly since they are just a small part of the total GC costs. The unified fully managed implementation enabled simplificication of the implementation and reduced allocations. - It gets rid of confusing async pinned handles behavior. The change was actually motivated by a recent discussion with a customer who was surprised by the async pinned handles not pinning anything. They were not sure whether it is expected behavior or whether it is a bug in the diagnostic tools. Micro-benchmarks for pinning feature of Overlapped are going to regress with this change. The regression in a micro-benchmark that runs Overlapped.Pack/Unpack in a tight loop is about 20% for each pinned object. If there is 3rd party code still using the pinning feature of Overlapped, Overlapped.Pack/Unpack is expected to be a tiny part of the end-to-end async flow and the regression for end-to-end scenarios is expected to be in noise range.

Move most of Overlapped code to managed CoreLib

4d588ec

filipnavara commented Mar 5, 2019

View reviewed changes

Fix layout

2d1ce05

Remove obsolete code

1079ba8

filipnavara and others added 7 commits March 5, 2019 18:03

Move CheckVMForIOPacket logic out of managed code

c4a187c

Move the handling of OverlappedData._callback == null to managed code…

a17d320

…. It is no longer used since commit 5a7fb90 and thus is not performance critical. This allows removing the knowledge of the object layout from the unmanaged code.

Merge Overlapped and OverlappedData classes

f447252

Throw correct ParamName in ArgumentException, add CoreFX tests exclus…

e736580

…ions

Another attempt at fixing CoreFX test exclusions

fdd2359

Fix typo

e771858

Drop the ThreadPoolIOPack event tracing

b5ea2b7

filipnavara added a commit to filipnavara/corefx that referenced this pull request Mar 6, 2019

Update parameter names to match dotnet/coreclr#23029

b03ef47

filipnavara changed the title ~~WIP: Move most of Overlapped code to managed CoreLib~~ Move Overlapped implementation to managed code in CoreLib Mar 6, 2019

jkotas reviewed Mar 7, 2019

View reviewed changes

src/vm/ClrEtwAll.man Show resolved Hide resolved

jkotas reviewed Mar 7, 2019

View reviewed changes

src/vm/nativeoverlapped.h Outdated Show resolved Hide resolved

jkotas reviewed Mar 7, 2019

View reviewed changes

src/vm/mscorlib.h Show resolved Hide resolved

filipnavara added 5 commits March 7, 2019 08:41

Remove unused definition of OverlappedDataObject, it is opaque to unm…

3b1a5a5

…anaged code now

Update asserts

dd0f445

Revert "Update asserts"

72a93bf

This reverts commit dd0f445.

Remove unnecessary returns

2c261a7

Bring back event tracing through FrameworkEventSource

362ec80

filipnavara commented Mar 8, 2019

View reviewed changes

Shuffle the event code a tiny bit

252a815

filipnavara added 2 commits March 8, 2019 13:05

Log both NativeOverlapped and Overlapped in the ETW event

8a63784

Move ETW code back to unmanaged FCall

afc80c5

jkotas reviewed Mar 8, 2019

View reviewed changes

src/vm/comthreadpool.cpp Outdated Show resolved Hide resolved

jkotas reviewed Mar 8, 2019

View reviewed changes

Address PR feedback

54e4ed9

filipnavara closed this Mar 11, 2019

filipnavara mentioned this pull request Mar 11, 2019

Update parameter names to match https://github.com/dotnet/coreclr/pull/23029 dotnet/corefx#35813

Closed

jkotas mentioned this pull request Aug 24, 2022

Switch to unified fully managed Overlapped implementation dotnet/runtime#74532

Merged

		@@ -1440,27 +1412,7 @@ void GCToEEInterface::WalkAsyncPinned(Object* object, void* context, void (*call
		assert(object != nullptr);


		HELPER_METHOD_FRAME_BEGIN_RET_0();

		if (ETW_EVENT_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, ThreadPoolIODequeue))

Conversation

filipnavara commented Mar 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

filipnavara Mar 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

filipnavara Mar 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

filipnavara commented Mar 5, 2019

Uh oh!

filipnavara commented Mar 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas commented Mar 5, 2019

Uh oh!

jkotas commented Mar 5, 2019

Uh oh!

jkotas commented Mar 5, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

filipnavara commented Mar 8, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

filipnavara commented Mar 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas commented Mar 10, 2019

Uh oh!

filipnavara commented Mar 10, 2019

Uh oh!

filipnavara commented Mar 10, 2019

Uh oh!

filipnavara commented Mar 10, 2019

Uh oh!

jkotas commented Mar 11, 2019

Uh oh!

jkotas commented Mar 11, 2019

Uh oh!

filipnavara commented Mar 11, 2019

Uh oh!

filipnavara commented Mar 5, 2019 •

edited

Loading

filipnavara Mar 5, 2019 •

edited

Loading

filipnavara Mar 8, 2019 •

edited

Loading

filipnavara commented Mar 5, 2019 •

edited

Loading

filipnavara commented Mar 9, 2019 •

edited

Loading