Move Overlapped implementation to managed code in CoreLib#23029
Move Overlapped implementation to managed code in CoreLib#23029filipnavara wants to merge 19 commits intodotnet:masterfrom
Conversation
| *(GCHandle*)(_pNativeOverlapped + 1) = GCHandle.Alloc(this); | ||
|
|
||
| //if (ETW_EVENT_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, ThreadPoolIODequeue)) | ||
| // FireEtwThreadPoolIOPack(lpOverlapped, overlappedUNSAFE, GetClrInstanceId()); |
There was a problem hiding this comment.
Not sure how to translate this to managed code? I looked into FrameworkEventSource, but the definitions there are not really aligned with ClrEtwAll.man, so I am not sure how to proceed.
There was a problem hiding this comment.
The appropriate thing to do is to create an event that logs when an I/O packing happens as an event in FrameworkEventSource (as you guessed). Yes, there is not particular correspondence, and the update may break causality reconstruction logic in PerfView (it will have to be updated to use the new event). I can help with that. (you can see the current use in https://github.com/Microsoft/perfview/blob/master/src/TraceEvent/Computers/ActivityComputer.cs#L158).
But the first thing to do is to have event to replace it. Fundamentally we want to link when async I/O is requested and when it completes (by some ID like the address of the NativeOverlapped structure, or anything else that is shared uniquely between the IO request and its completion).
There was a problem hiding this comment.
Ah, I just realised why the event is needed. I was always testing code that called both the ThreadPoolIOPack and ThreadPoolIOEnqueue events and not the native (file/socket) overlapped I/O where only ThreadPoolIOPack is triggered. I will look into it tomorrow to see how to add the event back.
There was a problem hiding this comment.
I gave it a whirl, but I am not happy with the result. Now ThreadPoolIOEnqueue and ThreadPoolIOPack appear in different providers. They can be correlated based on the parameters, but it's a bit suboptimal. Moreover, PerfView always picks the manifest for FrameworkEventSource from full NetFX instead of the CoreFX version, so the new event is not properly decoded.
There was a problem hiding this comment.
Would it be too bad to keep emitting the event using FCall?
There was a problem hiding this comment.
@jkotas Probably not. I have a prototype ready, but I didn't benchmark it yet. I also tried emitting the exact same event from managed code, but could not avoid getting some information from runtime and requiring FCall anyway.
There was a problem hiding this comment.
If you need to make a FCall anyway, it would be nice to simply log the event in native code, which avoids breaking compat.
There was a problem hiding this comment.
I need FCall if I try to log the event in the same format as before. It's not necessary if I use FrameworkEventSource.
I run some benchmarks, but the results were not very appealing and they were quite inconsistent (some runs were 2x as slow as others). I suspect there's something wrong either with the implementation or the benchmark... I'll look into it more.
There was a problem hiding this comment.
Apparently happens on the machine even for the unchanged CoreCLR, so I guess some power management issue :-/
|
The tests are failing because |
|
The last remaining native function is coreclr/src/vm/win32threadpool.cpp Line 3617 in 1079ba8 (or in Not sure what condition is handled by the coreclr/src/vm/nativeoverlapped.cpp Lines 51 to 58 in 1079ba8 /cc @jkotas |
Yes, it is an perf optimization. As with any perf optimization, any changes to it should be measured. |
It is fine to update the CoreFX tests to expect the correct argument name. (Disable them in https://github.com/dotnet/coreclr/blob/master/tests/CoreFX/CoreFX.issues.json to make the CoreCLR PR green.) |
|
cc @kouvel |
…. It is no longer used since commit 5a7fb90 and thus is not performance critical. This allows removing the knowledge of the object layout from the unmanaged code.
| @@ -1440,27 +1412,7 @@ void GCToEEInterface::WalkAsyncPinned(Object* object, void* context, void (*call | |||
| assert(object != nullptr); | |||
There was a problem hiding this comment.
We can just assert false here. This method should be unreachable now.
There was a problem hiding this comment.
Apparently it's still called. The CI fails on it big time.
There was a problem hiding this comment.
Ok, I see why. Could you please change it back?
| ThreadPoolDequeueWork((long)*((void**)Unsafe.AsPointer(ref workID))); | ||
| } | ||
|
|
||
| [Event(32, Level = EventLevel.Verbose, Keywords = Keywords.ThreadPool)] |
There was a problem hiding this comment.
Not sure how the event IDs are assigned. I looked up what CoreFX and NetFX had and used the next available in the row.
There was a problem hiding this comment.
The only important thing about the ID is that it is unique. The only thing that makes it at all tricky is that we have two version of this file (.NET Core and .NET Desktop), and we want it to be unique considering both of them. As long as we add events to .NET Core every time we do to Desktop, then everything is OK (and I took a look and this seems to be the case.
The long and the short of this is that 32 is OK to use.
Could you put a comment that we should be using the IDs in the range from 33 to 149 first for any new events.
| FreeNativeOverlapped(); | ||
|
|
||
| if (success && FrameworkEventSource.Log.IsEnabled(EventLevel.Verbose, FrameworkEventSource.Keywords.ThreadPool)) | ||
| System.Diagnostics.Tracing.FrameworkEventSource.Log.ThreadPoolIOPackWork((long)(IntPtr)_pNativeOverlapped); |
There was a problem hiding this comment.
This needs to be checked to see if it produces the same ID as the native code to allow the events to be properly correlated.
There was a problem hiding this comment.
It doesn't contain all the relevant data yet. I finally got to check it... Working on it.
|
My benchmark machine is broken, so I cannot re-run the benchmark to test for regression. Code is linked in the PR description if anyone wants to give it a go. |
|
|
||
| HELPER_METHOD_FRAME_BEGIN_RET_0(); | ||
|
|
||
| if (ETW_EVENT_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, ThreadPoolIODequeue)) |
There was a problem hiding this comment.
Check of this condition can be before the HELPER_METHOD_FRAME_BEGIN to minimize overhead when the tracing is off.
There was a problem hiding this comment.
Wanted to ask about that.
|
The benchmark is definitely triggering some code that results in quite unreliable performance. Here are some results I managed to get: Before:
After:
The problem is that during the benchmark runs there are some quite visible stalls like this: While most of the results are within 11-15 us/op there's significant number of outliers which are 49-72 us/op. It could be the thread pool code, GC, or something else, but it happens both before and after my changes. |
The benchmark seems to be dominated by the thread pool overhead that can vary for the reasons you have mentioned. How much of this number is the threadpool overhead vs. actual cost of overlapped - what would the numbers be for raw alloc/free (e.g. #18360 (comment))? The fixed cost of Overlapped did show up during socket performance investigations, e.g. #10302 or #21320. |
|
I'll benchmark the pure Pack/Unpack, but it's not fully representative since it will not show the costs incurred during garbage collection for the old code. |
|
Reduced/fixed benchmark code: using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Threading;
namespace OverlappedBench
{
[InProcess]
public class Program
{
[Benchmark]
public unsafe void AllocOverlappedNull() => AllocOverlapped(null);
[Benchmark]
public unsafe void AllocOverlappedObject() => AllocOverlapped(IntPtr.Zero);
[Benchmark]
public unsafe void AllocOverlappedArray() => AllocOverlapped(new[] { IntPtr.Zero, IntPtr.Zero, IntPtr.Zero });
private unsafe void AllocOverlapped(object userObject)
{
Overlapped ov = new Overlapped();
NativeOverlapped* nativeOverlapped = ov.Pack(null, userObject);
Overlapped.Unpack(nativeOverlapped);
Overlapped.Free(nativeOverlapped);
}
static void Main(string[] args)
{
BenchmarkRunner.Run<Program>();
}
}
}Before:
After:
|
|
Note that even with the updated benchmark I still get at least one time spike within the run of three benchmarks, so there could still be something wrong. Happens both before/after the changes, it is filtered out by BDN. Here's one example from warmup phase, but it happened during actual benchmark run too: |
|
Does the AllocOverlappedArray actually hit the |
|
Also, I am not able to replicate these results with low-tech microbenchmarks. For example: Before: 1.9s |
|
I'd expect the performance to be different, so I think there's something wrong with my benchmarking. BDN 0.14 dropped support for direct configurations with local CoreCLR, so I had to switch to using the I'll run the low-tech benchmark locally, but the different on your run seems to be too big to be explained by tiered compilation. |
|
Low-tech benchmark confirms your numbers. I get the same numbers as you on my machine. I'll just close this as an unsuccessful attempt and perhaps revisit it some other day. I'll also try to get my BDN infrastructure working again because it obviously broke with updates to latest .NET Core / BDN versions and my attempts to get it working didn't succeed (the Thanks for all the help and sorry for wasting so much time on this. |
|
No problem. It was an interesting experiment. |
This change was attempted before in dotnet/coreclr#23029 and rejected due to performance impact. Things have changed since then that makes it feasible now. Sockets and file I/O do not use pinning feature of Overlapped anymore. They pin memory on their own using `{ReadOnly}Memory<T>.Pin` instead. It means that the async pinned handles are typically not pinning anything. The async pinned handles come with some extra overhead in this common use case. Also, they cause confusion during GC behavior drill downs. This change removes the support for async pinned handles from the GC: - It makes the current most common Overlapped use cheaper. It is hard to measure the impact of eliminating async pinned handles exactly since they are just a small part of the total GC costs. The unified fully managed implementation enabled simplificication of the implementation and reduced allocations. - It gets rid of confusing async pinned handles behavior. The change was actually motivated by a recent discussion with a customer who was surprised by the async pinned handles not pinning anything. They were not sure whether it is expected behavior or whether it is a bug in the diagnostic tools. Micro-benchmarks for pinning feature of Overlapped are going to regress with this change. The regression in a micro-benchmark that runs Overlapped.Pack/Unpack in a tight loop is about 20% for each pinned object. If there is 3rd party code still using the pinning feature of Overlapped, Overlapped.Pack/Unpack is expected to be a tiny part of the end-to-end async flow and the regression for end-to-end scenarios is expected to be in noise range.
This change was attempted before in dotnet/coreclr#23029 and rejected due to performance impact. Things have changed since then that makes it feasible now. Sockets and file I/O do not use pinning feature of Overlapped anymore. They pin memory on their own using `{ReadOnly}Memory<T>.Pin` instead. It means that the async pinned handles are typically not pinning anything. The async pinned handles come with some extra overhead in this common use case. Also, they cause confusion during GC behavior drill downs. This change removes the support for async pinned handles from the GC: - It makes the current most common Overlapped use cheaper. It is hard to measure the impact of eliminating async pinned handles exactly since they are just a small part of the total GC costs. The unified fully managed implementation enabled simplificication of the implementation and reduced allocations. - It gets rid of confusing async pinned handles behavior. The change was actually motivated by a recent discussion with a customer who was surprised by the async pinned handles not pinning anything. They were not sure whether it is expected behavior or whether it is a bug in the diagnostic tools. Micro-benchmarks for pinning feature of Overlapped are going to regress with this change. The regression in a micro-benchmark that runs Overlapped.Pack/Unpack in a tight loop is about 20% for each pinned object. If there is 3rd party code still using the pinning feature of Overlapped, Overlapped.Pack/Unpack is expected to be a tiny part of the end-to-end async flow and the regression for end-to-end scenarios is expected to be in noise range.
Adapts managed implementation of
Overlappedfrom CoreRT for use in CoreCLR. Eventually the goal is to move the code to shared partition, but there are few small things that need to be sorted out first.I adapted a code from CoreFX tests to see whether using
GCHandlewill result in measurable performance impact. The benchmark showed that for_userObjectpointing to single object or array of three objects there was no significant performance impact.Before:
After:
Missing things:
Overlapped.PackOverlappedandOverlappedDataMove code to shared partitionWill submit as separate PR