Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Move Overlapped implementation to managed code in CoreLib#23029

Closed
filipnavara wants to merge 19 commits intodotnet:masterfrom
filipnavara:overlapped1
Closed

Move Overlapped implementation to managed code in CoreLib#23029
filipnavara wants to merge 19 commits intodotnet:masterfrom
filipnavara:overlapped1

Conversation

@filipnavara
Copy link
Member

@filipnavara filipnavara commented Mar 5, 2019

Adapts managed implementation of Overlapped from CoreRT for use in CoreCLR. Eventually the goal is to move the code to shared partition, but there are few small things that need to be sorted out first.

I adapted a code from CoreFX tests to see whether using GCHandle will result in measurable performance impact. The benchmark showed that for _userObject pointing to single object or array of three objects there was no significant performance impact.

Before:

Method Mean Error StdDev
AllocOverlappedNull 23.69 us 0.4541 us 0.4248 us
AllocOverlappedObject 23.92 us 0.4692 us 0.9370 us
AllocOverlappedArray 25.25 us 0.8678 us 2.5586 us

After:

Method Mean Error StdDev
AllocOverlappedNull 23.71 us 0.4676 us 0.4593 us
AllocOverlappedObject 23.80 us 0.4592 us 0.6130 us
AllocOverlappedArray 24.35 us 0.4937 us 0.8112 us

Missing things:

*(GCHandle*)(_pNativeOverlapped + 1) = GCHandle.Alloc(this);

//if (ETW_EVENT_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, ThreadPoolIODequeue))
// FireEtwThreadPoolIOPack(lpOverlapped, overlappedUNSAFE, GetClrInstanceId());
Copy link
Member Author

@filipnavara filipnavara Mar 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to translate this to managed code? I looked into FrameworkEventSource, but the definitions there are not really aligned with ClrEtwAll.man, so I am not sure how to proceed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The appropriate thing to do is to create an event that logs when an I/O packing happens as an event in FrameworkEventSource (as you guessed). Yes, there is not particular correspondence, and the update may break causality reconstruction logic in PerfView (it will have to be updated to use the new event). I can help with that. (you can see the current use in https://github.com/Microsoft/perfview/blob/master/src/TraceEvent/Computers/ActivityComputer.cs#L158).

But the first thing to do is to have event to replace it. Fundamentally we want to link when async I/O is requested and when it completes (by some ID like the address of the NativeOverlapped structure, or anything else that is shared uniquely between the IO request and its completion).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I just realised why the event is needed. I was always testing code that called both the ThreadPoolIOPack and ThreadPoolIOEnqueue events and not the native (file/socket) overlapped I/O where only ThreadPoolIOPack is triggered. I will look into it tomorrow to see how to add the event back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a whirl, but I am not happy with the result. Now ThreadPoolIOEnqueue and ThreadPoolIOPack appear in different providers. They can be correlated based on the parameters, but it's a bit suboptimal. Moreover, PerfView always picks the manifest for FrameworkEventSource from full NetFX instead of the CoreFX version, so the new event is not properly decoded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be too bad to keep emitting the event using FCall?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas Probably not. I have a prototype ready, but I didn't benchmark it yet. I also tried emitting the exact same event from managed code, but could not avoid getting some information from runtime and requiring FCall anyway.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to make a FCall anyway, it would be nice to simply log the event in native code, which avoids breaking compat.

Copy link
Member Author

@filipnavara filipnavara Mar 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need FCall if I try to log the event in the same format as before. It's not necessary if I use FrameworkEventSource.

I run some benchmarks, but the results were not very appealing and they were quite inconsistent (some runs were 2x as slow as others). I suspect there's something wrong either with the implementation or the benchmark... I'll look into it more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently happens on the machine even for the unchanged CoreCLR, so I guess some power management issue :-/

@filipnavara
Copy link
Member Author

The tests are failing because ArgumentException has different parameter name. I can re-wrap the exception or change the CoreFX test. None of the names are actually correct name of the parameter. CoreFX test expects null, which CoreCLR used to return. In new code GCHandle throws the exception and it uses value as parameter name. The actual name of the parameter in the public function is userData.

@filipnavara
Copy link
Member Author

filipnavara commented Mar 5, 2019

The last remaining native function is CheckVMForIOPacket. As far as I can see it is an optimization that tries to improve performance of multiple packets being processed in succession. It seems it could be avoided completely if the check is moved just behind this line and kept directly in the thread pool:

((LPOVERLAPPED_COMPLETION_ROUTINE) key)(errorCode, numBytes, pOverlapped);

(or in BindIoCompletionCallBack_Worker)

Not sure what condition is handled by the if (overlapped->m_callback == NULL) branch:

if (overlapped->m_callback == NULL)
{
//We're not initialized yet, go back to the Vm, and process the packet there.
ThreadpoolMgr::StoreOverlappedInfoInThread(pThread, *errorCode, *numBytes, key, *lpOverlapped);
*lpOverlapped = NULL;
return;
}

/cc @jkotas

@jkotas
Copy link
Member

jkotas commented Mar 5, 2019

CheckVMForIOPacket As far as I can see it is an optimization

Yes, it is an perf optimization. As with any perf optimization, any changes to it should be measured.

@jkotas
Copy link
Member

jkotas commented Mar 5, 2019

change the CoreFX test.

It is fine to update the CoreFX tests to expect the correct argument name. (Disable them in https://github.com/dotnet/coreclr/blob/master/tests/CoreFX/CoreFX.issues.json to make the CoreCLR PR green.)

@jkotas
Copy link
Member

jkotas commented Mar 5, 2019

cc @kouvel

filipnavara added a commit to filipnavara/corefx that referenced this pull request Mar 6, 2019
@filipnavara filipnavara changed the title WIP: Move most of Overlapped code to managed CoreLib Move Overlapped implementation to managed code in CoreLib Mar 6, 2019
@@ -1440,27 +1412,7 @@ void GCToEEInterface::WalkAsyncPinned(Object* object, void* context, void (*call
assert(object != nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just assert false here. This method should be unreachable now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently it's still called. The CI fails on it big time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see why. Could you please change it back?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

ThreadPoolDequeueWork((long)*((void**)Unsafe.AsPointer(ref workID)));
}

[Event(32, Level = EventLevel.Verbose, Keywords = Keywords.ThreadPool)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how the event IDs are assigned. I looked up what CoreFX and NetFX had and used the next available in the row.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only important thing about the ID is that it is unique. The only thing that makes it at all tricky is that we have two version of this file (.NET Core and .NET Desktop), and we want it to be unique considering both of them. As long as we add events to .NET Core every time we do to Desktop, then everything is OK (and I took a look and this seems to be the case.

The long and the short of this is that 32 is OK to use.

Could you put a comment that we should be using the IDs in the range from 33 to 149 first for any new events.

FreeNativeOverlapped();

if (success && FrameworkEventSource.Log.IsEnabled(EventLevel.Verbose, FrameworkEventSource.Keywords.ThreadPool))
System.Diagnostics.Tracing.FrameworkEventSource.Log.ThreadPoolIOPackWork((long)(IntPtr)_pNativeOverlapped);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be checked to see if it produces the same ID as the native code to allow the events to be properly correlated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't contain all the relevant data yet. I finally got to check it... Working on it.

@filipnavara
Copy link
Member Author

My benchmark machine is broken, so I cannot re-run the benchmark to test for regression. Code is linked in the PR description if anyone wants to give it a go.


HELPER_METHOD_FRAME_BEGIN_RET_0();

if (ETW_EVENT_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_Context, ThreadPoolIODequeue))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check of this condition can be before the HELPER_METHOD_FRAME_BEGIN to minimize overhead when the tracing is off.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to ask about that.

@filipnavara
Copy link
Member Author

filipnavara commented Mar 9, 2019

The benchmark is definitely triggering some code that results in quite unreliable performance. Here are some results I managed to get:

Before:

Method Mean Error StdDev
AllocOverlappedNull 13.00 us 0.2599 us 0.6176 us
AllocOverlappedObject 13.03 us 0.3933 us 1.0222 us
AllocOverlappedArray 13.75 us 0.4887 us 1.3623 us

After:

Method Mean Error StdDev
AllocOverlappedNull 12.01 us 0.1646 us 0.1459 us
AllocOverlappedObject 12.08 us 0.2172 us 0.2032 us
AllocOverlappedArray 13.21 us 0.2812 us 0.8114 us

The problem is that during the benchmark runs there are some quite visible stalls like this:

WorkloadActual  27: 65536 op, 791229500.00 ns, 12.0732 us/op
WorkloadActual  28: 65536 op, 807996900.00 ns, 12.3291 us/op
WorkloadActual  29: 65536 op, 3896247700.00 ns, 59.4520 us/op
WorkloadActual  30: 65536 op, 788609100.00 ns, 12.0332 us/op
WorkloadActual  31: 65536 op, 972309200.00 ns, 14.8363 us/op

While most of the results are within 11-15 us/op there's significant number of outliers which are 49-72 us/op. It could be the thread pool code, GC, or something else, but it happens both before and after my changes.

@jkotas
Copy link
Member

jkotas commented Mar 10, 2019

While most of the results are within 11-15 us/op there's significant number of outliers which are 49-72 us/op

The benchmark seems to be dominated by the thread pool overhead that can vary for the reasons you have mentioned. How much of this number is the threadpool overhead vs. actual cost of overlapped - what would the numbers be for raw alloc/free (e.g. #18360 (comment))?

The fixed cost of Overlapped did show up during socket performance investigations, e.g. #10302 or #21320.

@filipnavara
Copy link
Member Author

I'll benchmark the pure Pack/Unpack, but it's not fully representative since it will not show the costs incurred during garbage collection for the old code.

@filipnavara
Copy link
Member Author

Reduced/fixed benchmark code:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Threading;

namespace OverlappedBench
{
    [InProcess]
    public class Program
    {
        [Benchmark]
        public unsafe void AllocOverlappedNull() => AllocOverlapped(null);
        [Benchmark]
        public unsafe void AllocOverlappedObject() => AllocOverlapped(IntPtr.Zero);
        [Benchmark]
        public unsafe void AllocOverlappedArray() => AllocOverlapped(new[] { IntPtr.Zero, IntPtr.Zero, IntPtr.Zero });

        private unsafe void AllocOverlapped(object userObject)
        {
            Overlapped ov = new Overlapped();
            NativeOverlapped* nativeOverlapped = ov.Pack(null, userObject);
            Overlapped.Unpack(nativeOverlapped);
            Overlapped.Free(nativeOverlapped);
        }

        static void Main(string[] args)
        {
            BenchmarkRunner.Run<Program>();
        }
    }
}

Before:

Method Mean Error StdDev
AllocOverlappedNull 162.1 ns 1.1808 ns 0.9219 ns
AllocOverlappedObject 173.7 ns 0.9135 ns 0.8545 ns
AllocOverlappedArray 175.8 ns 0.9807 ns 0.9174 ns

After:

Method Mean Error StdDev
AllocOverlappedNull 161.4 ns 1.7787 ns 1.6638 ns
AllocOverlappedObject 172.0 ns 0.6487 ns 0.5417 ns
AllocOverlappedArray 176.4 ns 0.7456 ns 0.6975 ns

@filipnavara
Copy link
Member Author

Note that even with the updated benchmark I still get at least one time spike within the run of three benchmarks, so there could still be something wrong. Happens both before/after the changes, it is filtered out by BDN. Here's one example from warmup phase, but it happened during actual benchmark run too:

WorkloadWarmup   1: 4194304 op, 703022900.00 ns, 167.6137 ns/op
WorkloadWarmup   2: 4194304 op, 691389500.00 ns, 164.8401 ns/op
WorkloadWarmup   3: 4194304 op, 2717302000.00 ns, 647.8553 ns/op
WorkloadWarmup   4: 4194304 op, 2700309200.00 ns, 643.8039 ns/op
WorkloadWarmup   5: 4194304 op, 2711021400.00 ns, 646.3579 ns/op
WorkloadWarmup   6: 4194304 op, 683177200.00 ns, 162.8821 ns/op

@jkotas
Copy link
Member

jkotas commented Mar 11, 2019

Does the AllocOverlappedArray actually hit the object[] array path? I think it needs to be

byte[] buffer = new byte[1];
AllocOverlapped(new object[] { buffer, buffer, buffer });

@jkotas
Copy link
Member

jkotas commented Mar 11, 2019

Also, I am not able to replicate these results with low-tech microbenchmarks. For example:

int start = Environment.TickCount;
for (int i = 0; i < 10000000; i++)
{
    object userObject = new IntPtr();
    Overlapped ov = new Overlapped();
    NativeOverlapped* nativeOverlapped = ov.Pack(null, userObject);
    Overlapped.Unpack(nativeOverlapped);
    Overlapped.Free(nativeOverlapped);
}
int end = Environment.TickCount;
Console.WriteLine((end-start).ToString());

Before: 1.9s
After: 12.2s

@filipnavara
Copy link
Member Author

I'd expect the performance to be different, so I think there's something wrong with my benchmarking. BDN 0.14 dropped support for direct configurations with local CoreCLR, so I had to switch to using the dotnet run -c Release -- --coreRun <path>. I wasn't sure it was working, so I started to print using reflection the layout of the Overlapped class to ensure that it was really different before and after the changes.

I'll run the low-tech benchmark locally, but the different on your run seems to be too big to be explained by tiered compilation.

@filipnavara
Copy link
Member Author

Low-tech benchmark confirms your numbers. I get the same numbers as you on my machine.

I'll just close this as an unsuccessful attempt and perhaps revisit it some other day. I'll also try to get my BDN infrastructure working again because it obviously broke with updates to latest .NET Core / BDN versions and my attempts to get it working didn't succeed (the --coreRun trick from https://benchmarkdotnet.org/articles/configs/toolchains.html no longer works on my machine).

Thanks for all the help and sorry for wasting so much time on this.

@jkotas
Copy link
Member

jkotas commented Mar 11, 2019

No problem. It was an interesting experiment.

jkotas added a commit to jkotas/runtime that referenced this pull request Aug 24, 2022
This change was attempted before in dotnet/coreclr#23029 and rejected due to performance impact. Things have changed since then that makes it feasible now.

Sockets and file I/O do not use pinning feature of Overlapped anymore. They pin memory on their own using `{ReadOnly}Memory<T>.Pin` instead. It means that the async pinned handles are typically not pinning anything. The async pinned handles come with some extra overhead in this common use case. Also, they cause confusion during GC behavior drill downs. This change removes the support for async pinned handles from the GC:
- It makes the current most common Overlapped use cheaper. It is hard to measure the impact of eliminating async pinned handles exactly since they are just a small part of the total GC costs. The unified fully managed implementation enabled simplificication of the implementation and reduced allocations.
- It gets rid of confusing async pinned handles behavior. The change was actually motivated by a recent discussion with a customer who was surprised by the async pinned handles not pinning anything. They were not sure whether it is expected behavior or whether it is a bug in the diagnostic tools.

Micro-benchmarks for pinning feature of Overlapped are going to regress with this change. The regression in a micro-benchmark that runs Overlapped.Pack/Unpack in a tight loop is about 20% for each pinned object. If there is 3rd party code still using the pinning feature of Overlapped, Overlapped.Pack/Unpack is expected to be a tiny part of the end-to-end async flow and the regression for end-to-end scenarios is expected to be in noise range.
jkotas added a commit to dotnet/runtime that referenced this pull request Aug 25, 2022
This change was attempted before in dotnet/coreclr#23029 and rejected due to performance impact. Things have changed since then that makes it feasible now.

Sockets and file I/O do not use pinning feature of Overlapped anymore. They pin memory on their own using `{ReadOnly}Memory<T>.Pin` instead. It means that the async pinned handles are typically not pinning anything. The async pinned handles come with some extra overhead in this common use case. Also, they cause confusion during GC behavior drill downs. This change removes the support for async pinned handles from the GC:
- It makes the current most common Overlapped use cheaper. It is hard to measure the impact of eliminating async pinned handles exactly since they are just a small part of the total GC costs. The unified fully managed implementation enabled simplificication of the implementation and reduced allocations.
- It gets rid of confusing async pinned handles behavior. The change was actually motivated by a recent discussion with a customer who was surprised by the async pinned handles not pinning anything. They were not sure whether it is expected behavior or whether it is a bug in the diagnostic tools.

Micro-benchmarks for pinning feature of Overlapped are going to regress with this change. The regression in a micro-benchmark that runs Overlapped.Pack/Unpack in a tight loop is about 20% for each pinned object. If there is 3rd party code still using the pinning feature of Overlapped, Overlapped.Pack/Unpack is expected to be a tiny part of the end-to-end async flow and the regression for end-to-end scenarios is expected to be in noise range.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants