Skip to content

Conversation

@dotnet-maestro
Copy link
Contributor

@dotnet-maestro dotnet-maestro bot commented Dec 12, 2024

This pull request updates the following dependencies

From https://github.com/dotnet/runtime

  • Subscription: d9f5b309-084f-43b5-02de-08d8b80548e4
  • Build: 20241217.3
  • Date Produced: December 17, 2024 6:40:45 PM UTC
  • Commit: 040cbe276907174316e2cc07b35814b3069874a6
  • Branch: refs/heads/main
Microsoft Reviewers: Open in CodeFlow

…1211.9

Microsoft.Internal.Runtime.WindowsDesktop.Transport , Microsoft.NET.Sdk.IL , Microsoft.NETCore.App.Ref , Microsoft.NETCore.App.Runtime.win-x64 , Microsoft.NETCore.ILAsm , Microsoft.NETCore.ILDAsm , Microsoft.NETCore.Platforms , Microsoft.Win32.Registry.AccessControl , Microsoft.Win32.SystemEvents , runtime.win-x64.Microsoft.NETCore.ILAsm , runtime.win-x86.Microsoft.NETCore.ILAsm , System.CodeDom , System.ComponentModel.Composition , System.ComponentModel.Composition.Registration , System.Configuration.ConfigurationManager , System.Data.Odbc , System.Data.OleDb , System.Diagnostics.EventLog , System.Diagnostics.PerformanceCounter , System.DirectoryServices , System.DirectoryServices.AccountManagement , System.DirectoryServices.Protocols , System.Formats.Nrbf , System.IO.Hashing , System.IO.Packaging , System.IO.Ports , System.Management , System.Reflection.Context , System.Reflection.MetadataLoadContext , System.Resources.Extensions , System.Runtime.Caching , System.Runtime.Serialization.Formatters , System.Security.Cryptography.Pkcs , System.Security.Cryptography.ProtectedData , System.Security.Cryptography.Xml , System.Security.Permissions , System.ServiceModel.Syndication , System.ServiceProcess.ServiceController , System.Speech , System.Text.Encoding.CodePages , System.Text.Encodings.Web , System.Text.Json , System.Threading.AccessControl , System.Windows.Extensions , VS.Redist.Common.NetCore.SharedFramework.x64.10.0
 From Version 10.0.0-alpha.1.24611.1 -> To Version 10.0.0-alpha.1.24611.9
@dotnet-maestro dotnet-maestro bot requested a review from a team as a code owner December 12, 2024 13:02
Copy link
Contributor

@dotnet-policy-service dotnet-policy-service bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go, you big red fire engine!

…1213.1

Microsoft.Internal.Runtime.WindowsDesktop.Transport , Microsoft.NET.Sdk.IL , Microsoft.NETCore.App.Ref , Microsoft.NETCore.App.Runtime.win-x64 , Microsoft.NETCore.ILAsm , Microsoft.NETCore.ILDAsm , Microsoft.NETCore.Platforms , Microsoft.Win32.Registry.AccessControl , Microsoft.Win32.SystemEvents , runtime.win-x64.Microsoft.NETCore.ILAsm , runtime.win-x86.Microsoft.NETCore.ILAsm , System.CodeDom , System.ComponentModel.Composition , System.ComponentModel.Composition.Registration , System.Configuration.ConfigurationManager , System.Data.Odbc , System.Data.OleDb , System.Diagnostics.EventLog , System.Diagnostics.PerformanceCounter , System.DirectoryServices , System.DirectoryServices.AccountManagement , System.DirectoryServices.Protocols , System.Formats.Nrbf , System.IO.Hashing , System.IO.Packaging , System.IO.Ports , System.Management , System.Reflection.Context , System.Reflection.MetadataLoadContext , System.Resources.Extensions , System.Runtime.Caching , System.Runtime.Serialization.Formatters , System.Security.Cryptography.Pkcs , System.Security.Cryptography.ProtectedData , System.Security.Cryptography.Xml , System.Security.Permissions , System.ServiceModel.Syndication , System.ServiceProcess.ServiceController , System.Speech , System.Text.Encoding.CodePages , System.Text.Encodings.Web , System.Text.Json , System.Threading.AccessControl , System.Windows.Extensions , VS.Redist.Common.NetCore.SharedFramework.x64.10.0
 From Version 10.0.0-alpha.1.24611.1 -> To Version 10.0.0-alpha.1.24613.1
…1213.10

Microsoft.Internal.Runtime.WindowsDesktop.Transport , Microsoft.NET.Sdk.IL , Microsoft.NETCore.App.Ref , Microsoft.NETCore.App.Runtime.win-x64 , Microsoft.NETCore.ILAsm , Microsoft.NETCore.ILDAsm , Microsoft.NETCore.Platforms , Microsoft.Win32.Registry.AccessControl , Microsoft.Win32.SystemEvents , runtime.win-x64.Microsoft.NETCore.ILAsm , runtime.win-x86.Microsoft.NETCore.ILAsm , System.CodeDom , System.ComponentModel.Composition , System.ComponentModel.Composition.Registration , System.Configuration.ConfigurationManager , System.Data.Odbc , System.Data.OleDb , System.Diagnostics.EventLog , System.Diagnostics.PerformanceCounter , System.DirectoryServices , System.DirectoryServices.AccountManagement , System.DirectoryServices.Protocols , System.Formats.Nrbf , System.IO.Hashing , System.IO.Packaging , System.IO.Ports , System.Management , System.Reflection.Context , System.Reflection.MetadataLoadContext , System.Resources.Extensions , System.Runtime.Caching , System.Runtime.Serialization.Formatters , System.Security.Cryptography.Pkcs , System.Security.Cryptography.ProtectedData , System.Security.Cryptography.Xml , System.Security.Permissions , System.ServiceModel.Syndication , System.ServiceProcess.ServiceController , System.Speech , System.Text.Encoding.CodePages , System.Text.Encodings.Web , System.Text.Json , System.Threading.AccessControl , System.Windows.Extensions , VS.Redist.Common.NetCore.SharedFramework.x64.10.0
 From Version 10.0.0-alpha.1.24611.1 -> To Version 10.0.0-alpha.1.24613.10
…1216.1

Microsoft.Internal.Runtime.WindowsDesktop.Transport , Microsoft.NET.Sdk.IL , Microsoft.NETCore.App.Ref , Microsoft.NETCore.App.Runtime.win-x64 , Microsoft.NETCore.ILAsm , Microsoft.NETCore.ILDAsm , Microsoft.NETCore.Platforms , Microsoft.Win32.Registry.AccessControl , Microsoft.Win32.SystemEvents , runtime.win-x64.Microsoft.NETCore.ILAsm , runtime.win-x86.Microsoft.NETCore.ILAsm , System.CodeDom , System.ComponentModel.Composition , System.ComponentModel.Composition.Registration , System.Configuration.ConfigurationManager , System.Data.Odbc , System.Data.OleDb , System.Diagnostics.EventLog , System.Diagnostics.PerformanceCounter , System.DirectoryServices , System.DirectoryServices.AccountManagement , System.DirectoryServices.Protocols , System.Formats.Nrbf , System.IO.Hashing , System.IO.Packaging , System.IO.Ports , System.Management , System.Reflection.Context , System.Reflection.MetadataLoadContext , System.Resources.Extensions , System.Runtime.Caching , System.Runtime.Serialization.Formatters , System.Security.Cryptography.Pkcs , System.Security.Cryptography.ProtectedData , System.Security.Cryptography.Xml , System.Security.Permissions , System.ServiceModel.Syndication , System.ServiceProcess.ServiceController , System.Speech , System.Text.Encoding.CodePages , System.Text.Encodings.Web , System.Text.Json , System.Threading.AccessControl , System.Windows.Extensions , VS.Redist.Common.NetCore.SharedFramework.x64.10.0
 From Version 10.0.0-alpha.1.24611.1 -> To Version 10.0.0-alpha.1.24616.1
…1216.9

Microsoft.Internal.Runtime.WindowsDesktop.Transport , Microsoft.NET.Sdk.IL , Microsoft.NETCore.App.Ref , Microsoft.NETCore.App.Runtime.win-x64 , Microsoft.NETCore.ILAsm , Microsoft.NETCore.ILDAsm , Microsoft.NETCore.Platforms , Microsoft.Win32.Registry.AccessControl , Microsoft.Win32.SystemEvents , runtime.win-x64.Microsoft.NETCore.ILAsm , runtime.win-x86.Microsoft.NETCore.ILAsm , System.CodeDom , System.ComponentModel.Composition , System.ComponentModel.Composition.Registration , System.Configuration.ConfigurationManager , System.Data.Odbc , System.Data.OleDb , System.Diagnostics.EventLog , System.Diagnostics.PerformanceCounter , System.DirectoryServices , System.DirectoryServices.AccountManagement , System.DirectoryServices.Protocols , System.Formats.Nrbf , System.IO.Hashing , System.IO.Packaging , System.IO.Ports , System.Management , System.Reflection.Context , System.Reflection.MetadataLoadContext , System.Resources.Extensions , System.Runtime.Caching , System.Runtime.Serialization.Formatters , System.Security.Cryptography.Pkcs , System.Security.Cryptography.ProtectedData , System.Security.Cryptography.Xml , System.Security.Permissions , System.ServiceModel.Syndication , System.ServiceProcess.ServiceController , System.Speech , System.Text.Encoding.CodePages , System.Text.Encodings.Web , System.Text.Json , System.Threading.AccessControl , System.Windows.Extensions , VS.Redist.Common.NetCore.SharedFramework.x64.10.0
 From Version 10.0.0-alpha.1.24611.1 -> To Version 10.0.0-alpha.1.24616.9
@lonitra
Copy link
Member

lonitra commented Dec 17, 2024

We are observing access violations / ExecutionEngineExceptions in our tests with this update. I tried to investigate further by debugging our ScrollBarTests suite, which is where I first saw tests aborting. Unfortunately, it is inconsistent on where it is seen surfacing.. I have seen it surface when xunit tries to invoke test methods, doing simple null checks, when trying to create new objects. Whenever there is an access violation typically message says something like:

Exception thrown at 0x00007FF825E2C0D6 (coreclr.dll) in testhost.exe: 0xC0000005: Access violation reading location 0x0000000000000020.

These are the commits included in the first dependency update dotnet/runtime@aa9cd3b...c646425 but it is not obvious to me what may have caused this behavior. @ericstj are you aware of any changes that could have caused this or who can help investigate?

@ericstj
Copy link
Member

ericstj commented Dec 17, 2024

@lonitra any chance you have a dump?

Of those commits identified, the one that seems the most likely is dotnet/runtime@69d8076 cc @VSadov @jkotas.

@jkotas
Copy link
Member

jkotas commented Dec 18, 2024

Of those commits identified, the one that seems the most likely is dotnet/runtime@69d8076

I have confirmed that it is caused by this change. I am reverting it in dotnet/runtime#110801

The crash repros fairly reliably for me locally. It is not obvious what the problem is.

@VSadov
Copy link
Member

VSadov commented Dec 18, 2024

My guess - either winforms scenarios use FLS in a conflicting manner, or something that we do on thread detach in CoreCLR is somehow dependent on holding the loader lock (that would be a bit inconvenient).

…1217.3

Microsoft.Internal.Runtime.WindowsDesktop.Transport , Microsoft.NET.Sdk.IL , Microsoft.NETCore.App.Ref , Microsoft.NETCore.App.Runtime.win-x64 , Microsoft.NETCore.ILAsm , Microsoft.NETCore.ILDAsm , Microsoft.NETCore.Platforms , Microsoft.Win32.Registry.AccessControl , Microsoft.Win32.SystemEvents , runtime.win-x64.Microsoft.NETCore.ILAsm , runtime.win-x86.Microsoft.NETCore.ILAsm , System.CodeDom , System.ComponentModel.Composition , System.ComponentModel.Composition.Registration , System.Configuration.ConfigurationManager , System.Data.Odbc , System.Data.OleDb , System.Diagnostics.EventLog , System.Diagnostics.PerformanceCounter , System.DirectoryServices , System.DirectoryServices.AccountManagement , System.DirectoryServices.Protocols , System.Formats.Nrbf , System.IO.Hashing , System.IO.Packaging , System.IO.Ports , System.Management , System.Reflection.Context , System.Reflection.MetadataLoadContext , System.Resources.Extensions , System.Runtime.Caching , System.Runtime.Serialization.Formatters , System.Security.Cryptography.Pkcs , System.Security.Cryptography.ProtectedData , System.Security.Cryptography.Xml , System.Security.Permissions , System.ServiceModel.Syndication , System.ServiceProcess.ServiceController , System.Speech , System.Text.Encoding.CodePages , System.Text.Encodings.Web , System.Text.Json , System.Threading.AccessControl , System.Windows.Extensions , VS.Redist.Common.NetCore.SharedFramework.x64.10.0
 From Version 10.0.0-alpha.1.24611.1 -> To Version 10.0.0-alpha.1.24617.3
@jkotas
Copy link
Member

jkotas commented Dec 18, 2024

@VSadov The symptoms of this crash are very similar to dotnet/runtime#110442 . One possible explanation is that there is a subtle race condition where we forget to unregister the dying thread and the FLS change just made it more likely to be hit.

@jkotas
Copy link
Member

jkotas commented Dec 18, 2024

I was able to catch the problem under debugger. The problem is that COM apartment shutdown ends up calling into managed code and re-initializes the Thread at the callstack below. The re-initialized Thread won't be unregistered and we will crash later when trying to access it:

combase!PeekTillDone+0x47
combase!OXIDEntry::WaitForApartmentShutdown+0x2b
combase!OXIDEntry::StopServer+0x52
combase!CComApartment::StopServer+0x31
combase!StopThread+0x3d
combase!ApartmentUninitialize+0x91
combase!wCoUninitialize+0x2cf
combase!CoUninitialize+0x197
combase!DoThreadSpecificCleanup+0x5d
combase!FlsThreadCleanupCallback+0x29
ntdll!RtlpFlsCallbackPerform+0xa
ntdll!RtlpFlsDataCleanup+0xff
ntdll!RtlProcessFlsData+0x15
ntdll!LdrShutdownThread+0x49
ntdll!RtlExitUserThread+0x46

@VSadov
Copy link
Member

VSadov commented Dec 18, 2024

By the looks of FlsThreadCleanupCallback it seems like combase uses similar trick with FLS as we do.

Are you seeing a scenario when we are called first (and detach) and then their callback gets called and they end up calling managed code? (it is not clear from the stack what calls managed code)

@jkotas
Copy link
Member

jkotas commented Dec 18, 2024

Are you seeing a scenario when we are called first (and detach) and then their callback gets called and they end up calling managed code?

Yes.

@VSadov
Copy link
Member

VSadov commented Dec 18, 2024

Our assumption that a thread cannot call into managed code after calling our FLS callback is not correct then.
~destructors of c++ thread-local objects could, in theory, have the same issue - anyone can set up one and do whatever in in it.

@jkotas
Copy link
Member

jkotas commented Dec 19, 2024

Our assumption that a thread cannot call into managed code after calling our FLS callback

Dtto for regular TLS callbacks - I guess that it is the problem in dotnet/runtime#110442. We wish that it can declared as unsupported scenario, but we have not done a good job failing deterministically in this situation, so there is likely quite a bit out code that depends on it "working".

@VSadov
Copy link
Member

VSadov commented Dec 19, 2024

At high level dotnet/runtime#110442 seems similar - it also seems to be an issue with thread destruction notification. Although it is a different platform with different notification mechanism and from the discussion it appears the notification just did not arrive (or was not acted upon) before TLS was destroyed and then GC happened and crashed. Here the notification did arrive, but the thread found a way to do a reverse pinvoke after that.

Unix issue feels simpler - we should be able to be notified and detach before TLS is destroyed. Something did not work as expected.

The issue with thread termination callback itself entering managed code seems a bit harder to deal with. I do not have any good ideas at the moment.

FLS callbacks are not very commonly used and the documentation says that they should be simple, basically just memory deallocation. Perhaps there is a way to prevent entering managed code from that?

@jkotas
Copy link
Member

jkotas commented Dec 19, 2024

Here the notification did arrive, but the thread found a way to do a reverse pinvoke after that.

My guess is that the Unix issue is the same: The notification did arrive, but the thread found a way to do a reverse pinvoke after that via thread static in some native library.

FLS callbacks are not very commonly used and the documentation says that they should be simple, basically just memory deallocation. Perhaps there is a way to prevent entering managed code from that?

We can fail-fast if the thread tries to re-enter managed code after the thread destruction notification (e.g. we can have a bool thread static that is set in the thread destruction notification). But it would be a severe breaking change in combination with the FLS change on Windows.

@jkotas
Copy link
Member

jkotas commented Dec 19, 2024

FLS callbacks are not very commonly used and the documentation says that they should be simple, basically just memory deallocation.

The COM FLS fallback is anything but simple. It pumps Windows messages... .

it would be a severe breaking change in combination with the FLS change on Windows.

Brainstorming about possible solutions:

  • Ensure that our FLS slot is allocated after COM FLS slot, so that our callback comes after the COM callback. It would not be pretty, but I am sure that it would fix the winforms crash.
    (or)
  • Do a full attach/detach on attempt to enter managed code after thread destruction notification. It would make JIT_ReversePInvokeExit a bit slower (we would have to check a flag and to do full detach if the flag is set)

@VSadov
Copy link
Member

VSadov commented Dec 19, 2024

it would be a severe breaking change in combination with the FLS change on Windows.

On that I agree. Poisoning a thread after detaching will get us in trouble in exactly same scenarios.

Do a full attach/detach on attempt to enter managed code after thread destruction notification.

In theory it could work. Everything is synchronous as it is local to the thread, so no races to be concerned about.
Possible problems could be:

  • I am not sure if registering an FLS slot from a detach callback is something that API expects. (same applies to any other kind of thread termination notifications - is it ok to reregister in a callback?).
  • might be some subtle issues with managed thread identity - is it the "same" thread after reattachment, is the managed ID the same, are thread statics resurrected somehow, does the thread still own the same locks (although terminating thread while owning locks is mostly UB anyways...).
    It feels like it might need to be a new managed thread, while native part would still remember old things.

It would make JIT_ReversePInvokeExit a bit slower

If the idea is to not reregister with OS at all and just detach on every PInvokeExit, then it eliminates the concern #1 (can we reregister at all), while the concern #2 (thread's identity) may get worse.

Ensure that our FLS slot is allocated after COM FLS slot, so that our callback comes after the COM callback.

If registering FLS slot earlier guarantees that callback is called later (i.e. LIFO), this might be sufficient.

We'd need just to be sure that the FLS is allocated before everything else when a thread is attached.

(another assumption is that anything that allocates FLS slots before calling managed code will not call managed code in its callback - not the 100% robust, but plausible assumption, I think)

@jkotas
Copy link
Member

jkotas commented Dec 19, 2024

If registering FLS slot earlier guarantees that callback is called later (i.e. LIFO), this might be sufficient.

It is the other way around: Callbacks registered earlier get called earlier. (As long as there is no FLS slot reuse.) We registered our FLS callback before the COM one, so it got called before the COM one.

@jkotas
Copy link
Member

jkotas commented Dec 19, 2024

It feels like it might need to be a new managed thread, while native part would still remember old things.

I think it can be the same managed/native combo. We can use the thread static callback that gets executed under loader lock to tell signal that the managed/native combo is not needed anymore (without entering the coop mode).

@VSadov
Copy link
Member

VSadov commented Dec 19, 2024

Also wonder why we did not see this before.
I guess it is just because FLS callbacks are called before DllMain (or whatever calls threadlocal destructors).

@VSadov
Copy link
Member

VSadov commented Dec 19, 2024

What exactly managed code do they need to run on apartment shutdown? Can we move that to native or fake it and not run at all (conditionally - if thread is detached) ?

@jkotas
Copy link
Member

jkotas commented Dec 19, 2024

What exactly managed code do they need to run on apartment shutdown?

The managed code that I have seen running under WaitForApartmentShutdown was a managed Windows Proc registered by WinForms. It run because of the thread was pumping windows messages and the message happened to be for the window owned by WinForms.

@lonitra
Copy link
Member

lonitra commented Dec 20, 2024

Thank you for investigating this! I'm closing this PR as #12655 has superseded this and contains the fix.

@lonitra lonitra closed this Dec 20, 2024
@dotnet-maestro dotnet-maestro bot deleted the darc-main-c234ee6c-30cc-4517-9cda-e9d43eb82c42 branch December 20, 2024 17:15
@VSadov
Copy link
Member

VSadov commented Dec 22, 2024

I was able to catch the problem under debugger.

@jkotas what was your way to repro this? Did you run a particular test or a directed repro sample?

@jkotas
Copy link
Member

jkotas commented Dec 22, 2024

@jkotas what was your way to repro this? Did you run a particular test or a directed repro sample?

Build and run all winforms tests against the changes in this PR. The crash occurs when running the WinForms unit tests (System.Windows.Forms.Tests). I have not isolated the specific test that triggers the crash. I have used coreclr.dll with custom instrumentation to figure out what's going on.

@VSadov
Copy link
Member

VSadov commented Jan 7, 2025

As I run winforms tests I see a failure that does not happen without my changes. It must be related, but it looks different from scenario described above.
It looks like a GC hole when ScanThreadStaticRoots reports roots to GC. The thread that is being scanned does not look like exited though. Perhaps there are some indirection steps that lead to this state - thread exits, then reenters somehow, then GC happens ...
More digging is needed.

@VSadov
Copy link
Member

VSadov commented Jan 10, 2025

Ensure that our FLS slot is allocated after COM FLS slot, so that our callback comes after the COM callback. It would not be pretty, but I am sure that it would fix the winforms crash.

I can confirm that this approach works.
(i.e. the following two examples lead to fixing the failures in WinForm tests).

@VSadov
Copy link
Member

VSadov commented Jan 10, 2025

One way to ensure that COM FLS slot has lower index is by skipping a few slots when allocating ours.

Like:

void InitFlsSlot()
{
    // We use fiber detach callbacks to run our thread shutdown code because the fiber detach
    // callback is made without the OS loader lock

    // alloc some slots and then release, so that if COM needs a slot, it could reuse one that we have released.

    DWORD indices[42];
    for (int i = 0; i < 42; i++)
    {
        indices[i] = FlsAlloc(0);
    }

    g_flsIndex = FlsAlloc(FiberDetachCallback);
    if (g_flsIndex == FLS_OUT_OF_INDEXES)
    {
        COMPlusThrowWin32();
    }

    for (int i = 0; i < 42; i++)
    {
        FlsFree(indices[i]);
    }
}

The problem here is that we do not know how many slots to skip.

A typical program will use just 10-20 FLS slots, that is why the original limit for the slot count was 128, but some programs (like something with numerous dynamically loaded native plugins) could use a lot - that is why the limit, which is just an abitrary number, was increased to 4000 in Win10 18312 (https://blogs.windows.com/windows-insider/2019/01/09/announcing-windows-10-insider-preview-build-18312/)

@VSadov
Copy link
Member

VSadov commented Jan 10, 2025

Another way to ensure that COM FLS slot is created before ours is by pulsing CoInitialize just before we create our FLS slot.

void InitFlsSlot()
{
    // We use fiber detach callbacks to run our thread shutdown code because the fiber detach
    // callback is made without the OS loader lock

    // TODO: WinRT uses different APIs
    // Pulse CoInitialize to ensure that COM FLS slot exists
    if (SUCCEEDED(::CoInitialize(NULL)))
    {
        ::CoUninitialize();
    }

    g_flsIndex = FlsAlloc(FiberDetachCallback);
    if (g_flsIndex == FLS_OUT_OF_INDEXES)
    {
        COMPlusThrowWin32();
    }
}

One inconvenience here is that if EE is initialized on some special thread that can't do COM, then CoInitialize may fail and not allocate a slot.
That is probably an unusual corner case, but it is a possibility.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 10, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants