-
Notifications
You must be signed in to change notification settings - Fork 5.3k
This is an attempt to fix the remaining GC stress bugs on ARM64. #40931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This is an attempt to fix the remaining GC stress bugs on ARM64. #40931
Conversation
I found a dump for an AV in background_sweep where we had crashed due to a null method table, yet the object pointed at was a valid free object. So the theory is that a foreground GC allocated an object in gen 2, but the background GC thread doesn't have an up-to-date picture of memory. The synchronization between foreground and background GC is via allow_fgc() which switches from cooperative GC mode to preemptive mode and back. If the timing is such that the background GC thread never sees m_fPreemptiveGCDisabled being true, then we may not execute a a memory barrier. The fix tries to remedy this situation by introducing an interlocked instruction in DisablePreemptiveGC for non Intel-architecture targets. Whether this actually fixes the issues we are seeing is not clear yet.
|
Tagging subscribers to this area: @dotnet/gc |
|
I triggered gcstress-extra job - https://dev.azure.com/dnceng/public/_build/results?buildId=775013&view=results Let's see if it passes with this change |
|
No, we still observe the same failures, e.g. in the type generator tests. Back to the drawing board... |
|
Wonder if we are hitting this case in the inlined pinvoke stubs too -- we flip the thread in and out of preemptive mode there too by simply doing stores. |
| m_fPreemptiveGCDisabled.StoreWithoutBarrier(1); | ||
| #else | ||
| // weaker memory models need an interlocked operation to ensure consistency | ||
| FastInterlockOr(&m_fPreemptiveGCDisabled, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not be using any barrier here even for weak architectures. It would have significant performance consequences. @jkotas mentioned just recently that it is not unusual to see workloads that flip this flag million or more times per second. Reading that flag happens on many orders slower rate, basically just during GC suspension.
So our synchronization model is based on writing to the m_fPreemptiveGCDisabled without any barrier and executing process wide memory barrier (FlushProcessWriteBuffers) at the read time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right we don't need to do this. if a barrier is actually needed (which I'm not sure that it is...still thinking), we can add one on the GC side only when a suspension is required.
In reply to: 472863596 [](ancestors = 472863596)
|
As it turns out, the AV in gc_heap::background_sweep had a different cause - a race condition caused by inserting an object in GCHeap::StressHeap. Nevertheless, I still worry about issues with on architectures with weak ordering constraints either with background_gc or pinvoke. For this to become an issue, we'd have to just miss g_TrapReturningThreads being true on both the way in and the way out - if we see as being true, we will execute an interlocked instruction. I actually don't know if there's a problem - I'll see if I can write a test that provokes it. |
I found a dump for an AV in background_sweep where we had crashed due to a null method table, yet the object pointed at was a valid free object. The AV was at coreclr!WKS::gc_heap::background_sweep+0x668, we load the method table to compute the size of the object:
As a free object can only be generated by the GC itself, the theory is that a foreground GC allocated an object in gen 2 and created a free object for the unused space, but the background GC thread doesn't have an up-to-date picture of memory. The synchronization between foreground and background GC is via allow_fgc() which switches from cooperative GC mode to preemptive mode and back. If the timing is such that the background GC thread never sees m_fPreemptiveGCDisabled being true, then we may not execute a a memory barrier.
The fix tries to remedy this situation by introducing an interlocked instruction in DisablePreemptiveGC for non Intel-architecture targets. Whether this actually fixes the issues we are seeing is not clear yet.