This is an attempt to fix the remaining GC stress bugs on ARM64. #40931

PeterSolMS · 2020-08-17T12:55:21Z

I found a dump for an AV in background_sweep where we had crashed due to a null method table, yet the object pointed at was a valid free object. The AV was at coreclr!WKS::gc_heap::background_sweep+0x668, we load the method table to compute the size of the object:

            while ((o < end) && !background_object_marked (o, FALSE))
            {
                o = o + Align (size (o), align_const);;    // <-- AV here
                current_num_objs++;
                if (current_num_objs >= num_objs)
                {
                    current_sweep_pos = plug_end;
                    dprintf (1234, ("f: swept till %Ix", current_sweep_pos));
                    allow_fgc();
                    current_num_objs = 0;
                }
            }

As a free object can only be generated by the GC itself, the theory is that a foreground GC allocated an object in gen 2 and created a free object for the unused space, but the background GC thread doesn't have an up-to-date picture of memory. The synchronization between foreground and background GC is via allow_fgc() which switches from cooperative GC mode to preemptive mode and back. If the timing is such that the background GC thread never sees m_fPreemptiveGCDisabled being true, then we may not execute a a memory barrier.

The fix tries to remedy this situation by introducing an interlocked instruction in DisablePreemptiveGC for non Intel-architecture targets. Whether this actually fixes the issues we are seeing is not clear yet.

I found a dump for an AV in background_sweep where we had crashed due to a null method table, yet the object pointed at was a valid free object. So the theory is that a foreground GC allocated an object in gen 2, but the background GC thread doesn't have an up-to-date picture of memory. The synchronization between foreground and background GC is via allow_fgc() which switches from cooperative GC mode to preemptive mode and back. If the timing is such that the background GC thread never sees m_fPreemptiveGCDisabled being true, then we may not execute a a memory barrier. The fix tries to remedy this situation by introducing an interlocked instruction in DisablePreemptiveGC for non Intel-architecture targets. Whether this actually fixes the issues we are seeing is not clear yet.

ghost · 2020-08-17T12:55:26Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

echesakov · 2020-08-17T23:34:28Z

I triggered gcstress-extra job - https://dev.azure.com/dnceng/public/_build/results?buildId=775013&view=results Let's see if it passes with this change

PeterSolMS · 2020-08-18T09:37:14Z

No, we still observe the same failures, e.g. in the type generator tests. Back to the drawing board...

AndyAyersMS · 2020-08-18T18:06:14Z

Wonder if we are hitting this case in the inlined pinvoke stubs too -- we flip the thread in and out of preemptive mode there too by simply doing stores.

janvorli · 2020-08-19T08:42:14Z

src/coreclr/src/vm/threads.h

        m_fPreemptiveGCDisabled.StoreWithoutBarrier(1);
+#else
+        // weaker memory models need an interlocked operation to ensure consistency
+        FastInterlockOr(&m_fPreemptiveGCDisabled, 1);


We should not be using any barrier here even for weak architectures. It would have significant performance consequences. @jkotas mentioned just recently that it is not unusual to see workloads that flip this flag million or more times per second. Reading that flag happens on many orders slower rate, basically just during GC suspension.
So our synchronization model is based on writing to the m_fPreemptiveGCDisabled without any barrier and executing process wide memory barrier (FlushProcessWriteBuffers) at the read time.

right we don't need to do this. if a barrier is actually needed (which I'm not sure that it is...still thinking), we can add one on the GC side only when a suspension is required.

In reply to: 472863596 [](ancestors = 472863596)

PeterSolMS · 2020-08-21T10:02:27Z

As it turns out, the AV in gc_heap::background_sweep had a different cause - a race condition caused by inserting an object in GCHeap::StressHeap.

Nevertheless, I still worry about issues with on architectures with weak ordering constraints either with background_gc or pinvoke. For this to become an issue, we'd have to just miss g_TrapReturningThreads being true on both the way in and the way out - if we see as being true, we will execute an interlocked instruction.

I actually don't know if there's a problem - I'll see if I can write a test that provokes it.

PeterSolMS requested review from AndyAyersMS, Maoni0 and echesakov August 17, 2020 12:55

Dotnet-GitSync-Bot added the area-GC-coreclr label Aug 17, 2020

janvorli reviewed Aug 19, 2020

View reviewed changes

devsko mentioned this pull request Aug 25, 2020

Type check the actual values during deserialization in polymorphic scenarios #41323

Closed

PeterSolMS closed this Aug 27, 2020

ghost locked as resolved and limited conversation to collaborators Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

This is an attempt to fix the remaining GC stress bugs on ARM64. #40931

This is an attempt to fix the remaining GC stress bugs on ARM64. #40931

Uh oh!

PeterSolMS commented Aug 17, 2020

Uh oh!

ghost commented Aug 17, 2020

Uh oh!

echesakov commented Aug 17, 2020

Uh oh!

PeterSolMS commented Aug 18, 2020

Uh oh!

AndyAyersMS commented Aug 18, 2020

Uh oh!

janvorli Aug 19, 2020

Uh oh!

Maoni0 Aug 20, 2020

Uh oh!

PeterSolMS commented Aug 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

This is an attempt to fix the remaining GC stress bugs on ARM64. #40931

This is an attempt to fix the remaining GC stress bugs on ARM64. #40931

Uh oh!

Conversation

PeterSolMS commented Aug 17, 2020

Uh oh!

ghost commented Aug 17, 2020

Uh oh!

echesakov commented Aug 17, 2020

Uh oh!

PeterSolMS commented Aug 18, 2020

Uh oh!

AndyAyersMS commented Aug 18, 2020

Uh oh!

janvorli Aug 19, 2020

Choose a reason for hiding this comment

Uh oh!

Maoni0 Aug 20, 2020

Choose a reason for hiding this comment

Uh oh!

PeterSolMS commented Aug 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants