Skip to content

Conversation

@PeterSolMS
Copy link
Contributor

I found a dump for an AV in background_sweep where we had crashed due to a null method table, yet the object pointed at was a valid free object. The AV was at coreclr!WKS::gc_heap::background_sweep+0x668, we load the method table to compute the size of the object:

            while ((o < end) && !background_object_marked (o, FALSE))
            {
                o = o + Align (size (o), align_const);;    // <-- AV here
                current_num_objs++;
                if (current_num_objs >= num_objs)
                {
                    current_sweep_pos = plug_end;
                    dprintf (1234, ("f: swept till %Ix", current_sweep_pos));
                    allow_fgc();
                    current_num_objs = 0;
                }
            }

As a free object can only be generated by the GC itself, the theory is that a foreground GC allocated an object in gen 2 and created a free object for the unused space, but the background GC thread doesn't have an up-to-date picture of memory. The synchronization between foreground and background GC is via allow_fgc() which switches from cooperative GC mode to preemptive mode and back. If the timing is such that the background GC thread never sees m_fPreemptiveGCDisabled being true, then we may not execute a a memory barrier.

The fix tries to remedy this situation by introducing an interlocked instruction in DisablePreemptiveGC for non Intel-architecture targets. Whether this actually fixes the issues we are seeing is not clear yet.

I found a dump for an AV in background_sweep where we had crashed due to a null method table, yet the object pointed at was a valid free object.

So the theory is that a foreground GC allocated an object in gen 2, but the background GC thread doesn't have an up-to-date picture of memory. The synchronization between foreground and background GC is via allow_fgc() which switches from cooperative GC mode to preemptive mode and back. If the timing is such that the background GC thread never sees m_fPreemptiveGCDisabled being true, then we may not execute a a memory barrier.

The fix tries to remedy this situation by introducing an interlocked instruction in DisablePreemptiveGC for non Intel-architecture targets. Whether this actually fixes the issues we are seeing is not clear yet.
@ghost
Copy link

ghost commented Aug 17, 2020

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@echesakov
Copy link
Contributor

@PeterSolMS
Copy link
Contributor Author

No, we still observe the same failures, e.g. in the type generator tests. Back to the drawing board...

@AndyAyersMS
Copy link
Member

Wonder if we are hitting this case in the inlined pinvoke stubs too -- we flip the thread in and out of preemptive mode there too by simply doing stores.

m_fPreemptiveGCDisabled.StoreWithoutBarrier(1);
#else
// weaker memory models need an interlocked operation to ensure consistency
FastInterlockOr(&m_fPreemptiveGCDisabled, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be using any barrier here even for weak architectures. It would have significant performance consequences. @jkotas mentioned just recently that it is not unusual to see workloads that flip this flag million or more times per second. Reading that flag happens on many orders slower rate, basically just during GC suspension.
So our synchronization model is based on writing to the m_fPreemptiveGCDisabled without any barrier and executing process wide memory barrier (FlushProcessWriteBuffers) at the read time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right we don't need to do this. if a barrier is actually needed (which I'm not sure that it is...still thinking), we can add one on the GC side only when a suspension is required.


In reply to: 472863596 [](ancestors = 472863596)

@PeterSolMS
Copy link
Contributor Author

As it turns out, the AV in gc_heap::background_sweep had a different cause - a race condition caused by inserting an object in GCHeap::StressHeap.

Nevertheless, I still worry about issues with on architectures with weak ordering constraints either with background_gc or pinvoke. For this to become an issue, we'd have to just miss g_TrapReturningThreads being true on both the way in and the way out - if we see as being true, we will execute an interlocked instruction.

I actually don't know if there's a problem - I'll see if I can write a test that provokes it.

@PeterSolMS PeterSolMS closed this Aug 27, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants