[GC] Fix freelist crash - don't modify during check-only operation#104876
[GC] Fix freelist crash - don't modify during check-only operation#104876markples merged 1 commit intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @dotnet/gc |
|
The build analysis failure is #104883, which wasn't open yet when this testing ran but is closed now so rerunning just build analysis doesn't work. |
|
|
||
| #ifdef USE_REGIONS | ||
| if (!try_get_new_free_region()) | ||
| if (!check_only_p) |
There was a problem hiding this comment.
NIT: Using && to avoid changing many lines?
There was a problem hiding this comment.
That annoyed me a bit too, but I followed the pattern at line 21544 instead. I assumed that it was to make it clear by separating the "mode" of the function and the check itself, which made sense to me.
|
/ba-g see above comment about #104883 |
|
/backport to release/8.0-staging |
|
Started backporting to release/8.0-staging: https://github.com/dotnet/runtime/actions/runs/9962984209 |
…104992) Backport of #104876 to release/8.0-staging /cc @markples ## Customer Impact - [x] Customer reported - [ ] Found internally GC can crash in region code if GC.RegisterForFullGCNotification is used. ## Regression - [x] Yes - [ ] No Regression is part of general move to regions, but specific code was introduced in #48691 ## Testing The behavior is very difficult to repro and likely hadn't occurred before because it relies on a race condition, a region allocation failing, and GC.RegisterForFullGCNotification be used. I built a custom GC that always triggers the region allocation failure path and was able to repro and verify the fix in GCPerfSim. ## Risk Low: only impacts GC.RegisterForFullGCNotification path; only impacts notifications; may cause a notification for a full GC to be skipped, but only in a case where no other reasons for full gc occur -and- the code would be subject to the race condition and crash
RegisterForFullGCNotificationsetsgc_heap::fgn_maxgen_percent, which guards calls tocheck_for_full_gcto determine if a notification should be sent.gc_heap::generation_to_condemn. One part ofcheck_for_full_gccallsgeneration_to_condemnwith the parametercheck_only_pset toTRUE.generation_to_condemncallstry_get_new_free_region, which ensures that a free region is available. It checks the freelist, but if none are found it creates a new one and adds it to the freelist.check_for_full_gcdoes not (and should not) contain the necessary synchronization to make this safe. This can lead to corruption of the basic region freelist, often ending up withnullptrlinks between nodes.This caused crashed in production and the cause was determined by using a custom GC build with additional runtime checks. The fix is to simply honor the
check_only_pparameter and not calltry_get_new_free_regionwhen it is set. The problem can be reproed by changingtry_get_new_free_regionto always create a new region. This fix handles that repro.This fix does not impact normal GC operation. It can cause
check_for_full_gcto no longer detect that full collection is going to occur, but only if the failure to create or initialize a new region is the only reason why that full collection is going to occur. However, those cases are open to the race condition that can cause the freelist corruption is possible.