Skip to content

[GC] Fix freelist crash - don't modify during check-only operation#104876

Merged
markples merged 1 commit intodotnet:mainfrom
markples:freelist-fix
Jul 15, 2024
Merged

[GC] Fix freelist crash - don't modify during check-only operation#104876
markples merged 1 commit intodotnet:mainfrom
markples:freelist-fix

Conversation

@markples
Copy link
Contributor

RegisterForFullGCNotification sets gc_heap::fgn_maxgen_percent, which guards calls to check_for_full_gc to determine if a notification should be sent. gc_heap::generation_to_condemn. One part of check_for_full_gc calls generation_to_condemn with the parameter check_only_p set to TRUE. generation_to_condemn calls try_get_new_free_region, which ensures that a free region is available. It checks the freelist, but if none are found it creates a new one and adds it to the freelist. check_for_full_gc does not (and should not) contain the necessary synchronization to make this safe. This can lead to corruption of the basic region freelist, often ending up with nullptr links between nodes.

This caused crashed in production and the cause was determined by using a custom GC build with additional runtime checks. The fix is to simply honor the check_only_p parameter and not call try_get_new_free_region when it is set. The problem can be reproed by changing try_get_new_free_region to always create a new region. This fix handles that repro.

This fix does not impact normal GC operation. It can cause check_for_full_gc to no longer detect that full collection is going to occur, but only if the failure to create or initialize a new region is the only reason why that full collection is going to occur. However, those cases are open to the race condition that can cause the freelist corruption is possible.

@ghost ghost added the area-GC-coreclr label Jul 14, 2024
@markples markples self-assigned this Jul 14, 2024
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@markples
Copy link
Contributor Author

The build analysis failure is #104883, which wasn't open yet when this testing ran but is closed now so rerunning just build analysis doesn't work.


#ifdef USE_REGIONS
if (!try_get_new_free_region())
if (!check_only_p)
Copy link
Contributor

@cshung cshung Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Using && to avoid changing many lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That annoyed me a bit too, but I followed the pattern at line 21544 instead. I assumed that it was to make it clear by separating the "mode" of the function and the check itself, which made sense to me.

Copy link
Contributor

@cshung cshung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@Maoni0 Maoni0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@markples
Copy link
Contributor Author

/ba-g see above comment about #104883

@markples markples merged commit f38e661 into dotnet:main Jul 15, 2024
@markples
Copy link
Contributor Author

/backport to release/8.0-staging

@github-actions
Copy link
Contributor

markples pushed a commit that referenced this pull request Jul 25, 2024
…104992)

Backport of #104876 to release/8.0-staging

/cc @markples

## Customer Impact

- [x] Customer reported
- [ ] Found internally

GC can crash in region code if GC.RegisterForFullGCNotification is used.

## Regression

- [x] Yes
- [ ] No

Regression is part of general move to regions, but specific code was introduced in #48691

## Testing

The behavior is very difficult to repro and likely hadn't occurred before because it relies on a race condition, a region allocation failing, and GC.RegisterForFullGCNotification be used.  I built a custom GC that always triggers the region allocation failure path and was able to repro and verify the fix in GCPerfSim.

## Risk

Low: only impacts GC.RegisterForFullGCNotification path; only impacts notifications; may cause a notification for a full GC to be skipped, but only in a case where no other reasons for full gc occur -and- the code would be subject to the race condition and crash
@github-actions github-actions bot locked and limited conversation to collaborators Aug 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants