Skip to content

Fix SegmentResortChains synchronization#107468

Merged
ChrisAhna merged 1 commit intodotnet:mainfrom
ChrisAhna:chrisahna-resortchainsfix
Sep 6, 2024
Merged

Fix SegmentResortChains synchronization#107468
ChrisAhna merged 1 commit intodotnet:mainfrom
ChrisAhna:chrisahna-resortchainsfix

Conversation

@ChrisAhna
Copy link
Member

Fixes #107467.

See the issue for details on the problem that is being fixed.

With this change in place, the repro app discussed in the issue runs indefinitely without ever failing (in the same environment where, without this change, the repro app always gets stuck in an infinite loop within the first several minutes of execution). This observed behavior is consistent across Checked and Release builds of the rutime.

Note that the new lock acquire statement in StandardSegmentIterator is identical to the ones which already run in FullSegmentIterator under certain conditions.

The new code only translates to a new lock acquire during any Gen1-or-larger foreground GCs which specifically occur in an environment where preceding handle table activity in the process has driven "whole-block" allocs/frees and has therefore forced "fResortChains" back to true.

I believe that the handle table lock is generally lightly contended while foreground GC is in progress. So even in scenarios which manage to chronically drive "whole-block" allocs/frees in the steady state (and thus push the new lock acquire to the worst-case "one per handle table per Gen1-or-larger foreground GC" rate), I believe that the cost of the new lock acquire will generally be negligible.

(Other than adding a new lock acquire, I believe the only other option for fixing this problem would be to somehow rework the system to eliminate and ban any notion that handle table slots can be created or destroyed by the coreclr!Thread ctor or any other preemptive-mode code.)

@jkotas
Copy link
Member

jkotas commented Sep 6, 2024

Great catch!

rework the system to eliminate and ban any notion that handle table slots can be created or destroyed by the coreclr!Thread ctor or any other preemptive-mode code

The several VM callsites that create/destroy GC handles in preemptive mode are likely a corner-case bug farm. We may want to open an issue on refactoring the code so that the GC handles are created/destroyed in cooperative mode only.

This fix LGTM as something that is easy to backport.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@ChrisAhna
Copy link
Member Author

Thanks for the reviews!

@ChrisAhna ChrisAhna merged commit 1c4755d into dotnet:main Sep 6, 2024
@marklio marklio added the tenet-reliability Reliability/stability related issue (stress, load problems, etc.) label Sep 13, 2024
jtschuster pushed a commit to jtschuster/runtime that referenced this pull request Sep 17, 2024
sirntar pushed a commit to sirntar/runtime that referenced this pull request Sep 30, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Oct 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-VM-coreclr tenet-reliability Reliability/stability related issue (stress, load problems, etc.)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rare failures can occur when SegmentResortChains races with coreclr!Thread construction

4 participants