Skip to content
This repository was archived by the owner on Feb 25, 2025. It is now read-only.

Conversation

@jason-simmons
Copy link
Member

@flutter-dashboard
Copy link

It looks like this pull request may not have tests. Please make sure to add tests before merging. If you need an exemption to this rule, contact Hixie on the #hackers channel in Chat (don't just cc him here, he won't see it! He's on Discord!).

If you are not sure if you need tests, consider this rule of thumb: the purpose of a test is to make sure someone doesn't accidentally revert the fix. Ask yourself, is there anything in your PR that you feel it is important we not accidentally revert back to how it was before your fix?

Reviewers: Read the Tree Hygiene page and make sure this patch meets those guidelines before LGTMing.

Copy link
Member

@zanderso zanderso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to write a test for this?

@jason-simmons
Copy link
Member Author

This is a race between the use of the CFRunLoop by CFRunLoopStop and the deletion of the CFRunLoop during thread shutdown, and I don't know of a way to force that race to happen.

As we have seen, the race will eventually occur when repeatedly running the existing tests that call MessageLoopDarwin::Terminate.

@jason-simmons jason-simmons added the autosubmit Merge PR when tree becomes green via auto submit App label Jul 19, 2022
@chinmaygarde
Copy link
Member

I must admit I don't understand this fix. The thread won't stop until the run loop has exited though right?

Fortunately, CoreFoundation is open-source, and reading the code, it looks like CFRunLoopStop acquires the loop lock. Deallocating (via the final CFRelease) also acquires the same lock. If the dealloc happens on the thread, then it won't join and we have a valid reference to our C++ wrapper. If it happens on the tests thread, then we have a live reference to the loop. Either way, I don't see how this fixes anything. If it doesn't, the ref count bumps are harmless though.

@auto-submit auto-submit bot merged commit b71b935 into flutter:main Jul 19, 2022
@jason-simmons
Copy link
Member Author

Without this fix, at best the code is relying on implementation details of CoreFoundation for safety. And AFAICT the CoreFoundation documentation does not promise that this usage pattern is allowed.

In these tests, thread A is calling MessageLoopDarwin::Terminate in order to stop thread B's message loop and allow thread B to shut down.

Before this fix, the only reference to the CFRunLoop was held by MessageLoopDarwin::loop_, and the MessageLoopDarwin instance lived inside Thread B's thread-local storage.
When Thread A calls CFRunLoopStop, Thread B can then finish its call to MessageLoopDarwin::Run, delete its objects in thread-local storage, and shut down.

So there is now a race between Thread A's potential use of the CFRunLoop until CFRunLoopStop completes, versus Thread B's deletion of the CFRunLoop. If Thread A holds its own reference to the CFRunLoop until after CFRunLoopStop exits, then this race is prevented.

I was able to reproduce this crash reasonably often by running this with a host_debug_unopt build on a MacBook:

python3 ../../third_party/gtest-parallel/gtest-parallel ./fml_unittests --gtest_filter=\*TaskRunnerCheckerTests\* --repeat=200`

When the crash occurred I captured a core dump with this backtrace:

* thread #1
  * frame #0: 0x00007ff80f5f0f4e CoreFoundation`_CFAssertMismatchedTypeID + 110
    frame #1: 0x00007ff80f49ecd7 CoreFoundation`CFRunLoopWakeUp + 462
    frame #2: 0x00000001080699d2 fml_unittests`fml::MessageLoopDarwin::Terminate(this=0x0000600000a74400) at message_loop_darwin.mm:66:3
    frame #3: 0x000000010803cf2a fml_unittests`fml::MessageLoopImpl::DoTerminate(this=0x0000600000a74400) at message_loop_impl.cc:115:3
    frame #4: 0x000000010803ba2b fml_unittests`fml::MessageLoop::Terminate(this=0x0000600001d781c0) at message_loop.cc:53:10
    frame #5: 0x0000000107e7bfef fml_unittests`fml::testing::TaskRunnerCheckerTests_FailsTheCheckIfOnDifferentTaskRunner_Test::TestBody(this=0x0000600001d6c110) at task_runner_checker_unittest.cc:40:9
    frame #6: 0x000000010ac00ee9 fml_unittests`void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(object=0x0000600001d6c110, method=21 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00, location="the test body")(), char const*) at gtest.cc:2631:10
    frame #7: 0x000000010abd6515 fml_unittests`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(object=0x0000600001d6c110, method=21 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00, location="the test body")(), char const*) at gtest.cc:2686:12
    frame #8: 0x000000010abd6431 fml_unittests`testing::Test::Run(this=0x0000600001d6c110) at gtest.cc:2706:5
  ...

This implied that the CFRunLoop was becoming invalid midway through Thread A's call to MessageLoopDarwin::Terminate and CFRunLoopWakeUp.

After applying this PR I have not been able to reproduce the crash with the command above.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

autosubmit Merge PR when tree becomes green via auto submit App needs tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mac Unopt is 6% flaky on step test: Host Tests for host_debug_unopt

3 participants