Skip to content

Conversation

@stuartmorgan-g
Copy link
Collaborator

@stuartmorgan-g stuartmorgan-g commented Jan 14, 2023

Partially migrates the macOS-host Pigeon tests to LUCI:

While having the tests split is not ideal, it allows us to at least begin the process of moving these tests to LUCI (rather than leaving the failing bringup-mode tasks indefinitely) while we try to figure out why macOS tests are hanging in CI. And this allows us to eliminate the legacy iOS test harness, reducing tech debt.

This does mean we'll double-run other package's custom tests for now, but those other tests take ~30s in CI, so the cost is irrelevant.

Part of the consolidation of test harnesses that is covered by flutter/flutter#111505

@stuartmorgan-g stuartmorgan-g marked this pull request as draft January 14, 2023 14:44
@stuartmorgan-g
Copy link
Collaborator Author

It looks like these tests are hanging in LUCI, even though they run fine on Cirrus. @godofredoc Is there a way to connect to a machine while it's running to see if we can tell what process is hanging?

@stuartmorgan-g stuartmorgan-g changed the title [ci] Enable all tests that are in bringup mode [ci] Enable LUCI Pigeon tests Jan 31, 2023
@stuartmorgan-g
Copy link
Collaborator Author

@godofredoc Ping on the question above about debugging hangs. This is the last macOS LUCI migration, and I don't know how to make progress on it without infra help.

@stuartmorgan-g
Copy link
Collaborator Author

stuartmorgan-g commented Feb 2, 2023

This may well be flutter/flutter#119750 rather than an actual hang.

@stuartmorgan-g
Copy link
Collaborator Author

It's not 119750; I've been experimenting with Intel in #3149 and it hangs there too. So either there's a real hang, or a process left around that makes LUCI think there's a hang, and I'm still not sure how to debug that further.

@stuartmorgan-g
Copy link
Collaborator Author

Based on my experiments in #3149 it seems to be specific to the macOS tests, so perhaps something related to launching the macOS application. But we're doing that successfully on LUCI machines in flutter/plugins.

@stuartmorgan-g
Copy link
Collaborator Author

Watching a run over VNC I'm still not sure what's going on, but it looks like despite the missing output the macOS application was built (it's on disk) and probably ran tests (I see test data on disk, although I'm not sure how to inspect it to be sure it's as expected). And since I don't see xcodebuild in the process list, that suggests it has completed, so maybe somehow we're getting stuck waiting for the output stream?

@stuartmorgan-g
Copy link
Collaborator Author

maybe somehow we're getting stuck waiting for the output stream?

Disabling output streaming (and thus the waits for it to complete) didn't help; it was apparently stuck on await process.exitCode.

@godofredoc
Copy link
Contributor

maybe somehow we're getting stuck waiting for the output stream?

Disabling output streaming (and thus the waits for it to complete) didn't help; it was apparently stuck on await process.exitCode.

We can try running the test, getting the list of dart subprocesses and their children and killing one by one of the child subprocesses to see if at some point the main dart subprocess recovers and exits.

We've seen this behavior with pub get when some of the https calls keep waiting to establish a socket connection indefinitely.

@stuartmorgan-g
Copy link
Collaborator Author

This time the tests passed, with the change being that I had added some prints for debugging, so now I'm even more confused. Some kind of mistake in the await setup perhaps, causing a race? I'll keep experimenting with this code a bit more first.

(I didn't see any obvious subprocesses still running the last time I was connected and watch. I was just looking at the reverse-PID list of active processes though, so maybe I missed something.)

@stuartmorgan-g
Copy link
Collaborator Author

Ah, I think I confused myself about what I had previously tested, and the print statements were unrelated. If I run just the macOS unit tests, everything is fine. If I run that same test after running the iOS tests, it times out, but after the logs show the iOS tests as having run successfully.

I'll try just the macOS integration tests and see if that also passes, but it seems like it's a combination of tests that triggers this.

@stuartmorgan-g
Copy link
Collaborator Author

I'll try just the macOS integration tests and see if that also passes, but it seems like it's a combination of tests that triggers this.

Yes, that worked too.

So any individual test is fine, but either macOS test will hang in LUCI if another test (iOS or macOS) runs first.

@stuartmorgan-g
Copy link
Collaborator Author

We can try running the test, getting the list of dart subprocesses and their children and killing one by one of the child subprocesses to see if at some point the main dart subprocess recovers and exits.

I just tried this again; according to the hierarchical view in Activity Monitor there are no processes under the test process when the hang happens. It really seems like somehow the problem is in the awaits in the Dart code itself rather than with the process being run, but I'm not seeing what it would be.

@stuartmorgan-g
Copy link
Collaborator Author

I replaced all of the code that handles awaits process and process stream awaiting with process_runner, to try to rule out await mistakes in the Pigeon test code, and it doesn't seem to have helped.

I've exhausted everything I can think to try at the moment.

@stuartmorgan-g stuartmorgan-g changed the title [ci] Enable LUCI Pigeon tests [ci] Partially migrate macOS-host Pigeon tests to LUCI Feb 8, 2023
@stuartmorgan-g stuartmorgan-g marked this pull request as ready for review February 8, 2023 16:29
@stuartmorgan-g
Copy link
Collaborator Author

@tarrinneal This is ready for review as a hybrid, interim solution that gets the tests that do run successfully on LUCI running there, while leaving the ones that don't on Cirrus for now. Once this lands I'll make a new draft PR to fully switch that we can used to continue debugging flutter/flutter#120231.

@stuartmorgan-g stuartmorgan-g added the autosubmit Merge PR when tree becomes green via auto submit App label Feb 8, 2023
@stuartmorgan-g
Copy link
Collaborator Author

Landing on red since once again this PR seems to somehow have affected post-submit tests for the tree before it landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autosubmit Merge PR when tree becomes green via auto submit App p: pigeon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants