Skip to content

Conversation

@dnfield
Copy link
Contributor

@dnfield dnfield commented Feb 25, 2020

The integration tests in the tool invoke Flutter to:

  • Create project(s) in temp directories
  • Run those projects directly via flutter_tester
  • Make sure that you can attach a debugger, evaluate expressions, etc.

These processes leak. The shell script/batch script files are not set up to wait for the dart process they invoke to finish, and thus they can exit (and complete the process.exitCode future) before the Dart VM process they fork actually finishes. On Windows, this is particularly bad, since the leaked process(es) end up holding an exclusive lock on the temp directory, meaning we fail to clean it up. We've been swallowing that under the assumption that it was a permissions issue, but it appears to not be a permissions issue and just be an issue of leaked processes.

This patch adds some additional waits/attempts to terminate the Dart VM(s) that the tests spawn. In cases where we cannot be sure that the spawned VM has terminated, it adds some additional OS specific waiting to check that the PID the VM created is no longer active (on Windows in tasklist and on POSIX in response to kill -0).

We should be able to further improve the test_driver logic if we can get dart-lang/sdk#40759 resolved. Alternatively, we could look at making flutter.bat wait for Dart to exit when it launches, but I'm a bit scared to touch that logic since it has some comments about how it has to work in a bunch of scenarios I'm not sure we have tests for (/cc @goderbauer). The Dart feature would be useful regardless of this particular instance of usage though.

@fluttergithubbot fluttergithubbot added engine flutter/engine related. See also e: labels. tool Affects the "flutter" command-line tool. See also t: labels. labels Feb 25, 2020
@fluttergithubbot
Copy link
Contributor

It looks like this pull request may not have tests. Please make sure to add tests before merging. If you need an exemption to this rule, contact Hixie.

Reviewers: Read the Tree Hygiene page and make sure this patch meets those guidelines before LGTMing.

@dnfield dnfield added team: flakes and removed engine flutter/engine related. See also e: labels. labels Feb 25, 2020
} on FileSystemException catch (error) {
print('Failed to delete ${directory.path}: $error');
}
directory.deleteSync(recursive: true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to leave this for now in case we see flakes around this pop up again. Should I add a TODO to remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual reason these files were failing to delete was because we had processes that weren't being killed before trying to delete files in use, which on Windows causes exceptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you link to the Dart SDK bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that's the right bug to link here - if anything I'd say file a new bug to track removing this code once we're convinced the current change hasn't re-introduced old flakes.

@dnfield dnfield changed the title roll engine plus fix leak Fix process leaks in tool tests Feb 25, 2020

// TODO(dnfield): This is racy. If dart-lang/sdk#40759 can be resolved, we
// should remove this.
Future<void> _ensureDead(int pid, int tries) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uber nit: shouldn't be called ensureX if it can't ensure it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to tryWaitForPidDeath, and moved it to test_utils.dart so it can be reused in the daemon_mode_test for the same reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@jonahwilliams
Copy link
Contributor

jonahwilliams commented Feb 25, 2020

This seems reasonable to me, but I'd wait for @zanderso to take a look too

@zanderso
Copy link
Member

Initial thoughts:

  1. Please expand the PR description to explain what problem this is solving (or link to the github issue), and why it solves it, and if there are any remaining issues.
  2. I think we need to take a step back and think about what is being bought by going through the shell script indirection here rather than invoking Dart directly. It seems like if the test driver had a Process object for the VM, then some difficulty here could be avoided, and we wouldn't need a feature request against killPid.

ca-certificates \
gnupg
gnupg \
procps
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherfujino - should I split this into a separate PR?

This is needed so that kill is available in the container.

@fluttergithubbot fluttergithubbot added the c: contributor-productivity Team-specific productivity, code health, technical debt. label Feb 25, 2020
@dnfield
Copy link
Contributor Author

dnfield commented Feb 25, 2020

@zanderso - I updated the PR description.

The closest bug I can think of to this is #29141. The resolution of that issue was to make an outer harness just kill processes at the end of it. We can't quite do that in these tests.

As to whether it would be good enough to avoid the flutter.bat/flutter shell scripts, I'd leave that to @Hixie - I believe he wrote most of these tests.

@dnfield
Copy link
Contributor Author

dnfield commented Feb 25, 2020

We also might still have a real bug such that in certain debugging scenarios, the Windows VM leaks. That seems to be the case in these tests, without explicitly killing it.

@dnfield
Copy link
Contributor Author

dnfield commented Feb 25, 2020

Hm. This seemed to work locally on a Windows box, but I think the Windows instance we're using is just slower... Not sure why the process is still sticking around.

@dnfield
Copy link
Contributor Author

dnfield commented Feb 25, 2020

Discussed offline. This approach will still be flaky, and it's not clear that it's any less flaky than the current approach (even though it may avoid flakes in many common cases).

I'm going to file an issue specifically about some of the things I've discovered here. I'll leave this open for now - hopefully @Hixie might have some good input on this as the author of many of these tests.

@dnfield
Copy link
Contributor Author

dnfield commented Mar 3, 2020

I'm going to close this - it's not ready as is and I don't have a good sense of when it will be.

@dnfield dnfield closed this Mar 3, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

c: contributor-productivity Team-specific productivity, code health, technical debt. tool Affects the "flutter" command-line tool. See also t: labels.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants