Skip to content

Conversation

@Hixie
Copy link
Contributor

@Hixie Hixie commented Oct 25, 2018

These are essentially self-inflicted race conditions. Instead of timeouts we're going to try a more verbose logging mechanism that points out when things are taking a long time.

@Hixie Hixie changed the title Remove many timeouts. [O] Remove many timeouts. Oct 25, 2018
@Hixie
Copy link
Contributor Author

Hixie commented Oct 25, 2018

cc @tvolkert

@Hixie
Copy link
Contributor Author

Hixie commented Oct 25, 2018

The tests will likely fail until #23529 lands. They may be broken after that too, I can't run them locally until that is fixed.

@Hixie Hixie force-pushed the flakes branch 2 times, most recently from 8436e7e to b138fae Compare October 25, 2018 20:58
@tvolkert
Copy link
Contributor

In many (most?) cases, these timeouts were added to alert the developer that something is likely wrong. That use case is solved by logging trace statements (or warnings), but this PR doesn't always ensure that such logging will take place.

What if we introduced something like the following:

typedef LogFunction = void Function(int attempt, Duration interval);

void _defaultMonitorLog(int attempt, Duration interval) {
  final Duration elapsed = attempt & interval;
  print('Future is taking too long ($elapsed)');
}

/// Monitors the specified [future], logging a message (via [log])
/// periodically (according to [interval]) if the future is taking too
/// long to complete.
Future<T> monitor(
  Future<T> future,
  Duration interval, {
  LogFunction log = _defaultMonitorLog,
}) async {
  assert(future != null);
  assert(interval != null);
  assert(interval > Duration.zero);
  int attempt = 1;
  while (true) {
    try {
      return await future.timeout(interval);
    } on TimeoutException {
      log(attempt);
      attempt++;
    }
  }
}

Then in cases where logging could be useful (most cases I'm guessing), we could call:

Before:

await foo.timeout(const Duration(minutes: 1));

After:

await monitor(foo, const Duration(minutes: 1));

@Hixie
Copy link
Contributor Author

Hixie commented Oct 27, 2018

What are the cases where it doesn't happen? Any time where we're timing out but not showing a progress indicator, we have another bug, namely, that we're not showing a progress indicator. I'd rather fix that bug than only fix the part where we're not showing a message when it times out.

@Hixie Hixie force-pushed the flakes branch 4 times, most recently from c93ec39 to 0cc9fc5 Compare November 27, 2018 23:02
@zoechi zoechi added a: tests "flutter test", flutter_test, or one of our tests c: contributor-productivity Team-specific productivity, code health, technical debt. tool Affects the "flutter" command-line tool. See also t: labels. t: gradle "flutter build" and "flutter run" on Android labels Nov 28, 2018
@Hixie Hixie force-pushed the flakes branch 2 times, most recently from e5ef756 to 70c695c Compare November 30, 2018 02:59
@Hixie
Copy link
Contributor Author

Hixie commented Nov 30, 2018

cc @tvolkert @DanTup

There might still be some issues, I haven't rereviewed the PR myself yet, but if you are curious to review it, go ahead. It should be reasonably solid at this point.

There's one or two random delays that I haven't tried to remove yet, might do that in another PR though as this has gotten quite big already.

@Hixie
Copy link
Contributor Author

Hixie commented Nov 30, 2018

ok, still some flakiness i need to shake out.

Copy link
Contributor

@DanTup DanTup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some minor questions/comments, but otherwise LGTM (assuming the builds pass and are not flaky, ofc :-))

Copy link
Contributor

@DanTup DanTup Dec 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to return future to avoid completing when the warning is written? (there are things that await the return of _warnIfSlow)

Scratch that, didn't see the cascade ..!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit confusing:

Failures should be indicated by completing the future with a string

It sounds like the future will complete successfully with a String but the return type is Future<void>. Should it say that they complete with an error/throw a String?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Clarified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any significance to the 2 worth commenting (or making a const)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the exit code for the process (documented in the superclass, which declares the abstract method implemented here). We seem to use 1 for user error and 2 for failure. I'll add a comment to the superclass about that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for changing from printError to printTrace? Could this stop the user from getting feedback that it failed? (Seems like in the Daemon's logger we just drop printTrace events and only send printErrors- "This is a lot of traffic to send over the wire").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole function is run in the context of a Status progress indicator, so we can't print anything to the console until it's stopped. Also, we should never show a stack trace to the user.

The message is still reported, just after the progress indicator is stopped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it matters for any of the tests we have, but this code looks racy if the code with the breakpoint only runs once. Is it possible the code for the breakpoint has already run by the time the breakpoint is added, so it never pauses?

(the original code used reloads/restarts to avoid this, but it was kinda clumsy!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment about this. It's certainly possible to use it badly.

@DanTup
Copy link
Contributor

DanTup commented Dec 15, 2018

LGTM with the changes (though some of the attach tests seem to be timing out).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a documentation block about when the completed future returns and the fact that it explicitly will not time out, so if a developer wants that, they need to do it explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the point of this delay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke to the person who added it, and they weren't sure but seemed to think it was prophylactic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support this PR, but it's worth understanding the context in which this timeout was added (in #16762) -- we were seeing cases where the CI container was timing out after indefinitely waiting for a future to complete, and when that happened, the driver test harness wasn't getting a chance to clean up and write out its logs, so we had no context as to what was hanging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add flushing or something in that case, to make sure we're writing the logs promptly. A crash would presumably have the same problem, and a timeout won't help there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another unexplained delay. This was a smell to begin with, but I'm curious why it existed and if we're going to re-introduce an old problem by removing it without correspondingly fixing some underlying issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite possible, but I've no idea what to do about it. Leaving the delay in doesn't seem reasonable either, it's just making the original bug harder to track down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not guaranteed that timeout is set here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove the .timeout() call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed (also removed the argument to the method).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout may not be set - looks like we should just remove the .timeout() call here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed (also removed the argument to the method).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert(timeout != null)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@tvolkert tvolkert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Hixie
Copy link
Contributor Author

Hixie commented Dec 20, 2018

Thanks for the review. Will land on green.

@Hixie
Copy link
Contributor Author

Hixie commented Dec 20, 2018

Looks like test/integration/flutter_attach_test.dart and test/integration/lifetime_test.dart may be flaky even with my patch.

@DanTup
Copy link
Contributor

DanTup commented Dec 20, 2018

The failures are all "Test timed out after 30 seconds.". I think this is just because these tests can be slow - they used to have longer timeouts but I think this changes them. I think we need to put back the increased timeouts for these tests.

(I can't repro any failures locally, but I presume my machine is faster what we're getting on the CI bots).

These are essentially self-inflicted race conditions. Instead of timeouts we're going to try a more verbose logging mechanism that points out when things are taking a long time.
@Hixie
Copy link
Contributor Author

Hixie commented Dec 21, 2018

@DanTup Yeah, I reproduced the errors locally by running a lot of them at once. I've added a comment next to the extended timeouts explaining why, so that I'm not tempted to remove them again. We really should make devfs sync faster (that's the long pole here).

Tests seem good now. Will land on green.

@Hixie Hixie merged commit 76f7081 into flutter:master Dec 21, 2018
@Hixie Hixie deleted the flakes branch December 21, 2018 01:10
Hixie added a commit that referenced this pull request Dec 21, 2018
Hixie added a commit that referenced this pull request Dec 21, 2018
kangwang1988 added a commit to XianyuTech/flutter that referenced this pull request Dec 26, 2018
* flt_master: (143 commits)
  Roll engine 215ca15..d8c5ec0 (12 commits) (flutter#25728)
  Provide some more locations for the FAB. (flutter#24736)
  Undeprecated BigInteger support, but document what it actually does. (flutter#24511)
  ClipPath.shape and related fixes (flutter#24816)
  Handle errors in `compute()` by propagating them to the Future. (flutter#24848)
  Fix merge conflict. (flutter#25718)
  Some minor tweaks to InputDecoration (mainly docs). (flutter#24643)
  Expose font fallback API in TextStyle, Roll engine 54a3577..215ca15 (8 commits) (flutter#25585)
  Updated Shrine demo (flutter#25674)
  Pin the goldens repo to a specific commit in the android_views test. (flutter#25678)
  Friendlier flags for Dart compilation training. (flutter#25645)
  Revert dependency upgrade to see if it helps with build times and APK size (flutter#25642)
  Depend on the goldens repo through git. (flutter#25479)
  no period after an alone link in see also section (flutter#25604)
  Update links for China help (flutter#25238)
  Roll engine 6a90418..54a3577 (23 commits) (flutter#25649)
  Roll engine e859296..6a90418 (4 commits) (flutter#25643)
  Adding support for android app bundle - Issue flutter#17829 (flutter#24440)
  Revert "[O] Remove many timeouts. (flutter#23531)" (flutter#25646)
  [O] Remove many timeouts. (flutter#23531)
  ...
Hixie added a commit to Hixie/flutter that referenced this pull request Jan 18, 2019
* Remove many timeouts.

These are essentially self-inflicted race conditions. Instead of timeouts we're going to try a more verbose logging mechanism that points out when things are taking a long time.

* Get the attach tests to pass.

* Apply review comments from Todd

* More review comment fixes

* Put back the extended timeouts here now that I know why we have them...
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

a: tests "flutter test", flutter_test, or one of our tests c: contributor-productivity Team-specific productivity, code health, technical debt. t: gradle "flutter build" and "flutter run" on Android tool Affects the "flutter" command-line tool. See also t: labels.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants