Skip to content

Conversation

@yjbanov
Copy link
Contributor

@yjbanov yjbanov commented Jun 13, 2017

Add the following benchmarks to the devicelab:

  • flutter analyze --flutter-repo --benchmark
  • flutter analyze --flutter-repo --benchmark --watch
  • flutter analyze --benchmark (on mega gallery app)
  • flutter analyze --benchmark --watch (on mega gallery app)

@Hixie

return new AnalyzerCliTask(sdk, commit, timestamp);
}
/// Run each benchmark this many times and pick the best result.
const int _kRunsPerBenchmark = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the best result rather than the average?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how it used to be. Happy to change to average.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

double minValue;
for (int i = 0; i < _kRunsPerBenchmark; i++) {
// Delete cached analysis results.
rmTree(dir('${Platform.environment['HOME']}/.dartServer'));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to do this before each run of the analyzer, not before each group of runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. It's inside the benchmark loop, so it deletes the directory before each flutter analyze command.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow I totally misread that. LGTM.

@Hixie
Copy link
Contributor

Hixie commented Jun 13, 2017

LGTM modulo that I think doing the average (or maybe the worst time? but probably the average) would be more representative and that we should remove the cache before each run of the analyzis subprocess rather than before the run of four analyses.

@Hixie
Copy link
Contributor

Hixie commented Jun 13, 2017

LGTM

@yjbanov yjbanov merged commit fde985b into flutter:master Jun 13, 2017
@skybrian
Copy link
Contributor

<bikeshed>
There's an argument that taking the best time is most representative of what's actually measured by the benchmark, since includes everything that happens on every benchmark run, but excludes things that slow down the benchmark, but only happen some of the time (for example, caches not being warmed up or the machine it's running on having a bit more load than usual). To accurately measure worst-case, you want to make sure it happens on every run (by clearing caches, etc) rather than trusting to chance.
</bikeshed>

@Hixie
Copy link
Contributor

Hixie commented Jun 14, 2017

Taking the smallest time would entirely hide cases where the code takes substantially longer on first or second run (e.g. because the code has a bi-modal bug where on odd-numbered runs it runs a different code path than even-numbered runs).

@yjbanov
Copy link
Contributor Author

yjbanov commented Jun 14, 2017

We are actually intentionally clearing the .dartServer cache between runs, but also we're not interested in the absolute numbers, but rather regressions.

@skybrian
Copy link
Contributor

skybrian commented Jun 14, 2017

Yes, clearing the cache every time is good.

It's true that taking the minimum hides bi-modal bugs but I claim that's a feature. Any average will hide variation between test runs to some extent (that's kind of the point). For benchmarking, taking the minimum seems to do a better job of it, revealing regressions that affect every test run, which tend to be easier to work on.

Showing variation between runs is also useful for detecting some bugs and understanding noise added by the environment, but I think it's less confusing to plot it separately. If there's enough data, plotting the median, 90%, 99%, and so on would be useful. Or in this case I think there are only three points, so maybe just plot all of them?

(But I don't think it's important to make a change now; just theorizing.)

@yjbanov
Copy link
Contributor Author

yjbanov commented Jun 14, 2017

@skybrian good points. Next time I get cycles to work on benchmarks, I'll see if I can come up with a more scalable system. Right now it's very simple, a chart card for every metric, which is great for maintenance, but doesn't scale well if we want to collect more numbers. That the only reason we limit the metrics we publish.

@Hixie
Copy link
Contributor

Hixie commented Jun 14, 2017

For some of the other benchmarks we actually plot the average and the worst time. The worst time is (obviously) very noisy, but it is still a useful metric to track, and has shown regressions. It's also useful to see the actual worst case variance (e.g. complex_layout_scroll_perf_ios__timeline_summary worst_frame_build_time_millis shows we regularly have 2x worst-case times than normal).

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants