resurrect analyzer benchmarks #10668

yjbanov · 2017-06-13T19:42:40Z

Add the following benchmarks to the devicelab:

flutter analyze --flutter-repo --benchmark
flutter analyze --flutter-repo --benchmark --watch
flutter analyze --benchmark (on mega gallery app)
flutter analyze --benchmark --watch (on mega gallery app)

Hixie · 2017-06-13T19:44:28Z

dev/devicelab/lib/tasks/analysis.dart

-  return new AnalyzerCliTask(sdk, commit, timestamp);
-}
+/// Run each benchmark this many times and pick the best result.
+const int _kRunsPerBenchmark = 3;


why the best result rather than the average?

That's how it used to be. Happy to change to average.

Hixie · 2017-06-13T19:46:18Z

dev/devicelab/lib/tasks/analysis.dart

+  double minValue;
+  for (int i = 0; i < _kRunsPerBenchmark; i++) {
+    // Delete cached analysis results.
+    rmTree(dir('${Platform.environment['HOME']}/.dartServer'));


I think we have to do this before each run of the analyzer, not before each group of runs.

Yep. It's inside the benchmark loop, so it deletes the directory before each flutter analyze command.

Oh wow I totally misread that. LGTM.

Hixie · 2017-06-13T19:49:35Z

LGTM modulo that I think doing the average (or maybe the worst time? but probably the average) would be more representative and that we should remove the cache before each run of the analyzis subprocess rather than before the run of four analyses.

Hixie · 2017-06-13T19:56:12Z

skybrian · 2017-06-14T00:28:58Z

<bikeshed>
There's an argument that taking the best time is most representative of what's actually measured by the benchmark, since includes everything that happens on every benchmark run, but excludes things that slow down the benchmark, but only happen some of the time (for example, caches not being warmed up or the machine it's running on having a bit more load than usual). To accurately measure worst-case, you want to make sure it happens on every run (by clearing caches, etc) rather than trusting to chance.
</bikeshed>

Hixie · 2017-06-14T02:15:57Z

Taking the smallest time would entirely hide cases where the code takes substantially longer on first or second run (e.g. because the code has a bi-modal bug where on odd-numbered runs it runs a different code path than even-numbered runs).

yjbanov · 2017-06-14T02:20:40Z

We are actually intentionally clearing the .dartServer cache between runs, but also we're not interested in the absolute numbers, but rather regressions.

skybrian · 2017-06-14T03:24:58Z

Yes, clearing the cache every time is good.

It's true that taking the minimum hides bi-modal bugs but I claim that's a feature. Any average will hide variation between test runs to some extent (that's kind of the point). For benchmarking, taking the minimum seems to do a better job of it, revealing regressions that affect every test run, which tend to be easier to work on.

Showing variation between runs is also useful for detecting some bugs and understanding noise added by the environment, but I think it's less confusing to plot it separately. If there's enough data, plotting the median, 90%, 99%, and so on would be useful. Or in this case I think there are only three points, so maybe just plot all of them?

(But I don't think it's important to make a change now; just theorizing.)

yjbanov · 2017-06-14T05:04:28Z

@skybrian good points. Next time I get cycles to work on benchmarks, I'll see if I can come up with a more scalable system. Right now it's very simple, a chart card for every metric, which is great for maintenance, but doesn't scale well if we want to collect more numbers. That the only reason we limit the metrics we publish.

Hixie · 2017-06-14T19:32:54Z

For some of the other benchmarks we actually plot the average and the worst time. The worst time is (obviously) very noisy, but it is still a useful metric to track, and has shown regressions. It's also useful to see the actual worst case variance (e.g. complex_layout_scroll_perf_ios__timeline_summary worst_frame_build_time_millis shows we regularly have 2x worst-case times than normal).

resurrect analyzer benchmarks

02f2234

googlebot added the cla: yes label Jun 13, 2017

Hixie reviewed Jun 13, 2017

View reviewed changes

move analyzer_benchmark to the more stable linux/android

0c86869

Hixie reviewed Jun 13, 2017

View reviewed changes

report average rather than best result

f0be2fd

yjbanov merged commit fde985b into flutter:master Jun 13, 2017

github-actions bot locked as resolved and limited conversation to collaborators Aug 13, 2021

resurrect analyzer benchmarks #10668

resurrect analyzer benchmarks #10668

Uh oh!

Conversation

yjbanov commented Jun 13, 2017

Uh oh!

Hixie Jun 13, 2017

Choose a reason for hiding this comment

Uh oh!

yjbanov Jun 13, 2017

Choose a reason for hiding this comment

Uh oh!

yjbanov Jun 13, 2017

Choose a reason for hiding this comment

Uh oh!

Hixie Jun 13, 2017

Choose a reason for hiding this comment

Uh oh!

yjbanov Jun 13, 2017

Choose a reason for hiding this comment

Uh oh!

Hixie Jun 13, 2017

Choose a reason for hiding this comment

Uh oh!

Hixie commented Jun 13, 2017

Uh oh!

Hixie commented Jun 13, 2017

Uh oh!

skybrian commented Jun 14, 2017

Uh oh!

Hixie commented Jun 14, 2017

Uh oh!

yjbanov commented Jun 14, 2017

Uh oh!

skybrian commented Jun 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yjbanov commented Jun 14, 2017

Uh oh!

Hixie commented Jun 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

skybrian commented Jun 14, 2017 •

edited

Loading