-
Notifications
You must be signed in to change notification settings - Fork 29.7k
resurrect analyzer benchmarks #10668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return new AnalyzerCliTask(sdk, commit, timestamp); | ||
| } | ||
| /// Run each benchmark this many times and pick the best result. | ||
| const int _kRunsPerBenchmark = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the best result rather than the average?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how it used to be. Happy to change to average.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| double minValue; | ||
| for (int i = 0; i < _kRunsPerBenchmark; i++) { | ||
| // Delete cached analysis results. | ||
| rmTree(dir('${Platform.environment['HOME']}/.dartServer')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to do this before each run of the analyzer, not before each group of runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. It's inside the benchmark loop, so it deletes the directory before each flutter analyze command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow I totally misread that. LGTM.
|
LGTM modulo that I think doing the average (or maybe the worst time? but probably the average) would be more representative and that we should remove the cache before each run of the analyzis subprocess rather than before the run of four analyses. |
|
<bikeshed> |
|
Taking the smallest time would entirely hide cases where the code takes substantially longer on first or second run (e.g. because the code has a bi-modal bug where on odd-numbered runs it runs a different code path than even-numbered runs). |
|
We are actually intentionally clearing the |
|
Yes, clearing the cache every time is good. It's true that taking the minimum hides bi-modal bugs but I claim that's a feature. Any average will hide variation between test runs to some extent (that's kind of the point). For benchmarking, taking the minimum seems to do a better job of it, revealing regressions that affect every test run, which tend to be easier to work on. Showing variation between runs is also useful for detecting some bugs and understanding noise added by the environment, but I think it's less confusing to plot it separately. If there's enough data, plotting the median, 90%, 99%, and so on would be useful. Or in this case I think there are only three points, so maybe just plot all of them? (But I don't think it's important to make a change now; just theorizing.) |
|
@skybrian good points. Next time I get cycles to work on benchmarks, I'll see if I can come up with a more scalable system. Right now it's very simple, a chart card for every metric, which is great for maintenance, but doesn't scale well if we want to collect more numbers. That the only reason we limit the metrics we publish. |
|
For some of the other benchmarks we actually plot the average and the worst time. The worst time is (obviously) very noisy, but it is still a useful metric to track, and has shown regressions. It's also useful to see the actual worst case variance (e.g. complex_layout_scroll_perf_ios__timeline_summary worst_frame_build_time_millis shows we regularly have 2x worst-case times than normal). |
Add the following benchmarks to the devicelab:
flutter analyze --flutter-repo --benchmarkflutter analyze --flutter-repo --benchmark --watchflutter analyze --benchmark(on mega gallery app)flutter analyze --benchmark --watch(on mega gallery app)@Hixie