Add performance regression comparison in CI #2370

karkhaz · 2023-04-13T13:15:42Z

This commit adds a CI job that runs the benchcomp tool on the
performance regression suite, comparing the HEAD of the pull request to
the branch that the PR targets.

The CI job fails if any performance benchmark regresses when run using
the HEAD version of Kani with respect to the 'base' branch. Regression
is defined as a regression in symex or solver time of 10% for any
benchmark for which this value is >2s, or if any performance benchmark
fails with the HEAD version while passing with the base.

This fixes #2277.

Call-outs:

This is currently running 'old' kani on the old perf suite, and 'new' kani on the new suite.
There seems to be a problem with finding the stdlib otherwise. This means that benchmarks that get renamed will not be checked in the PR that renames them (though they will then start to be checked again when the PR is merged).

Testing:

How is this change tested?

Here is a run that failed because one of the benchmarks regressed. Compared with that run, I've changed this PR so that numerical metrics that are less than 10 are not compared between the old and new checkouts.

Checklist

Each commit message has a non-empty body, explaining why the change was made
Methods or procedures are documented
Regression or unit tests are included, or existing tests cover the modified code
My PR is restricted to a single feature or bugfix

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 and MIT licenses.

.github/workflows/kani.yml

qinheping · 2023-04-13T16:23:28Z

This may frequently block on benchmarks which take a small amount of time (e.g., less than 5s). Maybe we should define different scales, for example:
* Less than 1s: Block on >3s

* Less than 30s: Block on >1min

* Anything else: Block on >1.1x
Of course, I haven't experimented on these particular values, so more scales or tweaks might be needed.

I agree with the idea that we have different scales for different benchmarks. But I am not sure about the numbers. How about running this check in CI a couple times to collect the data of running, time and then use the data to decide the scale numbers.

celinval

Just a nit: This PR is adding a new workflow for PRs but it isn't blocking yet. AFAIK, this is a github configuration which I think we should activate only after we adjust the scale that ensures this job won't give too many false positives.

.github/workflows/kani.yml

karkhaz · 2023-04-14T14:41:05Z

Just a nit: This PR is adding a new workflow for PRs but it isn't blocking yet. AFAIK, this is a github configuration which I think we should activate only after we adjust the scale that ensures this job won't give too many false positives.

Thanks, I've changed the commit message. Agreed about leaving this to bake for a while before we make it blocking

karkhaz · 2023-04-14T15:19:10Z

@qinheping here are my thoughts on which numbers to use:

I've tried a few experiments with CI and locally. I get a lot less variance locally, often all benchmarks (or all but one) have a variance <10%. This is when comparing identical Kanis.

In CI, the numbers that @adpaco-aws seem to work well for now. I do think it would be a good idea to optimize these numbers, but I would prefer to see what the range is on realistic PRs---that is, PRs that actually change the implementation of Kani and so cause and actual performance difference.

What I propose:

In a future PR, I'll save the results.yaml file as an artifact. This will allow me to download the results of several runs over the last 30 days.
benchcomp visualize can be used locally on a results.yaml file, without re-running the benchmarks. So if we download a lot of results.yaml files, we can try lots of different numbers in the config file and see which ones are best suited for checking that wide range of result files.

I suggest that we merge this PR first, though. I would be inclined to merge quickly and start getting data as quickly as we can, rather than trying to get it right from now, especially since we don't have data for realistic runs.

tautschnig

I think we need to make sure we have access to metrics that are very low noise.

tools/benchcomp/configs/perf-regression.yaml

qinheping · 2023-04-14T17:05:59Z

@qinheping here are my thoughts on which numbers to use:

I've tried a few experiments with CI and locally. I get a lot less variance locally, often all benchmarks (or all but one) have a variance <10%. This is when comparing identical Kanis.

In CI, the numbers that @adpaco-aws seem to work well for now. I do think it would be a good idea to optimize these numbers, but I would prefer to see what the range is on realistic PRs---that is, PRs that actually change the implementation of Kani and so cause and actual performance difference.

What I propose:
* In a future PR, I'll save the results.yaml file as an artifact. This will allow me to download the results of several runs over the last 30 days.

* `benchcomp visualize` can be used locally on a results.yaml file, without re-running the benchmarks. So if we download a lot of results.yaml files, we can try lots of different numbers in the config file and see which ones are best suited for checking that wide range of result files.
I suggest that we merge this PR first, though. I would be inclined to merge quickly and start getting data as quickly as we can, rather than trying to get it right from now, especially since we don't have data for realistic runs.

Sounds great! Thank you @karkhaz.
Yes, we also want to save the results from realistic runs and review them periodically to detect accumulating performance regression.

tautschnig

Got convinced that this is a start, but I do maintain that we need to dig deeper to figure out what the sources of performance variability are.

celinval · 2023-04-15T00:07:13Z

Got convinced that this is a start, but I do maintain that we need to dig deeper to figure out what the sources of performance variability are.

I agree. I think we should merge the PR but only make it required once we figure that out.

tools/benchcomp/configs/perf-regression.yaml

.github/workflows/kani.yml

This commit adds a CI job that runs the `benchcomp` tool on the performance regression suite, comparing the HEAD of the pull request to the branch that the PR targets. The CI job fails if any performance benchmark regresses when run using the HEAD version of Kani with respect to the 'base' branch. Regression is defined as a regression in symex or solver time of 10% for any benchmark for which this value is >2s, or if any performance benchmark fails with the HEAD version while passing with the base. This fixes model-checking#2277.

karkhaz force-pushed the kk-benchcomp-kani-perf branch from 08d2bdf to f4b9875 Compare April 13, 2023 14:14

karkhaz marked this pull request as ready for review April 13, 2023 15:09

karkhaz requested a review from a team as a code owner April 13, 2023 15:09

karkhaz force-pushed the kk-benchcomp-kani-perf branch 2 times, most recently from f2fa8df to 11be08e Compare April 13, 2023 15:20

adpaco-aws reviewed Apr 13, 2023

View reviewed changes

.github/workflows/kani.yml Outdated Show resolved Hide resolved

karkhaz force-pushed the kk-benchcomp-kani-perf branch from 11be08e to 9cce08c Compare April 13, 2023 15:37

celinval reviewed Apr 13, 2023

View reviewed changes

.github/workflows/kani.yml Outdated Show resolved Hide resolved

.github/workflows/kani.yml Outdated Show resolved Hide resolved

karkhaz force-pushed the kk-benchcomp-kani-perf branch from 9cce08c to 498731e Compare April 14, 2023 14:32

karkhaz changed the title ~~Add PR-blocking performance regression run (#2362)~~ Add performance regression comparison in CI Apr 14, 2023

tautschnig requested changes Apr 14, 2023

View reviewed changes

tools/benchcomp/configs/perf-regression.yaml Outdated Show resolved Hide resolved

karkhaz force-pushed the kk-benchcomp-kani-perf branch 2 times, most recently from a9523c7 to bd483f2 Compare April 14, 2023 16:36

qinheping approved these changes Apr 14, 2023

View reviewed changes

karkhaz force-pushed the kk-benchcomp-kani-perf branch from bd483f2 to 20ad552 Compare April 14, 2023 17:44

tautschnig approved these changes Apr 14, 2023

View reviewed changes

karkhaz force-pushed the kk-benchcomp-kani-perf branch 8 times, most recently from 71ec9e6 to 60c0c43 Compare April 17, 2023 21:12

celinval reviewed Apr 17, 2023

View reviewed changes

tools/benchcomp/configs/perf-regression.yaml Outdated Show resolved Hide resolved

tools/benchcomp/configs/perf-regression.yaml Outdated Show resolved Hide resolved

.github/workflows/kani.yml Outdated Show resolved Hide resolved

.github/workflows/kani.yml Outdated Show resolved Hide resolved

karkhaz force-pushed the kk-benchcomp-kani-perf branch 4 times, most recently from fdaa1fa to eae21c3 Compare April 17, 2023 22:04

karkhaz force-pushed the kk-benchcomp-kani-perf branch from eae21c3 to aad0cbb Compare April 17, 2023 22:04

karkhaz merged commit 00c19e2 into model-checking:main Apr 17, 2023

karkhaz deleted the kk-benchcomp-kani-perf branch April 17, 2023 22:43

tautschnig mentioned this pull request Apr 20, 2023

Create a benchmark workflow #1376

Open

3 tasks

Add performance regression comparison in CI #2370

Add performance regression comparison in CI #2370

Uh oh!

Conversation

karkhaz commented Apr 13, 2023 • edited by AlexanderPortland Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Call-outs:

Testing:

Checklist

Uh oh!

Uh oh!

qinheping commented Apr 13, 2023

Uh oh!

celinval left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

karkhaz commented Apr 14, 2023

Uh oh!

karkhaz commented Apr 14, 2023

Uh oh!

tautschnig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qinheping commented Apr 14, 2023

Uh oh!

tautschnig left a comment

Choose a reason for hiding this comment

Uh oh!

celinval commented Apr 15, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

karkhaz commented Apr 13, 2023 •

edited by AlexanderPortland

Loading