blog article about perf tests by akuzm · Pull Request #27879 · ClickHouse/ClickHouse

akuzm · 2021-08-19T16:03:03Z

Changelog category (leave one):

Not for changelog (changelog entry is not required)

KochetovNicolai · 2021-08-19T18:41:01Z

website/blog/en/2021/performance-test-1.md

+
+For complex processes which resist modeling, a practical option is to use the historical data from the same process. We actually used to do this for ClickHouse. For each tested commit, we measured the run times for each test query and saved them into a database. We could compare the patched server to these reference values, build graphs of changes over time and so on. The main problem with this approach is systematic errors induced by environment. Sometimes the performance testing task ends up on a machine with dying HDD, or they update `atop` to a broken version that slows every kernel call in half, et cetera, ad infinitum. This is why now we employ another approach.
+
+We run the reference version of the server process and the tested version, simultaneously on the same machine, and run the test queries on each of them in turn, one by one. This way we eliminate most systematic errors, because both servers are equally influenced by them. We can then compare the set of results we got from the reference server process, and the set from the test server process, to see whether they look the same. Comparing the distributions using two samples is a very interesting problem in itself. We use a non-parametric bootstrap method to build a randomization distribution for the observed difference of median query run times. This method is described in detail in [[1]](#ref1), where they apply it to see how changing a fertilizer mixture changes the yield of tomato plants. ClickHouse is not much different from tomatoes, only we have to check how the changes in code influence the performance.


We use a non-parametric bootstrap method to build a randomization distribution

It is not so clear why, actually. The good question, in general, what can we do to compare two distributions, and why methods like Student's t-test (distribution is never nornal), Mann–Whitney (observations should be independent) or Wilcoxon (why?). What is the expected number of measurments for some reasonalbe p-values (which?), and why bootstrap helps.

Wilcoxon signed-rank test should be suitable. The added benefit of using the randomization distribution is that it allows you to directly estimate how unstable the test is (e.g. how big a difference you can see with given p-value even w/o changes). But the main reason I use it because it doesn't involve any math and is easy to understand.

Not sure about the best number of runs and how to calculate it, I just took the least number where most tests became relatively stable: 7 runs on each server give expected change below 10% for p=0.01.

Probably we need another article that explains these choices, written by a statistics expert, not me...

I mean, at 7 runs even the p=0.01 doesn't really make sense, because you can't actually calculate the 99th percentile. The randomization distribution is discrete, and in this case it has like 16 values or something, so the max value takes more than one percent. But it works somehow :) In master, we use 13 runs for each server, for better precision.

alexey-milovidov

I've read the article, everything is perfect!

blog article about perf tests

7e2be8f

robot-clickhouse added the pr-not-for-changelog This PR should not be mentioned in the changelog label Aug 19, 2021

akuzm added 2 commits August 19, 2021 20:17

fixes

6fa539e

boop

30dd965

KochetovNicolai reviewed Aug 19, 2021

View reviewed changes

alexey-milovidov approved these changes Aug 20, 2021

View reviewed changes

alexey-milovidov self-assigned this Aug 20, 2021

akuzm merged commit a0dff68 into master Aug 20, 2021

akuzm deleted the aku/perf-article branch August 20, 2021 11:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blog article about perf tests#27879

blog article about perf tests#27879
akuzm merged 3 commits intomasterfrom
aku/perf-article

akuzm commented Aug 19, 2021

Uh oh!

KochetovNicolai Aug 19, 2021

Uh oh!

akuzm Aug 19, 2021

Uh oh!

akuzm Aug 19, 2021

Uh oh!

alexey-milovidov left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		For complex processes which resist modeling, a practical option is to use the historical data from the same process. We actually used to do this for ClickHouse. For each tested commit, we measured the run times for each test query and saved them into a database. We could compare the patched server to these reference values, build graphs of changes over time and so on. The main problem with this approach is systematic errors induced by environment. Sometimes the performance testing task ends up on a machine with dying HDD, or they update `atop` to a broken version that slows every kernel call in half, et cetera, ad infinitum. This is why now we employ another approach.

		We run the reference version of the server process and the tested version, simultaneously on the same machine, and run the test queries on each of them in turn, one by one. This way we eliminate most systematic errors, because both servers are equally influenced by them. We can then compare the set of results we got from the reference server process, and the set from the test server process, to see whether they look the same. Comparing the distributions using two samples is a very interesting problem in itself. We use a non-parametric bootstrap method to build a randomization distribution for the observed difference of median query run times. This method is described in detail in [[1]](#ref1), where they apply it to see how changing a fertilizer mixture changes the yield of tomato plants. ClickHouse is not much different from tomatoes, only we have to check how the changes in code influence the performance.

Conversation

akuzm commented Aug 19, 2021

Uh oh!

KochetovNicolai Aug 19, 2021

Choose a reason for hiding this comment

Uh oh!

akuzm Aug 19, 2021

Choose a reason for hiding this comment

Uh oh!

akuzm Aug 19, 2021

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants