Conversation
|
|
||
| For complex processes which resist modeling, a practical option is to use the historical data from the same process. We actually used to do this for ClickHouse. For each tested commit, we measured the run times for each test query and saved them into a database. We could compare the patched server to these reference values, build graphs of changes over time and so on. The main problem with this approach is systematic errors induced by environment. Sometimes the performance testing task ends up on a machine with dying HDD, or they update `atop` to a broken version that slows every kernel call in half, et cetera, ad infinitum. This is why now we employ another approach. | ||
|
|
||
| We run the reference version of the server process and the tested version, simultaneously on the same machine, and run the test queries on each of them in turn, one by one. This way we eliminate most systematic errors, because both servers are equally influenced by them. We can then compare the set of results we got from the reference server process, and the set from the test server process, to see whether they look the same. Comparing the distributions using two samples is a very interesting problem in itself. We use a non-parametric bootstrap method to build a randomization distribution for the observed difference of median query run times. This method is described in detail in [[1]](#ref1), where they apply it to see how changing a fertilizer mixture changes the yield of tomato plants. ClickHouse is not much different from tomatoes, only we have to check how the changes in code influence the performance. |
There was a problem hiding this comment.
We use a non-parametric bootstrap method to build a randomization distribution
It is not so clear why, actually. The good question, in general, what can we do to compare two distributions, and why methods like Student's t-test (distribution is never nornal), Mann–Whitney (observations should be independent) or Wilcoxon (why?). What is the expected number of measurments for some reasonalbe p-values (which?), and why bootstrap helps.
There was a problem hiding this comment.
Wilcoxon signed-rank test should be suitable. The added benefit of using the randomization distribution is that it allows you to directly estimate how unstable the test is (e.g. how big a difference you can see with given p-value even w/o changes). But the main reason I use it because it doesn't involve any math and is easy to understand.
Not sure about the best number of runs and how to calculate it, I just took the least number where most tests became relatively stable: 7 runs on each server give expected change below 10% for p=0.01.
Probably we need another article that explains these choices, written by a statistics expert, not me...
There was a problem hiding this comment.
I mean, at 7 runs even the p=0.01 doesn't really make sense, because you can't actually calculate the 99th percentile. The randomization distribution is discrete, and in this case it has like 16 values or something, so the max value takes more than one percent. But it works somehow :) In master, we use 13 runs for each server, for better precision.
alexey-milovidov
left a comment
There was a problem hiding this comment.
I've read the article, everything is perfect!
Changelog category (leave one):