Conversation
|
@4ertus2 Do you have any performance benchmarks already comparing std::sort versus your custom radix sort on data that is the result of the If you were going to replace the sorting algorithm, my first hunch would be something like Timsort. Reason: it already takes into account the natural increasing or decreasing runs that I would expect to find in the resulting pre-sorted series, as there are probably multiple entries from the same block (which would usually already be sorted provided that the last entry of the primary key of the table is the timestamp) that get appended to the SortedLookupVector in order. |
Something like this (on my laptop). It's with clang7. https://pastebin.com/b1CMYS9B
We could make sort algo for ASOF JOIN as a setting. |
|
That's a bigger improvement than I was expecting - anyway, given that Clickhouse already includes boost, it might be worth it to run your benchmark with spreadsort (a radix sort), pdqsort and spinsort from https://www.boost.org/doc/libs/1_70_0/libs/sort/doc/html/index.html#sort.introduction.single_thread_algorithms as those are already well tested and should be quicker than std::sort. |
|
My test is wrong cause of multiple lines with same value in right ASOF column. I've re-checked it with correct queries and radix returns the same result as std::sort. |
|
How do the other sorts included in boost perform on your dataset? There is quite probably a pattern that appears in the order that the data gets put into the SortedLookupVector, the question is just to find the right sort to exploit that pattern for extra performance :) |
I haven't test the others yet. I'm on duty this week and going to have holidays the next two weeks so probably I cannot check it till the second half of may. It would be great if you have time to check what sort is the best ;) |
|
Cool. I'll have to try to dig up a representative (non-proprietary) dataset somewhere. |
|
@4ertus2 Found some time and ran some basic tests: the join performance does seem to be dependent on the data. It might be an option to come up with some representative use cases (which the one below is not) and then go for a single implementation, or add the join algorithm as a setting? (in my testing I just passed it in as an env var). Luckily there don't seem to be absolutely massive differences so it would probably be pretty low priority. |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Category (leave one):
Short description (up to few sentences):
Use radixSort instead of std::sort for ASOF JOIN values.