Skip to content

Add SingeliSort#6

Merged
Voultapher merged 1 commit intomainfrom
add-singeli-sort
Jun 11, 2023
Merged

Add SingeliSort#6
Voultapher merged 1 commit intomainfrom
add-singeli-sort

Conversation

@Voultapher
Copy link
Copy Markdown
Owner

Still WIP, SingeliSort crashes for certain patterns, and sort u64 missing.

@Voultapher
Copy link
Copy Markdown
Owner Author

Crash can be reproduce with OVERRIDE_SEED=1215502009629807262 cargo t descending_saw. See the project README on how to build the code.

@Voultapher Voultapher mentioned this pull request Jun 11, 2023
@mlochbaum
Copy link
Copy Markdown

Not super surprising: various components like parity_cuts to do branchless merging on any length combination are new, and none of it's that well tested. Is there a way to log the data being sorted, or build the sort with debug symbols?

@dzaima
Copy link
Copy Markdown

dzaima commented Jun 11, 2023

Missing a *4 (i.e. sizeof(int32_t)) on the aux size argument to sort32. That is, this passes the test in question:

void singelisort_i32(int32_t* data, size_t len) {
  std::vector<int32_t> aux_memory{};
  aux_memory.reserve(aux_alloc_size(len));
  sort32(data, static_cast<uint64_t>(len), aux_memory.data(),
         aux_memory.capacity()*4);
}

(also, debug diff)

@mlochbaum
Copy link
Copy Markdown

And I've changed sort64 to unsigned sort_u64 for a new compiled/sort.c. The obvious singelisort_u64 implementation seems to work; I don't know how to interpret all the test output but nothing looks wrong.

@Voultapher
Copy link
Copy Markdown
Owner Author

@dzaima good catch. You are right the aux len is passed as size in bytes, not as size in elements. That's a kind of unfortunate inconsistency to how the input len is passend.

@Voultapher
Copy link
Copy Markdown
Owner Author

And I've changed sort64 to unsigned sort_u64 for a new compiled/sort.c. The obvious singelisort_u64 implementation seems to work; I don't know how to interpret all the test output but nothing looks wrong.

Thanks will check out.

@Voultapher
Copy link
Copy Markdown
Owner Author

Not super surprising: various components like parity_cuts to do branchless merging on any length combination are new,

I have a generalized version of parity merge https://github.com/Voultapher/sort-research-rs/blob/main/src/graveyard/ipn_stable_with_probe_common.rs#L1064 if you don't care about Ord safety you can even remove some of the checks.

@mlochbaum
Copy link
Copy Markdown

Aux is interpreted as many different types by different algorithms, so I felt like bytes was the only measure that made sense.

The parity_cuts method is different from your version (or glidesort) in that every merge step it does is unguarded. Those checks in the bi_directional_merge while loop look potentially costly. My version is kind of over-engineered though. If the size imbalance is large enough it's better off skipping the cuts and doing a galloping-style merge that skips through the larger argument quickly. Probably only one set of cuts is useful.

@Voultapher
Copy link
Copy Markdown
Owner Author

Tests now pass, but I didn't do any fancy fuzzing or so.

@Voultapher
Copy link
Copy Markdown
Owner Author

Running benchmarks now, will post results here.

@Voultapher Voultapher merged commit 8b5d72c into main Jun 11, 2023
@Voultapher
Copy link
Copy Markdown
Owner Author

Aux is interpreted as many different types by different algorithms, so I felt like bytes was the only measure that made sense.

That is an internal story, you could easily divide by sizeof(T). From an API standpoint I suggest you follow the approach other implementations use uniform metrics. Also it would be great to get an error if the aux size is too small instead of a crash. I took the aux size from the benchmark code, probably could do with some wrappers that take care of the allocation.

@Voultapher
Copy link
Copy Markdown
Owner Author

Interactive graphs analysis.tar.gz

Results look impressive. Surprisingly though the pattern perf sees large difference between i32 and u64, random_d20 and random_z1 seem to be affected the most.

@mlochbaum
Copy link
Copy Markdown

SingeliSort is modular, so if you want say a pure counting sort you can make a variation on sort.singeli and compile one for yourself. Building in a division makes this more confusing. Also aux is consistently a void pointer and I just think it's weird to measure it in varying units—it's bench.c that's sloppy about this, not SingeliSort. There's a lot to clean up about aux usage, and I'd like to be able to operate with less space for large arrays. The distribution base cases generally check aux size; merging in glidesort should use swaps to cut down on space if there isn't enough. I'll add an allocating version at some point.

I can reproduce the random_d20 slowdown in your benchmarks, but not in C (including compiled as C++) using rand()%20. There u64 is a little better than half as fast as i32, as expected: it goes to counting sort and most of the time is just writing the results. And it should perform the same on any random-ish list of integers in [0,20), no matter how they're distributed. Any ideas as to what would be different?

@Voultapher
Copy link
Copy Markdown
Owner Author

Any ideas as to what would be different?

I guess the way I produce my u64 :

.map(|val| -> u64 {
    // Extends the value into the 64 bit range,
    // while preserving input order.
    let x = ((*val as i64) + (i32::MAX as i64) + 1) as u64;
    x.checked_mul(i32::MAX as u64).unwrap()
})

@mlochbaum
Copy link
Copy Markdown

mlochbaum commented Jun 11, 2023

All right, so it no longer fits into a small range. Using INT32_MAX*((T)(rand()%20) + ((T)INT32_MAX+1)) I get similar timings (incidentally, I believe let x = ((*val as i64) - (i32::MIN as i64)) as u64 is equivalent and find it clearer).

The slower results from Zipf make sense given than Robin Hood sorting doesn't apply to such a distribution, so it has to actually do some merging. RH doesn't apply to d20 either, but quicksort's low-cardinality handling does. For Zipf there's apparently not enough duplication; it improves to be nearly equal to uniform random when the size is bumped to a million.

I'd suggest having at least one benchmark that does some distribution over a small number of values picked randomly from the entire range (or some large range in the case of strings etc.).

@Voultapher
Copy link
Copy Markdown
Owner Author

I'd suggest having at least one benchmark that does some distribution over a small number of values picked randomly from the entire range (or some large range in the case of strings etc.).

I have a collection of additional patterns

let mut extra_pattern_providers: Vec<(&'static str, fn(usize) -> Vec<i32>)> = vec![
I'm totally open to adding new patterns there. In terms of default patterns, I want to have 8 at most, for reasons of readability and color palette. Which means any new default pattern has to replace an existing one, I've done that a couple times in the past. But it needs a good reason why that new pattern is more meaningful than one of the existing ones.

I believe let x = ((*val as i64) - (i32::MIN as i64)) as u64 is equivalent and find it clearer).

That doesn't seem to be true. 0 -> 4611686016279904256 vs 2147483648 3 -> 4611686022722355197 vs 2147483651.

@mlochbaum
Copy link
Copy Markdown

I'd suggest randomizing the range of z1 then. Zipf is the most like "generic data", and a sort optimizing it just because it fits in a small [0,k) range isn't very interesting; it's better accounted for in the d20 case anyway. I think it makes sense to have variations with a fixed cardinality and random range (random_c20 etc. maybe) but there's no reason to put them in the main set of 8. Unrelated comment, I doubt descending benchmarks are relevant very often in real use, certainly much less than ascending ones. But people may start asking questions if you leave that case out.

Sorry for the confusion: the i32::MIN expression just replaces one line, and you still need the multiplication after.

@mlochbaum
Copy link
Copy Markdown

mlochbaum commented Jun 12, 2023

The 99p_zero_1p_random-10000 test uncovered a segfault caused by parity_cuts, which I've fixed allowing all EXTRA_PATTERNS=1 tests (on 10000 elements at least) to be run. I'm still seeing wrong results on this data in my testing, so more fixes to come.

SingeliSort does poorly with repeated element tests like u64-random_d1024-10000, somewhat worse than fluxsort and more like pdq. This is a case where Robin Hood sorting is used but is detrimental. The ratio of ~10 unique copies of each element is the worst case because it doesn't get caught by a sample but ends up with about 5 collisions per hash insertion, which is a lot. On the other hand if there were a way of dealing with exact duplicates, probably gated on some sort of collision counter, then RH would end up much faster than other sorts. I tested out a method using a hash table with 4-byte ints in BQN, and it came out even better than a radix sort. This would be a not-even-unstable optimization like counting sort.

For larger numbers of duplicates as in u64-random_d64-10000, switching to merge sort so quickly (<192) is detrimental because it doesn't adapt to duplicates. Also that threshold was tuned with i32 and it looks like something around 96 is better for u64. When the number gets large enough like u64-random_d20-10000 this no longer matters, but SingeliSort is still a little behind fluxsort. Annoying since I designed fluxsort's low-cardinality handling in the first place (with reference to pdqsort). I think SingeliSort is doing too much work when it gets to a partition that has all equal values. [I'll just edit this bit in: wrong, that part could maybe be improved but is about as good as fluxsort. The cost is a mix of switching to merge sort or other base cases, and overhead from doing a sqrt(n) sample which becomes relevant because the total cost is so low.]

I haven't said it yet: thanks for all your work on these tests!

@Voultapher
Copy link
Copy Markdown
Owner Author

I'd suggest randomizing the range of z1 then. Zipf is the most like "generic data", and a sort optimizing it just because it fits in a small [0,k) range isn't very interesting; it's better accounted for in the d20 case anyway. I think it makes sense to have variations with a fixed cardinality and random range (random_c20 etc. maybe) but there's no reason to put them in the main set of 8. Unrelated comment, I doubt descending benchmarks are relevant very often in real use, certainly much less than ascending ones. But people may start asking questions if you leave that case out.

I want to look into that in detail, thanks for the suggestion. Not sure when I'll find the time.

The 99p_zero_1p_random-10000 test uncovered a segfault caused by parity_cuts, which I've fixed allowing all EXTRA_PATTERNS=1 tests (on 10000 elements at least) to be run. I'm still seeing wrong results on this data in my testing, so more fixes to come.

Interesting, you are right in addition to my test suite the benchmark suite can by virtue of running many times and with additional patterns be seen as a kind of test.

SingeliSort does poorly with repeated element tests like u64-random_d1024-10000, somewhat worse than fluxsort and more like pdq. This is a case where Robin Hood sorting is used but is detrimental. The ratio of ~10 unique copies of each element is the worst case because it doesn't get caught by a sample but ends up with about 5 collisions per hash insertion, which is a lot. On the other hand if there were a way of dealing with exact duplicates, probably gated on some sort of collision counter, then RH would end up much faster than other sorts. I tested out a method using a hash table with 4-byte ints in BQN, and it came out even better than a radix sort. This would be a not-even-unstable optimization like counting sort.

I have to admit that I haven't looked closely at radix, or counting sorts yet. So I only understood half of that.

For larger numbers of duplicates as in u64-random_d64-10000, switching to merge sort so quickly (<192) is detrimental because it doesn't adapt to duplicates. Also that threshold was tuned with i32 and it looks like something around 96 is better for u64. When the number gets large enough like u64-random_d64-10000 this no longer matters, but SingeliSort is still a little behind fluxsort.

One idea I can provide that also fixes some of issues in glidesort is to have sqrt(len) min run below which runs are rejected and unsorted runs of that min run len are created.

Annoying since I designed fluxsort's low-cardinality handling in the first place (with reference to pdqsort). I think SingeliSort is doing too much work when it gets to a partition that has all equal values.

I wasn't aware you were involved in the development of fluxsort. You are not credited as far as I can tell.

@mlochbaum
Copy link
Copy Markdown

mlochbaum commented Jun 12, 2023

Glidesort's using a sqrt(n/2) threshold as of v0.1.2: commit. This seems not very principled to me. Working it out now, if merge is 20% slower than partition and the whole array is made of length-r runs, you'd want to switch over when 1.2 log(n/r) = log(n), or 0.2 log(n) = 1.2 log(r), which gives r = n^(1/6). Square root would be appropriate for a factor of 2 difference, but the more important concern is low-cardinality versus natural runs where the length of the run is just not enough information. And for only one run, even making the top step a merge costs 0.2 n, so O(n), for a savings of O(sqrt(n)log(n)) so the asymptotics are still not right there.

I've mangled glidesort quite a bit. The run finder actually always searches until it finds a long enough run, so unsorted sections all get grouped together. Then in the logical merge part it ignores sortedness if it would merge an unsorted section with a substantially smaller sorted one. Still not perfect, but one run in the middle can't force everything to use merges.

I just thought of a simple test for a single repeated element: once you find the run, compare the elements 1/3 and 2/3 along it: they're always equal if 2/3 of elements are the same and never if less than 1/3 are. Not a full solution but it seems nice and practical.

The issue with u64-random_d64-10000 was quicksort switching to the base merge though: the glide layer never gets involved as there aren't natural runs.

There's a comment in fluxsort's source. I don't care about attribution though, and have told Scandum that.

@Voultapher Voultapher deleted the add-singeli-sort branch July 22, 2023 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants