-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Have had this half-finished for a while; I'm writing up what I can even though there's more to investigate. CC @scandum, @Voultapher as I believe ipn's picked up glidesort's candidate approach.
I looked into how the choice of pivot candidates affects median accuracy using a simulation on random piecewise-linear inputs. These seem like they capture one type of order that might be expected in an input, and I don't have any other promising ideas for testing that.
I found that the accuracy depends not only on the way candidates are grouped into sets of 3 (what I initially wanted to test) but also the positions of those candidates (unexpected, but obvious in retrospect). So I measured both for an even distribution, and Glidesort's recursive selection, using midpoints of the nine intervals that come out of the recursion. I copied the arithmetic for this by hand; it looks right but it's possible I made a mistake here. I've also tested various positions not shown here. Error is measured as cost, assuming sorting each of the two partitions takes time proportional to n log(n). For context, the baseline cost is 8965784, so that a relatively high cost difference of 30000 adds 0.33% to total sorting cost due to that step (it'll compound at every partition). The tables below show cost for the best and worst groupings after testing every possibility, as well as some others:
- 000111222 is the current glidesort scheme, which I find is fairly poor regardless of positions
- 012012012 transposes it, and seems generally good
- 012102120 stands out as consistently very good
The arrangement 011201220 that I've used and recommended previously does badly, often worse than 000111222.
Beyond any particular choice of grouping it doesn't seem like Glidesort's positions do well: the "true of 9" row is the true median of those and is worse than most pseudomedians with even spacing. The middle 4 makes Glidesort's positions skewed and prevents perfect performance on 1-piece (totally linear) inputs, but changing it to 3.5 isn't much of an improvement (in fact the true median is worse for 2-piece inputs). So I think it's mainly the clustering that weakens it, although I can't say exactly why the effect is so strong. Something that seems a little better is to use 3 as the midpoint for one or two intervals and 4 for the others.
I also did some theoretical analysis on the median-of-median-of-... idea. I find that 3^k candidates processed with recursive medians have about as much power as a true median of (π/2)*2.25^k pivots. For example the pseudomedian of 243 for a 32768-element list is worth a true median of 91. The probability distributions for the median value found this way with random candidates are equal at the exact midpoint, and the shapes seem similar (if anything the pseudomedian is wider). I can write the math up if you're interested.
Even spacing (0.5/9, ..., 8.5/9):
| Partition | Diagram | 1 | 2 | 3 | 7 |
|---|---|---|---|---|---|
| True of 3 | ~147 | 0 | 13413 | 25575 | 77986 |
| True of 9 | 012345678 | 0 | 1557 | 2940 | 8960 |
| 012102120 | 048/136/257 | 0 | 1557 | 3174 | 16728 |
| 012012012 | 036/147/258 | 0 | 1557 | 4276 | 18077 |
| 000111222 | 012/345/678 | 0 | 13413 | 20636 | 28802 |
| 001112022 | 016/234/578 | 35848 | 31942 | 29785 | 30945 |
Glidesort arrangement (approx 0.008, 0.07, 0.117, 0.508, 0.57, 0.617, 0.883, 0.945, 0.992):
| Partition | Diagram | 1 | 2 | 3 | 7 |
|---|---|---|---|---|---|
| True of 3 | ~147 | 0 | 13413 | 25575 | 77986 |
| True of 9 | 012345678 | 14184 | 69439 | 77943 | 74393 |
| 001122102 | 017/236/458 | 184 | 26029 | 43807 | 74776 |
| 012102120 | 048/136/257 | 14184 | 69439 | 79653 | 81007 |
| 012012012 | 036/147/258 | 14184 | 69439 | 84625 | 85439 |
| 000111222 | 012/345/678 | 14184 | 79120 | 97432 | 98255 |
| 011120220 | 058/123/467 | 39866 | 141889 | 139266 | 99083 |
Source for this, run with CBQN. I can translate to some other language on request. The inputs are created with •rand which changes between runs, but the results above don't change significantly.
# All partitions of 9 candidates into 3+3+3
part ← (⊐⊸≡∧(3⥊3)≡/⁼)¨⊸/⥊↕3⌊1+↕9
pfmt ← > ('0'⊸+ ⋈ ·∾⟜"/"⊸∾´ '0'+⊔)¨ part # Display
# Hard-coded partitions of interest
pint ← "000111222"‿"012012012"‿"012102120" #‿"011201220"
# Make random piecewise-linear functions
GetFn ← {
R ← •rand.Range
p←0∾1∾˜∧(𝕩-2)R 0 ⋄ q←𝕩 R 0 # x endpoints, y endpoints
m←q÷○(«⊸-)p ⋄ y←q-m×p # Slope, y intercept
{(⊏⟜y+𝕩×⊏⟜m)(1↓p)⍋𝕩}
}
dist ← GetFn¨ 1e3/≍2‿3‿4‿8 # 1e3 sets with each of 2, 3, 4, 8 vertices
Sample ← dist {𝕎𝕩}¨ < # Sample values given positions
# Candidate sampling
Mid ← {(0.5+↕𝕩)÷𝕩} # Midpoints of equal-sized intervals, 𝕩 total
glide ← {t←0‿4‿7÷8 ⋄ ⥊t+⌜(t+÷16)÷8} # Midpoints of glidesort candidate intervals
# Scoring: cost increase relative to true median on 1e3 elements
list ← Sample Mid l←1e3
ScoreAll ← {+˝˘ (2×{𝕩×2⋆⁼𝕩}l÷2) -˜ +○{𝕩×2⋆⁼𝕩}⟜(l⊸-) list +´∘≤¨⎉∞‿¯1 𝕩}
MakeTable ← {
cand ← Sample 𝕩
med ← part (1⊑∧){𝔽𝔽¨}∘⊔⌜ cand # Pseudomedians
score ← +˝˘ scoremat ← ScoreAll med
t39 ← > (⌊2÷˜3‿9) ⊑⟜∧¨¨ ⟨Sample Mid 3, cand⟩ # True medians
∾⟨
["True of 3"‿"147","True of 9"‿('0'+↕9)] ∾˘ ⌊ ScoreAll t39
((⌽⊸∨0=↕∘≠)∨pint∊˜⊏˘)⊸/ score ⍋⊸⊏ (⌊scoremat) ∾˘˜ pfmt
⟩
}
•Show∘MakeTable¨ ⟨Mid 9, glide⟩