Median candidate analysis

Have had this half-finished for a while; I'm writing up what I can even though there's more to investigate. CC @scandum, @Voultapher as I believe ipn's picked up glidesort's candidate approach.

I looked into how the choice of pivot candidates affects median accuracy using a simulation on random piecewise-linear inputs. These seem like they capture one type of order that might be expected in an input, and I don't have any other promising ideas for testing that.

I found that the accuracy depends not only on the way candidates are grouped into sets of 3 (what I initially wanted to test) but also the positions of those candidates (unexpected, but obvious in retrospect). So I measured both for an even distribution, and Glidesort's recursive selection, using midpoints of the nine intervals that come out of the recursion. I copied the arithmetic for this by hand; it looks right but it's possible I made a mistake here. I've also tested various positions not shown here. Error is measured as cost, assuming sorting each of the two partitions takes time proportional to n log(n). For context, the baseline cost is 8965784, so that a relatively high cost difference of 30000 adds 0.33% to total sorting cost due to that step (it'll compound at every partition). The tables below show cost for the best and worst groupings after testing every possibility, as well as some others:
- 000111222 is the current glidesort scheme, which I find is fairly poor regardless of positions
- 012012012 transposes it, and seems generally good
- 012102120 stands out as consistently very good

The arrangement 011201220 that I've used and recommended previously does badly, often worse than 000111222.

Beyond any particular choice of grouping it doesn't seem like Glidesort's positions do well: the "true of 9" row is the true median of those and is worse than most pseudomedians with even spacing. The middle 4 makes Glidesort's positions skewed and prevents perfect performance on 1-piece (totally linear) inputs, but changing it to 3.5 isn't much of an improvement (in fact the true median is worse for 2-piece inputs). So I think it's mainly the clustering that weakens it, although I can't say exactly why the effect is so strong. Something that seems a little better is to use 3 as the midpoint for one or two intervals and 4 for the others.

I also did some theoretical analysis on the median-of-median-of-... idea. I find that 3^k candidates processed with recursive medians have about as much power as a true median of (π/2)*2.25^k pivots. For example the pseudomedian of 243 for a 32768-element list is worth a true median of 91. The probability distributions for the median value found this way with random candidates are equal at the exact midpoint, and the shapes seem similar (if anything the pseudomedian is wider). I can write the math up if you're interested.

Even spacing (0.5/9, ..., 8.5/9):

Partition | Diagram     |     1 |      2 |      3 |     7
----------|:-----------:|------:|-------:|-------:|-----:
True of 3 | ~147        |     0 |  13413 |  25575 | 77986
True of 9 | 012345678   |     0 |   1557 |   2940 |  8960
012102120 | 048/136/257 |     0 |   1557 |   3174 | 16728
012012012 | 036/147/258 |     0 |   1557 |   4276 | 18077
000111222 | 012/345/678 |     0 |  13413 |  20636 | 28802
001112022 | 016/234/578 | 35848 |  31942 |  29785 | 30945

Glidesort arrangement (approx 0.008, 0.07, 0.117, 0.508, 0.57, 0.617, 0.883, 0.945, 0.992):

Partition | Diagram     |     1 |      2 |      3 |     7
----------|:-----------:|------:|-------:|-------:|-----:
True of 3 | ~147        |     0 |  13413 |  25575 | 77986
True of 9 | 012345678   | 14184 |  69439 |  77943 | 74393
001122102 | 017/236/458 |   184 |  26029 |  43807 | 74776
012102120 | 048/136/257 | 14184 |  69439 |  79653 | 81007
012012012 | 036/147/258 | 14184 |  69439 |  84625 | 85439
000111222 | 012/345/678 | 14184 |  79120 |  97432 | 98255
011120220 | 058/123/467 | 39866 | 141889 | 139266 | 99083

Source for this, run with [CBQN](https://github.com/dzaima/CBQN). I can translate to some other language on request. The inputs are created with `•rand` which changes between runs, but the results above don't change significantly.
```
# All partitions of 9 candidates into 3+3+3
part ← (⊐⊸≡∧(3⥊3)≡/⁼)¨⊸/⥊↕3⌊1+↕9
pfmt ← > ('0'⊸+ ⋈ ·∾⟜"/"⊸∾´ '0'+⊔)¨ part  # Display
# Hard-coded partitions of interest
pint ← "000111222"‿"012012012"‿"012102120" #‿"011201220"

# Make random piecewise-linear functions
GetFn ← {
  R ← •rand.Range
  p←0∾1∾˜∧(𝕩-2)R 0 ⋄ q←𝕩 R 0  # x endpoints, y endpoints
  m←q÷○(«⊸-)p ⋄ y←q-m×p       # Slope, y intercept
  {(⊏⟜y+𝕩×⊏⟜m)(1↓p)⍋𝕩}
}
dist ← GetFn¨ 1e3/≍2‿3‿4‿8    # 1e3 sets with each of 2, 3, 4, 8 vertices
Sample ← dist {𝕎𝕩}¨ <         # Sample values given positions

# Candidate sampling
Mid ← {(0.5+↕𝕩)÷𝕩}                  # Midpoints of equal-sized intervals, 𝕩 total
glide ← {t←0‿4‿7÷8 ⋄ ⥊t+⌜(t+÷16)÷8} # Midpoints of glidesort candidate intervals

# Scoring: cost increase relative to true median on 1e3 elements
list ← Sample Mid l←1e3
ScoreAll ← {+˝˘ (2×{𝕩×2⋆⁼𝕩}l÷2) -˜ +○{𝕩×2⋆⁼𝕩}⟜(l⊸-) list +´∘≤¨⎉∞‿¯1 𝕩}
MakeTable ← {
  cand ← Sample 𝕩
  med ← part (1⊑∧){𝔽𝔽¨}∘⊔⌜ cand                # Pseudomedians
  score ← +˝˘ scoremat ← ScoreAll med
  t39 ← > (⌊2÷˜3‿9) ⊑⟜∧¨¨ ⟨Sample Mid 3, cand⟩ # True medians
  ∾⟨
    ["True of 3"‿"147","True of 9"‿('0'+↕9)] ∾˘ ⌊ ScoreAll t39
    ((⌽⊸∨0=↕∘≠)∨pint∊˜⊏˘)⊸/ score ⍋⊸⊏ (⌊scoremat) ∾˘˜ pfmt
  ⟩
}
•Show∘MakeTable¨ ⟨Mid 9, glide⟩
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Median candidate analysis #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Partition	Diagram	1	2	3	7
True of 3	~147	0	13413	25575	77986
True of 9	012345678	0	1557	2940	8960
012102120	048/136/257	0	1557	3174	16728
012012012	036/147/258	0	1557	4276	18077
000111222	012/345/678	0	13413	20636	28802
001112022	016/234/578	35848	31942	29785	30945

Median candidate analysis #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions