Skip to content

Median candidate analysis #10

@mlochbaum

Description

@mlochbaum

Have had this half-finished for a while; I'm writing up what I can even though there's more to investigate. CC @scandum, @Voultapher as I believe ipn's picked up glidesort's candidate approach.

I looked into how the choice of pivot candidates affects median accuracy using a simulation on random piecewise-linear inputs. These seem like they capture one type of order that might be expected in an input, and I don't have any other promising ideas for testing that.

I found that the accuracy depends not only on the way candidates are grouped into sets of 3 (what I initially wanted to test) but also the positions of those candidates (unexpected, but obvious in retrospect). So I measured both for an even distribution, and Glidesort's recursive selection, using midpoints of the nine intervals that come out of the recursion. I copied the arithmetic for this by hand; it looks right but it's possible I made a mistake here. I've also tested various positions not shown here. Error is measured as cost, assuming sorting each of the two partitions takes time proportional to n log(n). For context, the baseline cost is 8965784, so that a relatively high cost difference of 30000 adds 0.33% to total sorting cost due to that step (it'll compound at every partition). The tables below show cost for the best and worst groupings after testing every possibility, as well as some others:

  • 000111222 is the current glidesort scheme, which I find is fairly poor regardless of positions
  • 012012012 transposes it, and seems generally good
  • 012102120 stands out as consistently very good

The arrangement 011201220 that I've used and recommended previously does badly, often worse than 000111222.

Beyond any particular choice of grouping it doesn't seem like Glidesort's positions do well: the "true of 9" row is the true median of those and is worse than most pseudomedians with even spacing. The middle 4 makes Glidesort's positions skewed and prevents perfect performance on 1-piece (totally linear) inputs, but changing it to 3.5 isn't much of an improvement (in fact the true median is worse for 2-piece inputs). So I think it's mainly the clustering that weakens it, although I can't say exactly why the effect is so strong. Something that seems a little better is to use 3 as the midpoint for one or two intervals and 4 for the others.

I also did some theoretical analysis on the median-of-median-of-... idea. I find that 3^k candidates processed with recursive medians have about as much power as a true median of (π/2)*2.25^k pivots. For example the pseudomedian of 243 for a 32768-element list is worth a true median of 91. The probability distributions for the median value found this way with random candidates are equal at the exact midpoint, and the shapes seem similar (if anything the pseudomedian is wider). I can write the math up if you're interested.

Even spacing (0.5/9, ..., 8.5/9):

Partition Diagram 1 2 3 7
True of 3 ~147 0 13413 25575 77986
True of 9 012345678 0 1557 2940 8960
012102120 048/136/257 0 1557 3174 16728
012012012 036/147/258 0 1557 4276 18077
000111222 012/345/678 0 13413 20636 28802
001112022 016/234/578 35848 31942 29785 30945

Glidesort arrangement (approx 0.008, 0.07, 0.117, 0.508, 0.57, 0.617, 0.883, 0.945, 0.992):

Partition Diagram 1 2 3 7
True of 3 ~147 0 13413 25575 77986
True of 9 012345678 14184 69439 77943 74393
001122102 017/236/458 184 26029 43807 74776
012102120 048/136/257 14184 69439 79653 81007
012012012 036/147/258 14184 69439 84625 85439
000111222 012/345/678 14184 79120 97432 98255
011120220 058/123/467 39866 141889 139266 99083

Source for this, run with CBQN. I can translate to some other language on request. The inputs are created with •rand which changes between runs, but the results above don't change significantly.

# All partitions of 9 candidates into 3+3+3
part ← (⊐⊸≡∧(3⥊3)≡/⁼)¨⊸/⥊↕3⌊1+↕9
pfmt ← > ('0'⊸+ ⋈ ·∾⟜"/"⊸∾´ '0'+⊔)¨ part  # Display
# Hard-coded partitions of interest
pint ← "000111222"‿"012012012"‿"012102120" #‿"011201220"

# Make random piecewise-linear functions
GetFn ← {
  R ← •rand.Range
  p←0∾1∾˜∧(𝕩-2)R 0 ⋄ q←𝕩 R 0  # x endpoints, y endpoints
  m←q÷○(«⊸-)p ⋄ y←q-m×p       # Slope, y intercept
  {(⊏⟜y+𝕩×⊏⟜m)(1↓p)⍋𝕩}
}
dist ← GetFn¨ 1e3/≍2‿3‿4‿8    # 1e3 sets with each of 2, 3, 4, 8 vertices
Sample ← dist {𝕎𝕩}¨ <         # Sample values given positions

# Candidate sampling
Mid ← {(0.5+↕𝕩)÷𝕩}                  # Midpoints of equal-sized intervals, 𝕩 total
glide ← {t←0‿4‿7÷8 ⋄ ⥊t+⌜(t+÷16)÷8} # Midpoints of glidesort candidate intervals

# Scoring: cost increase relative to true median on 1e3 elements
list ← Sample Mid l←1e3
ScoreAll ← {+˝˘ (2×{𝕩×2⋆⁼𝕩}l÷2) -˜ +○{𝕩×2⋆⁼𝕩}⟜(l⊸-) list +´∘≤¨⎉∞‿¯1 𝕩}
MakeTable ← {
  cand ← Sample 𝕩
  med ← part (1⊑∧){𝔽𝔽¨}∘⊔⌜ cand                # Pseudomedians
  score ← +˝˘ scoremat ← ScoreAll med
  t39 ← > (⌊2÷˜3‿9) ⊑⟜∧¨¨ ⟨Sample Mid 3, cand⟩ # True medians
  ∾⟨
    ["True of 3"‿"147","True of 9"‿('0'+↕9)] ∾˘ ⌊ ScoreAll t39
    ((⌽⊸∨0=↕∘≠)∨pint∊˜⊏˘)⊸/ score ⍋⊸⊏ (⌊scoremat) ∾˘˜ pfmt
  ⟩
}
•Show∘MakeTable¨ ⟨Mid 9, glide⟩

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions