Unlocking Boxes: A Prisoner Puzzle Variant

I had an interesting discussion about the recent prisoner puzzle post, which reminded me of another nice problem that has a similar feel.  This problem is of the sort that can be tackled from a few different angles, from computer simulation to cocktail napkin.  I will wait until next week (or until solutions in the comments) to discuss the useful nugget of mathematics inside this one.

You are being held prisoner by the usual bloodthirsty-yet-eccentric pirates, who offer the following scenario.  In a room are 100 boxes.  For each of the boxes, which lock upon closing, there is one key which opens only that box.  The 100 keys are distributed randomly in the boxes, one key per box, and the boxes are then closed.  You may select any 50 of the boxes, which will then be broken open so that you can retrieve the keys inside, possibly allowing you to open additional boxes.  If you are able to open all of the boxes, the pirates will set you free; otherwise, you will be forced to walk the plank.

What is your probability of survival?  (As usual, generalize to breaking open k of n boxes.)

An additional variant: how does your situation change if you do not have to select all of the boxes to break open ahead of time?  That is, you are allowed to select and break open a single box, unlock any additional boxes with the resulting keys, then select and break open a second box, etc.

(Unfortunately, I do not recall where I first encountered this problem.  I think it may be from one of West’s combinatorics texts.)

Re-Measuring Morton

“Unconscious manipulation of data may be a scientific norm.” – Stephen Jay Gould

“You’re just a woman with a small brain.  With a brain a third the size of us.  It’s science.” – Ron Burgundy, Anchorman

The subject this week is a recent paper by Lewis et al. titled “The Mismeasure of Science: Stephen Jay Gould versus Samuel George Morton on Skulls and Bias.”  The paper refers to the 1978 paper by Stephen Jay Gould, “Morton’s Ranking of Races by Cranial Capacity,” which is in turn a critique of the work by Samuel George Morton in the mid-19th century.

Background

Let’s start at the beginning.  Morton lived in a time when an important question was whether the various human populations were separately divinely created species (polygenesis), or a single divinely created species that subsequently diverged (monogenesis).  This was before Darwin, remember.  To address this question, Morton collected over 600 skulls from around the world, and measured their cranial capacity, with results showing variation in average capacity between groups, supporting his argument for polygenesis.

Gould argued instead that Morton skewed his data to support his a priori convictions, and re-analyzed Morton’s measurements, showing that “there are no differences to speak of among Morton’s races.”  Based in large part on Gould’s work, which is “widely read, frequently cited, and still commonly assigned in university courses,” Morton is now– or was, anyway– regarded as “a canonical example of scientific misconduct.”

Which brings us to this most recent study, in which a group of anthropologists actually re-measured the skulls in Morton’s collection (which Gould never did).  Their results contradict Gould, showing that Morton’s data and methodology were reliable “despite his clear bias.”  Indeed, “Gould’s own analysis of Morton is likely the stronger example of a bias influencing results.”

Thoughts

Why is this important?  I think that both Gould and Lewis discuss two related but very different issues, and I am really only interested in one of them here:

  1. Whether Morton’s conclusions were wrong; and
  2. Whether Morton’s conclusions were wrong due to a priori bias influencing methodology.

I am less interested in (1); although Morton’s reputation as an empiricist deserves vindication, I think the actual subject of cranial capacity variation and its support for monogenesis or polygenesis are no longer terribly relevant (unless you are still a creationist, in which case I am curious as to how this issue gets resolved).  I instead want to focus on (2).  The valid question raised by Gould is: is it even possible to conduct scientific study and obtain objective data and conclusions, when we are always influenced by our own preconceptions, biases, and beliefs?

(Aside: This same question has come up here before, particularly in the comments on this early post.  Interestingly, on re-reading that post, I see some similarities with another of Gould’s ideas, “non-overlapping magisteria” (NOMA).  But I think my conclusion differs from Gould’s, in that mine is not quite as friendly to actual practiced religions.)

Morton’s study is a vivid example of this question.  Reading his work, the racist views common in the 19th century seem outrageous today, and would seem to clearly indicate a priori bias.  And he is certainly not the only example of such bias; as Gould points out, Isaac Newton and Gregor Mendel also both “fudged” results to support some of their theories.

An interesting related question is: how much does it matter, at least in the end game?  As Gould puts it, “After all, Newton and Mendel were right.”  In other words, the world works a certain way no matter whether or how we observe or comment on it.  There are “right” answers out there, independent of our questionable motives, clumsy methodologies, or erroneous refutations.  As another example, consider one creationist argument that rejects the theory of evolution because Darwin was motivated by a desire to reject God.  On one hand, I think this is simply wrong, or at least a misrepresentation of the man.  But on the other more important hand, who cares?  Motivation doesn’t make a theory wrong, observation does.

In the end, based on their re-analysis of Morton’s work, Lewis et al. reject Gould’s suggestion that “unconscious or dimly perceived finagling is probably endemic in science, since scientists are human beings rooted in cultural contexts, not automatons directed toward external truth.”  I share their optimism and confidence in our ability to be objective despite our very human biases… but I also think that Gould’s warning is well heeded, and that our current greatest disease as a scientific community is an insupportable pressure to publish positive and unambiguous results, when the overwhelming majority of experimental results should be expected to be negative or ambiguous.  Many of us probably learned the scientific method in school as a simple short sequence of steps, when in reality, as I find I am very fond of saying, science is messy.

Stephen Jay Gould died in 2002.  It is extremely unfortunate that he is no longer with us.  I would be interested to hear his response to these latest developments.

Reddit’s comment ranking algorithm

On a web site where users rate the quality of– well, just about anything, from consumer products to comments on news articles– with positive or negative “votes,” how should the rated items be sorted so as to present the higher-rated items near the top of the page and the lower-rated items near the bottom?

I recently learned how Reddit handles this problem when sorting user comments on a link post.  (I was teased to the Reddit link by the title’s suggestion that Randall Munroe of xkcd fame had a hand in the algorithm.)  There is some interesting mathematics involved… but there also appear to be some equally interesting bugs in the implementation, which I will discuss here.

[Edit: Two of the three problems discussed here have been fixed.]

First, some background, in chronological order: Evan Miller has a great write-up titled “How Not To Sort By Average Rating.”  He shows why a few simple solutions do not work very well, and instead proposes using the lower bound of the Wilson score confidence interval to sort ratings, including Ruby source code implementing the formula.

Evan’s post was in turn referenced by a Reddit blog guest post by Randall, describing Reddit’s new “best” comment ranking algorithm around the time it was implemented.  And finally, the most recent amix blog by Amir Salihefendic describes Reddit’s ranking algorithms in excellent detail, including inspection of the Reddit source code.  (To be clear, Reddit uses different ranking algorithms for “stories” (i.e., link posts) vs. user comments.  We are focusing on the latter here.)

I had initially planned to discuss confidence intervals in general, and some interesting issues with binomial proportion estimation in particular, of which the Wilson score interval is just one example.  There frequently seems to be confusion about just what information is conveyed by a confidence interval.  For example, Evan describes the Wilson score lower bound as the answer to the question, “Given the ratings I have, there is a 95% chance that the “real” fraction of positive ratings is at least what?”  This is misleading, since no claim can be made about the probability that the “real” fraction of positive ratings lies within a particular computed interval.  (Wikipedia actually does a great job of explaining this.)

But let’s skip that and instead focus on the actual implementation in the Reddit source code.  I’ll take Amir’s approach of presenting a Python translation, since the original is in the less common Pyrex:

# From /r2/r2/lib/db/_sorts.pyx
from math import sqrt

def confidence(ups, downs):
    n = ups + downs
    if n == 0:
        return 0
    z = 1.0 #1.0 = 85%, 1.6 = 95%
    phat = float(ups) / n
    return sqrt(phat+z*z/(2*n)-z*((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)

Compare this with the formula for the endpoints of the Wilson score interval:

\frac{\hat{p}+\frac{z^2}{2n}\pm z\sqrt{\frac{\hat{p}(1-\hat{p})}{n}+\frac{z^2}{4n^2}}}{1+\frac{z^2}{n}}

where \hat{p} is the fraction of observed votes that are positive, and z is the appropriate quantile of the standard normal distribution (more on this shortly).

We can make several observations at this point.  First, and most importantly, the square root is in the wrong place in the code!  I am not sure how this error crept in, since the form of the expression and even the variable names are nearly identical to the correct Ruby example from Evan’s original post.  At any rate, whatever Reddit is computing, it isn’t the Wilson score interval.

Second, it is worth clarifying for exactly what confidence interval we are trying to compute endpoints.  Both Evan and Amir present the above formula with the plus/minus notation, and use \alpha/2 in their z-subscripts, which together strongly suggest that the two-sided Wilson score interval of the form [c-\Delta,c+\Delta] is intended.  However, the text in Evan’s post and the actual constants used for z in the code comments suggest a one-sided interval of the form [u,1], where u is the value returned by the function.  Since the distinction only matters when we talk about the actual level of confidence in the resulting interval (e.g., is it 95% or 90%, 85% or 70%, etc.), to avoid further confusion I will use the code as a guide and refer to levels of confidence assuming one-sided intervals.

Finally, one interesting side effect of the current incorrect implementation is that the function always returns positive values, even for zero upvotes.  More generally, the resulting interval frequently does not even contain the maximum likelihood estimate of the true fraction of positive ratings!  That’s the bad news… but the good news is that the function “does the right thing” for zero upvotes anyway, by returning smaller values for larger numbers of downvotes.  The true Wilson score interval lower bound does not have this nice property, since the lower bound is zero for \hat{p}=0, independent of n.

(Actually, I am not sure if this is ever a problem, at least on Reddit, since every comment seems to begin its life with one “automatic” upvote.)

So, how should we fix things?  I would recommend the following modification to the algorithm:

# From /r2/r2/lib/db/_sorts.pyx
from math import sqrt

def confidence_fixed(ups, downs):
    if ups == 0:
        return -downs
    n = ups + downs
    z = 1.64485 #1.0 = 85%, 1.6 = 95%
    phat = float(ups) / n
    return (phat+z*z/(2*n)-z*sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)

This addresses each of the three issues described above:

First, we fix the square root in the formula, so that we are indeed computing the lower bound of the Wilson score interval (usually, anyway; see below).

Second, we change the constant z to use the 95% confidence interval instead of the original 85%.  This is really just a fine-tuning to keep the resulting ordering “closer,” in an eyeball-norm sense, to how the current algorithm performs.

Finally, we extend the special-case behavior from just returning zero for zero total votes, to returning negative values for zero upvotes.  This maintains the current property that more downvotes will always push items farther down the list.