Historical probability of picking a perfect bracket 1985-2025

For the past six years, I have been maintaining (on GitHub) a machine-readable record of outcomes of the NCAA men’s basketball tournaments, over the now four decades since the current 64-team format began in 1985. (I continue to refuse to acknowledge the four Tuesday-Wednesday play-in games.) This effort was originally motivated by the annual question of the probability of picking a “perfect bracket,” i.e., correctly guessing the winners of all 63 games in all six rounds of the tournament. Despite some close calls over the years, no one has ever verifiably done this, and it is unlikely that anyone ever will.

This past post describes a method of estimating this probability for any given bracket, so that, for example, a “chalk” bracket (where a higher seeded team is always selected to beat a lower seed) is a much more likely overall outcome– albeit still unlikely as a specific outcome– than a bracket picking, say, all of the #1 seeds to be beaten by #16 seeds in the first round.

We can also apply this method to historical data, aggregating and weighting each year’s games and upsets into a single number, that we can use to compare the overall prior (un)likelihood of tournament outcomes across years, as shown in the following figure.

Probability of perfect bracket 1985-2025.

The black line at the bottom is 2^{-63}, the often-quoted “one in 9.2 quintillion” probability of correctly guessing all 63 games by simply flipping a coin. The blue and red lines at the top are probabilities of chalk brackets, using two different models of individual game probabilities as a function of difference in “strength” of each team. As discussed in more detail here, blue indicates a strength following a normal distribution density, and red indicates a simpler linear strength function.

From the above figure, we see that this year’s tournament was the second most likely ever in the history of the current format. This makes sense: the Final Four teams were all #1 seeds (only the second time this has happened– the other was in 2008 with champion Kansas), and there were only 11 upsets– the fewest ever, as shown in the figure below– of a higher seed losing to a lower seed. The only more likely prior probability of overall tournament outcome was in 2007, with a then-fewest 12 total upsets, and #1 seeds Ohio State and champion Florida beating #2 seeds Georgetown and UCLA, respectively, in the Final Four.

Number of upsets (lower seed beating higher seed) 1985-2025.

So, how (un)likely is it that someone will eventually pick a verifiable perfect bracket? The probability of a chalk bracket– the mode of the distribution– being correct is one in roughly 100 to 200 billion. Even then, there are eight different chalk brackets… and an astronomically larger number of brackets with the dozen or so upsets that we expect to see in a given year.

Expected time to complete a WOW board row or column

I recently saw a WOW board, which seems to be a popular approach to rewarding positive behaviors in elementary classrooms. The idea is pretty simple: start with an n \times n grid (usually n=10) of empty squares. When a student exhibits some positive behavior, academic achievement, etc., they write their name or otherwise fill in a randomly chosen square. When the grid is “filled,” something cool happens, such as a prize drawing, extra recess, gold watches for everyone, whatever.

In most descriptions of WOW boards that I found online, the students must fill the entire grid of all 100 squares. But this particular variant was more mathematically interesting: the grid squares were numbered 1 to 100, and the teacher kept a container of slips also numbered 1 to 100; each time a student filled in a square, they didn’t simply pick any open spot on the board, but instead drew one of the slips from the container (without replacement) to determine which square to write their name in. The cool-something happens when the students first complete any row or column of the grid.

How long, on average, does this take? That is, what is the expected value of X, the number of slips drawn until first completing a row or column of the grid?

This is essentially a special case of the game of Bingo discussed here before. However, the analysis approaches presented there required computing sums with an exponential number of terms: 2^{n^2} for the most brute-force approach, which we knocked down to 2^{2n}. This is manageable for Bingo cards with n=5, but is already unpleasant for a WOW board with n=10.

It’s a nice problem to show that we can more efficiently compute the probability distribution– and thus the expected value– of X in polynomial time using the formula

P(X>k)=\frac{1}{{n^2 \choose k}} \sum\limits_{i=0}^n \sum\limits_{j=0}^n (-1)^{i+j} {n \choose i}{n \choose j}{n^2-ni-nj+ij \choose k-ni-nj+ij}

E(X)=\sum\limits_{k=0}^{n(n-1)} P(X>k)

The resulting cumulative distribution for a 10×10 grid is shown below in blue, with the expected number of draws– about 71.6– shown in red.

An open question that I don’t know how to answer: what is the value of

\lim\limits_{n \to \infty} \frac{E(X)}{n^2}

That is, as the WOW board grid size grows large, do we expect to need to fill closer and closer to 100% of the squares before first filling a row or column?

I suspect the answer is yes, but I’m not sure how to prove it.

Optimal strategy maximizing expected value for splitting pairs

Introduction

I think the game of blackjack would be relatively boring, mathematically speaking, if not for splitting pairs. But that one additional playing strategy option makes analyzing the game much more challenging and interesting: if you are dealt a pair of cards with the same rank, should you “split” the pair– at a cost of another wager– and play each single-card hand separately? If you are dealt yet another pair card, should you “re-split” to a third, or even fourth, additional hand?

When evaluating splitting strategy, we usually trade some optimality for speed and simplicity. That is, we can compute the exact expected return from splitting a pair very quickly— in milliseconds– if we are willing to constrain the player’s strategy to depend only on the cards in the current hand (and maybe some limited additional contextual information), not all of the cards observed in the entire round.

But there are some analysis questions where it would be useful to know the speed of light: how much can you possibly win with truly perfect play, with every strategy decision accounting for every card dealt from the shoe? The motivation for this post is to capture my notes on an algorithm– code is on GitHub— to compute this optimal pair-splitting strategy and corresponding maximum expected return. See [2] and [3] for related work; the objective here is to significantly reduce execution time and extend the space of feasibly computable resplits.

More retrograde analysis

I was able to borrow a lot of code from the recent solution of the children’s game Gobblet Gobblers. The idea is the same here: we compute the expected value of a pair split in two steps:

  1. Breadth-first search the decision tree of all possible states of the round.
  2. Traverse the tree backward from leaves to the root, in order of decreasing depth, computing the expected value of each state in terms of its children.

In the first step, the breadth-first search trades the memory required to store all possible states for the speed of not repeatedly re-visiting states as in a recursive (effectively depth-first) traversal. Storing all of those states requires some careful packing; more on this shortly.

In the second step, we compute the expected value of each game state in terms of its children. For example, the value of hitting (or doubling down) from a given state is a linear combination of the values of the states reached from each possible drawn card, weighted by their probabilities; the value of a decision state is the maximum of the values obtained by standing vs. hitting, etc. By “solving” for each state in order of decreasing depth, we ensure that we have already computed the expected values of any dependent child states. We start at the bottom, with leaf states, corresponding to the player having completed all split hands in the round, which we “solve” in terms of the probabilities of outcomes of the dealer’s hand, as discussed here before.

Storing packed game states

The most interesting part of this problem was the memory management. As with Gobblet, we need a hash map from each game state to its corresponding (expected) value. And once again, we’re using the lean and mean “mask, step, index” MSI hash map implementation described by Chris Wellons [1]. (Thanks to Chris for very helpful discussions about other optimizations as well!)

But now we need even more states– well over half a billion depending on the rules and number of decks– and even more memory per state. Following is the 16-byte representation of a single game state:

struct State
{
    std::int8_t cards[11];
    std::int8_t hands[4];
    std::int8_t current;

    bool operator==(const State& rhs) const
    {
        return std::memcmp(this, &rhs, sizeof(State)) == 0;
    }
};

The cards[1] through cards[10] indicate the number of cards of each rank observed during the round, including the dealer’s up card and the initial two pair cards being split. The hands[current] indicates the current hand total, negated to indicate a soft total. The cards[0] indicates the number of cards in the current hand: either 1 for a single split pair card, 2 for a two-card hand, or 3 for more than two cards.

The hands[0] through hands[current-1] indicate totals for hands that are already completed (via stand, double down, or bust), but with a slightly different convention than hands[current]: totals are “clamped” to 16 for any total less than 17, or 17, 18, 19, 20, or 21; or 22 for any busted hand, with negation indicating that the hand was doubled down. Furthermore, this subarray of completed hand values is sorted.

The result of all of this clamping and sorting is that we cannot in general reconstruct from a State the chronological “history” of decisions and composition of cards drawn to each hand in a round. But that’s okay: the objective is to preserve only information that is exploitable by strategy decisions. When trying to determine what to do next, we don’t care whether that already-completed hand total is hard or soft, but we do care whether we doubled our wager on it, etc.

But we’re not done compressing just yet. Each element of the MSI hash map array contains not just a game state “key,” but also its expected value:

struct Solved
{
    State state;
    union
    {
        Solved* next;
        double value;
    };
};

I wrote my first computer program over four decades ago, and this is the first time I’ve found myself using a union in C++. During the first breadth-first search exploration of all possible game states, we use these 8 bytes as linked list pointers to group states by depth. Then in the second step we traverse those linked lists of states, overwriting the list pointers as we compute each state’s expected value.

One interesting thing that I learned from this project is that just zeroing memory takes a long time (“only” gigabytes per second). For this reason, you can input the desired size of the hash map, to speed up scripted evaluation of large numbers of “fast” scenarios that don’t require a large number of states. For example, to evaluate splitting 2-2 against a dealer 4 in 1D, H17, DOA, DAS, SPL1, we only need a couple of million states:

Memory allocation exponent (29=12GB, 30=24GB, etc.): 21
European no hole card rule (1=yes, 0=no): 0
Dealer hits soft 17 (1=yes, 0=no): 1
Double down after split (1=yes, 0=no): 1
Double down on any two cards (1=yes, 0=no): 1
Maximum number of splits (3=SPL3): 1
Shoe (ace through ten, -1 to exit): 4 4 4 4 4 4 4 4 4 16
Pair card: 2
Dealer up card: 4

Split 2 vs. 4:
    Searched 1247392 states to depth 17, 59% capacity.
    Solving depth 0, 100% complete.
    Elapsed time 2.675 seconds.
E(split)= 0.12079971803625923

References:

  1. Wellons, C., The quick and practical “MSI” hash table [nullprogram.com]
  2. ChemMeister iCountNTrack, Composition Dependent Combinatorial Analyzer [moderator on blackjacktheforum]
  3. kc, Optimal Expected Values for a single split [bjstrat.net]

Another Secret Santa puzzle

Suppose that n couples want to participate in a Secret Santa gift exchange, where each person is secretly assigned to give a gift to one other person in the group. We need a procedure to randomize the gift-giving assignments, subject to some desired constraints:

  1. Every person receives one gift; that is, the function mapping each person to their assigned gift recipient is a bijection.
  2. No person is assigned to give a gift to themselves; that is, the function is a derangement.
  3. No person is assigned to give a gift to their partner.
  4. Every assignment generated by the procedure that satisfies the above constraints is equally likely.

A common approach to generating a random assignment is for each person to write their name on a slip of paper, put all of the slips into a hat, and pass the hat around, with each person drawing a slip to determine their assigned gift recipient. We have discussed this approach before: for example, if a person “peeks” at their slip as it is drawn and sees their own name– violating constraint (2)– then it might be tempting to “fix” the procedure inline by returning the slip to the hat and drawing another. But this turns out to violate constraint (4).

We can preserve constraint (4) by rejection sampling: if someone draws their own name– or that of their partner– then everyone immediately returns their slips to the hat, and the drawing starts again from the beginning. We can compute the probability of “accepting” such a drawing, with the added wrinkle of constraint (3) addressed here.

This post is motivated by a slightly different procedure: suppose that we have a moderator who uniformly randomly permutes the list of 2n participants, and each person is assigned to give a gift to the person immediately following them in the shuffled list (viewed cyclically, so that the last person in the list gives a gift to the first person in the list).

This procedure satisfies constraints (1), (2), and (4), without any need for rejection re-sampling. However, what is the probability that this procedure also satisfies constraint (3)?

It’s a nice problem to compute this probability exactly; but it’s also interesting to show that this probability approaches 1/e as n grows large.

An algorithm incorrectness proof

Introduction

Let’s play a card game for money. For example, I (the “house”) will thoroughly shuffle together six decks of playing cards, and deal a round of blackjack against you (the “player”). That is the scenario motivating this post, but the details of the rules of the game don’t matter much for our purpose here, so we can alternatively consider a much simpler game: I will shuffle a single deck of cards, and deal one card to you and one card to me. High card wins (aces are low, face cards have value 10), with a tie going to the house.

Because we are playing for money, both the player and the house have an interest in knowing the exact (dis)advantage in the game. We can write a program to compute the player’s expected return, as a function of the composition of ranks of face-down cards in the shuffled “shoe.” For example, we might write the following Python code to compute the expected return for the high card game:

from fractions import Fraction
import numpy

def expected_value(shoe):
    """Expected value of high card game as fraction of initial wager."""
    ev = 0
    for card1 in range(10):
        for card2 in range(10):
            s = list(shoe)
            p = Fraction(s[card1]) / sum(s)
            p = p * s[card2] / sum(s)
            ev = ev + p * (1 if card1 > card2 else -1)
    return ev

print(expected_value((4, 4, 4, 4, 4, 4, 4, 4, 4, 16)))

The result is -25/169, or about 14.8% of the player’s wager going to the house.

Or is it? There is an error in the above code. Let’s leave it as an exercise for the reader to find the bug (hint: one line is missing) … and instead prove that there must be a bug, even without identifying nor fixing it.

The Extended True Count Theorem

The idea is to suppose that we shuffle the deck as usual, but then “burn” the top card of the deck, setting it aside face down and unused, before starting the game by dealing the player’s and dealer’s “hands.” Intuitively, this action should not affect the player’s expected return: neither the player nor the dealer sees the burn card, and at least in this simple high card game it wouldn’t matter even if they did, since there is no “playing strategy” that might be influenced by knowledge of the burn card’s value.

We can verify this by computing the average expected return played from the resulting 51-card deck, weighted by the probability of each possible burned top card:

def average_ev(shoe):
    """Average expected value after 'burning' the top card."""
    ev = 0
    for card in range(10):
        s = list(shoe)
        p = Fraction(s[card]) / sum(s)
        s[card] = s[card] - 1
        ev = ev + p * expected_value(s)
    return ev

print(average_ev((4, 4, 4, 4, 4, 4, 4, 4, 4, 16)))

The resulting average expected return is -557/3757: still a roughly 14.8% house edge, but not exactly the same -25/169 that we should expect, thus proving that our original implementation of expected_value(shoe) must be in error.

This approach to verifying the correctness of expected_value(shoe) applies much more generally, as described in this paper, as well as here and here. For example, instead of just burning the top card, modify average_ev(shoe) to play the game twice, computing the average expected return from the second round (effectively burning the top two cards instead of just one). Or deal cards from the top of the deck, stopping when you have either dealt an ace or a maximum of ten cards, then play the game from the resulting depleted deck. Or repeatedly flip a coin, dealing a card for each heads, stopping when you flip tails or have flipped heads a maximum of ten times, then play the game. In each case, if we ever find that average_ev(shoe) != expected_value(shoe), then we know that something is wrong.

(The converse, however, is not true: matching expected values do not imply that the implementation of expected_value(shoe) is necessarily correct. For example, let’s change the game slightly so that ties are a push instead of going to the house, with the following change to expected_value (but otherwise leaving the original bug as-is):

            ev = ev + p * numpy.sign(card1 - card2)

The reader can verify that the discrepancy goes away, with average_ev(shoe) == expected_value(shoe) == 0: not only do the values match, but they are correct, despite the bug remaining in the implementation.)

Application to blackjack

Coming back now to the original motivation for this post, let’s consider a round of blackjack with some specified rules (6D, S17, DOA, DAS, SPL3, LS), played using CDZ- strategy optimized for the full shoe. Using my blackjack software (on GitHub), we can compute the corresponding expected return from a round played off the top of the shoe (think expected_value(shoe)), as well as the average expected return after removing a single card, but playing the same fixed “full-shoe” strategy (think average_ev(shoe)):

#include "blackjack.h"
#include <iostream>

int main() {
    BJShoe shoe{6};
    BJRules rules;
    BJStrategy cdz;
    BJProgress progress;
    BJPlayer *basic = new BJPlayer{shoe, rules, cdz, progress};
    std::cout << basic->getValue() << std::endl;

    BJReal ev = 0;
    for (int card = 1; card <= 10; ++card) {
        shoe.deal(card);
        BJPlayer* eor = new BJPlayer{shoe, rules, *basic, progress};
        shoe.reset();
        ev += eor->getValue() * shoe.getProbability(card);
    }
    std::cout << ev << std::endl;
}

with the following results (as fraction of initial wager):

-0.0033026803733750346
-0.0033026803733748255

These two values are close, but not exactly equal. Granted, these calculations were done using the limited precision of 64-bit floating-point arithmetic, with the accompanying rounding errors. But how can we tell whether the differences are due to limited numeric precision, or due to errors in the algorithm itself?

To answer this question, I recently updated the code to typedef the numeric data type (BJReal in the above example code) used for computing probabilities and expected values. This type defaults to double, but we can use the arbitrary-precision math::Rational instead– but otherwise using exactly the same algorithm– and verify that the expected return in the above scenario is exactly:

-\frac{636491186124390779995774550739146028101753}{192719583540554771243108474049609438884038125}

and that we get exactly this same result whether played from the full shoe or averaged over each possible burn card. In other words, the discrepancy in the lowest order digits above was due entirely to double-precision rounding error.

Of course, as shown in the high card “ties push” excursion described above, this doesn’t prove that the algorithm is correct. But inequality would have proved that the algorithm is incorrect. For this reason, I think this is a useful additional test to consider when evaluating algorithms like these.

Analysis of adversarial binary search game

Introduction

Alice and Bob are at it again, this time playing a game called Binary Search. Alice begins by secretly selecting an integer key from 1 to n=100, then Bob makes a series of integer guesses, to each of which Alice responds indicating whether the guess is correct, or higher than her selected key value, or lower. Bob must pay Alice one dollar for each guess (including the final correct guess). How much should Alice be willing to pay to play this game with Bob?

This post is motivated by a recent blog post by John Graham-Cumming, discussed on Hacker News, about a similar game apparently used by Steve Ballmer in Microsoft job interviews. In the linked clip, Ballmer plays the role of Alice, and offers $6 to play the game. You’re Bob interviewing for a job: should you take the bet? Ballmer suggests that the answer is no, since as he puts it, “I [Alice/Steve] can pick numbers specifically that are hard for you [Bob] to get.”

As many have noted in the subsequent discussion, this comment certainly seems to allow for Alice/Steve picking numbers adversarially, that is, not necessarily uniformly at random. But presumably Bob is then allowed similar leeway, not being forced to execute textbook binary search that always guesses the midpoint of the interval of remaining possible numbers. So who has the greater strategic advantage? I’m not sure. I think this is a hard problem, as I’ll try to make the case below.

Solving smaller problems

I wrote some Mathematica code (on GitHub) to compute exact optimal strategies for Alice and Bob for smaller versions of this problem, up to n=13. Alice’s pure strategies are simply the set of n possible secret keys. Bob is more interesting; we can identify each possible pure search strategy as an in-order binary tree with n vertices. The following function enumerates these (it’s always a nice surprise when my favorite sequence turns up in another unexpected place):

(* List all possible search strategies with key space[n]. *)
strategies[n_] := strategies[n] = Module[
   {guess},
   If[n === 0, {{}},
    Join @@ Table[Join @@ Outer[
        {guess, #1, #2} &,
        strategies[guess - 1],
        guess + strategies[n - guess],
        1
        ],
      {guess, 1, n}
      ]]]

Armed with Alice’s and Bob’s pure strategies, we can compute the matrix of payoffs resulting from each possible strategy match-up, then solve the corresponding linear programming problem to compute the expected value of the game and the Nash equilibrium mixed strategies for each player.

Results and conjectures

The results for n up to 13 exhibit a consistent pattern suggesting that an optimal strategy for Alice is to pick numbers “almost uniformly” at random… but twice as likely to pick a number on the edge. More precisely, pick 1 or n each with probability 2/(n+2), and pick any other number with probability 1/(n+2). I think it is likely that this is an optimal strategy for Alice in the n=100 case as well. (However, it’s interesting to observe that the presumed “Ballmer strategy” of picking uniformly from the subset of numbers for which the normal binary search would yield the maximum number of guesses \lfloor\log_2 n\rfloor + 1 has the same expected payoff.)

Bob’s strategy, on the other hand, is harder to describe in simple terms. Let’s consider n=13 for a specific example. Of the 742,900 possible pure search strategies, an optimal mixed strategy for Bob is to select from just the following baker’s dozen with the indicated probabilities, where each search strategy is specified as the pre-order traversal of the corresponding binary tree:

Prob.      Strategy
------------------------------------------------
0.381579   7  3  1  2  5  4  6 11  9  8 10 13 12
0.131579   8  4  2  1  3  6  5  7 12 10  9 11 13
0.131579   6  2  1  4  3  5 10  8  7  9 12 11 13
0.0625     9  4  1  2  3  6  5  8  7 12 10 11 13
0.060855   5  2  1  4  3 10  8  6  7  9 13 12 11
0.055921   9  4  2  1  3  6  5  8  7 12 10 11 13
0.055921   8  3  1  2  5  4  6  7 12 10  9 11 13
0.055921   5  2  1  4  3 10  8  6  7  9 12 11 13
0.055099   6  2  1  4  3  5 11  9  8  7 10 13 12
0.003289   9  5  2  1  4  3  7  6  8 12 10 11 13
0.003289   5  2  1  4  3  9  7  6  8 12 10 11 13
0.001645   5  2  1  4  3 10  7  6  9  8 13 12 11
0.000822   7  2  1  4  3  5  6 11  9  8 10 13 12

In other words, roughly 38% of the time Bob should conduct the usual even-split binary search, guessing the midpoint 7 to start. However, sometimes Bob might start with an initial guess as low as 5 or as high as 9.

In the original game where n=100, I think it’s certainly the case that Bob should similarly be willing to start with guesses north of 51 or south of 50, but it’s unclear how extreme those departures might be.

Finally, what is Alice/Steve’s expected return in this game? Let’s set things up similarly to the Ballmer interview version of the problem, and suppose that Alice pays \lfloor\log_2 n\rfloor dollars to play, and wins back a dollar for each of Bob’s guesses. Then the following figure shows Alice/Steve’s expected return as a function of the number n of possible keys to pick from.

My suspicion is that this sawtooth behavior persists, continuing to straddle zero expected return, so that it seems difficult to predict whether the particular game at n=100 has Alice in the red or the black, when both she and Bob use truly optimal strategies. This simple brute force computational approach certainly won’t work: the matrix of payoffs has 100 rows and roughly 10^57 columns.

Edit 2024-09-09: I was referred to Konstantin Gukov‘s very interesting article about this game, where they compute an explicit strategy for the n=100 case that proves that Alice/Ballmer’s expected return is indeed negative in that case. They do this by restricting Bob the Guesser’s set of available binary search strategies, from all C(2n,n)/(n+1) \approx 10^{57} possible strategies to just 586 carefully chosen ones, so that it is feasible to solve the corresponding linear programming problem. Since Alice’s expected return is negative against this handcuffed version of Bob, it must also be against a truly optimal guesser.

Gukov’s selection of subset of strategies is indeed well-chosen. I ran Gukov’s Python code to generate their search strategies for all n up to 100, then computed the resulting expected return in each case. (Recall that we are assuming that Alice pays \lfloor\log_2 n\rfloor dollars to play, so that we can compare performance across a range of key space sizes.) The results are shown in the figure below, with Ballmer’s return against Gukov’s restricted guesser shown in red, and the performance against an optimal guesser shown in blue (essentially a reproduction of the earlier figure above).

This gives more evidence for our earlier speculation about how the advantage varies with n, with a clearer picture of the pattern: it seems that Ballmer would have been safest by offering to play the game choosing numbers from 1 to 127– or 1 to 63, or 1 to 31, etc.– instead of 1 to 100. Powers of two are the worst for Ballmer, but as the key space increases from there, it eventually becomes advantageous again… until reaching the next power of two.

Robot simulator updated to VPython 7.6.5

Years ago, I wrote a Python module using the VPython 3D graphics library to simulate a three-wheeled robot modeled after the Parallax Scribbler, that could execute Logo-like “turtle” motion commands. It worked really well as an educational tool for programming students, since using VPython’s graphics meant that you could manipulate the robot even from the interpreter prompt, a line at a time, and rotate and zoom the 3D view even while the robot is moving. No window event loop, no multithreading to think about.

I soon added a stall sensor and proximity sensors for detecting and navigating around obstacles, such as the walls of a maze as shown in the screenshot below.

from vturtle import *
robot = Robot(obstacles=maze())

In the intervening 14-ish years, VPython has undergone a significant refactoring. I’ve recently updated the robot simulator code accordingly, so that it now works in the currently latest VPython version 7.6.5. The code is on GitHub.

Another whodunit logic puzzle

Once again, you are a detective investigating a robbery, with a cast of suspects having made the following statements:

  • Alice says, “Bob could claim that I did it.”
  • Bob says, “Charlie could claim that I did it.”
  • Charlie says, “David could claim that I did it.”
  • David says, “Zoe could claim that I did it.”
  • Zoe says, “At least one of us is innocent.”

You do not know which, nor even how many, of the suspects were involved in the crime. However, you do know that every guilty suspect is lying, and every innocent suspect is telling the truth. Which suspect or suspects committed the crime?

Problem 2: What if instead Zoe had said, “At least two of us are innocent”?

(As usual, I like these both as math problems and as programming problems for students. The added wrinkle here is the need to encode the somewhat “meta” nature of most of the statements, talking about the feasibility of someone else’s hypothetical assertions.)

Strongly solving Gobblet Gobblers using retrograde analysis

Introduction

Despite my nephews and nieces growing older, it seems like I continue to encounter mathematically interesting children’s games. I recently played Gobblet Gobblers, a deceptively complex variant of Tic Tac Toe marketed for “ages 5 and up.”

The rules are simple:

  • Like Tic Tac Toe, the game is played on a 3×3 board, with the winner being the first player to get 3 of their pieces in a row, column, or diagonal.
  • Each player starts with 6 pieces: 2 pieces in each of 3 sizes. Each piece has a hollow dome shape, so that a smaller piece may be covered by a larger piece played on top of it.
  • At each player’s turn, they may either play a new piece, or move one of their pieces already on the board (possibly uncovering a smaller piece underneath).
  • Whether playing a new piece or moving one already played, the piece may be played/moved either to an empty square, or over a smaller piece (of either player’s) already on the board.

The neat wrinkle is being able to move pieces around without having to add a new piece to the board with each move. This not only increases the game’s state space complexity– there are 341,024,631 possible states up to board and player symmetry– but also complicates a straightforward minimax search, which must deal with the possibilities of repeated game states, and game tree paths hundreds of moves deep even without ever repeating a game state.

This post describes my approach to computing optimal strategy using retrograde analysis— I wasn’t familiar with this term from the chess world, I would have just called it dynamic programming with some extra sauce. The C++ implementation is on GitHub, with which you can play the game, show optimal moves for either player at any point in the game, and “rewind” to experiment with alternate playouts.

Playing the game and its variants

When you run the program, you are first asked for three parameters specifying the supported rule variations:

  • The number of piece sizes (1, 2, or 3)
  • The number of pieces of each size (1 to 9)
  • A 0-1 Boolean indicating whether pieces may be moved after initially played.

For example, (3,2,1) corresponds to Gobblet, while (1,5,0) corresponds to plain old Tic Tac Toe, as shown below:

Enter rules (num_sizes, num_per_size, allow_move): 1 5 0
Searching... found 765 states.
Solving... solved 614 win/loss states.
| |
| |
0| 1| 2
------|------|------
| |
| |
3| 4| 5
------|------|------
| |
| |
6| 7| 8
Player 1, enter move (-size | start, end), or (0, 0) for best move, or (-1, -1) to undo move: -1 4

Solving Tic Tac Toe is trivial and quick, and most other rule variants take at most a couple of minutes to “think” before beginning the game. But Gobblet (3,2,1) is by far the most complex variant, taking about 25 minutes on my laptop to compute optimal strategy… but that’s a one-time cost, and includes saving the strategy to disk, so that you can quickly reload it to play again later without having to sit through the calculation again.

A move is specified by a pair of integers, indicating the piece to play-or-move and where to move it, with non-negative values 0 through 8 indicating a square on the board as labeled above. A negative value (-1, -2, or -3) indicates playing a new piece of the corresponding negated size. Thus, the example move (-1, 4) above indicates Player 1 (X) starting the game by playing in the center square.

Hidden pieces

Although technically a game of perfect information, in practice there is a memory component: as the rules warn, “The first player to align 3 pieces in a row wins, so when you wish to move a piece try to remember what is under it!” The displayed board preserves this practical difficulty, only showing the topmost piece in each square, indicating its owner (X or O) and size (1, 2, or 3). In the example below, O responds to X’s opening move by covering the smaller X1 with their “medium” sized piece O2:

Enter rules (num_sizes, num_per_size, allow_move): 3 2 1
Loading from gobblet_3_2_1.dat
| |
| |
0| 1| 2
------|------|------
| |
| |
3| 4| 5
------|------|------
| |
| |
6| 7| 8
Player 1, enter move (-size | start, end), or (0, 0) for best move, or (-1, -1) to undo move: -1 4
| |
| |
0| 1| 2
------|------|------
| |
| X1 |
3| 4| 5
------|------|------
| |
| |
6| 7| 8
Player 2, enter move (-size | start, end), or (0, 0) for best move, or (-1, -1) to undo move: -2 4
| |
| |
0| 1| 2
------|------|------
| |
| O2 |
3| 4| 5
------|------|------
| |
| |
6| 7| 8

Optimal strategy

Gobblet is a win for the first player in 13 moves (plies), meaning that optimal strategy does indeed involve moving some pieces around the board. X’s first move can be anything but a medium piece (“X2”); the only caveat is that X3 to an edge, or X1 to the center or corner, allows the second player O to delay losing for 15 moves instead of 13.

(There is a Racket version of the game, with a much better user interface, that appears to implement optimal first player strategy in its “smart mode,” but seemingly only for its specific choice of principal variation, beginning with X3 to the center. For example, manually playing X1 to the center instead, then using this implementation’s optimal strategy for the second player– enter 0 0 for the move at any point to see a best move– against subsequent Racket auto-play moves for the first player, results in a win for O.)

If we prohibit moving pieces already played– that is, rule variant (3,2,0), with the next largest space complexity of 148,599,441 states– then the game is a draw, where X’s first move must be either X3 to the center or X1 anywhere.

The only other rule variants that are not a draw are (2,3,0) and (2,3,1)– that is, still six pieces per player, but in two sizes of three pieces each instead of the other way around– both of which are also first player wins, in 9 and 11 moves, respectively.

Game state bitboards

This was a really fun programming exercise. If there is any moral to this story, I think it is to remember that the standard library was written for everyone’s problems, not for your problem.

We use a std::uint64_t to represent each game state as a 54-bit bitboard: each of the 3×3=9 squares requires 6 bits, 2 bits for each piece size (332211), indicating whether there is a piece of that size in the corresponding square:

  • 00 indicates no piece of that size.
  • 01 indicates a piece of the player to move.
  • 10 indicates the opponent’s piece.

Gobblet is symmetric, allowing us to reduce the size of the state space by “swapping seats” after each move, so that the state is always interpreted with respect to the player to move next. We can further reduce the number of states by roughly another factor of 8 with the usual trick of rotating and reflecting the board, and using the minimum value among all resulting states.

I think I got nerd-sniped here a bit: rotating the board was awkward using this bitboard encoding, and renumbering the squares to make rotations simpler just made reflections even worse. I finally settled instead on alternating between two reflections– one about the middle row, the other about the off-diagonal– chosen to minimize the number of “shift-and-mask” bitwise operations in each.

As a first ding on the C++ standard library, I found it interesting that the final implementation, shown below, is different at all, let alone noticeably faster, than what I initially started with using the clearer min_s = std::min(min_s, s):

State canonical(State s)
{
    State min_s = s;
    s = flipud(s);        min_s = s < min_s ? s : min_s;
    s = antitranspose(s); min_s = s < min_s ? s : min_s;
    s = flipud(s);        min_s = s < min_s ? s : min_s;
    s = antitranspose(s); min_s = s < min_s ? s : min_s;
    s = flipud(s);        min_s = s < min_s ? s : min_s;
    s = antitranspose(s); min_s = s < min_s ? s : min_s;
    s = flipud(s);        min_s = s < min_s ? s : min_s;
    return min_s;
}

Chris Wellons’s MSI hash map

Retrograde analysis involves mapping every possible game state to its corresponding win/loss value (more on this below). To do this, I started out with a std::map (or std::unordered_map if you like), starting development with smaller versions of the game that executed more quickly, such as rule variant (2,2,1), with only 252,238 states, then on to variant (2,3,0), with 1,964,786 states.

But with this approach, available memory was dwindling fast. At the time, before I knew Gobblet’s state space complexity exactly, a reasonable back-of-the-envelope estimate was

\frac{1}{8}(\sum\limits_{a=0}^2 \sum\limits_{b=0}^2 {9 \choose a}{9-a \choose b})^3

or well over 300 million states. I just couldn’t afford the overhead of red-black tree pointers, node colors, or hash bucket linked list pointers in the standard library’s map implementations.

Enter the “mask, step, index” (MSI) hash map described by Chris Wellons [1]. This worked like a charm, being not just lean but fast. Here is the entire implementation, in place of std::map, squeezing all 2.5 GB worth of game states into a 4 GB array:

const int HASH_EXP = 29;
const std::size_t HASH_MASK = (1ull << HASH_EXP) - 1;
const State STATE_EMPTY = 0x3; // 0x0 is the (valid) initial board state
const State STATE_MASK = (1ull << 54) - 1;
std::vector<State> hash_map(HASH_MASK + 1, STATE_EMPTY);

// Return pointer to hash map entry for given game state.
State* lookup(State s)
{
    std::uint64_t h = hash(s);
    std::size_t step = (h >> (64 - HASH_EXP)) | 1;
    for (std::size_t i = h;;)
    {
        i = (i + step) & HASH_MASK;
        if (hash_map[i] == STATE_EMPTY || (hash_map[i] & STATE_MASK) == s)
        {
            return &hash_map[i];
        }
    }
}

That’s it. Or almost– we also need the hash(s) function used to compute the starting index and step size when skipping past collisions. I ended up using SplitMix64, which I was expecting to be an overly big hammer, but turned out to result in significantly faster execution than all of the simpler hashes that I tried. I wonder if this is due to the rather heavy loading (we’re using over 63.5% of the allocated space)?

At this point, we really have just a hash table. We need a hash map, from the 54-bit state as the key, to the win-or-lose value of the game played optimally from that point. This is where the STATE_MASK above comes in: we will pack the not-yet-computed value for the 54-bit state key into the remaining upper 10 bits of the std::uint64_t.

Retrograde analysis

Which brings us, finally, to the algorithm for computing the values of all possible game states, implicitly also computing the optimal strategy from any state (by choosing the move that leads to the highest value state). To do this, we map each game state to an ordered pair (value, moves), where value indicates +1 for a win, -1 for a loss, or 0 for a draw for the player to move, assuming optimal play from that point; and moves indicates the “depth to win,” or number of moves (plies) from that point to the end of the game assuming optimal play.

We “solve” for these values using retrograde analysis, starting with the terminal win or loss states that are initialized with their endgame value and moves=0, and working backward from there, identifying immediately previous game states and propagating solved win/loss values, incrementing moves accordingly.

In the process, I ended up using a slightly weird encoding of (value, moves) into the upper 10 bits of the std::uint64_t for each game state. Consider the function to compute the best move from a given game state:

Move best_move(State s)
{
    Move best{};
    State max_next = 0;
    for (auto& m : get_moves(s))
    {
        State next = *lookup(canonical(swap(move(s, m))));
        if (next > max_next)
        {
            max_next = next;
            best = m;
        }
    }
    return best;
}

In other words, optimal strategy is simply to choose the move leading to the state with the largest possible std::uint64_t value, which is in turn dominated by the (value, moves) pair encoded in the upper 10 bits. The value is in the highest 2 bits, and the moves are in the lower 8 bits… but negated (and offset by one) using two’s complement for losses. That is:

  • 0100000000 indicates (+1, 0), an immediate win for the player to move.
  • 0100000001 indicates (+1, 1), the player to move can win in 1 move.
  • 0100000010 indicates (+1, 2), the player to move can win in 2 moves.
  • 10######## indicates (0, #), game is a draw.
  • 1111111101 indicates (-1, 2), the player to move will lose in 2 moves.
  • 1111111110 indicates (-1, 1), the player to move will lose in 1 move.
  • 1111111111 indicates (-1, 0), an immediate loss for the player to move.

If we are computing the best move by choosing from among these values as next possible states, the “player to move” is our opponent. Thus, the ideal move would lead to the largest value with high order bits equal to 1111111111, an immediate loss for our opponent– or barring that, a loss for our opponent in the fewest possible moves. If we can’t guarantee a win, then we want to force a draw. And if we can’t force a draw, then our opponent has a winning strategy, but we want to delay the win for as many moves as possible.

Reference:

  1. Wellons, C., The quick and practical “MSI” hash table [HTML]

Another coin flip puzzle

Alice and Bob are playing a game. They flip a fair coin 100 times, noting the resulting sequence of 100 heads (H) or tails (T). For each HH in the sequence, Alice scores a point; for each HT, Bob scores a point. For example, given the sequence THHHT, Alice scores 2 points and Bob scores 1 point. Whoever scores the most points wins the game. Who is most likely to win?

I was referred to this problem, which I suspect originally came from this recent tweet. I reeallly like this problem… but not just for the usual reasons of involving probability and having a very unintuitive solution. (Alice’s expected total score is exactly equal to Bob’s expected score, so maybe it’s most likely a tie? On the other hand, Alice’s maximum score is 99, since her pattern can overlap with itself, while Bob can score at most 50 points, so maybe Alice is most likely to win?)

I also like this problem because I didn’t even recognize it– as a special case of a problem I had not only seen before, but discussed here before.

Because I didn’t recognize it, I ended up solving it in a different way, this time using generating functions. It’s a nice obligatory follow-on puzzle to do this in a one-liner in Mathematica or Wolfram Alpha, but it’s not too bad in Python either using SymPy:

from sympy import symbols, expand
n = 100
x = symbols('x')
h, t = 0, 1
for k in range(n):
    h, t = expand(h * x + t), expand(h / x + t)
g = h + t
alice, bob, tie = (sum(g.coeff(x, k) for k in range(1, n)),
                   sum(g.coeff(x, k) for k in range(-n, 0)),
                   g.coeff(x, 0))

This counts the number of equally probable outcomes where Alice wins, Bob wins, or they tie.