Analysis of Biscuits dice game

Introduction

In the dice game “Biscuits,” a player rolls 12d6+1d8+1d10+1d12, then selects one or more of the rolled dice to set aside, rolling the remainder again, continuing this process until eventually all 15 dice have been set aside. The player’s score is the sum of the final values shown on the dice, with the highest score winning the game.

A rolled six on a d6– or an eight on the d8, or generally the maximum value on a die– is referred to as a biscuit… and as the rules suggest, “you want biscuits and lots of them!”, where “sometimes you’ll roll several biscuits all at once, you can set them all aside.”

You can set them all aside… but should you? That is, suppose that you roll sixes on multiple d6s at once, but relatively small values for the d8, d10, and/or d12. Should you set aside all of those biscuits, banking the highest possible score for those dice, or is it worth it to “give up” some of that guaranteed score in exchange for more future opportunities to roll better scores for the d8, d10, and d12?

Results

Not to beat around the bush, the latter turns out to be the case. Strategy in this game appears to be pretty interesting; let’s suppose that our objective is to maximize our expected total score (more on this later). Python code is available on GitHub to evaluate all possible moves that we might make from any given roll. For a most extreme example situation, suppose that on the initial roll we get biscuits for all 12d6, but we roll 1 for each of the other dice:

>>> analyze(((0, 0, 0, 0, 0, 12),
             (1, 0, 0, 0, 0, 0, 0, 0),
             (1, 0, 0, 0, 0, 0, 0, 0, 0, 0),
             (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)))
 93.836 points by keeping d6 (6 ), 
 93.761 points by keeping d6 (6 6 ), 
 93.691 points by keeping d6 (6 6 6 ), 
 93.625 points by keeping d6 (6 6 6 6 ), 
 93.565 points by keeping d6 (6 6 6 6 6 ), 
 93.509 points by keeping d6 (6 6 6 6 6 6 ), 
 93.462 points by keeping d6 (6 6 6 6 6 6 6 ), 
 93.449 points by keeping d6 (6 6 6 6 6 6 6 6 6 6 6 6 ), 
 93.427 points by keeping d6 (6 6 6 6 6 6 6 6 6 6 6 ), 
 93.426 points by keeping d6 (6 6 6 6 6 6 6 6 ), 
 93.407 points by keeping d6 (6 6 6 6 6 6 6 6 6 ), 
 93.407 points by keeping d6 (6 6 6 6 6 6 6 6 6 6 ), 
 87.292 points by keeping d6 (6 6 6 6 6 6 6 6 6 6 6 6 ), d8 (1 ), 
 87.167 points by keeping d6 (6 6 6 6 6 6 6 6 6 6 6 ), d8 (1 ), 
 87.064 points by keeping d8 (1 ), 
 87.037 points by keeping d6 (6 6 6 6 6 6 6 6 6 6 ), d8 (1 ), 
...

Optimal strategy is to keep just a single d6, sacrificing all 11 other biscuits in exchange for more future rolls to hopefully improve the contribution from the d8, d10, and d12. Even more interesting is that the expected total score isn’t monotonic in how many biscuits we might choose to set aside: keeping one d6 is best, and keeping all 12 is worse… but not as bad as keeping just 10 of them.

The figure below shows the distribution of possible total scores using this strategy, sampled from 10,000 simulated games.

Distribution of final total score in Biscuits using expectation-maximizing strategy, with expected score of about 93.9 (out of 102 possible) shown in red.

I was surprised at how well this strategy performs, with an expected score of about 93.9 (shown in red), and in a few simulated games even reaching the maximum possible score of 102.

Implementation

As discussed previously here, we can efficiently enumerate all possible outcomes from rolling a given subset of the dice as weak compositions of the total number of dice of each type. To compute optimal strategy for a given roll, we consider all possible non-empty subsets of dice to keep, recursively evaluating the expected score of each such move, memoizing results as we go.

As an aside, the Python code was not the starting point. Quite often I find it useful to do my “thinking” in Mathematica, then convert the resulting algorithm into a more widely accessible form. It’s always interesting to see how much more concise– but often much less readable– the Mathematica implementation usually is, as shown below.

rolls[n_, d_] := Map[
  Differences[Append[Prepend[#, 0], n + d]] - 1 &,
  Subsets[Range[n + d - 1], {d - 1}]
  ]

score[ns_, ds_] := score[ns, ds] = Total@Map[
  (Times @@ Multinomial @@@ #) Max[0, Map[
    Total[Total /@ (Range[ds] #)] + score[ns - (Total /@ #), ds] &,
    Rest[Tuples[Tuples /@ Range[0, #]]]
    ]] &,
  Tuples[MapThread[rolls, {ns, ds}]]
  ] / Times @@ (ds^ns)

score[{12, 1, 1, 1}, {6, 8, 10, 12}]

Open questions

I think there are a lot of interesting additional questions to pursue here. For example, is there a concise description of optimal strategy, or even an approximation of it, that a human player can implement with simple mental calculation?

Can the calculation of optimal strategy be made more efficient? Interestingly, both the Mathematica and Python implementations took about a day and a half to execute on my laptop. Add any more dice and things would get out of hand.

Finally, the strategy analyzed above is only optimal in the sense of maximizing expected score. This is not the same thing as maximizing the probability of beating another player. Similar to the approach utilized here, ideally we would evaluate the second player’s strategy first, with the utility function parameterized by the first player’s hypothetical final score. Then, once we know all of the possible expected values for the second player, we could compute the first player’s optimal strategy and overall value of the game, using the (negated) second player’s expected returns as the utility function. How different do the resulting strategies look as a result, compared with maximizing expectation?

Analysis of Mastermind-like bottle match game

During the holidays, my niece introduced me to a “bottle match” game, variants of which have apparently made the rounds online for a while now, but it was new to me. She secretly arranges seven wooden bottle-shaped pegs, each of a different color, in a row. I then make a series of guesses of the ordering of the colors (arranging a second set of seven bottles visible to everyone), with each guess scored with a number indicating how many bottles are in the correct position. The objective is to get all seven bottles in the correct order in as few guesses as possible.

This is very similar to the board game Mastermind: one player arranges a secret “code,” and another player makes a series of guesses, with scores providing partial information about how close each guess is to the solution.

Watching family members play the bottle match game, I wondered how inefficient we were being in our guesses. Our exploration of the space of possible solutions seemed pretty timid, making “small” single-transposition changes to the arrangement of bottles at each step, even backtracking to previous guesses at times just to confirm remembered scores.

With n=7 bottles, a coarse lower bound of at least \lceil\log_{n-1}n!\rceil=5 guesses are necessary in the worst case. How close to this bound can we get?

I wrote a program to try to answer this question, taking advantage of the similarity to Mastermind to analyze both games in the process. The resulting ~100 lines of Python code are on GitHub.

The greedy heuristic strategy used for both games is that described by Knuth (2): at each turn, make a guess that minimizes the maximum number of possible secret codes that might remain, over all possible scores that the guess might receive. Break any ties by first preferring codes that are possible (so that we have a chance of winning on the next move)… and in the case of the bottle match game, prefer a code that is “closest” to the previous guess, in the sense of requiring the least amount of additional shifting bottles around to turn the previous guess into the new guess.

This distance metric on permutations can be hard to compute, depending on what sort of shifting operations are allowed. The “block interchange” metric described by Christie (1) seems reasonably realistic for this problem, while being efficiently computable: one operation involves taking two “blocks” of consecutive bottles and swapping them, as shown in the example in the figure below.

Example of a single block interchange operation, swapping the block of 3 bottles in the gray box with the block of 2 bottles in the black box.

As a sanity check, evaluating the resulting strategy for the Mastermind game across all 6^4=1296 possible secret codes, replicates the results in Knuth’s paper: an expected number of guesses of 5801/1296, or about 4.476, never requiring more than a maximum of 5 guesses.

For the bottle match game, we can guess the secret code in 34169/5040, or about 6.78 guesses on average, never requiring more than a maximum of 8 guesses. It’s interesting how often it is advantageous– at least in this greedy heuristic minimax sense– to guess a code that is not consistent with the scores observed so far, as shown in the example output below.

>>> import mastermind
>>> game = mastermind.ColorMatch()
>>> game.play((6, 5, 4, 3, 2, 1, 0), show=True)
Guess (0, 1, 2, 3, 4, 5, 6) in 5040 remaining: score 1
Guess (2, 3, 4, 5, 6, 1, 0) not in 1855 remaining: score 3
Guess (2, 5, 0, 1, 6, 3, 4) not in 106 remaining: score 1
Guess (4, 3, 0, 1, 6, 5, 2) not in 31 remaining: score 0
Guess (2, 1, 4, 5, 0, 3, 6) not in 8 remaining: score 1
Guess (6, 5, 4, 3, 2, 1, 0) in 2 remaining: score 7

References:

  1. Christie, D., Sorting permutations by block-interchanges, Information Processing Letters, 60(4) November 1996, p. 165-169.
  2. Knuth, D., The Computer as Master Mind, Journal of Recreational Mathematics, 9(1) 1976-77.

Riddler Solution: Can You Skillfully Ski The Slopes?

Introduction

In Zach Wissner-Gross’s Fiddler on the Proof, he mentions the following problem from nearly five years ago, when it was still FiveThirtyEight’s The Riddler:

You are one of n=2 finalists in a skiing championship. There are two rounds: in the first round, each finalist skis down the mountain, recording their finishing time; this is repeated in the second round, with the least total time for both runs winning the championship. This is the top talent in the world, so the time for every run is independent and identically distributed. Having learned that you have the best time in the first round, what is the probability that you will win the championship? For extra credit, what if n=30?

In the original problem, the finishing times were normally distributed. The neat feature of this problem is a cocktail-napkin solution for n=2 that does not depend on this distribution. However, the extra credit for larger n— for normally distributed run times– appears to require resorting to simulation or approximation, although Josh Silverman provides some really interesting recent asymptotic analysis.

The motivation for this post is to capture my notes on the exact solution for any n for the case where finishing times are uniformly distributed. (This feels to me like a slightly more natural version of the problem, without the unrealistic negative tails.)

Solution

Define random variables X and Y to be the finishing times for our first and second run, respectively, and similarly define R_i and S_i to be the first and second round times for each competitor i \in \{1,2,\ldots,n-1\}. Define event A_i to be X+Y<R_i+S_i (i.e., we beat competitor i overall) and event B_i to be X<R_i (i.e., we beat competitor i in the first round). Then we want to compute the probability

P(\bigcap\limits_{i=1}^{n-1}A_i | \bigcap\limits_{i=1}^{n-1}B_i)

Applying Bayes’ theorem, and re-ordering and re-grouping the intersection of events in the numerator, we have

\frac{P(\bigcap\limits_{i=1}^{n-1}A_i \cap B_i)}{P(\bigcap\limits_{i=1}^{n-1}B_i)}

There are two key observations: first, by symmetry, the probability in the denominator– that we are the leader after the first round– is 1/n. Second, in the numerator, if we condition on fixed times for our two runs, say for example X=x=0.3 and Y=y=0.8, then the n-1 events A_i \cap B_i are independent– so we can multiply them– and they all have the same probability.

The following figure shows this example situation, where the area of the shaded region is the probability P(A_i \cap B_i)=P(x+y<R_i+S_i \land x<R_i) that we beat a particular competitor in both rounds.

This shaded region has two different shapes depending on whether x+y is less than or greater than 1; putting everything together, the desired probability is

n \int_{x=0}^1(\int_{y=0}^{1-x}(1-x-\frac{1}{2}y^2)^{n-1}\mathrm{d}y + \int_{y=1-x}^1((1-x)(1-y)+\frac{1}{2}(1-x)^2)^{n-1}\mathrm{d}y)\mathrm{d}x

Applying binomial expansion to the powers and moving the integrals inside the resulting summation gives the following formula:

n\sum\limits_{k=0}^{n-1}{n-1 \choose k}\frac{1}{2^k}(\frac{(-1)^k}{(2k+1)(n+1+k)}+\frac{(n-1-k)!(n-1+k)!}{(2n)!})

The figure below shows this probability vs. the number of skiers n.

For 5 or 6 skiers, it’s roughly a coin toss whether you retain your first-round lead. When n=30, the probability is about 0.226.

Variants of Fitch Cheney’s Trick

Nearly 15 years ago, I wrote about the following trick attributed to William “Fitch” Cheney, Jr., in [1]:

Effect 1: While the magician turns their back or even leaves the room, a spectator shuffles a standard deck of n=52 cards, and selects an (unordered) subset of any m=5 cards and hands them to the magician’s assistant. For example:

Cards selected by the spectator.

The assistant arranges k=4 of these cards in a row left to right, as shown below, keeping the remaining selected card hidden.

Cards presented to the magician.

Upon observing this arrangement, the magician announces the fifth selected card, in this case the ace of diamonds.

The article linked above describes how the trick works in practice (including Python source code demonstrating both the assistant and the magician), as well as a proof that the following condition on (n,m,k) is necessary and sufficient for the trick to be feasible, at least in principle:

{n \choose m} \leq (n)_k

Intuitively, the LHS binomial coefficient counts the number of spectator selections that may be presented to the assistant, and the RHS falling factorial counts the number of arrangements that may be presented to the magician.

The motivation for this post is to observe that this basic argument, applying the König-Egerváry theorem to a bi-regular graph to find a saturating matching, is more generally applicable: it should be possible to perform other “Fitch-like” tricks, whose feasibility rests on similar counting arguments. For example:

Effect 2: A spectator rolls m=6 dice each with n=6 sides (maybe from their Farkle set), and arranges the dice in a row left to right, in any desired order. The assistant covers one of the dice by placing a cup over it, leaving the ordered arrangement otherwise undisturbed. The magician observes this arrangement with the k=m-1 values shown, and announces the value on the covered di(c)e.

This trick is feasible if and only if:

n^m \leq n^k {m \choose k}

For another example of this same form, (n,m,k)=(10,10,9) corresponds to writing down a 10-digit number, the assistant covering one of the digits, and the magician announcing the hidden digit.

The linked proof shows that both of these tricks are feasible in principle— but it’s a separately interesting challenge to come up with a practical scheme for performing them. In this case, there is a relatively simple solution left as a puzzle for the reader; but things can get complicated quickly, as in this puzzling.stackexchange.com post that discusses the (6,9,7) case.

Effect 3: A spectator selects any m=27 cards from a deck of n=52 cards, and arranges them face-up in a row left to right, in any desired order. (More practically, imagine shuffling and cutting strictly more than half the deck, and fanning the cards face-up in a row.) The assistant flips one of the cards face-down, leaving the ordered arrangement otherwise undisturbed. The magician observes this arrangement with the k=m-1 face-up cards shown, and announces the face-down card(s).

In this case, the trick is feasible if and only if:

(n)_m \leq (n)_k {m \choose k}

Of these two variants, I think this one would be the more powerful effect. Note that unlike Fitch’s original trick, the spectator gets to not only choose the cards, but to arrange them in any chosen order as well. The assistant’s only freedom is in which card(s) to flip face-down.

Here again, the obligatory puzzle is to determine a practical scheme for performing this trick. I’m not sure how much it helps as a hint to note that, in this particular case of k=m-1 where the assistant flips exactly one card face-down, for the trick to be possible we need the spectator to select and arrange strictly more than half of the cards in the deck. (So that, for example, it is perhaps useful that the spectator is guaranteed to select at least one pair of cards of the same rank and color in their arrangement?)

Reference:

  1. Kleber, M., The Best Card Trick. Mathematical Intelligencer24 (2002). [PDF]

The criminal coupon collector

Introduction

Alice and Bob ride their bicycles together to work each day. They each secure their bicycles using chain combination locks with m=6 dials, with 10 decimal digits on each dial.

“I notice you always take longer than I do to lock your bike,” Alice says, “and I spin each of my dials independently and uniformly after locking– making sure I don’t accidentally leave it unlocked. What takes you so long?”

“I spin my dials after locking, too, ” Bob says, “but I also make sure that none of the individual dials on the lock happens to be correct, giving it extra random spins if necessary. No sense in helping a thief get lucky.” (In other words, for example, if Bob’s combination is 123456, then he may leave the combination spun to 451234, but not 151234, since the first digit would be correct.)

Eve is the neighborhood thief, who has noticed these two expensive-looking bicycles at this same location every morning for a while now, and she happens to overhear this conversation. After watching Alice and Bob lock their bikes and enter their building, Eve casually strolls over and observes the sequence of digits on each of the locks. She has an idea.

Puzzle 1: Assuming Eve is able to observe the combination on Alice’s lock and try exactly one combination each morning after Alice leaves her bike for work, what is the expected number of days until she is able to steal the bike?

Puzzle 2: Similarly observing the displayed combination on Bob’s lock, what is the expected number of days until Eve is able to steal Bob’s bike?

Multiple coupon collectors

Stealing Alice’s bike is an interesting problem in its own right. But the focus of and motivation for this post is to capture my notes on the second problem of stealing Bob’s bike. The practice of ensuring that none of the individual combination dials is randomly accidentally correct, while well-intentioned, ends up making Eve’s job much easier. Despite the lock having one million possible combinations, Eve needs less than 40 days’ worth of observations on average to steal the bike.

We can rephrase the problem in terms of multiple coupon collectors: there are m collectors, one for each dial, each independently trying to collect s=10-1=9 coupon types, one for each incorrect digit on that dial. Each day, all m collectors draw a coupon; we want to compute the expected number of days until all collectors have each obtained all coupon types. Put another way, we want the expected value of the maximum of m independent random variables, each distributed according to the standard solo coupon collector problem with s coupon types.

The cumulative distribution for this maximum N is

P(N \leq n) = (\sum\limits_{k=0}^s (-1)^k{s \choose k}(1-\frac{k}{s})^n)^m

with expected value

E(N) = -\sum\limits_{\mathbf{k}>\mathbf{0}} \frac{\prod\limits_{i=1}^m (-1)^{k_i}{s \choose k_i}}{1-\prod\limits_{i=1}^m (1-\frac{k_i}{s})}

where the summation ranges over tuples \mathbf{k} of m non-negative integers that are not all zero, corresponding to a number of “unknown incorrect” digits for each dial. In Python:

from fractions import Fraction
from itertools import islice, product
from math import comb, prod

def probability_coupon_draws(n, s, m=1):
    """CDF(# of coupon draws from s types for all of m collectors)."""
    return sum((-1) ** k * comb(s, k) * (1 - Fraction(k, s)) ** n
               for k in range(s + 1)) ** m

def expected_coupon_draws(s, m=1):
    """E(# of coupon draws from s types for all of m collectors)."""
    return -sum(prod((-1) ** k_i * comb(s, k_i) for k_i in k) /
                (1 - prod(1 - Fraction(k_i, s) for k_i in k))
                for k in islice(product(range(s + 1), repeat=m), 1, None))

The result is about 39.5891 observations on average until Eve knows all 9 incorrect digits for each dial, and thus knows the entire correct combination.

Binomial proportion estimation with unknown and varying number of trials

Here is a neat trick: our friend, Paul, has asked us for help estimating his skill at the shooting range. Every day for the past several years, he has diligently recorded the number x_i of successful shots he took that day. Assuming that every shot is an iid Bernoulli trial with success probability p, can we estimate p?

Unfortunately, no. Paul only recorded the number of successful shots each day. If he hit the target x_i=20 times on a given day, was that out of 20 attempts, or 25, or 150? Not only did Paul not record that corresponding total number n_i of attempted trials each day, but that total varied from one day to the next: some days it might be just one “stand” of 25 shots, other days it might be several hours with hundreds of shots.

It might seem like we’re stuck. However, Paul mentions that he doesn’t shoot alone. His friend, Quinn, has accompanied him on each of these trips to the shooting range. In friendly competition, they each take the same number of shots, and whoever has the most hits buys a round on the way home. Fortunately, Quinn has kept a similar record of his daily hits (but again, not daily total trials).

This is interesting: Quinn takes each shot with some different success probability q, that is just as unknown to us as p. How can similarly incomplete information about Quinn possibly help us learn something about Paul?

It turns out that it can; following is Python code implementing a simpler one of several different estimates of Paul’s success probability p, given a list of pairs of Paul’s and Quinn’s hit counts.

def estimate_p(xy: list[tuple[int, int]]) -> float:
    """Estimate Bernoulli probability from pairs of success counts."""
    sx  = sum(x     for x, y in xy)
    sy  = sum(y     for x, y in xy)
    sxy = sum(x * y for x, y in xy)
    sx2 = sum(x * x for x, y in xy)
    return 1 - sx2 / sx + sxy / sy

The figure below shows this estimate vs. Paul’s true success probability, for each of 100 simulated experiments, where we randomly choose true success probabilities for Paul and Quinn.

Simulation of 100 experiments (one point each) with uniformly random p and q, showing resulting estimated vs. true probability p.

Code for these experiments and plotting results is on GitHub.

A trick-taking game

There has not been a puzzle here in a while, so…

Alice is playing a card game with her nephew, Bob, using a standard deck of playing cards. Alice starts with the 26 red cards in her hand, and Bob starts with the 26 black cards in his hand. In each of 26 rounds (or tricks), Alice and Bob simultaneously select and play one card from their remaining hand. The player with the higher-ranked card takes the trick; cards of the same rank are discarded with neither player taking the trick. After all cards have been played, Alice wins (loses) one point for every trick won (lost).

At this point, by symmetry, this seems like a fair game. To skew things slightly in her nephew’s favor, Alice suggests a wrinkle: aces are low… except for one of Bob’s aces, the ace of spades, which is high, ranking higher than all other cards.

  1. What are Alice’s and Bob’s optimal strategies and corresponding expected score in this game?
  2. Suppose that instead of a point for every trick, the player with the most tricks wins the game (with a score of +1, or 0 if both players take the same number of tricks). What are Alice’s and Bob’s optimal strategies and expected score?
  3. How does this generalize to arbitrary initial hands? That is, suppose Alice and Bob start with known card ranks a_1 \leq a_2 \leq \ldots \leq a_n and b_1 \leq b_2 \leq \ldots \leq b_n, respectively (with all cards known to both players).

Fiddler Solution: Can You Hack Bowling?

The following problem is from the Fiddler (the “spiritual successor to FiveThirtyEight’s The Riddler column”), involving the game of tenpin bowling with traditional scoring:

If you knock down 100 pins over the course of a game, you are guaranteed to have a score that’s at least 100. But what’s the minimum total number of pins you need to knock down such that you can attain a score of at least 100?

I thought this was an interesting problem, in part because traditional scoring in bowling is a bit of a mess. Every pin that you knock down scores at least one point, but some pins may yield additional bonus points… based on previously knocked down pins.

My first thought was to transform the puzzle into an integer linear programming (ILP) problem. In hindsight, the solution ends up having a nice enough form to suggest a more elegant cocktail-napkin proof. But those scoring rules are just messy enough to want some more brute-force verification from the computer. Or at least that’s what I told myself to justify the interesting coding problem.

The Mathematica code is on GitHub; it’s not much, just a few dozen lines. The solution is simple to describe: we can score at least 100 points by knocking down just 44 pins, rolling four consecutive strikes, immediately followed by knocking down four pins, and no more, for a total score of 30+30+24+14+4=102.

We certainly can’t fully brute-force the problem by examining all possible games: the number of distinct “scoreboards” in a game of bowling is

{12 \choose 2}^9({12 \choose 2}+10 + 10 \cdot 11 + {12 \choose 2}-11) = 5,726,805,883,325,784,576

However, at the other extreme, I also wasn’t able to reduce the problem to just a single ILP instance. Not only is the total score (which we are bounding) not a linear function of the variable numbers of pins knocked down by each roll, but even the total number of pins knocked down (which we are trying to minimize) is not a simple linear function of the variables, because of the possible extra frames after a strike or spare in the tenth frame.

Instead, we can trade execution time for preserved code simplicity, by evaluating not just one ILP instance, but 3^9 \cdot 4=78,732 of them, by grouping game outcomes according to the sequence of frame “types”: strike, spare, or open. For example, the eventual solution described above occurs in a game of the form (strike, strike, strike, strike, open, open, open, open, open, open). Having fixed the type of each frame in the game, now we can describe the number of pins knocked down and the resulting total score as linear functions of variables indicating the number of pins knocked down by each roll.

A maximally inefficient Monte Carlo estimate of pi

Introduction

Let’s play a game: I will repeatedly flip a fair coin, showing you the result of each flip, until you say to stop, at which point you win an amount equal to the fraction of observed flips that were heads. What is your strategy for deciding when to stop?

This weekend 6/28 is “Two-Pi Day” (or Tau Day, if you like). Celebration in the classroom can take many forms: for example, we can write a program to calculate an approximation of the value of \pi using Monte Carlo simulation. Or we can find some mathematical problem where the occurrence of \pi is surprising or unexpected. I think the game described above is an interesting case where we can do both at the same time.

Throwing darts

The most common approach to estimating \pi using Monte Carlo simulation involves selecting random points uniformly distributed in a square; imagine throwing darts at a square dartboard, as shown in the figure below.

Monte Carlo simulation of 5000 uniformly random points in a square. Red points lie within the inscribed circular sector, and the remaining blue points lie outside the sector.

Let the indicator random variable X_1 equal 1 if a dart lands within the inscribed circle– or within the inscribed 90-degree circular sector shown above, which has the same area– otherwise X_1=0. Then E(X_1)=P(X_1=1) is the ratio of areas \pi / 4, so that we can estimate \pi by Monte Carlo estimation of E(Y_1), where Y_1=4X_1. In Python:

import random

def dart():
    x, y = random.random(), random.random()
    return 4 * int(x * x + y * y < 1)

n = 1000000
print(sum(dart() for _ in range(n)) / n)

Note how simple the simulation of dart() is, just a couple of random numbers and a few arithmetic operations. Indeed, this might seem perhaps too simple: we don’t really need the randomness of Monte Carlo simulation here, and could instead integrate over an evenly spaced grid of thrown darts. And despite the simple arithmetic implementation of the simulation, the “pi-ness” of the process– there is a circle clearly visible in the diagram– is rather unsurprising.

Throwing noodles

Another common approach to estimating \pi is Buffon’s noodle. Imagine throwing uncooked straight spaghetti noodles, each one foot long, onto a floor marked with parallel lines spaced one foot apart, with the random variable X_2 equal to the number of times a particular noodle crosses a line on the floor.

Monte Carlo simulation of 100 Buffon’s noodles. Red noodles cross a line, and blue noodles do not.

It turns out that E(X_2)=2/\pi. (Note this is an expected value, not just a probability, because it works even if we cook the noodles, so that X_2 can take on non-negative integer values beyond just 0 and 1.) We can similarly estimate \pi by Monte Carlo simulation of this noodle-throwing process, appropriately transforming the resulting estimate of E(X_2). Again in Python:

import random
import math

def noodle():
    y = random.random()
    angle = math.pi * random.random()
    angle_0 = math.asin(y)
    return int(angle_0 < angle < math.pi - angle_0)

n = 1000000
print(2 / (sum(noodle() for _ in range(n)) / n))

As a method of computing an approximation of \pi, this feels a bit circular (!) to me. We are using math.pi, the very value we are trying to estimate, in the implementation of the simulation! (I’m also skipping over the error analysis, which is more complicated than for the other simulation approaches discussed here, due to the reciprocal in the estimate.)

Flipping coins

Coming back now to the Chow-Robbins coin-flipping game, unfortunately fully specifying an optimal strategy with maximum possible expected return (known to be approximately 0.793) remains a complex open problem. But let’s consider a strategy that, although it is suboptimal, is very simple to describe, still performs reasonably well, and turns out to have interesting behavior relevant to this discussion: let’s suppose that we stop when the number of heads h first exceeds the number of tails t=h-1, so that we win X_3=h/(h+t)=(t+1)/(2t+1).

Then it’s a nice exercise to show that our expected return using this strategy is

E(X_3)=\sum\limits_{t=0}^{\infty} \frac{1}{t+1}{2t \choose t}\frac{1}{2^{2t+1}}\frac{t+1}{2t+1} = \frac{\pi}{4} \approx 0.7854

So we can estimate \pi in the same way we did with the dartboard. In Python:

import random

def coin_flips():
    heads = 0
    flips = 0
    while heads <= flips - heads:
        heads += random.getrandbits(1)
        flips += 1
    return 4 * heads / flips

total = 0
n = 0
while True:
    total += coin_flips()
    n += 1
    print(f'Estimate after {n} iterations = {total / n}')

This works… sort of. There are a lot of interesting things happening here. First, the variance of our estimate is less than one-third that for a dart throw, so that we should need correspondingly fewer Monte Carlo iterations to achieve comparable accuracy.

Also, each individual coin flip is a lot less computationally expensive than each dart throw. The latter requires 128 random bits to generate two floating-point values for the (x,y) position of a dart, while a coin flip is a single random bit. The ability and relative speed of extraction of a single random bit varies from language to language and library to library. For example, even sticking to Python, it’s interesting to compare the above random.getrandbits(1) with the theoretically equivalent but much slower random.randint(0, 1).

But these potential benefits are dwarfed by the overall cost of each Monte Carlo iteration simulating a single playout of the game, which generally involves more than just a single coin flip. You may notice that the output from the above code scrolls by with not-really-blazing speed… but periodically there is a significant pause, sometimes for a long time.

The problem is that the expected number of coin flips– that is, the average duration of a single Monte Carlo iteration– is infinite. In other words, even if we know that we need, say, roughly 12 million iterations of the game to achieve 3 digits of accuracy in our estimate of \pi, we can’t even estimate how long those 12 million iterations will take to execute.

All is not entirely lost, however. Note that although the expected duration of a game is infinite, the program above is never in any danger of being well and truly hung in the inner loop. That is, even during one of those long pauses, we are guaranteed— with probability 1– to observe eventual completion of the current game and subsequent starting of the next new game.

We are right on the razor’s edge of hanging, though. That is, suppose that we play the game with a biased coin, so that the probability of heads is p<1/2. Then there is a positive probability (1-2p)/(1-p) that a game will never end, with a meandering deficit of heads that is never overcome. (Which means that the endless loop through games above is guaranteed, with probability 1, to eventually encounter and get stuck on such a never-ending game.)

On the other hand, if p>1/2, then the situation improves. For example, let’s suppose that we play the game with a biased coin with p=3/4. Then the expected return from this game is \pi/(2\sqrt{3}), so we can still estimate \pi by playing this modified game:

import random
import math

def coin_flips():
    heads = 0
    flips = 0
    while heads <= flips - heads:
        heads += int(random.getrandbits(2) != 0)
        flips += 1
    return 2 * math.sqrt(3) * heads / flips

n = 120000
print(sum(coin_flips() for _ in range(n)) / n)

Better yet, not only is the expected number of coin flips (i.e., random bits) per game finite, it’s just 2 flips on average per game, so that each game is relatively fast to simulate. Better still, the variance is now less than 12% of the dartboard’s, so that we don’t even have to play as many games to achieve the same accuracy.

How shallow is the cut card effect?

Introduction

This is a follow-up to an article from nearly a decade ago, answering what I didn’t recognize at the time as an interesting question, buried in an otherwise offhand remark. To repeat the setup, suppose that we shuffle together d=6 decks of playing cards, and play a series of rounds of blackjack, each using the same fixed CDZ- playing strategy, stopping at the “cut card”, a marker inserted in the shoe indicating when to reshuffle. Let’s place the cut card at “penetration” p=0.75, that is, stop and reshuffle when at most d(1-p)=1.5 decks or 78 cards remain in the shoe. Let the random variable X_n be the outcome of the n-th round. The figure below shows estimates of E(X_n), in percent of initial wager, from just 1000 simulated shuffled shoes.

Expected return vs. number of rounds dealt from the shoe, playing fixed CDZ- strategy optimized for full 6-deck shoe. The mean of 1000 randomly shuffled shoes is shown in blue, with standard error of the sample mean shown in gray.

That steep drop in expected return late in the shoe is the “cut card effect.” An intuitive, hand-waving explanation goes something like this: although the position of the cut card is fixed, the number of rounds needed to reach it varies from one shuffled shoe to the next. If near the end of the shoe we have played a larger-than-typical number of rounds, then each of those rounds must have consumed fewer-than-typical number of cards on average, and so were likely rich in tens. The remainder is thus poor in tens, a disadvantage for the player.

Inspection of the above figure suggests that the cut card effect “begins” at approximately round n=42 or so (for this particular choice of rules, playing strategy, and penetration). The motivation for this post is to show that the cut card effect actually “begins” much earlier in the shoe, at round n=19 in this case.

Extended True Count Theorem

The cut card effect is interesting because, at first glance, it seems to contradict a very useful theorem, first proved by Thorp, that is easy to state incorrectly, so for now I’ll state it vaguely:

Theorem: For each round n played with the same fixed strategy, E(X_n)=E(X_1), provided we do not run out of cards.

In other words, compute the exact expected return from playing our fixed strategy off the top of the shoe: this is E(X_1). Now compute the expected return from playing a second round, after having played a first round off the top of the shoe: this is E(X_2), and the above theorem states that these two expected values are equal. Similarly, the expected return from playing the third round into the shoe is E(X_3)=E(X_2)=E(X_1), etc.

Looking again at the figure above, the small variations in estimated expected return for small n are purely due to sampling error. If we could in principle evaluate not just 1000 randomly shuffled arrangements of cards in the shoe, but all (52d)! possible arrangements, computing each expected return exactly, then the above theorem states that this curve would be exactly constant for those small n.

But clearly something different is happening for n \geq 42, so what counts as “small” n?

Integer linear programming

This is where the vague “not running out of cards” comes in. There are two conditions that must be satisfied to guarantee E(X_n)=E(X_1): we must be guaranteed to reach round n, and we must be guaranteed to be able to complete round n.

I think the latter condition is simpler to describe. We need to ensure that, no matter how the previous n-1 rounds play out, we have enough cards left in the shoe to complete the subsequent round n, with neither the player nor the dealer running out of cards. For these rules and CDZ- playing strategy, we will consume at most 33 cards in any single round, so as long as we don’t start the round before the cut card, there is no danger of “running out of cards.”

It’s the former condition that is trickier, and is the heart of the problem here: we must, with probability 1, complete n-1 prior rounds before reaching the cut card. Otherwise, to even define the random variable X_n, we must condition the population of possible shuffled arrangements of cards in the shoe to exclude those cases where we reach the cut card too soon.

So, how many rounds are we guaranteed to be able to play before reaching the cut card? This is an integer linear programming problem: for each of the 3,054,067 possible subsets of card ranks consumed in a round, let a_{r,k} be the number of cards of rank r consumed. Then the problem is to find non-negative integers x_k:

minimizing \sum_k x_k

subject to \sum_k a_{r,k}x_k \leq 4d, 1 \leq r < 10

\sum_k a_{10,k}x_k \leq 16d

\sum_k (\sum_r a_{r,k})x_k \geq 52dp

The code is on GitHub: C++ to compute all possible subsets of cards comprising the player hands in a round, and Mathematica to do the same for the dealer, to compute the outer product of player and dealer outcomes, and to solve the linear programming problem.

Interpreting results

The resulting minimum value of the objective function is 18. This means that it is possible– albeit unlikely– to reach the cut card after having played just 18 rounds. If we could eliminate the sampling error in the above figure, the actual exact expected return would be constant for E(X_1)=E(X_2)=\ldots =E(X_{18}).

But the extended true count theorem has nothing to say about the expected return from round n=19 or later. Empirically, these mid-shoe departures from E(X_1) are almost certainly very small… but they are just as almost certainly non-zero.

At a blackjack table, imagine next to the dealer is a display indicating the number n of the round to be played, that increments by one after each round, and resets to 1 after each reshuffle. Further imagine that we walk up to the table in the middle of the shoe, and observe the displayed round number, with no information about any cards that have been dealt since the last shuffle. If we observe n \leq 18, then we can be confident that the expected return from playing the upcoming round is the same as if we walked up to a full shoe: E(X_n)=E(X_1). On the other hand, if we observe, say, n=19, then the expected return E(X_{19}) is almost certainly not exactly equal to E(X_1)… but we can’t say for sure, let alone the magnitude or even the sign of the difference.

In practice, such a display does not exist; if we walk up to a table cold we don’t know how many rounds have been played up to that point. But suppose instead that we observe not the number of the round, but the number of cards left in the shoe before the cut card. How can we tell when we are in the “safe zone” of no cut card effect, with expected return exactly equal to E(X_1)?

If there are 234 cards left in the shoe, before the 75% penetration cut card, that’s a full shoe, so the expected return from the round is E(X_1). But suppose that there are as many as 162 cards still left in the shoe. Seems like that should still be plenty… but with even that many cards remaining, we are already “in the cut card effect,” where the extended true count theorem provides no information about the expected return from the next round. (It’s an exercise for the reader to show that this minimum of 162 cards remaining is the solution to another linear programming problem, but one that we can solve in our head.)