October | 2024 | Possibly Wrong

Introduction

Let’s play a card game for money. For example, I (the “house”) will thoroughly shuffle together six decks of playing cards, and deal a round of blackjack against you (the “player”). That is the scenario motivating this post, but the details of the rules of the game don’t matter much for our purpose here, so we can alternatively consider a much simpler game: I will shuffle a single deck of cards, and deal one card to you and one card to me. High card wins (aces are low, face cards have value 10), with a tie going to the house.

Because we are playing for money, both the player and the house have an interest in knowing the exact (dis)advantage in the game. We can write a program to compute the player’s expected return, as a function of the composition of ranks of face-down cards in the shuffled “shoe.” For example, we might write the following Python code to compute the expected return for the high card game:

from fractions import Fraction
import numpy

def expected_value(shoe):
    """Expected value of high card game as fraction of initial wager."""
    ev = 0
    for card1 in range(10):
        for card2 in range(10):
            s = list(shoe)
            p = Fraction(s[card1]) / sum(s)
            p = p * s[card2] / sum(s)
            ev = ev + p * (1 if card1 > card2 else -1)
    return ev

print(expected_value((4, 4, 4, 4, 4, 4, 4, 4, 4, 16)))

The result is -25/169, or about 14.8% of the player’s wager going to the house.

Or is it? There is an error in the above code. Let’s leave it as an exercise for the reader to find the bug (hint: one line is missing) … and instead prove that there must be a bug, even without identifying nor fixing it.

The Extended True Count Theorem

The idea is to suppose that we shuffle the deck as usual, but then “burn” the top card of the deck, setting it aside face down and unused, before starting the game by dealing the player’s and dealer’s “hands.” Intuitively, this action should not affect the player’s expected return: neither the player nor the dealer sees the burn card, and at least in this simple high card game it wouldn’t matter even if they did, since there is no “playing strategy” that might be influenced by knowledge of the burn card’s value.

We can verify this by computing the average expected return played from the resulting 51-card deck, weighted by the probability of each possible burned top card:

def average_ev(shoe):
    """Average expected value after 'burning' the top card."""
    ev = 0
    for card in range(10):
        s = list(shoe)
        p = Fraction(s[card]) / sum(s)
        s[card] = s[card] - 1
        ev = ev + p * expected_value(s)
    return ev

print(average_ev((4, 4, 4, 4, 4, 4, 4, 4, 4, 16)))

The resulting average expected return is -557/3757: still a roughly 14.8% house edge, but not exactly the same -25/169 that we should expect, thus proving that our original implementation of expected_value(shoe) must be in error.

This approach to verifying the correctness of expected_value(shoe) applies much more generally, as described in this paper, as well as here and here. For example, instead of just burning the top card, modify average_ev(shoe) to play the game twice, computing the average expected return from the second round (effectively burning the top two cards instead of just one). Or deal cards from the top of the deck, stopping when you have either dealt an ace or a maximum of ten cards, then play the game from the resulting depleted deck. Or repeatedly flip a coin, dealing a card for each heads, stopping when you flip tails or have flipped heads a maximum of ten times, then play the game. In each case, if we ever find that average_ev(shoe) != expected_value(shoe), then we know that something is wrong.

(The converse, however, is not true: matching expected values do not imply that the implementation of expected_value(shoe) is necessarily correct. For example, let’s change the game slightly so that ties are a push instead of going to the house, with the following change to expected_value (but otherwise leaving the original bug as-is):

            ev = ev + p * numpy.sign(card1 - card2)

The reader can verify that the discrepancy goes away, with average_ev(shoe) == expected_value(shoe) == 0: not only do the values match, but they are correct, despite the bug remaining in the implementation.)

Application to blackjack

Coming back now to the original motivation for this post, let’s consider a round of blackjack with some specified rules (6D, S17, DOA, DAS, SPL3, LS), played using CDZ- strategy optimized for the full shoe. Using my blackjack software (on GitHub), we can compute the corresponding expected return from a round played off the top of the shoe (think expected_value(shoe)), as well as the average expected return after removing a single card, but playing the same fixed “full-shoe” strategy (think average_ev(shoe)):

#include "blackjack.h"
#include <iostream>

int main() {
    BJShoe shoe{6};
    BJRules rules;
    BJStrategy cdz;
    BJProgress progress;
    BJPlayer *basic = new BJPlayer{shoe, rules, cdz, progress};
    std::cout << basic->getValue() << std::endl;

    BJReal ev = 0;
    for (int card = 1; card <= 10; ++card) {
        shoe.deal(card);
        BJPlayer* eor = new BJPlayer{shoe, rules, *basic, progress};
        shoe.reset();
        ev += eor->getValue() * shoe.getProbability(card);
    }
    std::cout << ev << std::endl;
}

with the following results (as fraction of initial wager):

-0.0033026803733750346
-0.0033026803733748255

These two values are close, but not exactly equal. Granted, these calculations were done using the limited precision of 64-bit floating-point arithmetic, with the accompanying rounding errors. But how can we tell whether the differences are due to limited numeric precision, or due to errors in the algorithm itself?

To answer this question, I recently updated the code to typedef the numeric data type (BJReal in the above example code) used for computing probabilities and expected values. This type defaults to double, but we can use the arbitrary-precision math::Rational instead– but otherwise using exactly the same algorithm– and verify that the expected return in the above scenario is exactly:

$-\frac{636491186124390779995774550739146028101753}{192719583540554771243108474049609438884038125}$

and that we get exactly this same result whether played from the full shoe or averaged over each possible burn card. In other words, the discrepancy in the lowest order digits above was due entirely to double-precision rounding error.

Of course, as shown in the high card “ties push” excursion described above, this doesn’t prove that the algorithm is correct. But inequality would have proved that the algorithm is incorrect. For this reason, I think this is a useful additional test to consider when evaluating algorithms like these.

Possibly Wrong

On science, mathematics, and computing

Monthly Archives: October 2024

An algorithm incorrectness proof