An updated Google Books word frequency list

Introduction

I think I nerd-sniped myself. This started with the objective of writing a simple program to play Hangman, as a demonstration of a potential programming exercise for students. (Don’t assign the problem if you don’t know the solution.) But in the process of creating the word list for the game, I found that last year Google released an updated export of its Google Books Ngrams data, which I’ve used and discussed here several times before. So I thought it would be interesting to revisit that data reduction exercise, and see what’s changed since the last release.

First, let’s get the Hangman game out of the way. The Python code is on GitHub; it’s only about 30 lines (plus a list of 10,000 words)… but I think it’s an interesting 30 lines. This could be a great problem for students, either to implement the game themselves, or even just to play this version, and inspect the code to figure out what’s “evil” about this particular computer player. I think it’s an interesting problem to implement even this simple approach efficiently, let alone to significantly improve on the computer player’s performance (see Reference 3 below).

The word list for the game is selected from a longer list– also on GitHub— of 458,343 words and their corresponding frequency of occurrence in the Google Books dataset. I built this longer list in two steps, as described below.

Google Books 1-grams

First, I downloaded the entire list of Google Books 1-grams (roughly, whitespace-delimited tokens) from the new English version 20200217 release, and aggregated the total number of occurrences of each 1-gram containing only the case-insensitive letters ‘A’ through ‘Z’, with no other special characters, but otherwise without any restrictions on word length or frequency of occurrence.

(Aside: This is the same filtering approach that I used for the previous 20120701 release, although the file organization and format changed in this new version. Peter Norvig did a similar analysis of the earlier data, but I was unable to reproduce his results; both I and at least one commenter on his site observed identical frequency counts that are, interestingly, almost-but-not-quite exactly half of his values.)

The result is 14,808,229 tokens and corresponding frequency counts. This is roughly triple the 4,999,714 tokens from the 2012 release, although it’s interesting that this new data set is not a proper superset of the old: there are 57,754 tokens missing in the new release, three of which are valid Collins Scrabble words (more on this later): alcaicerias (a Spanish bazaar), initiatrices (female initiators), and nouritures (nourishment).

More interesting are the new words that have been added in the last decade or so since the 2012 release. Scanning the 250 most frequently occurring new tokens yields a technological trip down memory lane: instagram, blockchain, bitcoin, hadoop, brexit, icloud, crowdfunding, pinterest, wikileaks, obamacare, gamification, hashtag, github, selfie, airbnb, kinect, tumblr, crispr, sexting, whatsapp, snapchat, spotify, microservices, cryptocurrency, tensorflow, emoji, cisgender.

An updated word frequency list

Armed with this list of nearly 15 million tokens and corresponding frequencies, the second step was to reduce the list of tokens to a more manageable “dictionary” of “words.” To do this, I used the union of the following five word lists:

  • The ENABLE2k word list, containing 173,528 words.
  • The North American Scrabble Players Association (NASPA) Word List 2018 (NWL2018), used in Scrabble tournaments in the United States and Canada, containing 192,111 words.
  • The Collins Scrabble Words 2019 (CSW19) list, used in Scrabble tournaments pretty much everywhere else, containing 279,496 words.
  • The Spell Checker Oriented Word List (SCOWL) by Kevin Atkinson, containing 430,590 words. (See the repository for details on the configurable parameters of this word list.)
  • [Edit 2021-09-11 to add] the Princeton WordNet Lexical Database for English version 3.1, containing 63,745 words constrained to all lowercase letters without any spaces or punctuation.

The SCOWL is included as a sort of intentional overkill, a compromise between the size of the dataset and the hope that it will contain as a subset whatever dictionary you might want to use for your application. Note that this comes at a cost of including tokens that are definitely not words in any reasonable dictionary; for example, all 26 single-letter tokens are present, not just the two words a and I.

The result is a single tab-separated text file with 458,343 rows, one for each word, and three columns: the word, followed by the number of occurrences in the 20120701 and 20200217 Google Books datasets, respectively. The entire list is sorted in decreasing order of frequency in the latest 20200217 dataset.

References:

  1. Google Books Ngram Viewer Exports, English versions 20120701 and 20200217
  2. Atkinson, K., Spell Checker Oriented Word List (SCOWL) Copyright 2000-2018 by Kevin Atkinson
  3. Barbay, J. and Subercaseaux, B., The Computational Complexity of Evil Hangman, arXiv:2003.10000
  4. Princeton University “About WordNet.” WordNet. Princeton University. 2011.

Balanced clock hands

Given an analog clock with sweeping hour, minute, and second hands, at what times are the hands most evenly “balanced”– that is, most equally separated in angle?

It’s a standard high school algebra problem to compute all of the times at which just the hour and minute hands are aligned on top of each other. And it’s relatively straightforward to show that all three hands, including the second hand, are only exactly aligned twice per day, at 12 o’clock. But this problem is different: here, we would like the three angles made by the three hands to be as equal– that is, as close to 120 degrees each– as possible.

As we will see shortly, there is no time at which all three angles are exactly 120 degrees. (This is another nice exercise to prove just using pencil and paper.) So, how close can we get? This post was an admittedly somewhat frivolous excuse to experiment with SymPy, the Python symbolic mathematics library bundled with SciPy. My approach was pretty brute-force, letting SymPy do all of the heavy lifting with very few lines of code, including evaluating multiple different metrics defining what we mean by “balanced.” (All of the code discussed here is available on GitHub.)

The problem is complicated by the floor functions that would be buried in the cost function if we were to use a single parameter to represent the continuously varying time over the entire 12-hour domain. Instead, let’s divide the domain into 12×60=720 one-minute intervals, each specified by a fixed integer hour and minute, and for each, only let the second hand sweep through the single revolution of that one minute of time, computing the resulting three angles between consecutive hands:

def hand_positions(hour, minute, second):
    """Return positions in [0, 360) of clock hands."""
    r_60 = sym.Rational(60, 1)
    return (360 * (hour + minute / r_60 + second / (r_60 * r_60)) / 12,
            360 * (minute + second / r_60) / r_60,
            360 * second / r_60)

def hand_angles(hands):
    """Return angles between clock hands."""
    x, y, z = sorted(hands)
    return (y - x, z - y, x - z + 360)

At any given time, what yardstick should we use to measure how “balanced” the clock hands are? There are at least a couple of reasonable alternatives, even if we restrict our attention to (piecewise) linear cost functions. The first that occurred to me was to “maximize the minimum” angle: by the pigeonhole principle, at least one of the angles is at most 120 degrees, so let’s try to make that smallest angle as large as possible, measuring the deficit:

def max_min(*time):
    """(120 minus) minimum angle between clock hands."""
    return 120 - min(hand_angles(hand_positions(*time)))

Rather than maximizing the minimum angle, another option is to “minimize the maximum” absolute deviation of each angle from the desired optimal 120 degrees. This cost is implemented below. (I had initially also implemented the Manhattan distance, i.e., the sum (or equivalently, the average) of absolute deviations from 120 degrees. But it’s another nice problem to show that this is unnecessary: the sum of absolute deviations yields the same ordering as the maximum of absolute deviations… but this would not generally be true if our clocks somehow had more than just three hands (why?).)

def min_max(*time):
    """Max. deviation from 120 deg. of angles between clock hands."""
    return max(abs(a - 120) for a in hand_angles(hand_positions(*time)))

At this point, the key observation is that the only possible candidate times at which these cost functions are optimized are at their critical points (or at endpoints of the domain). And because these costs are piecewise linear, the critical points are easy to enumerate: either two of the angles are equal (when maximizing the minimum angle), or one of the angles is exactly 120 degrees (when minimizing the absolute deviation). The following function lumps all of these possibilities together into one list:

h, m, s = [sym.Symbol(v) for v in 'hms']

def critical_points():
    """Generate possible critical positions of sweeping second hand."""
    yield 0
    for x, y, z in permutations(hand_positions(h, m, s)):
        a, b, c = (y - x, z - y, x - z + 360)
        for lhs, rhs in ((a, b), (a, c), (b, c), (a, 120), (b, 120), (c, 120)):
            yield from sym.solve(lhs - rhs, [s])

The resulting “most balanced” times are shown below. There are a couple of interesting observations. First, a solution won’t be unique; times come in “mirror image” pairs with the same cost. Second, although the two cost functions considered here do yield slightly different optimal times, they differ by less than 3 hundredths of a second, and the same four or five best mirror pairs are all clustered within about 8 hundredths of a second of each other– all at approximately 2:54:34, along with its mirror time 9:05:25.

The two “mirror” times with the same minimum maximum absolute deviation of clock hand angles from 120 degrees: 2:54:34 394/719 (at left) and 9:05:25 325/719 (at right).

Finally, what if the second hand doesn’t sweep, but ticks from one integer second to the next? (When I was a kid, this was the magic method to distinguish between the real and fake Rolex watch, neither of which any of us had actually ever seen.) In this case, the most balanced times– that also appear within about a tenth of a second of the nearby overall top ten times– are at 5:49:09, along with its mirror time 6:10:51, as shown below.

The two “mirror” times that are most balanced if we constrain the second hand to “tick” at integer numbers of seconds: 5:49:09 (at left) and 6:10:51 (at right).