Getting matplotlib’s default colors

When you make plots with Python’s matplotlib package, the package chooses some default colors for you. In the code below, we draw the lines y = x^2 + i for i = 1, \dots, 11.

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0.0, 2.0, 0.01)

for i in range(1, 12):
    y = x**2 + i
    plt.plot(x, y)

plt.show()

Notice that the first and the 11th line have the same color. That’s because matplotlib only has 10 default colors. We can use the code snippet below to get the hex codes for these 10 colors:

print(plt.rcParams['axes.prop_cycle'].by_key()['color'])
# ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']

The code snippet below is an example of how to pick out specific default colors. For example, if we wanted all the lines to be green (the 3rd color from the bottom):

default_color = plt.rcParams['axes.prop_cycle'].by_key()['color']
for i in range(1, 12):
    y = x**2 + i
    plt.plot(x, y, color = default_color[2])

plt.show()

Update (2023-05-27): As commenter jared f points out, there are several other ways to define colors and get the default matplotlib colors. This reference describes all the different ways to do so.

References:

  1. Stack Overflow. “Get default line colour cycle.

What is a horizon chart?

A horizon chart is a compact version of an area chart. In the words of Jonathan Schwabish (Reference 1, page 164), it is

… an area chart that is sliced into equal horizontal intervals and collapsed down into single bands, which makes the graph more compact and similar to a heatmap…

What are horizon charts good for? Here is Schwabish again:

Horizon charts are especially useful when you are visualizing time series data that are so close in value so that the data marks in, for example, a line chart, would lie atop each other. … the purpose of the horizon chart is not necessarily to enable readers to pick out specific values, but instead to easily spot general trends and identify extreme values.

Horizon charts can be a bit difficult to interpret at first, partly because the way to create them is a bit involved. Let’s illustrate how horizon charts are created with an example using the EuStockMarkets dataset from the datasets R package. (Most of this code is derived from the documentation for the horizonplot function in the latticeExtra package.)
The dataset contains the daily closing prices of 4 European stock indices from 1991-1998.

library(latticeExtra)

str(EuStockMarkets)
# Time-Series [1:1860, 1:4] from 1991 to 1999: 1629 1614 1607 1621 1618 ...
# - attr(*, "dimnames")=List of 2
#  ..$ : NULL
#  ..$ : chr [1:4] "DAX" "SMI" "CAC" "FTSE"

Let’s make a line plot of just Germany’s DAX:

# look at just DAX
dax_ts <- EuStockMarkets[, 1]

# raw plot
plot(dax_ts, ylab = "DAX", main = "Plot of DAX (1991-1998)")

First, choose a baseline level B and some bandwidth \Delta. Color each horizontal band between B + i\Delta and B + (i+1)\Delta with a different color (i is an integer, usually between -3 and 2 inclusive). In the code below, we choose a baseline of 4,000 and a bandwidth of 1,000. Bands above 4,000 are in increasingly darker shades of blue, while bands below 4,000 are in increasingly darker shades of red.

xyplot(dax_ts,
       panel = function(x, y, ...) {
         col <-
           c("#B41414","#E03231","#F7A99C","#9FC8DC","#468CC8","#0165B3")
         for (i in c(-3:-1, 2:0)) {
           if (i >= 0)
             yi <- pmax(4000, pmin(y, 4000 + 1000 * (i+1)))
           if (i < 0)
             yi <- pmin(4000, pmax(y, 4000 + 1000 * i))
           panel.xyarea(x, yi, origin = 4000,
                        col = col[i+4], border = NA)
         }
         panel.lines(x, y, col = "black")
         panel.abline(h = 4000, lty = 2)
       },
       ylab = "DAX",
       main = "Plot of DAX (1991-1998)")

Second, flip all the bands below the baseline along the baseline, then stack all the bands on top of each other, with the darker colors right on top… and that’s a horizon chart!

horizonplot(dax_ts, colorkey = TRUE,
            origin = 4000, horizonscale = 1000,
            col.regions = c("#B41414","#E03231","#F7A99C","#9FC8DC",
                            "#468CC8","#0165B3"),
            main = "Plot of DAX: 1991-1998")

The main advantage of the horizon chart is its (vertical) compactness. The horizon chart above has dimensions 600 x 200, while the earlier time series plots had dimensions 600 x 400. For comparison, here is the line plot with dimensions 600 x 200:

The real power of horizon charts comes when we want to compare multiple time series against each other. Here is the time series plot for all 4 EU stock indices (dimensions 600 x 400):

plot(EuStockMarkets, main = "Plot of EU Stock Indices (1991-1998)")

Here’s a horizon chart for the same data with the same dimensions (600 x 400). Notice how it’s easier to see the tiny fluctuations in each time series and to compare across time series.

## these layers show scale and origin in each panel
infolayers <-
  layer(panel.scaleArrow(x = 0.99, digits = 1, col = "grey",
                         srt = 90, cex = 0.7)) +
  layer(lim <- current.panel.limits(),
        panel.text(lim$x[1], lim$y[1], round(lim$y[1],1), font = 2,
                   cex = 0.7, adj = c(-0.5,-0.5), col = "#9FC8DC"))

## horizon chart for EU stock indices
horizonplot(EuStockMarkets, colorkey = TRUE,
            col.regions = c("#B41414","#E03231","#F7A99C","#9FC8DC",
                            "#468CC8","#0165B3"),
            origin = 4000, horizonscale = 1000,
            main = "Plot of EU Stock Indices (1991-1998)") +
  infolayers

In the horizon chart above, we set the same baseline (4,000) and same bandwidth (1,000) across the 4 stock indices. This doesn’t make a whole lot of sense because the stock indices are on different scales. The code below generates a horizon chart with each index having its own baseline and bandwidth:

horizonplot(EuStockMarkets, colorkey = TRUE,
            col.regions = c("#B41414","#E03231","#F7A99C","#9FC8DC",
                            "#468CC8","#0165B3"),
            main = "Plot of EU Stock Indices (1991-1998)") +
  infolayers

Let’s finish off this post with an example of a horizon chart from Reference 1. We want to compare public health spending (as a percentage of GDP) over time for 10 countries. Here’s what a time series plot (line chart) looks like:

And here is the horizon chart for the same data:

I think it’s much easier to make comparisons and see trends in the horizon chart as compared to the line chart.

References:

  1. Schwabish, J. (2021). Better Data Visualizations.

Verifying a stat from The Athletic NBA Show

A few weeks ago, I was listening to The Athletic NBA Show podcast (Episode 581: “5 Players I was wrong about, 20 Games in Contenders, and Sam Vecenie on the 2021 Rookie Class”) and the following statistic caught my attention:

Question: Since the 2000-2001 NBA finals, there have been 42 teams in the NBA finals. (This episode aired in Dec 2021, so the most recent finals was the 2020-2021 edition.) At the 20-game mark in the regular season, how many of these 42 teams were below 0.500 (that is, less than 50% winning percentage)?

Answer: Zero (!) (One of the guests on the show guessed 6.)

(If you want to hear the full segment on the stat, it’s the first topic of the podcast, right after the recap of the week.) Since a full regular season in the NBA is 82 games per team, 20 games represents just under a quarter of the season. It seemed a little surprising that NBA finalists were so consistently >= 0.500 so early in the season, so I wanted to verify the statistic for myself.

What I found was even more surprising: none of the teams in the NBA finals since the 1981-1982 edition were below 0.500 at the 20-game mark! That’s 40 years of NBA finals, almost double the number considered in the podcast. The 1980-1981 Houston Rockets were the most recent NBA finalists to be under 0.500 at the 20-game mark: they were 9-11.

The rest of this post goes through the data and R code to verify this statistic. The full R script is available here.

To verify this statistic, I needed two from two different sources. The first data source is Tristan Malherbe’s NBA Finals and MVPs dataset hosted on data.world. This tells us which teams were in the NBA finals in each year. The dataset only goes to the 2017-2018 season, so I added NBA finals data up to the 2020-2021 season (inclusive) manually. The second data source is Wyatt Walsh’s Basketball Dataset hosted on Kaggle. This is an extremely comprehensive dataset on the NBA stored as an SQLite database; we only use the Game table, which contains statistics on every regular season game.

(Note: If you play around with the data, you might find some oddities/errors. For example, there are one or two teams that had more than 82 games in the regular season when there should be at most 82 games. There is also some missing data. Since this is quick analysis for a blog post I didn’t try to ensure 100% correctness; this is something one should do or strive toward for more high-stakes analyses.)

First, let’s load the regular season data. I’m being a bit lazy by importing the whole Game table, but since the data isn’t very big we don’t suffer a huge efficiency loss.

library(DBI)
library(tidyverse)

theme_set(theme_bw())

firstYear <- 1980  # represents 1980-1981 season

# these filepaths are specific to my local drive
mainFile <- "../data/nba-kaggle-wyattowalsh/basketball.sqlite"
finalsFile <- "../data/datatouille-nba-finals-and-mvps/data/data-updated.csv"

# get all regular season games (only relevant columns 
# selected)
mydb <- dbConnect(RSQLite::SQLite(), mainFile)
df <- dbGetQuery(mydb, "SELECT * FROM Game")
dbDisconnect(mydb)
regular_df <- df %>% mutate(GAME_DATE = as.Date(GAME_DATE),
                              SEASON = as.numeric(SEASON)) %>% 
  filter(SEASON >= firstYear) %>%
  select(SEASON, GAME_DATE, TEAM_NAME_HOME, TEAM_NAME_AWAY, WL_HOME, WL_AWAY) %>%
  arrange(SEASON, GAME_DATE)

head(regular_df)
#   SEASON  GAME_DATE      TEAM_NAME_HOME         TEAM_NAME_AWAY WL_HOME WL_AWAY
# 1   1980 1980-10-10      Boston Celtics    Cleveland Cavaliers       W       L
# 2   1980 1980-10-10     Detroit Pistons     Washington Bullets       L       W
# 3   1980 1980-10-10 Seattle SuperSonics     Los Angeles Lakers       L       W
# 4   1980 1980-10-10        Phoenix Suns  Golden State Warriors       W       L
# 5   1980 1980-10-10  Philadelphia 76ers        Milwaukee Bucks       L       W
# 6   1980 1980-10-10           Utah Jazz Portland Trail Blazers       W       L

Next, let’s load the NBA finals data. Note that this data uses a different convention for labeling the season (e.g. 2021 refers to the 2020-2021 season), so we have to subtract 1 in order for the season column to match across the two data sources. For the rest of this analysis, season x refers to the season starting in the year x.

# get NBA finals info
# season = year - 1 to match regular_df
finals_df <- read_csv(finalsFile) %>%
  select(season = year, nba_champion, nba_vice_champion) %>%
  mutate(season = season - 1) %>%
  filter(season >= firstYear)
names(finals_df) <- toupper(names(finals_df))

head(finals_df)
# # A tibble: 6 × 3
#   SEASON NBA_CHAMPION       NBA_VICE_CHAMPION 
#    <dbl> <chr>              <chr>             
# 1   1980 Boston Celtics     Houston Rockets   
# 2   1981 Los Angeles Lakers Philadelphia 76ers
# 3   1982 Philadelphia 76ers Los Angeles Lakers
# 4   1983 Boston Celtics     Los Angeles Lakers
# 5   1984 Los Angeles Lakers Boston Celtics    
# 6   1985 Boston Celtics     Houston Rockets 

I’ve never heard the term “vice champion” before! Since the dataset uses that to refer to the team that was in the finals but didn’t win, I’ll use that term too. Join the two data frames and keep just the games that an NBA finalist played in:

joined_df <- regular_df %>%
  inner_join(finals_df, by = "SEASON") %>%
  filter(TEAM_NAME_HOME == NBA_CHAMPION |
           TEAM_NAME_HOME == NBA_VICE_CHAMPION |
           TEAM_NAME_AWAY == NBA_CHAMPION |
           TEAM_NAME_AWAY == NBA_VICE_CHAMPION) %>%
  mutate(FOR_CHAMPION = TEAM_NAME_HOME == NBA_CHAMPION |
                                 TEAM_NAME_AWAY == NBA_CHAMPION,
         FOR_VICE_CHAMPION = TEAM_NAME_HOME == NBA_VICE_CHAMPION |
           TEAM_NAME_AWAY == NBA_VICE_CHAMPION)

Next, let’s compute the win percentages across games for the NBA finalists. I did it separately for the champions and vice champions then joined the two datasets together since it was a little easier to figure out the code.

# Compute win percentages across games for champions
champion_df <- joined_df %>% filter(FOR_CHAMPION) %>%
  select(SEASON, GAME_DATE, TEAM_NAME_HOME, WL_HOME, NBA_CHAMPION) %>%
  group_by(SEASON) %>%
  mutate(GAME = 1,
         WIN = ifelse((TEAM_NAME_HOME == NBA_CHAMPION & WL_HOME == "W") |
                        (TEAM_NAME_HOME != NBA_CHAMPION & WL_HOME == "L"),
                      1, 0)) %>%
  transmute(TEAM_TYPE = "Champion",
            SEASON = factor(SEASON),
            GAMES_PLAYED = cumsum(GAME),
            WINS = cumsum(WIN),
            WIN_PCT = WINS / GAMES_PLAYED)

# Compute win percentages across games for vice champions
vice_champion_df <- joined_df %>% filter(FOR_VICE_CHAMPION) %>%
  select(SEASON, GAME_DATE, TEAM_NAME_HOME, WL_HOME, NBA_VICE_CHAMPION) %>%
  group_by(SEASON) %>%
  mutate(GAME = 1,
         WIN = ifelse((TEAM_NAME_HOME == NBA_VICE_CHAMPION & WL_HOME == "W") |
                        (TEAM_NAME_HOME != NBA_VICE_CHAMPION & WL_HOME == "L"),
                      1, 0)) %>%
  transmute(TEAM_TYPE = "Vice Champion", 
            SEASON = factor(SEASON),
            GAMES_PLAYED = cumsum(GAME),
            WINS = cumsum(WIN),
            WIN_PCT = WINS / GAMES_PLAYED)

# put champion & vice champion data together
# final_win_pct_df is used for the year labels in the plots
win_pct_df <- rbind(champion_df, vice_champion_df)
final_win_pct_df <- win_pct_df %>% group_by(TEAM_TYPE, SEASON) %>%
  slice_tail(n = 1)

head(win_pct_df)
# # A tibble: 6 × 5
# # Groups:   SEASON [1]
#   TEAM_TYPE SEASON GAMES_PLAYED  WINS WIN_PCT
#   <chr>     <fct>         <dbl> <dbl>   <dbl>
# 1 Champion  1980              1     1   1    
# 2 Champion  1980              2     1   0.5  
# 3 Champion  1980              3     2   0.667
# 4 Champion  1980              4     2   0.5  
# 5 Champion  1980              5     3   0.6  
# 6 Champion  1980              6     3   0.5 

We are now ready to make the plot! First, let’s make the plot of win percentage vs. games played for the NBA champions:

ggplot(champion_df, aes(x = GAMES_PLAYED, y = WIN_PCT, col = SEASON)) +
  geom_line() +
  geom_hline(yintercept = 0.5, linetype = "dashed") +
  geom_vline(xintercept = 20, linetype = "dashed") +
  geom_text(data = filter(final_win_pct_df, TEAM_TYPE == "Champion"), 
            aes(x = max(GAMES_PLAYED), y = WIN_PCT, label = SEASON),
            hjust = 0, size = 3) +
  coord_cartesian(ylim = c(0, 1)) +
  theme(legend.position = "none") +
  labs(title = "Win percentage by games played (champions)",
       x = "Games played", y = "Win percentage")

Indeed, at 20 games all NBA champions were 0.500 or better. Only one team was exactly at 0.500: the 2005-2006 Miami Heat.

Let’s make the same plot for the vice champions:

ggplot(vice_champion_df, aes(x = GAMES_PLAYED, y = WIN_PCT, col = SEASON)) +
  geom_line() +
  geom_hline(yintercept = 0.5, linetype = "dashed") +
  geom_vline(xintercept = 20, linetype = "dashed") +
  geom_text(data = filter(final_win_pct_df, TEAM_TYPE == "Vice Champion"), 
            aes(x = max(GAMES_PLAYED), y = WIN_PCT, label = SEASON),
            hjust = 0, size = 3) +
  coord_cartesian(ylim = c(0, 1)) +
  theme(legend.position = "none") +
  labs(title = "Win percentage by games played (vice champions)",
       x = "Games played", y = "Win percentage")

Again, at 20 games all NBA vice champions after the 1980-1981 season were 0.500 or better. Only one team at 0.500 was the 2004-2005 Detroit Pistons. As we can see from the plot, the most recent NBA finalist to be below 0.500 was the 1980-1981 Houston Rockets, who also happen to be the only NBA finalist in this time window to finish the entire regular season under 0.500.

Using different fonts with ggplot2

I was recently asked to convert all the fonts in my ggplot2-generated figures for a paper to Times New Roman. It turns out that this is easy, but it brought up a whole host of questions that I don’t have the full answer to.

If you want to go all out with using custom fonts, I suggest looking into the extrafont and showtext packages. This post will focus on what you can do without importing additional packages.

Let’s make a basic plot and see its default look (I am generating this on a Mac with the Quartz device):

library(ggplot2)
base_fig <- ggplot(data = economics, aes(date, pop)) +
  geom_line() +
  labs(title = "Total US population over time",
       subtitle = "Population in thousands",
       x = "Date",
       y = "Total population (in thousands)")

base_fig

To change all text in the figure to Times New Roman, we just need to update the text option of the theme as follows:

base_fig +
  theme(text = element_text(family = "Times New Roman"))

ggplot allows you to change the font of each part of the figure: you just need to know the correct option to modify in the theme. (For a full list of customizable components of the theme, see this documentation.)

base_fig +
  theme(plot.title    = element_text(family = "mono"),
        plot.subtitle = element_text(family = "sans"),
        axis.title.x  = element_text(family = "Comic Sans MS"),
        axis.title.y  = element_text(family = "AppleGothic"),
        axis.text.x   = element_text(family = "Optima"),
        axis.text.y   = element_text(family = "Luminari"))

For standard fonts (i.e. widely-used fonts used in large parts of academia and industry) the code above will suffice.

What fonts can I pass to element_text()? I couldn’t find an easy answer on this. This will depend on the OS you are using and the graphics device used to render the figure.

On a Mac, you can go to the “Font Book” application and have a look at the list of fonts there. Most fonts there should work. One exception I noticed is when using fonts having a different alphabet. (For example, using family = "Klee" in the code above did not work for me.)

What’s a graphic device? It’s the engine that renders your plot. Common graphics devices are Quartz and X11. This is something the user typically does not need to care about. See this link and ?Devices for more details.

What are “mono”, “sans” and “serif”? These are categories of “types of fonts” (see details here). When the user specifies one of these 3 keywords instead of a full font name (e.g. “Times New Roman”), the graphics engine uses its default font associated for that keyword.

For the Quartz device, you can use quartzFonts() to see what the default font for each of these keywords is:

quartzFonts()
# $serif
# [1] "Times-Roman"      "Times-Bold"       "Times-Italic"     "Times-BoldItalic"
# 
# $sans
# [1] "Helvetica"             "Helvetica-Bold"        "Helvetica-Oblique"     "Helvetica-BoldOblique"
# 
# $mono
# [1] "Courier"             "Courier-Bold"        "Courier-Oblique"     "Courier-BoldOblique"

NBA playoffs: Visualizing win percentage by seeding

Background

With the NBA playoffs going on, I’ve been thinking about the following question:

A and B are about to play a game. We know that among all players, A has rank/seed a and B has rank/seed b. (A higher ranking/seeding corresponds to a smaller value of a, with a = 1 being the best. Players with higher rank/seed are better players.) Knowing only these two ranks, what is a reasonable model for the probability that the higher seed wins the game?

On one hand, if the probability is always 0.5, it means that ranking is uninformative: the result of the game is like an unbiased coin flip. On the other hand, if the probability is 1, it means that ranking is perfectly informative: a higher-ranked player will always beat a lower-ranked player. Reality is probably somewhere in between.

Of course, there is no single correct answer for this. The answer could also vary greatly across sports and games, and within a single sport it could also vary from league to league. The idea is just to come up with something that can approximate reality somewhat closely.

Unfortunately I will not have an answer for this question in this post. I will, however, show you some data that I scraped for the NBA playoffs and some interesting data visualizations.

NBA playoffs: An abbreviated introduction

There are two conferences in the NBA. At the end of the regular season, the top 8 teams in each conference make it to the playoff round where they play best-of-7 series to determine who advances. The teams are ranked within each conference according to their win-loss records in the regular season. The image below is the playoff bracket for the 2019-2020 season, giving you an idea of how a team’s seeding affects which opponents it will face in different playoff rounds.

Data scraping

My initial idea was the following: scrape NBA playoff data, keeping track of how many times seed a played against seed b, as well as how many times seed a won. This could be the raw data we use to estimate/model the win probabilities.

Since the seedings only make sense within each conference, for each playoff year, I excluded the NBA final (where the winner of each conference faces the other). I was fortunate to find that Land of BasketBall.com presents the information I need in a systematic manner, making it easy to scrape. The R scraping script and RDS data files I saved are available here.

Determining how many years of playoff data should go into my dataset was a surprisingly difficult question. See this wikipedia section for a full history of the NBA playoff format. I was surprised to find out that ranking by win-loss record within each conference was only instituted from the 2016 playoffs onwards. There were more complicated seeding rules before that, meaning that teams could have higher rank than other teams which had better win-loss records than them. Thus, if I wanted ranking to be as pure as possible, I could only use data from 2016-2020 (the latest playoffs with data at time of writing). As you will see in the rest of this post, this is not enough data to do good estimation. On the other hand, going back further means that the idea of ranking could be diluted.

The rest of the post is based on just 2016-2020 data. You can download data for 1984-2020 (1984 was when the playoffs first went from 12 teams to 16 teams) here, rerun the analysis and compare. (You could even try to go back further to 1947, when the playoffs first started!)

Data visualization

Let’s have a look at the raw data:

library(tidyverse)
readRDS("nba_playoffs_data_2016-2020.rds") %>% head()
#   year    team1 wins1        team2 wins2 seed1 seed2
# 1 2016 Warriors     4      Rockets     1     1     8
# 2 2016 Clippers     2 TrailBlazers     4     4     5
# 3 2016  Thunder     4    Mavericks     1     3     6
# 4 2016    Spurs     4    Grizzlies     0     2     7
# 5 2016 Warriors     4 TrailBlazers     1     1     5
# 6 2016  Thunder     4        Spurs     2     3     2

Each row corresponds to one best-of-7 series. Let’s add some more columns that will be useful:

results_df <- readRDS("nba_playoffs_data_2016-2020.rds") %>%
  mutate(higher_seed = pmin(seed1, seed2),
         lower_seed = pmax(seed1, seed2),
         n_games = wins1 + wins2) %>%
  mutate(seed_diff = lower_seed - higher_seed,
         higher_seed_wins = ifelse(higher_seed == seed1, wins1, wins2),
         lower_seed_wins = ifelse(higher_seed == seed1, wins2, wins1)) %>%
  mutate(series_winner = ifelse(wins1 > wins2, higher_seed, lower_seed))

In this first plot, we plot the number of series played between each pair of seeds (we always order the seeds so that the higher seed is first/on the x-axis).

theme_set(theme_bw())
results_df %>%
  group_by(higher_seed, lower_seed) %>%
  summarize(n_games = n()) %>%
  ggplot(aes(x = higher_seed, y = lower_seed)) +
  geom_tile(aes(fill = n_games)) +
  geom_text(aes(label = n_games), col = "white") +
  scale_x_continuous(breaks = 1:8) +
  scale_y_continuous(breaks = 1:8) +
  coord_fixed(ratio = 1) +
  scale_fill_gradient(low = "red", high = "blue", name = "# games") +
  labs(x = "Higher seed", y = "Lower seed",
       title = "# series played between seeds") +
  theme(panel.grid.major = element_blank())

Note that the x-axis starts with the first seed while the y-axis starts with the second seed, since the first seed can never be the lower seed in any matchup. We also won’t have any data in the bottom-right half of the plot since we always put the higher seed on the x-axis. Notice that we always have 10 matches between seeds 1-8, 2-7, 3-6 and 4-5: these are the first round matchups that will always happen. Notice also that some matches have never happened (e.g. 3-4, and anything involving two of 5, 6, 7 and 8).

Next, we plot the number of games played between each pair of seeds:

results_df %>%
  group_by(higher_seed, lower_seed) %>%
  summarize(n_games = sum(n_games)) %>%
  ggplot(aes(x = higher_seed, y = lower_seed)) +
  geom_tile(aes(fill = n_games)) +
  geom_text(aes(label = n_games), col = "white") +
  scale_x_continuous(breaks = 1:8) +
  scale_y_continuous(breaks = 1:8) +
  coord_fixed(ratio = 1) +
  scale_fill_gradient(low = "red", high = "blue", name = "# games") +
  labs(x = "Higher seed", y = "Lower seed",
       title = "# games played between seeds") +
  theme(panel.grid.major = element_blank())

From these numbers, it is clear that we won’t be able to get good estimates for the win probability between pairs of seeds due to small sample size, except possibly for the 1-8, 2-7, 3-6, 4-5 and 2-3 matchups.

Next, let’s look at the win percentage for the higher seed in each pair:

win_pct_df <- results_df %>%
  group_by(higher_seed, lower_seed) %>%
  summarize(higher_seed_wins = sum(higher_seed_wins),
            lower_seed_wins = sum(lower_seed_wins)) %>%
  mutate(higher_seed_win_pct = higher_seed_wins / 
           (higher_seed_wins + lower_seed_wins)) %>%
  select(higher_seed, lower_seed, higher_seed_win_pct)

head(win_pct_df)
# # A tibble: 6 x 3
# # Groups:   higher_seed [2]
#   higher_seed lower_seed higher_seed_win_pct
#         <dbl>      <dbl>               <dbl>
# 1           1          2               0.5  
# 2           1          3               0.75 
# 3           1          4               0.645
# 4           1          5               0.684
# 5           1          8               0.8  
# 6           2          3               0.554

ggplot(win_pct_df, aes(x = higher_seed, y = lower_seed)) +
  geom_tile(aes(fill = higher_seed_win_pct)) +
  geom_text(aes(label = round(higher_seed_win_pct, 2))) +
  scale_x_continuous(breaks = 1:8) +
  scale_y_continuous(breaks = 1:8) +
  coord_fixed(ratio = 1) +
  scale_fill_gradient2(low = "red", high = "blue", midpoint = 0.5, 
                       name = "Higher seed win %") +
  labs(x = "Higher seed", y = "Lower seed",
       title = "Higher seed win percentage") +
  theme(panel.grid.major = element_blank())

We shouldn’t give these numbers too much credence because of the small sample size, but there are a couple of things of note:

  1. We would expect the higher seed to have more than a 50% chance to beat a lower seed. That’s not what we see here (2-4, 3-5).
  2. Fixing the higher seed, we would expect that the lower seeded the opponent, the higher the win percentage. That’s not what we see when we look down each column. For example, seed 1 beat seed 3 75% of the time, but only beat seed 4 65% of the time.
  3. A very simple win probability model would be to say that the win probability only depends on the difference between the ranks. If that is the case, then win probabilities along each northeast-southwest diagonal should be about the same. That is clearly not the case (especially 1-3, 2-4 and 3-5).
  4. The win probabilities for 1-8, 2-7, 3-6 and 4-5 make sense, and that’s probably because of the larger sample size.

I would have expected a win probability diagram that looked more like this:

expand.grid(higher_seed = 1:8, lower_seed = 1:8) %>%
  filter(higher_seed < lower_seed) %>%
  mutate(seed_diff = lower_seed - higher_seed,
         higher_seed_win_pct = 0.5 + seed_diff * 0.4 / 7) %>%
  ggplot(aes(x = higher_seed, y = lower_seed)) +
  geom_tile(aes(fill = higher_seed_win_pct)) +
  geom_text(aes(label = round(higher_seed_win_pct, 2))) +
  scale_x_continuous(breaks = 1:8) +
  scale_y_continuous(breaks = 1:8) +
  coord_fixed(ratio = 1) +
  scale_fill_gradient2(low = "red", high = "blue", midpoint = 0.5, 
                       name = "Higher seed win %") +
  labs(x = "Higher seed", y = "Lower seed",
       title = "Expected higher seed win percentage") +
  theme(panel.grid.major = element_blank())

Having seen the data, what might you do next in trying to answer my original question?

Is the EPL getting more unequal?

I recently heard that Manchester City were so far ahead in the English Premier League (EPL) that the race for first was basically over, even though they were still about 6-7 more games to go (out of a total of 38 games). At the other end of the table, I heard that Sheffield United were so far behind that they are all but certain to be relegated. This got me wondering: has the EPL become more unequal over time? I guess there are several ways to interpret what “unequal” means, and this post is my attempt at quantifying it.

The data for this post is available here, while all the code is available in one place here. (I also have a previous blog post from 2018 on the histogram of points scored in the EPL over time.)

Let’s start by looking at a density plot of the points scored by the 20 teams and how that has changed over the last 12 years (note that the latest year is at the top of the chart):

library(tidyverse)
df <- read_csv("../data/epl_standings.csv")

theme_set(theme_bw())

# joy plot
library(ggridges)
ggplot(df, aes(x = Points, y = factor(Season))) +
    geom_density_ridges(scale = 3) +
    labs(y = "Season")

It looks like the spread for the more recent seasons is greater than that in earlier seasons, evidenced by the bulk of the distribution widening as we go from the bottom to the top of the chart.

Another way to view the data is to draw a boxplot for each season. In this view, it’s harder to see the spread that we saw in the joy plot.

# boxplot
ggplot(df, aes(x = Season, y = Points, group = Season)) +
    geom_boxplot() +
    labs(title = "Distribution of points by season")

One possible way to define “unequal” is to look at the difference between the number of points the top team earned vs. that for the bottom team: the larger the difference, the more unequal the EPL is becoming. The plot below shows the maximum and minimum number of points earned by a team across the years, as well as the difference between the two. We also show the linear regression lines for each of these values. With the upward slopes, it looks like the gap between the best and the worst is certainly increasing.

# plot of maximum and minimum
df %>% group_by(Season) %>%
    summarize(max = max(Points), min = min(Points)) %>%
    mutate(range = max - min) %>%
    pivot_longer(max:range) %>%
    ggplot(aes(x = Season, y = value, col = name)) +
    geom_line() +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_continuous(breaks = seq(min(df$Season), max(df$Season), by = 2)) +
    labs(title = "Max/min points by season", y = "Points") +
    theme(legend.title = element_blank(), legend.position = "bottom")

The problem with using range is that it only tracks the difference between the best and the worst teams while ignoring all the teams in the middle. A measure that takes the middle into account is variance. Note that the number of points that a team can score lies in the range 0 to 38 \times 3 = 114. For random variables bounded in the interval [0, 114], the smallest possible variance is 0 (all teams scoring the same number of points), while the maximum possible variance is (114 - 0)^2 / 4 = 3249 (half the teams scoring 114 points, half the teams scoring 0 points). Based on the configurations attaining the extremes, it seems reasonable to interpret the scores having a higher variance as the league being more unequal.

Here is a plot of point variance over time along with the linear regression line:

df %>% group_by(Season) %>%
    summarize(var = var(Points)) %>%
    ggplot(aes(x = Season, y = var)) +
    geom_line() +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    scale_x_continuous(breaks = seq(min(df$Season), max(df$Season), by = 2)) +
    labs(title = "Variance of point distribution by season", y = "Variance")

There’s quite a lot of fluctuation from year to year, but there seems to be an increasing trend. However, notice that the y-axis only goes from 250 to 450, while our min and max variances were 0 and 3249. Perhaps it’s better to have the y-axis go from 0 to 3249 to have a better sense of scale?

Before doing that, note that it’s actually not possible for half the teams to score maximum points. In the EPL, every team plays every other team exactly twice. Hence, only one team can have maximum points, since it means that everyone else loses to this team. I don’t have a proof for this, but I believe maximum variance happens when team 1 beats everyone, team 2 beats everyone except team 1, team 3 beats everyone except teams 1 and 2, and so on. With this configuration, the variance attained is just 1260.

Here’s the same line plot but with the y-axis going from 0 to 1260. With this scale, the change in variance over time looks pretty flat.

max_var_dist <- seq(0, 38 * 3, by = 6)
max_var <- var(max_var_dist)
df %>% group_by(Season) %>%
    summarize(var = var(Points)) %>%
    ggplot(aes(x = Season, y = var)) +
    geom_line() +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    geom_hline(yintercept = c(0, max_var), col = "red", linetype = "dashed") +
    scale_x_continuous(breaks = seq(min(df$Season), max(df$Season), by = 2)) +
    labs(title = "Variance of point distribution by season", y = "Variance")

Which is the correct interpretation? Is the change in variance over time important enough that we can say the EPL has become more unequal, or is it essentially the same over time? This is where I think domain expertise comes into play. 1260 is a theoretical maximum for the variance, but my guess is that the layperson looking at two tables, one with variance 300 and one with variance 900, would be able to tell them apart and say that the latter is more unequal. Can the layperson really tell the difference between variances of 250 and 450? I would generate several tables having these variances and test if people could tell them apart.

Finally, the Gini coefficient is one other measure of inequality, with 0 being the most equal and 1 being the most unequal.

# plot of Gini
library(DescTools)
df %>% group_by(Season) %>%
    summarize(gini = DescTools::Gini(Points)) %>%
    ggplot(aes(x = Season, y = gini)) +
    geom_line() +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    geom_hline(yintercept = c(0, 1), col = "red", linetype = "dashed") +
    scale_x_continuous(breaks = seq(min(df$Season), max(df$Season), by = 2)) +
    labs(title = "Gini coefficient of point dist by season", 
         y = "Gini coefficient")

Here are the plots with different scales for the y-axis:

As with variance, the different scales give very different interpretations. It will require some research to figure out if a change of Gini coefficient from 0.17 to 0.22 is perceptible to the viewer.

What are spaghetti plots and lasagna plots?

I recently heard of lasagna plots in a talk and I hadn’t heard of them before, so I looked it up. It turns out that the lasagna plot and its close cousin, the spaghetti plot, are plots that you’ve probably come across before. They are just technical terms that are used in certain domains with certain connotations.

According to Reference 1,

[the spaghetti plot] plots a subject’s values for the repeated outcome measure (vertical axis) versus time (horizontal axis) and connects the dots chronologically, using lines of uniform color and intensity for each subject.

Sounds like a line plot to me! A line plot with one line per subject and with time as the x-axis. The figure below demonstrates what a spaghetti plot is and what could go wrong. The left panel shows the time series for a single individual (one strand of spaghetti). Unfortunately, the entire dataset consists of 118 subjects, so plotting all the strands of spaghetti results in a hot mess.

Swihart et al. (2010) (Reference 1) introduce the lasagna plot as a potential remedy for the spaghetti plot’s shortcomings. The lasagna plot is a heatmap with some special connotations. Instead of one line per subject in the spaghetti plot, each subject gets one horizontal slice. Instead of depicting values using the y-axis, they are presented through the color of each part of the slice. Here is an example of a lasagna plot with 4 subjects and 6 timepoints:

Here is the lasagna plot for the same data which the spaghetti plot made a mess of. Since there are 118 subjects, each horizontal slice is very thin.

Certainly more palatable than the spaghetti plots, with colors to make it look like a real lasagna too! The panel on the left shows the plot with subjects labeled according to their position in the dataset. The right panel shows the plot with subjects ordered according to disease status, and within each disease status, ordered by data of recording. Some trends are obvious now: the top half tend to have lower values, and there seems to be a systematic pattern of missing data for some subjects.

Reference 1 goes into more detail about the possible ways to sort the rows and columns of the lasagna plot, as well as pointers on how to choose a suitable color palette.

Note: All figures were taken from Reference 1.

References:

  1. Swihart, B. J. (2010). Lasagna plots: A saucy alternative to spaghetti plots.

What is a sunflower plot?

A sunflower plot is a type of scatterplot which tries to reduce overplotting. When there are multiple points that have the same (x, y) values, sunflower plots plot just one point there, but has little edges (or “petals”) coming out from the point to indicate how many points are really there.

It’s best to see this via an example. Here is a plot of carb vs. gear from the mtcars dataset:

plot(mtcars$gear, mtcars$carb,
     main = "Plot of carb vs. gear")

From the plot it looks like there are only 11 data points. However, if we check the Environments tab in RStudio we see that there are actually 32 observations in the dataset: it’s just that some of the observations have the same (gear, carb) values.

Let’s see how a sunflower plot deals with this overplotting. It turns out that base R comes with a sunflower plot function that does just this:

sunflowerplot(mtcars$gear, mtcars$carb,
              main = "Plot of carb vs. gear")

This tells us, for example, that there are 3 observations with (gear, carb) = (3, 1).

We can change the color of the “petals” by specifying seg.col:

sunflowerplot(mtcars$gear, mtcars$carb,
              seg.col = "blue",
              main = "Plot of carb vs. gear")

By default, the first petal always points up. To have the first petal point in random directions, specify rotate = TRUE:

set.seed(1)
sunflowerplot(mtcars$gear, mtcars$carb,
              rotate = TRUE,
              main = "Plot of carb vs. gear")

When given (x, y) values, sunflowerplot counts the number of times each (x, y) value appears to determine the number of petals it needs to draw. It is possible to override this behavior by passing a number argument, as the next code snippet shows:

set.seed(1)
sunflowerplot(1:10, 1:10, number = 1:10,
              main = "n observations at (n, n)")

Sunflower plots aren’t the only way to reduce overplotting. Another common technique is jittering, where random noise is added to each point. The code below shows how you can do this in base R. Of course, a drawback of this is that the points are not plotted at the exact location of the data.

set.seed(1)
plot(jitter(mtcars$gear), jitter(mtcars$carb),
     main = "Plot of carb vs. gear")

Sunflower plots will not solve all your overplotting issues. Here is an example (on the diamonds dataset) where it does a horrendous job:

library(ggplot2)
data(diamonds)
sunflowerplot(diamonds$carat, diamonds$price,
              main = "Plot of price vs. carat")

A better solution here would be to change the transparency of the points. The code snippet below shows how this can be done in base R:

plot(diamonds$carat, diamonds$price,
     col = rgb(0, 0, 0, alpha = 0.05),
     main = "Plot of price vs. carat")

A shiny app for exploratory data analysis

I recently learnt how to build basic R Shiny apps. To practice using Shiny, I created a simple app that you can use to perform simple exploratory data analysis. You can use the app here to play around with the diamonds dataset from the ggplot2 package. To use the app for your own dataset, download the raw R code here (just the app.R file) and assign your dataset to  raw_df. In the rest of this post, I outline how to use this app.

(Credits: I worked off the source code for the “Diamonds Explorer” app. There are a few versions of this app out there and I can’t find the exact source I used, but it was very close to the source code of this version.)

As you can see from the screenshot below, the main panel (on the right) has 4 tabs. The last two tabs simply give the output of calling the summary and str functions on the entire dataset; they are not affected by the controls in the side panel. The “Data Snippet” panel prints up to 15 rows of the dataset for a peek into what the dataset looks like. (These 15 rows are the first 15 rows of the dataset used to create the plot on the “Plot” tab.)

The most interesting tab is probably the “Plot” tab. First let me describe how the app selects the dataset it makes the plot for. By default, it picks 1000 random observations or all the observations if the dataset has less than 1000 rows. The user can input the random seed for reproducibility. The user can also control the number of observations using the slider, and choose the observations randomly or take the top n from the dataset.

The type of plot the app makes depends on the type of variables given to it. In the screenshot above, one numeric variable and one non-numeric variable is given, so the app makes a boxplot. If two numeric variables are given, it makes a scatterplot:

For scatterplots, the user has the option to jitter the points and/or to add a smoothing line:

If two non-numeric variables are given, the app makes a heatmap depicting how often each combination is present in the data:

The plots above depict the joint distribution of two variables in the dataset. If the user wants a visualization for just one variable, the user can set the “Y” variable to “None”. If the “X” variable is numeric, the app plots a histogram:

If the “X” variable is non-numeric, the app plots a bar plot showing counts:

Finally, let’s talk about color. For simplicity, the app only allows plots to be colored by non-numeric variables. Below is an example of a colored scatterplot:

As the screenshot below shows, color works for boxplots too. (Color does not work for heatmaps.)

Color can be used for the one-dimensional plots as well:

There are certainly several improvements that can be made to this app. For one, it would be nice if the user could upload their dataset through the app (instead of downloading my source code and assigning their dataset to the raw_df variable). There could also be more controls to let the user change certain aspects of the plot (e.g. transparency of points), but at some point the UI might become too overwhelming for the user.

Happy exploring!

Simulating paths from a random walk

If you’ve ever visited this blog at wordpress.com, you might have noticed a header image that looks like this:

Ever wonder how it was generated? The image depicts 100 simulations of an asymmetric random walk. In this post, I’ll go through the code used to generate this image. All the code can also be found here.

For t = 1, 2, \dots, consider a series of i.i.d. random variables X_1, X_2, \dots such that for any t, X_t = 1 with probability p, and X_t = -1 with probability 1-p. (p \in [0,1] is a parameter that we get to choose.) A random walk simply tracks the cumulative sum of these random variables, i.e.

S_0 = 0, \quad S_t = \sum_{s = 1}^t X_s.

In my image, I let the random walk run until it hits a fixed upper limit or a fixed lower limit. Here is an R function that generates one realization of this random walk:

# returns the random walk path values as a vector 
# (random walk always starts at 0)
# p: probability of increasing by 1
# stop if path value hits either `lower` or `upper`
run <- function(p, lower, upper) {
    values <- c(0)
    current <- 0
    while (current > lower & current < upper) {
        current <- current + ifelse(runif(1) < p, 1, -1)
        values <- c(values, current)
    }
    values
}

(There might be more efficient ways of doing this, but since computation is relatively fast this is good enough for our purposes.)

The code below creates a list of 100 elements, each element being a simulated random walk. The probability of going up at any one step is 0.48, and we stop the random walk once we hit -50 or 50.

N <- 100  # no. of paths to simulate
p <- 0.48
lower <- -50
upper <- 50

# simulate paths
set.seed(1055)
vlist <- replicate(N, run(p, lower, upper))

We can plot these paths along with the x-axis and the upper & lower limits:

# get length of longest path
max_length <- max(sapply(vlist, length))

# make plot
par(mar = rep(0, 4))  # no margins
plot(c(1, max_length), c(lower, upper), type = "n")
for (i in 1:N) {
    lines(1:length(vlist[[i]]), vlist[[i]])
}
abline(h = 0, lty = "dashed")
abline(h = lower, lwd = 2)
abline(h = upper, lwd = 2)

Without color, it’s hard to make much sense of the image. To introduce color, let’s create a function that picks a color for path based on (i) whether the path hit the upper or lower limit, and (ii) how long the path is when compared to the longest path in the image.

colorPicker <- function(values, max_length,
                        ls_color = c(178, 34, 34), ll_color = c(255, 204, 0),
                        us_color = c(0, 0, 102), ul_color = c(102, 204, 225)) {
    l <- length(values)
    if (values[l] < 0) {
        rgb_values <- (ls_color + (ll_color - ls_color) * l / max_length) / 255
    } else {
        rgb_values <- (us_color + (ul_color - us_color) * l / max_length) / 255
    }
    rgb(rgb_values[1], rgb_values[2], rgb_values[3])
}

If a path hits the lower limit and has length 0, it will have color ls_color / 255, and if a path hits the lower limit and has the longest length among all our simulated paths, it will have color ll_color / 255. There is a similar relationship for paths hitting the upper limit and us_color and ul_color. (You can see what these colors are at a website such as this.)

We can now color our black-and-white image:

plot(c(1, max_length), c(lower, upper), type = "n")
for (i in 1:N) {
    lines(1:length(vlist[[i]]), vlist[[i]], 
          col = colorPicker(vlist[[i]], max_length), lwd = 0.5)
}
abline(h = 0, lty = "dashed")
abline(h = lower, lwd = 2)
abline(h = upper, lwd = 2)

From the image it is now obvious that most of the paths hit the lower limit first, and fairly quickly at that! Here is the same image, but with different choices of color scale:

plot(c(1, max_length), c(lower, upper), type = "n")
for (i in 1:N) {
    lines(1:length(vlist[[i]]), vlist[[i]], 
          col = colorPicker(vlist[[i]], max_length,
                            ls_color = c(230, 230, 230), 
                            ll_color = c(166, 166, 166),
                            us_color = c(255, 0, 0),
                            ul_color = c(0, 0, 255)),
          lwd = 0.5)
}
abline(h = 0, lty = "dashed")
abline(h = lower, lwd = 2)
abline(h = upper, lwd = 2)

The possibilities are endless!

If you haven’t seen random walks before, you might be surprised at how a slightly biased walk (p = 0.48 instead of p = 0.5) results in so many more paths hitting the lower limit before the upper limit, even though the two limits are the same distance from 0. This is an example of 100 simulations when p = 0.5:

This is an example of 100 simulations when p = 0.52:

Isn’t it interesting how slight deviations in the probability of going one step up (instead of down) completely change the dynamics? It turns out that there is a closed form for the probability of hitting one limit (as opposed to hitting the other). If the upper limit is x = b and the lower limit is x = -a, and q = 1-p, then

\begin{aligned} \mathbb{P} \{ \text{hit } x = b \text{ before } x = -a \} = \begin{cases} \frac{1 - (q/p)^a}{1 - (q/p)^{a+b}} &\text{if } p \neq 1/2, \\ \frac{a}{a+b} &\text{if } p = 1/2. \end{cases} \end{aligned}

For my blog header image, p = 0.48, a = b = 50, which means the probability of hitting the upper limit is approximately 0.0179: not high at all!