Showing posts with label time series. Show all posts
Showing posts with label time series. Show all posts

Friday, May 1, 2020

Tornado plots with matplotlib

Lately there's a bit of attention about charts where the values of a time series are plotted against the change point by point. This thanks to this rather colorful and cluttered Tornado plot.

In this post we will see how to make one of those charts with our favorite plotting library, matplotlib, and we'll also try to understand how to read them.
Let's start loading the records of the concentration of CO2 in the atmosphere and aggregate the values on month level. After that we can plot straight away the value of the concentration against the change compared to the previous month:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

data_url = 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt'
co2_data = pd.read_csv(data_url, sep='\s+', comment='#', na_values=-999.99,
                       names=['year', 'month', 'day', 'decimal', 'ppm', 
                       'days', '1_yr_ago',  '10_yr_ago', 'since_1800'])
co2_data = co2_data.groupby([co2_data.year, co2_data.month]).mean()
idx = co2_data.index.to_flat_index()
co2_data['timestamp'] = [pd.Timestamp(year=y, month=m, day=1) for y, m in idx]
co2_data.set_index('timestamp', inplace=True)
co2_data = co2_data['2018-04-01':]


plt.plot(co2_data.ppm.diff(), co2_data.ppm, '-o')


The result is nicely loopy. This is because we plotted data for 2 years and the time series presents a yearly seasonality. Still, apart from the loop it's quite hard to understand anything without adding few details. Let's print the name of the months and highlight beginning and end of the time series:
from calendar import month_abbr

plt.figure(figsize=(8, 12))
diffs = co2_data.ppm.diff()
plt.plot(diffs, co2_data.ppm, '-o', alpha=.6, zorder=0)
plt.scatter([diffs[1]], [co2_data.ppm[1]],
            c='C0', s=140)
plt.scatter([diffs[-1]], [co2_data.ppm[-1]],
            c='C0', s=140)

for d, v, ts in zip(diffs,
                    co2_data.ppm,
                    co2_data.index):
    plt.text(d, v-.15, '%s\n%d' % (month_abbr[ts.month], ts.year),
             horizontalalignment='left', verticalalignment='top')

plt.xlim([-np.abs(diffs).max()-.1,
          np.abs(diffs).max()+.1])
plt.ylabel('CO2 concentrations (ppm)')
plt.xlabel('change from previous month (ppm)')


The situation is much improved. Now we can see that from June to September (value on the left) the concentrations decrease while from October to May (value on the right) they increase. If we look at the points where the line of zero chance is crossed, we can spot when there's a change of trend. We see that the trend changes from negative to positive from September to October and the opposite from May to June.

Was this easier to observe just have time on the y axis? That's easy to see looking at the chart below.

Saturday, March 23, 2019

Visualizing the trend of a time series with Pandas

The trend of time series is the general direction in which the values change. In this post we will focus on how to use rolling windows to isolate it. Let's download from Google Trends the interest of the search term Pancakes and see what we can do with it:
import pandas as pd
import matplotlib.pyplot as plt
url = './data/pancakes.csv' # downloaded from https://trends.google.com
data = pd.read_csv(url, skiprows=2, parse_dates=['Month'], index_col=['Month'])
plt.plot(data)


Looking at the data we notice that there's some seasonality (Pancakes day! yay!) and an increasing trend. What if we want to visualize just the trend of this curve? We only need to slide a rolling window through the data and compute the average at each step. This can be done in just one line if we use the method rolling:

y_mean = data.rolling('365D').mean()
plt.plot(y_mean)


The parameter passed to rolling '365D' means that our rolling window will have size 365 days. Check out the documentation of the method to know more.
We can also add highlight the variation each year adding to the chart a shade with the amplitude of the standard deviation:

y_std = data.rolling('365D').std()
plt.plot(y_mean)
plt.fill_between(y_mean.index,
                 (y_mean - y_std).values.T[0],
                 (y_mean + y_std).values.T[0], alpha=.5)


Warning: the visualization above assumes that the distribution of the data each year follows a normal distribution, which is not entirely true.

Thursday, July 13, 2017

Dates in Pandas Cheatsheet

Lately I've been working a lot with dates in Pandas so I decided to make this little cheatsheet with the commands I use the most.

Importing a csv using a custom function to parse dates

import pandas as pd

def parse_month(month):
    """
    Converts a string from the format M in datetime format.
    Example: parse_month("2007M02") returns datetime(2007, 2, 1)
    """
    return pd.datetime(int(month[:4]), int(month[-2:]), 1)

temperature = pd.read_csv('TempUSA.csv', parse_dates=['Date'], 
                          date_parser=parse_month, 
                          index_col=['Date'], # will become an index
                          # use a subset of the columns
                          usecols=['Date', 
                                   'LosAngelesMax', 'LosAngelesMin'])
print temperature
            LosAngelesMax  LosAngelesMin
Date                                    
2000-01-01           19.6           10.0
2000-02-01           18.9           10.1
2000-03-01           18.6           10.1
2000-04-01           20.2           12.5
2000-05-01           21.9           14.2

Format the dates in a chart

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
plt.plot(temperature['LosAngelesMax'])
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.show()

Here's the reference of the date format directives. ISO compliant format: %Y-%m-%dT%H:%M:%S.

Group the DataFrame by month

print temperature.groupby([temperature.index.month]).mean() 
      LosAngelesMax  LosAngelesMin
Date                              
1         20.092308       8.992308
2         19.223077       9.276923
3         19.253846      10.492308
4         19.992308      11.461538
5         21.076923      13.761538
6         22.123077      15.800000
7         23.892308      17.315385
8         24.246154      17.530769
9         24.384615      16.846154
10        23.330769      14.630769
11        21.950000      11.241667
12        19.241667       8.683333
The resulting DataFrame is indexed by month.

Merging two DataFrames indexed with timestamps that don't match exactly

date_range_a = pd.date_range('2007-01-01 01:00', 
                            '2007-01-01 3:00', freq='1h')
date_range_b = date_range_a + pd.Timedelta(10, 'm')
df_a = pd.DataFrame(np.arange(len(date_range_a)), 
                    columns=['a'], index=date_range_a)
df_b = pd.DataFrame(['x', 'y', 'z'], 
                    columns=['b'], index=date_range_b)

print 'left DataFrame'
print df_a
print '\nright DataFrame'
print df_b
print '\nmerge_AsOf result'
print pd.merge_asof(df_a, df_b, direction='nearest', 
                    left_index=True, right_index=True)
left DataFrame
                     a
2007-01-01 01:00:00  0
2007-01-01 02:00:00  1
2007-01-01 03:00:00  2

right DataFrame
                     b
2007-01-01 01:10:00  x
2007-01-01 02:10:00  y
2007-01-01 03:10:00  z

merge_AsOf result
                     a  b
2007-01-01 01:00:00  0  x
2007-01-01 02:00:00  1  y
2007-01-01 03:00:00  2  z
The DataFrames have been aligned according to the index on the left.

Aligning two DataFrames

aligned = df_a.align(df_b)

print 'left aligned'
print aligned[0]
print '\nright aligned'
print aligned[1]
print '\ncombination'
aligned[0]['b'] = aligned[1]['b']
print aligned[0]
left aligned
                       a   b
2007-01-01 01:00:00  0.0 NaN
2007-01-01 01:10:00  NaN NaN
2007-01-01 02:00:00  1.0 NaN
2007-01-01 02:10:00  NaN NaN
2007-01-01 03:00:00  2.0 NaN
2007-01-01 03:10:00  NaN NaN

right aligned
                      a    b
2007-01-01 01:00:00 NaN  NaN
2007-01-01 01:10:00 NaN    x
2007-01-01 02:00:00 NaN  NaN
2007-01-01 02:10:00 NaN    y
2007-01-01 03:00:00 NaN  NaN
2007-01-01 03:10:00 NaN    z

combination
                       a    b
2007-01-01 01:00:00  0.0  NaN
2007-01-01 01:10:00  NaN    x
2007-01-01 02:00:00  1.0  NaN
2007-01-01 02:10:00  NaN    y
2007-01-01 03:00:00  2.0  NaN
2007-01-01 03:10:00  NaN    z
The timestamps are now aligned according to both the DataFrames and unknown values have been filled with NaNs. The missing value can be filled with interpolation when working with numeric values:
print aligned[0].a.interpolate()
2007-01-01 01:00:00    0.0
2007-01-01 01:10:00    0.5
2007-01-01 02:00:00    1.0
2007-01-01 02:10:00    1.5
2007-01-01 03:00:00    2.0
2007-01-01 03:10:00    2.0
Name: a, dtype: float64
The categorical values can be filled using the fillna method:
print aligned[1].b.fillna(method='bfill')
2007-01-01 01:00:00    x
2007-01-01 01:10:00    x
2007-01-01 02:00:00    y
2007-01-01 02:10:00    y
2007-01-01 03:00:00    z
2007-01-01 03:10:00    z
Name: b, dtype: object
The method bfill propagates the next valid observation, while ffil the last valid observation.

Convert a Timedelta in hours

td = pd.Timestamp('2017-07-05 16:00') - pd.Timestamp('2017-07-05 12:00')
print td / pd.Timedelta(1, unit='h')
4.0
To convert in days, months, minutes and so on one just need to change the unit. Here are the values accepted: D,h,m,s,ms,us,ns.

Convert pandas timestamps in unix timestamps

unix_ts = pd.date_range('2017-01-01 1:00', 
                        '2017-01-01 2:00', 
                        freq='30min').astype(np.int64) // 10**9
print unix_ts
Int64Index([1483232400, 1483234200, 1483236000], dtype='int64')
To convert in milliseconds divided by 10**6 instead of 10**9.

Convert unix timestamps in pandas timestamps

print pd.to_datetime(unix_ts, unit='s')
DatetimeIndex(['2017-01-01 01:00:00', '2017-01-01 01:30:00',
               '2017-01-01 02:00:00'],
              dtype='datetime64[ns]', freq=None)
To convert from timestamps in milliseconds change the unit to 'ms'.

Tuesday, January 27, 2015

Forecasting beer consumption with sklearn

In this post we will see how to implement a straightforward forecasting model based on the linear regression object of sklearn. The model that we are going to build is based on the idea idea that past observations are good predictors of a future value. Using some symbols, given xn−k,...,xn−2,xn−1 we want to estimate xn+h where h is the forecast horizon just using the given values. The estimation that we are going to apply is the following:


where xn−k and xn−1 are respectively the oldest and the newest observation we consider for the forecast. The weights wk,...,w1,w0 are chosen in order to minimize


where m is number of periods available to train our model. This model is often referred as regression model with lagged explanatory variables and k is called lag order.

Before implementing the model let's load a time series to forecast:
import pandas as pd
df = pd.read_csv('NZAlcoholConsumption.csv')
to_forecast = df.TotalBeer.values
dates = df.DATE.values
The time series represent the total of alcohol consumed by quarter millions of litres from the 1st quarter of 2000 to 3rd quarter of 2012. The data is from New Zealand government and can be downloaded in csv from here. We will focus on the forecast of beer consumption.
First, we need to organize our data in forecast in windows that contain the previous observations:
import numpy as np

def organize_data(to_forecast, window, horizon):
    """
     Input:
      to_forecast, univariate time series organized as numpy array
      window, number of items to use in the forecast window
      horizon, horizon of the forecast
     Output:
      X, a matrix where each row contains a forecast window
      y, the target values for each row of X
    """
    shape = to_forecast.shape[:-1] + /
            (to_forecast.shape[-1] - window + 1, window)
    strides = to_forecast.strides + (to_forecast.strides[-1],)
    X = np.lib.stride_tricks.as_strided(to_forecast, 
                                        shape=shape, 
                                        strides=strides)
    y = np.array([X[i+horizon][-1] for i in range(len(X)-horizon)])
    return X[:-horizon], y

k = 4   # number of previous observations to use
h = 1   # forecast horizon
X,y = organize_data(to_forecast, k, h)
Now, X is a matrix where the i-th row contains the lagged variables xn−k,...,xn−2,xn−1 and y[i] contains the i-th target value. We are ready to train our forecasting model:
from sklearn.linear_model import LinearRegression
 
m = 10 # number of samples to take in account
regressor = LinearRegression(normalize=True)
regressor.fit(X[:m], y[:m])
We trained our model using the first 10 observations, which means that we used the data from 1st quarter of 2000 to the 2nd quarter of 2002. We use a lag order of one year and a forecast horizon of 1 quarter. To estimate the error of the model we will use the mean absolute percentage error (MAPE). Computing this metric to compare the forecast of the remaining observation of the time series and the actual observations we have:
def mape(ypred, ytrue):
    """ returns the mean absolute percentage error """
    idx = ytrue != 0.0
    return 100*np.mean(np.abs(ypred[idx]-ytrue[idx])/ytrue[idx])

print 'The error is %0.2f%%' % mape(regressor.predict(X[m:]),y[m:])
The error is 6.15%
Which means that, on average, the forecast provided by our model differs from the target value only of 6.15%. Let's compare the forecast and the observed values visually:
figure(figsize=(8,6))
plot(y, label='True demand', color='#377EB8', linewidth=2)
plot(regressor.predict(X), 
     '--', color='#EB3737', linewidth=3, label='Prediction')
plot(y[:m], label='Train data', color='#3700B8', linewidth=2)
xticks(arange(len(dates))[1::4],dates[1::4], rotation=45)
legend(loc='upper right')
ylabel('beer consumed (millions of litres)')
show()

We note that the forecast is very close to the target values and that the model was able to learn the trends and anticipate them in many cases.

Wednesday, November 26, 2014

Comparing strikers statistics

Here we compare the scoring statistics of four of the best strikers of the recent football history: Del Piero, Trezeguet, Ronaldo and Vieri. The statistics that we will look at are the scoring trajectory, scoring rate and number of appearances.
To compute these values we need to scrape the career statistics (number of goals and appearances per season) on the Wikipedia pages of the players:
from bs4 import BeautifulSoup
from urllib2 import urlopen

def get_total_goals(url):
    """
    Given the url of a wikipedia page about a football striker
    returns three numy arrays:
    - years, each element corresponds to a season
    - apprearances, contains the number of appearances each season
    - goals, contains the number of goal scored each season
    
    Unfortunately this function is able to parse 
    only the pages of few strikers.
    """
    soup = BeautifulSoup(urlopen(url).read())
    table = soup.find("table", { "class" : "wikitable" })
    years = []
    apps = []
    goals = []
    for row in table.findAll("tr"):
        cells = row.findAll("td")
        if len(cells) > 1:
            years.append(int(cells[0].text[:4]))
            apps.append(int(cells[len(cells)-2].text))
            goals.append(int(cells[len(cells)-1].text))
    return np.array(years), 
           np.array(apps, dtype='float'), 
           np.array(goals)

ronaldo = get_total_goals('http://en.wikipedia.org/wiki/Ronaldo')
vieri = get_total_goals('http://en.wikipedia.org/wiki/Christian_Vieri')
delpiero = get_total_goals('http://en.wikipedia.org/wiki/Alessandro_Del_Piero')
trezeguet = get_total_goals('http://en.wikipedia.org/wiki/David_Trezeguet')
Now we are ready to compute our statistics. For each statistics we will produce an interactive chart using plotly.

Scoring trajectory

import plotly.plotly as py
from plotly.graph_objs import *
py.sign_in("sexyusername", "mypassword")

data = Data([
    Scatter(x=delpiero[0],y=cumsum(delpiero[2]), 
            name='Del Piero', mode='lines'),
    Scatter(x=trezeguet[0],y=cumsum(trezeguet[2]), 
            name='Trezeguet', mode='lines'),
    Scatter(x=ronaldo[0],y=cumsum(ronaldo[2]), 
            name='Ronaldo', mode='lines'),
    Scatter(x=vieri[0],y=cumsum(vieri[2]), 
            name='Vieri', mode='lines'),
])

layout = Layout(
    title='Scoring Trajectory',
    xaxis=XAxis(title='Year'),
    yaxis=YAxis(title='Cumuative goal'),
    legend=Legend(x=0.0,y=1.0))

fig = Figure(data=data, layout=layout)

py.iplot(fig, filename='cumulative-goals')
The scoring trajectory is given by the yearly cumulative totals of goals scored. From the scoring trajectories we can see that Ronaldo was a goal machine since his first professional season and his worse period was from 1999 to 2001. Del Piero and Trezeguet have the longest careers (and they're still playing!). Vieri had the shortest career but it's impressive to see that the number of goals he scored increased almost constantly from 1996 to 2004.

Scoring rate

data = Data([
    Bar(
        x=['Ronaldo', 'Vieri', 'Trezeguet', 'Del Piero'],
        y=[np.sum(ronaldo[2])/np.sum(ronaldo[1]), 
           np.sum(vieri[2])/np.sum(vieri[1]),
           np.sum(trezeguet[2])/np.sum(trezeguet[1]),
           np.sum(delpiero[2])/np.sum(delpiero[1])]
    )
])
py.iplot(data, filename='goal-average')
The scoring rate is the number of goals scored divided by the number of appearances. Ronaldo has a terrific 0.67 scoring rate, meaning that, on average he scored more than three goals each five games. Vieri and Trezeguet have a very similar scoring rate, almost one goal each two games. While Del Piero has 0.40, two goals each five games.

Appearances

data = Data([
    Bar(
        x=['Del Piero', 'Trezeguet', 'Ronaldo', 'Vieri'],
        y=[np.sum(delpiero[1]),
           np.sum(trezeguet[1]),
           np.sum(ronaldo[1]),
           np.sum(vieri[1])]
    )
])
py.iplot(data, filename='appearances')
The number of Del Piero's appearances on a football field is impressive. At the moment I'm writing, he played 773 games. No one of the other players was able to play the 70% of the games played by the Italian numero 10.