The Glowing Python: visualization

Showing posts with label visualization. Show all posts

Wednesday, November 11, 2020

Visualize the Dictionary of Obscure Words with T-SNE

I recently published on a wrapper around The Dictionary of Obscure Words (originally from this website http://phrontistery.info) for Python and in this post we'll see how to create a visualization to highlight few entries from the dictionary using the dimensionality reduction technique called T-SNE. The dictionary is available on github at this address https://github.com/JustGlowing/obscure_words and can be installed as follows:

pip install git+https://github.com/JustGlowing/obscure_words

We can now import the dictionary and create a vectorial representation of each word:

import matplotlib.pyplot as plt
import numpy as np
from obscure_words import load_obscure_words
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.manifold import TSNE

obscure_dict = load_obscure_words()
words = np.array(list(obscure_dict.keys()))
definitions = np.array(list(obscure_dict.values()))

vectorizer = TfidfVectorizer(stop_words=None)
X = vectorizer.fit_transform(definitions)

projector = TSNE(random_state=0)
XX = projector.fit_transform(X)

In the snippet above, we compute a Tf-Idf representation using the definition of each word. This gives us a vector for each word in our dictionary, but each of these vectors has many elements as the total number of words used in all the definitions. Since we can't plot all the features extracted, we reduce our data to 2 dimensions we use T-SNE. We have now a mapping that allows us to place each word in a point of a bi-dimensional space. There's one problem remaining, how can we plot the words in a way that we can still read them? Here's a solution:

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import pairwise_distances

def textscatter(x, y, text, k=10):
    X = np.array([x, y]).T
    clustering = KMeans(n_clusters=k)
    scaler = StandardScaler()
    clustering.fit(scaler.fit_transform(X))
    centers = scaler.inverse_transform(clustering.cluster_centers_)
    selected = np.argmin(pairwise_distances(X, centers), axis=0)
    plt.scatter(x, y, s=6, c=clustering.predict(scaler.transform(X)), alpha=.05)
    for i in selected:
        plt.text(x[i], y[i], text[i], fontsize=10)

plt.figure(figsize=(16, 16))
textscatter(XX[:, 0], XX[:, 1], 
            [w+'\n'+d for w, d in zip(words, definitions)], 20)
plt.show()

In the function textscatter we segment all the points created at the previous steps in k clusters using K-Means, then we plot the word related to the center of cluster (and also its definion). Given the properties of K-Means we know that the centers are distant from each other and with the right choice of k we can maximize the number of words we can display. This is the result of the snippet above:

(click on the figure to see the entire chart)

Sunday, May 17, 2020

Neural Networks Regularization made easy with sklearn and matplotlib

Using regularization has many benefits, the most common are reduction of overfitting and solving multicollinearity issues. All of this is covered very well in literature, especially in (Hastie et all). Howerver, wihout touching too many details we can have a very straigthforward interpretation of regularization. Regularization is a way to constrain a model in order to learn less from the data. In this post we will experimentally show what are the effects of regularization on a Neural Network (Multilayer Perceptron) validating this interpretation.

Let's define a goal for our Neural Network. We have a dataset H and we want to build a model able to reconstruct the same data. More formally, we want to build a function f that takes in input H and returns the same values or an approximation close as possible to H. We can say that we want f to minimize the following error

To begin our experiment we build a data matrix H that contains the coordinate of a stylized star:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x= [np.cos(np.pi/2), 2/5*np.cos(np.pi/2+np.pi/5), np.cos(np.pi/2+2*np.pi/5), 
    2/5*np.cos(np.pi/2+3*np.pi/5), np.cos(np.pi/2+4*np.pi/5), 
    2/5*np.cos(3*np.pi/2), np.cos(np.pi/2+6*np.pi/5), 
    2/5*np.cos(np.pi/2+7*np.pi/5), np.cos(np.pi/2+8*np.pi/5),
    2/5*np.cos(np.pi/2+9*np.pi/5), np.cos(np.pi/2)]

y=[np.sin(np.pi/2), 2/5*np.sin(np.pi/2+np.pi/5), np.sin(np.pi/2+2*np.pi/5),
   2/5*np.sin(np.pi/2+3*np.pi/5), np.sin(np.pi/2+4*np.pi/5),
   2/5*np.sin(3*np.pi/2), np.sin(np.pi/2+6*np.pi/5),
   2/5*np.sin(np.pi/2+7*np.pi/5), np.sin(np.pi/2+8*np.pi/5),
   2/5*np.sin(np.pi/2+9*np.pi/5), np.sin(np.pi/2)]

xp = np.linspace(0, 1, len(x))
np.interp(2.5, xp, x)
x = np.log(np.interp(np.linspace(0, 1, len(x)*10), xp, x)+1)

yp = np.linspace(0, 1, len(y))
np.interp(2.5, yp, y)
y = np.interp(np.linspace(0, 1, len(y)*10), yp, y)
#y[::2] += .1

H = np.array([x, y]).T
plt.plot(H[:,0], H[:,1], '-o')

Now the matrix H contains the x coordinates of our star in the first column and the y coordinates in the second. With the help of sklearn, we can now train a Neural Network and plot the result:

from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import minmax_scale

H = scale(H)

plt.figure()
f = MLPRegressor(hidden_layer_sizes=(200, 200, 200), alpha=0)
f.fit(H, H)
result = f.predict(H)
plt.plot(result[:,0], result[:,1], 'C3', linewidth=3, label='Neural Network')
plt.plot(H[:,0], H[:,1], 'o', alpha=.3, label='original')
plt.legend()
#plt.xlim([-0.1, 1.1])
#plt.ylim([-0.1, 1.1])
plt.show()

In the snippet above we created a Neural Network with 3 layers of 200 neurons. Then, we trained the model using H as both input and output data. In the chart we compare the original data with the estimation. It's easy to see that there are small differences between the two stars but they are very close.
Here's important to notice that we initialized MLPRegressor using alpha=0. This parameter controls the regularization of the model and the higher its value, the more regularization we apply. To understand how alpha affects the learning of the model we need to add a term to the computation of the error that was just introduced:

where W is a matrix of all the weights in the network. The error not only takes in account the difference between the ouput of the function and the data, but also the size of weights of the connections of the network. Hence, the higher alpha, the less the model is free to learn. If we set alpha to 20 in our experiment we have the following result:

We still achie an approximation of the star but the output of our model is smaller than before and the edges of the star are smoother.
Increasing alpha gradually is a good way to understand the effects of regularization:

from matplotlib.animation import FuncAnimation
from IPython.display import HTML

fig, ax = plt.subplots()
ln, = plt.plot([], [], 'C3', linewidth=3, label='Neural Network')
plt.plot(H[:,0], H[:,1], 'o', alpha=.3, label='original')
plt.legend()

def update(frame):
    f = MLPRegressor(hidden_layer_sizes=(200, 200, 200), alpha=frame)
    f.fit(H, H)
    result = f.predict(H)
    ln.set_data(result[:,0], result[:,1])
    plt.title('alpha = %.2f' % frame)
    return ln,

ani = FuncAnimation(fig, update, frames=np.linspace(0, 40, 100), blit=True)
HTML(ani.to_html5_video())

Here we vary alpha from 0 to 40 and plot the result for each value. We notice here that not only the star gets smaller and smoother as alpha increases but also that the network tends to preserve long lines as much as possible getting away from the edges which account less on error function. Finally we see that the result degenerates into a point when alpha is too high.

Friday, May 1, 2020

Tornado plots with matplotlib

Lately there's a bit of attention about charts where the values of a time series are plotted against the change point by point. This thanks to this rather colorful and cluttered Tornado plot.

In this post we will see how to make one of those charts with our favorite plotting library, matplotlib, and we'll also try to understand how to read them.
Let's start loading the records of the concentration of CO2 in the atmosphere and aggregate the values on month level. After that we can plot straight away the value of the concentration against the change compared to the previous month:

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

data_url = 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt'
co2_data = pd.read_csv(data_url, sep='\s+', comment='#', na_values=-999.99,
                       names=['year', 'month', 'day', 'decimal', 'ppm', 
                       'days', '1_yr_ago',  '10_yr_ago', 'since_1800'])
co2_data = co2_data.groupby([co2_data.year, co2_data.month]).mean()
idx = co2_data.index.to_flat_index()
co2_data['timestamp'] = [pd.Timestamp(year=y, month=m, day=1) for y, m in idx]
co2_data.set_index('timestamp', inplace=True)
co2_data = co2_data['2018-04-01':]


plt.plot(co2_data.ppm.diff(), co2_data.ppm, '-o')

The result is nicely loopy. This is because we plotted data for 2 years and the time series presents a yearly seasonality. Still, apart from the loop it's quite hard to understand anything without adding few details. Let's print the name of the months and highlight beginning and end of the time series:

from calendar import month_abbr

plt.figure(figsize=(8, 12))
diffs = co2_data.ppm.diff()
plt.plot(diffs, co2_data.ppm, '-o', alpha=.6, zorder=0)
plt.scatter([diffs[1]], [co2_data.ppm[1]],
            c='C0', s=140)
plt.scatter([diffs[-1]], [co2_data.ppm[-1]],
            c='C0', s=140)

for d, v, ts in zip(diffs,
                    co2_data.ppm,
                    co2_data.index):
    plt.text(d, v-.15, '%s\n%d' % (month_abbr[ts.month], ts.year),
             horizontalalignment='left', verticalalignment='top')

plt.xlim([-np.abs(diffs).max()-.1,
          np.abs(diffs).max()+.1])
plt.ylabel('CO2 concentrations (ppm)')
plt.xlabel('change from previous month (ppm)')

The situation is much improved. Now we can see that from June to September (value on the left) the concentrations decrease while from October to May (value on the right) they increase. If we look at the points where the line of zero chance is crossed, we can spot when there's a change of trend. We see that the trend changes from negative to positive from September to October and the opposite from May to June.

Was this easier to observe just have time on the y axis? That's easy to see looking at the chart below.

Tuesday, March 17, 2020

Ridgeline plots in pure matplotlib

A Ridgeline plot (also called Joyplot) allows us to compare several statistical distributions. In this plot each distribution is shown with a density plot, and all the distributions are aligned to the same horizontal axis and, sometimes, presented with a slight overlap.

There are many options to make a Ridgeline plot in Python (joypy being one of them) but I decided to make my own function using matplotlib to have full flexibility and minimal dependencies:

from scipy.stats.kde import gaussian_kde
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt

def ridgeline(data, overlap=0, fill=True, labels=None, n_points=150):
    """
    Creates a standard ridgeline plot.

    data, list of lists.
    overlap, overlap between distributions. 1 max overlap, 0 no overlap.
    fill, matplotlib color to fill the distributions.
    n_points, number of points to evaluate each distribution function.
    labels, values to place on the y axis to describe the distributions.
    """
    if overlap > 1 or overlap < 0:
        raise ValueError('overlap must be in [0 1]')
    xx = np.linspace(np.min(np.concatenate(data)),
                     np.max(np.concatenate(data)), n_points)
    curves = []
    ys = []
    for i, d in enumerate(data):
        pdf = gaussian_kde(d)
        y = i*(1.0-overlap)
        ys.append(y)
        curve = pdf(xx)
        if fill:
            plt.fill_between(xx, np.ones(n_points)*y, 
                             curve+y, zorder=len(data)-i+1, color=fill)
        plt.plot(xx, curve+y, c='k', zorder=len(data)-i+1)
    if labels:
        plt.yticks(ys, labels)

The function takes in input a list of datasets where each dataset contains the values to derive a single distribution. Each distribution is estimated using Kernel Density Estimation, just as we've seen previously, and plotted increasing the y value.

Let's generate data from few normal distributions with different means and have a look at the output of the function:

data = [norm.rvs(loc=i, scale=2, size=50) for i in range(8)]
ridgeline(data, overlap=.85, fill='y')

Not too bad, we can clearly see that each distribution has a different mean. Let's apply the function on real world data:

import pandas as pd
data_url = 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_weekly_mlo.txt'
co2_data = pd.read_csv(data_url, sep='\s+', comment='#', na_values=-999.99,
                       names=['year', 'month', 'day', 'decimal', 'ppm', 
                       'days', '1_yr_ago',  '10_yr_ago', 'since_1800'])
co2_data = co2_data[co2_data.year >= 2000]
co2_data = co2_data[co2_data.year != 2020]

plt.figure(figsize=(8, 10))
grouped = [(y, g.ppm.dropna().values) for y, g in co2_data.groupby('year')]
years, data = zip(*grouped)
ridgeline(data, labels=years, overlap=.85, fill='tomato')
plt.title('Distribution of CO2 levels per year since 2000',
          loc='left', fontsize=18, color='gray')
plt.gca().spines['left'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.xlabel('ppm')
plt.xlim((co2_data.ppm.min(), co2_data.ppm.max()))
plt.ylim((0, 3.1))
plt.grid(zorder=0)
plt.show()

In the snippet above we downloaded the measurements of the concentration of CO2 in the atmosphere, the same data was also used here, and grouped the values by year. Then, we generated a Ridgeline plot that shows the distribution of CO2 levels each year since 2000. We easily note that the average concentration went from 370ppm to 420pmm gradually increasing over the 19 years abserved. We also note that the span of each distribution is approximatively 10ppm.

Wednesday, September 11, 2019

Organizing movie covers with Neural Networks

In this post we will see how to organize a set of movie covers by similarity on a 2D grid using a particular type of Neural Network called Self Organizing Map (SOM). First, let's load the movie covers of the top 100 movies according to IMDB (the files can be downloaded here) and convert the images in samples that we can use to feed the Neural Network:

import numpy as np
import imageio
from glob import glob
from sklearn.preprocessing import StandardScaler

# covers of the top 100 movies on www.imdb.com/chart/top 
# (the 13th of August 2019)
# images downloaded from www.themoviedb.org
data = []
all_covers = glob('movie_covers/*.jpg')
for cover_jpg in all_covers:
    cover = imageio.imread(cover_jpg)
    data.append(cover.reshape(np.prod(cover.shape)))
    
original_shape = imageio.imread(all_covers[0]).shape

scaler = StandardScaler()
data = scaler.fit_transform(data)

In the snippet above we load every image and for each of them we stack the color values of each pixel in a one dimensional vector. After loading all the images a standard scaling is applied to have all the values with mean 0 and standard deviation equal to 1. This scaling strategies often turns out to be quite successful when working with SOMs. Now we can train our model:

from minisom import MiniSom

w = 10
h = 10
som = MiniSom(h, w, len(data[0]), learning_rate=0.5,
              sigma=3, neighborhood_function='triangle')

som.train_random(data, 2500, verbose=True)
win_map = som.win_map(data)

Here we use Minisom, a lean implementation of the SOM, to implement a 10-by-10 map of neurons. Each movie cover is mapped in a neuron and we can display the results as follows:

import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid

fig = plt.figure(figsize=(30, 20))
grid = ImageGrid(fig, 111,
                 nrows_ncols=(h, w), axes_pad=0)

def place_image(i, img):
    img = (scaler.inverse_transform(img)).astype(int)
    grid[i].imshow(img.reshape(original_shape))
    grid[i].axis('off')

to_fill = []
collided = []

for i in range(w*h):
    position = np.unravel_index(i, (h, w))
    if position in win_map:
        img = win_map[position][0]
        collided += win_map[position][1:]
        place_image(i, img)
    else:
        to_fill.append(i)

collided = collided[::-1]
for i in to_fill:
    position = np.unravel_index(i, (h, w))
    img = collided.pop()
    place_image(i, img)

plt.show()

Since some images can be mapped in the same neuron, we first draw all the covers picking only one per neuron, then we fill the empty spaces of the map with covers that have been mapped in nearby neurons but have not been plotted yet.

This is the result:

Where to go next:

Read more about how Self Organizing Maps work here.
Check out how to install Minisom here.

Sunday, August 11, 2019

Visualizing distributions with scatter plots in matplotlib

Let's say that we want to study the time between the end of a marked point and next serve in a tennis game. After gathering our data, the first thing that we can do is to draw a histogram of the variable that we are interested in:

import pandas as pd
import matplotlib.pyplot as plt

url = 'https://raw.githubusercontent.com/fivethirtyeight'
url += '/data/master/tennis-time/serve_times.csv'
event = pd.read_csv(url)

plt.hist(event.seconds_before_next_point, bins=10)
plt.xlabel('Seconds before next serve')
plt.show()

The histogram reveals some interesting aspects of the distribution, indeed we can see that data is slightly skewed to the right and that on average the server takes 20 seconds. However, we couldn't tell how many time the serves happens before 10 seconds or after 35. Of course, one could increase the bins of the histogram, but this would lead to a chart which is not particularly elegant and that might hide some other details.

To have a better understanding of the situation we can draw a scatter plot of the variable we are studying:

import numpy as np
from scipy.stats.kde import gaussian_kde

def distribution_scatter(x, symmetric=True, cmap=None, size=None):
    """
    Plot the distribution of x showing all the points.
    The x axis represents the samples in x
    and the y axis is function of the probability of x
    and random assignment.
    
    Returns the position on the y axis.
    """
    pdf = gaussian_kde(x)
    w = np.random.rand(len(x))
    if symmetric:
        w = w*2-1
    pseudo_y = pdf(x) * w
    if cmap:
        plt.scatter(x, pseudo_y, c=x, cmap=cmap, s=size)
    else:
        plt.scatter(x, pseudo_y, s=size)
    return pseudo_y

In this chart each sample is represented with a point and the spread of the points in the y direction depends on the probability of occurrence. In this case we can easily see that 4 serves happened before 10 seconds and 3 after 35.

Since we're not really interested on the values on y axis but only on the spread, we can remove the axis and add few details on the outliers to enrich the chart:

url = 'https://raw.githubusercontent.com/fivethirtyeight'
url += '/data/master/tennis-time/serve_times.csv'
event = pd.read_csv(url)

plt.figure(figsize=(7, 11))
title = 'Time in seconds between'
title += '\nend of marked point and next serve'
title += '\nat 2015 French Open'
plt.title(title, loc='left', fontsize=18, color='gray')
py = distribution_scatter(event.seconds_before_next_point, cmap='cool');


cut_h = np.percentile(event.seconds_before_next_point, 98)
outliers = event.seconds_before_next_point> cut_h


ha = {True: 'right', False: 'left'}
for x, y, c in zip(event[outliers].seconds_before_next_point,
                   py[outliers],
                   event[outliers].server):
    plt.text(x, y+.0005, c,
             ha=ha[x<0], va='bottom', fontsize=12)

plt.xlabel('Seconds before next serve', fontsize=15)
plt.gca().spines['left'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.yticks([])
plt.xticks(np.arange(5, 41, 5))
plt.xlim([5, 40])
plt.show()

Where to go next:

The data used in this post has also been presented in this article on fivethirtyeight.com.
If you found this visualization technique useful, you can check out the seaborn documentation about swarm plots.

Friday, May 17, 2019

Feelings toward immigration of people from other EU Member States in November 2018

In this post we will see a snippet about how to plot a part of the results of the eurobarometer survey released last March. In particular, we will focus on the responses to the following question:

Please tell me whether the following statement evokes a positive or negative feeling for you: Immigration of people from other EU Member States.

The data from the main spreadsheet reporting the results country by country was isolated in a csv file (then uploaded on github) so that it could be easily loaded in Pandas as follows:

import pandas as pd

# github gist
gist = 'https://gist.githubusercontent.com/JustGlowing/'
gist += '2c25b9b153192baf573ce3b744ea6a65/raw/'
gist += '5f3888f7f42caca58b2418ec5822425083b6d559/'
gist += 'immigration_from_EU_eurobarometer_2018.csv'
df = pd.read_csv(gist, index_col=0)
df = df[df.index.map(lambda x: not '\n' in x)]
df.sort_values(by=["Total 'Positive'"], inplace=True)

# from https://ec.europa.eu/eurostat/statistics-explained/index.php
country_names = {'BE' : 'Belgium',
'BG' : 'Bulgaria',
'CZ' : 'Czechia',
'DK' : 'Denmark',
'DE' : 'Germany',
'EE' : 'Estonia',
'IE' : 'Ireland',
'EL' : 'Greece',
'ES' : 'Spain',
'FR' : 'France',
'HR' : 'Croatia',
'IT' : 'Italy',
'CY' : 'Cyprus',
'LV' : 'Latvia',
'LT' : 'Lithuania',
'LU' : 'Luxembourg',
'HU' : 'Hungary',
'MT' : 'Malta',
'NL' : 'Netherlands',
'AT' : 'Austria',
'PL' : 'Poland',
'PT' : 'Portugal',
'RO' : 'Romania',
'SI' : 'Slovenia',
'SK' : 'Slovakia',
'FI' : 'Finland',
'SE' : 'Sweden',
'UK' : 'United Kingdom'}

df.index = df.index.map(country_names.get)

The idea is to create a bar chart with two sides, positive responses on the right and negative on the left. To do this, we can use the function barh and the attribute left can be used to stack the two subsets of responses ("Fairly positive/ negative" and "Very positive/negative"). The xticks also need to be adapted to reflect that the left side of the axis doesn't report values below zero. Here's the snippet:

import matplotlib.pyplot as plt
import numpy as np

country_idx = range(len(df))

plt.figure(figsize=(11, 14))
plt.barh(country_idx, df['Fairly positive'],
         color='deepskyblue',label='Fairly positive')
plt.barh(country_idx, df['Very positive'], left=df['Fairly positive'],
         color='dodgerblue', label='Very positive')
plt.barh(country_idx, -df['Fairly negative'],
         color='tomato', label='Fairly negative')
plt.barh(country_idx, -df['Very negative'], left=-df['Fairly negative'],
         color='firebrick', label='Very negative')

plt.yticks(country_idx, df.index)
plt.xlim([-100, 100])
plt.xticks(np.arange(-100, 101, 25), np.abs(np.arange(-100, 101, 25)))
plt.ylim([-.5, len(df)-.5])
title = 'Feelings toward immigration of people from\n'
title += 'other EU Member States in November 2018'
plt.title(title)
xlbl = 'negative            <<<       % responses       >>>            positive'
plt.xlabel(xlbl)
plt.legend(loc='lower right')

bbox_props = dict(fc="white", ec="k", lw=2) 
plt.text(-95, 27, 'twitter: @justglowing \nhttps://glowingpython.blogspot.com',
         ha="left", va="center", size=11, bbox=bbox_props)
plt.show()

From the chart we note that the percentage of positive responses per country is mostly above 50% while the negative ones reach 50% only in two cases. We also see that Ireland and Sweden are the countries with the most positive responses, while Czechia (yes, that's Chech Republic :) is the country with most negative responses, though Cypriots also gave a similar number of "Very negative" responses.

Friday, June 29, 2018

Plotting a calendar in matplotlib

And here's a function to plot a compact calendar with matplotlib:

import calendar
import numpy as np
from matplotlib.patches import Rectangle
import matplotlib.pyplot as plt

def plot_calendar(days, months):
    plt.figure(figsize=(9, 3))
    # non days are grayed
    ax = plt.gca().axes
    ax.add_patch(Rectangle((29, 2), width=.8, height=.8, 
                           color='gray', alpha=.3))
    ax.add_patch(Rectangle((30, 2), width=.8, height=.8,
                           color='gray', alpha=.5))
    ax.add_patch(Rectangle((31, 2), width=.8, height=.8,
                           color='gray', alpha=.5))
    ax.add_patch(Rectangle((31, 4), width=.8, height=.8,
                           color='gray', alpha=.5))
    ax.add_patch(Rectangle((31, 6), width=.8, height=.8,
                           color='gray', alpha=.5))
    ax.add_patch(Rectangle((31, 9), width=.8, height=.8,
                           color='gray', alpha=.5))
    ax.add_patch(Rectangle((31, 11), width=.8, height=.8,
                           color='gray', alpha=.5))
    for d, m in zip(days, months):
        ax.add_patch(Rectangle((d, m), 
                               width=.8, height=.8, color='C0'))
    plt.yticks(np.arange(1, 13)+.5, list(calendar.month_abbr)[1:])
    plt.xticks(np.arange(1,32)+.5, np.arange(1,32))
    plt.xlim(1, 32)
    plt.ylim(1, 13)
    plt.gca().invert_yaxis()
    # remove borders and ticks
    for spine in plt.gca().spines.values():
        spine.set_visible(False)
    plt.tick_params(top=False, bottom=False, left=False, right=False)
    plt.show()

You can use it to highlight the days with full moon in 2018:

full_moon_day = [2, 31, 2, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22]
full_moon_month = [1, 1, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
plot_calendar(full_moon_day, full_moon_month)

Or just highlight all the weekends:

from datetime import datetime, timedelta

def get_weekends(year):
    weekend_day = []
    weekend_month = [] 
    start = datetime(year, 1, 1)
    for i in range(365):
        day_of_the_year = start + timedelta(days=i)
        if day_of_the_year.weekday() > 4:
            weekend_day.append(day_of_the_year.day)
            weekend_month.append(day_of_the_year.month)
    return weekend_day, weekend_month

weekend_day, weekend_month = get_weekends(2018)
plot_calendar(weekend_day, weekend_month)

Friday, June 16, 2017

A heatmap of male to female ratios with Seaborn

In this post we will see how to create a heatmap with seaborn. We'll use a dataset from the Wittgenstein Centre Data Explorer. The data extracted is also reported here in csv format. It contains the ratio of males to females in the population by age for 1970 to 2015 (data reported after this period is projected). First, we import the data using Pandas:

import pandas as pd
import numpy as np

sex_ratios = pd.read_csv('m2f_ratios.csv', skiprows=8)

age_code = {a: i for i,a in enumerate(sex_ratios.Age.unique())}
age_label = {i: a for i,a in enumerate(sex_ratios.Age.unique())}
sex_ratios['AgeCode'] = sex_ratios.Age.apply(lambda x: age_code[x])

area_idx = sex_ratios.Area == \
           'United Kingdom of Great Britain and Northern Ireland'
years_idx = sex_ratios.Year <= 2015
sex_ratios_uk = sex_ratios[np.logical_and(years_idx, area_idx)]

Here take care of the age coding and isolate the data for the United Kingdom and Northern Ireland. Now we can rearrange the data to see ratio per year and age using a pivot table, we can then visualize the result using the heatmap function from seaborn:

import matplotlib as plt
import seaborn as sns

pivot_uk = sex_ratios_uk.pivot_table(values='Ratio', 
                                     index='AgeCode', 
                                     columns='Year')
pivot_uk.index = [age_label[a] for a in pivot_uk.index.values]

plt.figure(figsize=(10, 8))
plt.title('Sex ratio per year and age groups')
sns.heatmap(pivot_uk, annot=True)
plt.show()

In each year we see that the ratio was above 1 (in favor of males) for young ages it then becomes lower than 1 during adulthood and keeps lowering with the age. It also seems that with time the ratio decreases more slowly. For example, we see that the age group 70-74 had a ratio of 0.63 in 1970, while the ratio in 2015 was 0.9.

Thursday, April 9, 2015

Stacked area plots with matplotlib

In a stacked area plot, the values on the y axis are accumulated at each x position and the area between the resulting values is then filled. These plots are helpful when it comes to compare quantities through time. For example, considering the the monthly totals of the number of new cases of measles, mumps, and chicken pox for New York City during the years 1931-1971 (data that we already considered here). We can compare the number of cases of each disease month by month. First, we need to load and organize the data properly:

from scipy.io import loadmat
NYCdiseases = loadmat('NYCDiseases.mat') # loading a matlab file
chickenpox = np.sum(NYCdiseases['chickenPox'],axis=0)
mumps = np.sum(NYCdiseases['mumps'],axis=0)
measles = np.sum(NYCdiseases['measles'],axis=0)

In the snippet above we read the data from a Matlab file and summed the number of cases for each month. We are now ready to visualize our values using the function stackplot:

import matplotlib.patches as mpatches
import matplotlib.pyplot as plt

plt.stackplot(arange(12)+1,
          [chickenpox, mumps, measles], 
          colors=['#377EB8','#55BA87','#7E1137'])
plt.xlim(1,12)

# creating the legend manually
plt.legend([mpatches.Patch(color='#377EB8'),  
            mpatches.Patch(color='#55BA87'), 
            mpatches.Patch(color='#7E1137')], 
           ['chickenpox','mumps','measles'])
plt.show()

The result is as follows:

We note that the highest number of cases happens between January and Jul, also we see that measles cases are more common than mumps and chicken pox cases.

Wednesday, August 27, 2014

Visualizing electricity prices with Plotly

We have already mentioned plotly many times (here are other two posts about it) and this time we'll see how to use it in order to build an interactive visualization of the latest data about the domestic electricity prices provided by International Energy Agency (IEA).

In the chart that we are going to make, we will show the prices of the domestic electricity among the countries monitored by IEA in 2013 with a bar chart where each bar shows the electricity price and the fraction of the price represented by the taxes.

First, we import the data (the full data is available here, in this post we'll use only the Table 5.5.1 in cvs format) using pandas:

import pandas as pd
ieaprices = pd.read_csv('iea_prices.csv',
                        na_values=('..','+','-','+/-'))
ieaprices = ieaprices.dropna()
ieaprices.set_index(['Country'],inplace=True)
countries = ieaprices.sort('2013_with_tax').index

Then, we arrange the data in order create a plotly bar chart:

from plotly.graph_objs import Bar,Data,Layout,Figure
from plotly.graph_objs import XAxis,YAxis,Marker,Scatter,Legend

prices_bars = []

# computing the taxes
taxes = ieaprices['2013_with_tax']-ieaprices['2013_no_tax']

# adding the prices to the chart
prices_bars.append(Bar(x=countries.values, 
             y=ieaprices['2013_no_tax'].ix[countries].values,
             marker=Marker(color='#0074D9'),
             name='price without taxes'))

# adding the taxes to the chart
prices_bars.append(Bar(x=countries.values, 
             y=taxes.ix[countries].values,
             marker=Marker(color='#0099D9'),name='taxes'))

And now we are ready to submit the data to the plotly server to render the chart:

import plotly.plotly as py

py.sign_in("SexyUser", "asexykeyforasexyuser")

meadian_line = Scatter(
    x=countries.values,
    y=np.ones(len(countries))*ieaprices['2013_with_tax'].median(),
    marker=Marker(color='rgb(40, 40, 40)'),
    opacity=0.5,
    mode='lines',
    name='Median')

data = Data(prices_bars+[meadian_line])

layout = Layout(
    title='Domestic electricity prices in the IEA in 2013',
    xaxis=XAxis(type='category'),
    yaxis=YAxis(title='Price (Pence per Kwh)'),
    legend=Legend(x=0.0,y=1.0),
    barmode='stack',
    hovermode='closest')

fig = Figure(data=data, layout=layout)

# this line will work only in ipython
# use py.plot() in other environments
plot_url = py.iplot(fig, filename='ieaprices2013')

The result should look like this:

Looking at the chart we note that, during 2013, the average domestic electricity prices, including taxes, in Denmark and Germany were the highest in the IEA. We also note that in Denmark the fraction of taxes paid is higher than the actual electricity price whereas in Germany the actual electricity price and the taxes are almost the same. Interestingly, USA has the lowest price and the lowest taxation.

This post shows how to create one of the charts commented here, where a more insights about the IEA data are provided.

Thursday, December 12, 2013

Multiple axes and subplots in Plotly

Some time ago we have seen how to visualize 2D histograms with Plotly and in this post we will see how to use one of the mostin interesting new features introduced by the Plotly guys: multiple axes into subplots. This features makes us able to couple subplots, so when you zoom or pan in one subplot, it zooms and pans in the other subplots. Just like the graphs produced by D3.
Here's how to plot a subplots matrix where each cell is a scatterplot between the features of the Iris dataset that we already used here. The first thing we need to do is to convert the data in the format required by the Plotly API:

from sklearn.datasets import load_iris
iris = load_iris()

attr = [f.replace(' (cm)', '') for f in iris.feature_names]
colors = {'setosa': 'rgb(31, 119, 180)', 
          'versicolor': 'rgb(255, 127, 14)', 
          'virginica': 'rgb(44, 160, 44)'}

data = []
for i in range(4):
    for j in range(4):
        for t,flower in enumerate(iris.target_names):
            data.append({"name": flower, 
                         "x": iris.data[iris.target == t,i],
                         "y": iris.data[iris.target == t,j],
                         "type":"scatter", "mode":"markers",
                         'marker': {'color': colors[flower], 
                                    'opacity':0.7},
                         "xaxis": "x"+(str(i) if i!=0 else ''),
                         "yaxis": "y"+(str(j) if j!=0 else '')})

Then, we create a layout to adjust the look and feel:

d = 0.04; # padding
dms = [[i*d+i*(1-3*d)/4,i*d+((i+1)*(1-3*d)/4)] for i in range(4)]

layout = {
    "xaxis":{"domain":dms[0], "title":attr[0], 
             'zeroline':False,'showline':False},
    "yaxis":{"domain":dms[0], "title":attr[0], 
             'zeroline':False,'showline':False},
    "xaxis1":{"domain":dms[1], "title":attr[1], 
              'zeroline':False,'showline':False},
    "yaxis1":{"domain":dms[1], "title":attr[1], 
              'zeroline':False,'showline':False},
    "xaxis2":{"domain":dms[2], "title":attr[2], 
              'zeroline':False,'showline':False},
    "yaxis2":{"domain":dms[2], "title":attr[2], 
              'zeroline':False,'showline':False},
    "xaxis3":{"domain":dms[3], "title":attr[3], 
              'zeroline':False,'showline':False},
    "yaxis3":{"domain":dms[3], "title":attr[3], 
              'zeroline':False,'showline':False},
    "showlegend":False,
    "width": 500,
    "height": 550,
    "title":"Iris flower data set",
    "titlefont":{'color':'rgb(67,67,67)', 'size': 20}
    }

Finally, we import the plotly module (see this page for more details about the installation) and we are read to invoke the Plotly remote service:

import plotly
p = plotly.plotly('supersexyusername', 'mysecretkey')
# iplot shows the graph in the ipython notebook
# use plot if you're outside of the notebook
p.iplot(data,layout=layout, width=500,height=550)

The result should be as follows: This interactive graph of the iris data set below was inspired by this wonderful D3 example by Mike Bostock. Find out more example of Plotly visualizations in Python inside the IPython notebook here.

Sunday, September 15, 2013

Self Organizing Maps

Notice: For an update tutorial on how to use minisom refere to the examples in the official documentation.

The Self Organizing Maps (SOM), also known as Kohonen maps, are a type of Artificial Neural Networks able to convert complex, nonlinear statistical relationships between high-dimensional data items into simple geometric relationships on a low-dimensional display. In a SOM the neurons are organized in a bidimensional lattice and each neuron is fully connected to all the source nodes in the input layer. An illustration of the SOM by Haykin (1999) is the following

Each neuron n has a vector w_n of weights associated. The process for training a SOM involves stepping through several training iteration until the item in your dataset are learnt by the SOM. For each pattern x one neuron n will "win" (which means that w_n is the weights vector more similar to x) and this winning neuron will have its weights adjusted so that it will have a stronger response to the input the next time it sees it (which means that the distance between x and w_n will be smaller). As different neurons win for different patterns, their ability to recognize that particular pattern will increase. The training algorithm can be summarized as follows:

Initialize the weights of each neuron.
Initialize t = 0
Randomly pick an input x from the dataset
Determine the winning neuron i as the neuron such that
Adapt the weights of each neuron n according to the following rule
Increment t by 1
if t < t_max go to step 3

We have that η(t) is called learning rate and that h(i) is called neighborhood function which has high values for i and the neurons close to i on the lattice (a Gaussian centered on i is a good example of neighborhood function). And, when t increases η also decrease and h decrease its spread. This way at each training step the weights of the neurons close to the winning one are adjusted to have a stronger response to the current pattern. After the training process, we have that the locations of the neurons become ordered and a meaningful coordinate system for the input features is created on the lattice. So, if we consider the coordinates of the associated winning neuron for each patter the SOM forms a topographic map of the input patterns.

MiniSom is a minimalistic and Numpy based implementation of the SOM. I made it during the experiments for my thesis in order to have fully hackable SOM algorithm and lately I decided to release it on GitHub. The next part of this post will show how to train MiniSom on the Iris Dataset and how to visualize the result. The first step is to import and normalize the data:

from numpy import genfromtxt,array,linalg,zeros,apply_along_axis

# reading the iris dataset in the csv format    
# (downloaded from http://aima.cs.berkeley.edu/data/iris.csv)
data = genfromtxt('iris.csv', delimiter=',',usecols=(0,1,2,3))
# normalization to unity of each pattern in the data
data = apply_along_axis(lambda x: x/linalg.norm(x),1,data)

The snippet above reads the dataset from a CSV and creates a matrix where each row corresponds to a pattern. In this case, we have that each pattern has 4 dimensions. (Note that only the first 4 columns of the file are used because the fifth column contains the labels). The training process can be started as follows:

from minisom import MiniSom
### Initialization and training ###
som = MiniSom(7,7,4,sigma=1.0,learning_rate=0.5)
som.random_weights_init(data)
print("Training...")
som.train_random(data,100) # training with 100 iterations
print("\n...ready!")

Now we have a 7-by-7 SOM trained on our dataset. MiniSom uses a Gaussian as neighborhood function and its initial spread is specified with the parameter sigma. While with the parameter learning_rate we can specify the initial learning rate. The training algorithm implemented decreases both parameters as training progresses. This allows rapid initial training of the neural network that is then "fine tuned" as training progresses.
To visualize the result of the training we can plot the average distance map of the weights on the map and the coordinates of the associated winning neuron for each patter:

from pylab import plot,axis,show,pcolor,colorbar,bone
bone()
pcolor(som.distance_map().T) # distance map as background
colorbar()
# loading the labels
target = genfromtxt('iris.csv',
                    delimiter=',',usecols=(4),dtype=str)
t = zeros(len(target),dtype=int)
t[target == 'setosa'] = 0
t[target == 'versicolor'] = 1
t[target == 'virginica'] = 2
# use different colors and markers for each label
markers = ['o','s','D']
colors = ['r','g','b']
for cnt,xx in enumerate(data):
 w = som.winner(xx) # getting the winner
 # palce a marker on the winning position for the sample xx
 plot(w[0]+.5,w[1]+.5,markers[t[cnt]],markerfacecolor='None',
   markeredgecolor=colors[t[cnt]],markersize=12,markeredgewidth=2)
axis([0,som.weights.shape[0],0,som.weights.shape[1]])
show() # show the figure

The result should be like the following:

For each pattern in the dataset the corresponding winning neuron have been marked. Each type of marker represents a class of the iris data ( the classes are setosa, versicolor and virginica and they are respectively represented with red, green and blue colors). The average distance map of the weights is used as background (the values are showed in the colorbar on the right). As expected from previous studies on this dataset, the patterns are grouped according to the class they belong and a small fraction of Iris virginica is mixed with Iris versicolor.

For a more detailed explanation of the SOM algorithm you can look at its inventor's paper.

Friday, July 12, 2013

Hopalong fractals

Have you ever wondered what happen if you pick a point (x₀,y₀) and compute hundreds of point using these equations?

Well, you get a hopalong fractal.
Let's plot this fractal using Pylab. The following function computes n points using the equations above:

from __future__ import division
from numpy import sqrt,power

def hopalong(x0,y0,n,a=-55,b=-1,c=-42):
 def update(x,y):
  x1 = y-x/abs(x)*sqrt(abs(b*x+c))
  y1 = a-x
  return x1,y1
 xx = []
 yy = []
 for _ in xrange(n):
  x0,y0 = update(x0,y0) 
  xx.append(x0)
  yy.append(y0)
 return xx,yy

and this snippet computes 40000 points starting from (-1,10):

from pylab import scatter,show, cm, axis
from numpy import array,mean
x = -1
y = 10
n = 40000
xx,yy = hopalong(x,y,n)
cr = sqrt(power(array(xx)-mean(xx),2)+power(array(yy)-mean(yy),2))
scatter(xx, yy, marker='.', c=cr/max(cr), 
        edgecolor='w', cmap=cm.Dark2, s=50)
axis('equal')
show()

Here we have one of the possible hopalong fractals:

Varying the starting point and the values of a, b and c we have different fractals. Here are some of them:

(x=-1,y=0,a=.1,b=5,c=1)

(x=-1,y=0,a=5,b=1,c=5)

Friday, June 28, 2013

Shape Matching Experiments

Often, in Computer Vision tasks we have a model of an interesting shape and we want to know if this model matches with a target shape found through the analysis of the edges of an image. The target shape is never identical to the model we have, because it could be a representation of the model in a different pose or it could match only partially our model, so we have to find a proper transformation of the model in order to match the target. In this post we will see how to find the closest shape model to a target shape using the fmin function provided by Scipy.
We can represent a shape with a matrix like the following:

where each pair (x_i,y_i) is a "landmark" point of the shape and we can transform every generic point (x,y) using this function:

In this transformation we have that θ is a rotation angle, t_x and t_y are the translation along the x and the y axis respectively, while s is a scaling coefficient. Applying this transformation to every point of the shape described by X we get a new shape which is a transformation of the original one according to the parameters used in T. Given a matrix as defined above, the following Python function is able to apply the transformation to every point of the shape represented by the matrix:

import numpy as np
import pylab as pl
from scipy.optimize import fmin

def transform(X,theta,tx=0,ty=0,s=1):
 """ Performs scaling, rotation and translation
     on the set of points in the matrix X 
     
    Input:
      X, points (each row have to be a point [x, y])
      theta, rotation angle (in radiants)
      tx, translation along x
      ty, translation along y
      s, scaling coefficient
    Return:
      Y, transformed set of points
 """
 a = math.cos(theta)
 b = math.sin(theta)
 R = np.mat([[a*s, -b*s], [b*s, a*s]])
 Y = R*X # rotation and scaling
 Y[0,:] = Y[0,:]+tx # translation
 Y[1,:] = Y[1,:]+ty
 return Y

Now, we want to ﬁnd the best pose (translation, scale and rotation) to match a model shape X to a target shape Y. We can solve this problem minimizing the sum of square distances between the points of X and the ones of Y, which means that we want to find

where T' is a function that applies T to all the point of the input shape. This task turns out to be pretty easy using the fmin function provided by Scipy:

def error_fun(p,X,Y):
 """ cost function to minimize """
 return np.linalg.norm(transform(X,p[0],p[1],p[2],p[3])-Y)

def match(X,Y,p0):
 """ Match the model X with the shape Y
     returns the set of parameters to transform
     X into Y and a set of partial solutions.
 """
 p_opt = fmin(error_fun, p0, args=(X,Y),retall=True)
 return p_opt[0],p_opt[1] # optimal solutions, other solutions

In the code above we have defined a cost function which measures how close the two shapes are and another function that minimizes the difference between the two shapes and returns the transfomation parameters. Now, we can try to match the trifoil shape starting from one of its transformations using the functions defined above:

# generating a shape to match
t = np.linspace(0,2*np.pi,50)
x = 2*np.cos(t)-.5*np.cos( 4*t )
y = 2*np.sin(t)-.5*np.sin( 4*t )
A = np.array([x,y]) # model shape
# transformation parameters
p = [-np.pi/2,.5,.8,1.5] # theta,tx,ty,s
Ar = transform(A,p[0],p[1],p[2],p[3]) # target shape
p0 = np.random.rand(4) # random starting parameters
p_opt, allsol = match(A,Ar,p0) # matching

It's interesting to visualize the error at each iteration:

pl.plot([error_fun(s,A,Ar) for s in allsol])
pl.show()

and the value of the parameters of the transformation at each iteration:

pl.plot(allsol)
pl.show()

From the graphs above we can observe that the the minimization process found its solution after 150 iterations. Indeed, after 150 iterations the error become very close to zero and there is almost no variation in the parameters. Now we can visualize the minimization process plotting some of the solutions in order to compare them with the target shape:

def plot_transform(A,p):
 Afound = transform(A,p[0],p[1],p[2],p[3])
 pl.plot(Afound[0,:].T,Afound[1,:].T,'-')

for k,i in enumerate(range(0,len(allsol),len(allsol)/15)):
 pl.subplot(4,4,k+1)
 pl.plot(Ar[0,:].T,Ar[1,:].T,'--',linewidth=2) # target shape
 plot_transform(A,allsol[i])
 pl.axis('equal')
pl.show()

In the graph above we can see that the initial solutions (the green ones) are very different from the shape we are trying to match (the dashed blue ones) and that during the process of minimization they get closer to the target shape until they fits it completely. In the example above we used two shapes of the same form. Let's see what happen when we have a starting shape that has a different form from the target model:

t = np.linspace(0,2*np.pi,50)
x = np.cos(t)
y = np.sin(t)
Ac = np.array([x,y]) # the circle (model shape)
p0 = np.random.rand(4) # random starting parameters
# the target shape is the trifoil again
p_opt, allsol = match(Ac,Ar,p0) # matching

In the example above we used a circle as model shape and a trifoil as target shape. Let's see what happened during the minimization process:

for k,i in enumerate(range(0,len(allsol),len(allsol)/15)):
 pl.subplot(4,4,k+1)
 pl.plot(Ar[0,:].T,Ar[1,:].T,'--',linewidth=2) # target shape
 plot_transform(Ac,allsol[i])
 pl.axis('equal')
pl.show()

In this case, where the model shape can't match the target shape completely, we see that the minimization process is able to find the circle that is closer to the trifoil.

Sunday, June 16, 2013

2D Histrograms with Plotly

Plotly is an online tool that makes us able to create wonderful interactive visualizations of our data. It can plot data from csv files, spreadsheet, etc. but it also has a Python sandbox where we can put our Python snippets! In this post we will see a simple example that shows how to plot a 2D histogram in Plotly.

First, we need a snippet to generate some random sets of data:

from numpy import *
 
# generate some random sets of data
y0 = random.randn(100)/5. + 0.5 
x0 = random.randn(100)/5. + 0.5 
 
y1 = random.rayleigh(size=20)/7. + 0.1
x1 = random.rayleigh(size=20)/8. + 1.1
 
y2 = random.randn(50)/10. + 0.9
x2 = random.rayleigh(size=50)/10. + 0.1
 
y3 = random.randn(50)/8. + 0.1
x3 = random.randn(50)/8. + 0.1
 
y = concatenate([y0,y1,y2,y3])
x = concatenate([x0,x1,x2,x3])

The distribution of the variable x looks like:

The distribution of the variable y looks like: And the 2D histogram of both variables looks like this:

As showed in the colorbar, cells with lighter colors correspond to high density areas of the our distribution.

All the plots above were made with Plotly inside their Python sandbox using the following code:

## place the data into Plotly's dict format

# histograms
histx = {'x': x, 'type':'histogramx'}
histy = {'y': y, 'type':'histogramy'}
hist2d = {'x': x, 'y': y, 'type':'histogram2d'}

# scatter plots above the 1D histograms
# "jitter" the scatter plot points to make their distribution easier to distinguish
jitterx = {'x': x, 'y': 60+3*random.rand((len(x))), 'type':'scatter','mode':'markers','marker':{'size':4,'opacity':0.5,'symbol':'square'}}

jittery = {'x': y, 'y': 35+3*random.rand((len(x))), 'type':'scatter','mode':'markers','marker':{'size':4,'opacity':0.5,'symbol':'square'}}

# scatter points in the 2D histogram
xy = {'x': x, 'y': y, 'type':'scatter','mode':'markers','marker':{'size':5,'opacity':0.5,'symbol':'square'}}

# NOTE: the following lines plot all the graph above
plot([histx, jitterx], layout={'title': 'Distribution of Variable 1'})
plot([histy, jittery], layout={'title': 'Distribution of Variable 2'})
plot([hist2d,xy], layout={'title': 'Distribution of Variable 1 and Variable 2'})

Plots made with Plotly automatically provide interactions (click-drag to zoom, double-click to autoscale, shift-click to pan) and are very easy to embed in web page using the embedding snippet.

Thanks to the Plotly guys for providing the code of this post and this amazing tool :)

Saturday, May 25, 2013

A colorful Trifoil Knot

Some time ago I was looking for a cool way to visualize parametric equations and I started working on the the trefoil knot. It is the simplest example of a nontrivial knot and it can be defined with the following parametric equations:

After a while this snippet came out:

from numpy import linspace,pi,cos,sin
phi = linspace(0,2*pi,100)
x = sin(phi)+2*sin(2*phi)
y = cos(phi)-2*cos(2*phi)
z = -sin(3*phi)

from pylab import scatter,subplot,cm,show
from numpy import abs
s=abs(((y+3)+2)*10)
scatter(x,y,c=x,s=s,edgecolor='w',cmap=cm.gist_rainbow)
scatter(x/1.4,y/1.2,c=x,s=s,edgecolor='w',cmap=cm.gist_rainbow_r)
scatter(x/1.8,y/1.5,c=x,s=s,edgecolor='w',cmap=cm.gist_rainbow)
show()

So, here's my colorful version of the trifold knot:

Thursday, April 11, 2013

Odd-Even sort visualized

The Odd/Even sort is a sorting algorithm which uses the concept of the Bubble Sort to move elements around. Unlike Bubble sort, the Odd/Even sort compares disjointed pairs by using alternating odd and even index values splitting the sorting in different phases. We have that in the odd phase all the element with an odd index i are compared with the adjacent even index i+1 and in the even phase all the element with an even index i are compared with the adjacent odd index i+1. These two phases are repeated until no exchanges are required. This is a Python snippet that make us able to visualize the behavior of the algorithm:

def oddeven_anim(a):
 imgidx = 0
 x = range(len(a))
 sort = False; 
 while not sort:
  sort = True;
  for i in range(1,len(a)-1,2): # odd phase
   if a[i] > a[i+1]:
    a[i+1], a[i] = a[i], a[i+1]
    sort = False;
  for i in range(0,len(a)-1,2): # even phase
   if a[i] > a[i+1]:
    a[i+1], a[i] = a[i], a[i+1]
    sort = False;
  pylab.plot(x,a,'k.',markersize=6)
  pylab.savefig("oddevensort/img" + '%04d' % imgidx + ".png")
  pylab.clf()
  imgidx =  imgidx + 1

# run the algorithm
a = range(300)
shuffle(a)
oddeven_anim(a)

As in the other examples of sorting visualization in this blog, we have an image for each step of the algorithm. The following video have been produced joining all the images (here is explained how):

Monday, April 1, 2013

Real-time Twitter analysis

The twitter API is a great tool for analyze tweets by code. In particular, the streaming API gives real time access to the global stream of tweets and, unlike a conventional REST API, it is used through a continuous connection to the Twitter servers. So it requires a persistent HTTP connection open as long as you need to collect tweets. The typical workflow of an application which uses this API is the following:

The easiest way to handle an HTTP streaming in Python is to use PyCurl, the Python bindings for the famous Curl network library. PyCurl allows you to provide a callback function that will be executed every time a new block of data is available. The following code is a simple demonstration of how we can use HTTP streaming with PyCurl in order to analyzie a stream of tweet:

from __future__ import division
from collections import defaultdict
from pylab import barh,show,yticks
import pycurl
import simplejson
import sys
import nltk
import re

def plot_histogram(freq, mean):
 # using dict comprehensions to remove not frequent words
 topwords = {word : count 
             for word,count in freq.items() 
             if count > round(2*mean)}
 # plotting
 y = topwords.values()
 x = range(len(y))
 labels = topwords.keys()
 barh(x,y,align='center')
 yticks(x,labels)
 show()

class TwitterAnalyzer:
 def __init__(self):
  self.freq = defaultdict(int)
  self.cnt = 0
  self.mean = 0.0
  # composing the twitter stream url
  nyc_area = 'locations=-74,40,-73,41'
  self.url = "https://stream.twitter.com/1.1/statuses/filter.json?"+nyc_area

 def begin(self,usr,pws):
  """ 
    init and start the connection with twitter using pycurl
    usr and pws must be valid twitter credentials
  """
  self.conn = pycurl.Curl()
  # we use the basic authentication, 
  # in future oauth2 could be required
  self.conn.setopt(pycurl.USERPWD, "%s:%s" % (usr, pws))
  self.conn.setopt(pycurl.URL, self.url)
  self.conn.setopt(pycurl.WRITEFUNCTION, self.on_receive)
  self.conn.perform()

 def on_receive(self,data):
  """ Handles the arrive of a single tweet """
  self.cnt += 1
  tweet = simplejson.loads(data)
  # a little bit of natural language processing
  tokens = nltk.word_tokenize(tweet['text']) # tokenize
  tagged_sent = nltk.pos_tag(tokens) # Part Of Speech tagging
  for word,tag in tagged_sent:
   # filter sigle chars words and symbols
   if len(word) > 1 and re.match('[A-Za-z0-9 ]+',word):
    # consider only adjectives and nouns
    if tag == 'JJ' or tag == 'NN':
     self.freq[word] += 1 # keep the count
  # print the statistics every 50 tweets
  if self.cnt % 50 == 0:
   self.print_stats()

 def print_stats(self):
  maxc = 0
  sumc = 0
  for word,count in self.freq.items():
   if maxc < count:
    maxc = count
   sumc += count
  self.mean = sumc/len(self.freq)
  print '-------------------------------'
  print ' tweets analyzed:', self.cnt
  print ' words extracted:', len(self.freq)
  print '   max frequency:', maxc
  print '  mean frequency:', self.mean

 def close_and_plot(self,signal,frame):
  print ' Plotting...'  
  plot_histogram(self.freq,self.mean)
  sys.exit(0)

In the constructor of this class we initialize a dictionary that will contain the frequency of each word, a string that contains the url of the service we need to call (composed in order to query twitter for the tweets in NYC) and the variables cnt and mean to keep track of the number of tweets analyzed and of the mean frequency over all the words.
In the method begin, we use the PyCurl library for the authentication to Twitter and start the connection. In particular, we set that the method on_receive is the callback function demanded to processing of the incoming. In this method the actual analysis is done, every tweet is split in tokens and a part of speech tagging is performed. Then, the frequency of all the words that are adjective or nouns is updated.
The method print_stats is used to print the our statistics on the console while close_and_plot plots an histogram using the frequencies in the dictionary and closes the program.
Let's use this class:

import signal
usr = 'supersexytwitteruser'
pws = "yessosexyiam"

ta = TwitterAnalyzer()
# invoke the close_and_plot() method when a Ctrl-D arrives
signal.signal(signal.SIGINT, ta.close_and_plot)
ta.begin(usr,pws) # run the analysis

In the code above, a TwitterAnalyzer object is instantiated, its method close_and_plot is registered as handler for the SIGINT signal and, finally, begin is invoked.
This code starts a program which analyzes all the tweet of the New York Area in real time and prints the statistics every 50 tweets, just like follows:

-------------------------------
 tweets analyzed: 50
 words extracted: 110
   max frequency: 8
  mean frequency: 1.12727272727
-------------------------------
 tweets analyzed: 100
 words extracted: 200
   max frequency: 22
  mean frequency: 1.235
-------------------------------
 tweets analyzed: 150
 words extracted: 286
   max frequency: 29
  mean frequency: 1.26573426573
-------------------------------
 tweets analyzed: 200
 words extracted: 407
   max frequency: 39
  mean frequency: 1.31203931204
-------------------------------
 tweets analyzed: 250
 words extracted: 495
   max frequency: 49
  mean frequency: 1.37575757576

Pressing Cntrl-D we can stop the program and plot a bar chart of adjectives and nouns detected. This is what I got in the morning of March 21, 2013:

We see that is very common to post a link in a tweet (turned out that http is considered as a noun most of the time) and that the words day, today, good and morning were the most used during the analysis.

Monday, March 18, 2013

Binary Plots

A binary plot of an integer sequence is a plot of the binary representations of successive terms where each term is represented as a sequence of bits with 1s colored black and 0s colored white. Then each representation is stacked to form a table where each entry represents a bit in the sequence.
To make a binary plot of a sequence we need to convert each term of the sequence into its binary representation. Then we have to put this representation in a form which make us able to build the plot easily. The following function converts the sequence in a matrix where the element i-j represents is the bit j of the element i of the sequence:

from numpy import zeros,uint8
from pylab import imshow,subplot,show

def seq_to_bitmatrix(s):
 """ Returns a matrix where the row i 
     is the binary representation of s[i] 
     s must be a list of integers (also long) """
 # maximum number of bits used in this sequence
 maxbit = len(bin(max(s)))
 M = zeros((len(s),maxbit),dtype=uint8)
 for i,e in enumerate(s):
  bits = bin(e)[2:] # bin() -> 0b100, we drop 2 chars
  for j,b in enumerate(bits):
   M[i,maxbit-len(bits)+j] = int(b) # fill the matrix
 return M

The matrix returned by the function above represent each bit with an integer. Of course, this is a waste of space but makes us able to use the function imshow in order to draw the binary plot:

def show_binary_matrix(M):
 imshow(M,cmap='Greys', interpolation='nearest')
 show()

Now we have all the necessary. Let's plot the factorial sequence:

from math import factorial
s = [factorial(n) for n in range(200)]
show_binary_matrix(seq_to_bitmatrix(s))

The following graph shows the result:

From this graph we can observe two interesting things. The first is that we need more than 1200 bits to represent the biggest number of the sequence we generated, which is 199!, and the second is that when n grows the bits on the right tend to be unused.
Let's see what happen with the famous Fibonacci sequence:

fib_cache = {0:0,1:1}
def fib_memo(n):
 """ Returns the first n Fibonacci numbers """
 if n not in fib_cache:
  fib_cache[n] = fib_memo(n-1)+fib_memo(n-2)
 return fib_cache[n]

s = [fib_memo(n) for n in range(1,150)]
show_binary_matrix(seq_to_bitmatrix(s))

In the snippet above we compute the first 149 Fibonacci numbers using memoization then we plotted the sequence in the same way of the factorial. This time the result is surprising:

Indeed, it is very interesting to observe that the graph reveals a fractal like pattern of hollow and filled triangles.
Another sequence that shows a fractal like series of patter is the sum of the first n natural numbers which we can plot as follows:

s = [n*(n+1)/2 for n in range(124)]
show_binary_matrix(seq_to_bitmatrix(s).T)

In this case the Gauss formula have been used to compute the terms of the sequence and this is the result:

This time the matrix have been transposed to have a vertical visualization. We have that each column is a binary representation of a term of the sequence. In conclusion, we plot the sequence n^p for p = 1,2,3,4:

# n^p
for p in range(1,5):
    s = [pow(n,p) for n in range(124)]
    subplot(4,1,p)
    imshow(seq_to_bitmatrix(s).T,cmap='Greys',  interpolation='nearest')
show()

And we have this:

We can see that for p=1,2 we have something that looks like the graph we obtained with the sequence of the sum of the first n natural numbers.