The Glowing Python: sklearn

Showing posts with label sklearn. Show all posts

Wednesday, November 11, 2020

Visualize the Dictionary of Obscure Words with T-SNE

I recently published on a wrapper around The Dictionary of Obscure Words (originally from this website http://phrontistery.info) for Python and in this post we'll see how to create a visualization to highlight few entries from the dictionary using the dimensionality reduction technique called T-SNE. The dictionary is available on github at this address https://github.com/JustGlowing/obscure_words and can be installed as follows:

pip install git+https://github.com/JustGlowing/obscure_words

We can now import the dictionary and create a vectorial representation of each word:

import matplotlib.pyplot as plt
import numpy as np
from obscure_words import load_obscure_words
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.manifold import TSNE

obscure_dict = load_obscure_words()
words = np.array(list(obscure_dict.keys()))
definitions = np.array(list(obscure_dict.values()))

vectorizer = TfidfVectorizer(stop_words=None)
X = vectorizer.fit_transform(definitions)

projector = TSNE(random_state=0)
XX = projector.fit_transform(X)

In the snippet above, we compute a Tf-Idf representation using the definition of each word. This gives us a vector for each word in our dictionary, but each of these vectors has many elements as the total number of words used in all the definitions. Since we can't plot all the features extracted, we reduce our data to 2 dimensions we use T-SNE. We have now a mapping that allows us to place each word in a point of a bi-dimensional space. There's one problem remaining, how can we plot the words in a way that we can still read them? Here's a solution:

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import pairwise_distances

def textscatter(x, y, text, k=10):
    X = np.array([x, y]).T
    clustering = KMeans(n_clusters=k)
    scaler = StandardScaler()
    clustering.fit(scaler.fit_transform(X))
    centers = scaler.inverse_transform(clustering.cluster_centers_)
    selected = np.argmin(pairwise_distances(X, centers), axis=0)
    plt.scatter(x, y, s=6, c=clustering.predict(scaler.transform(X)), alpha=.05)
    for i in selected:
        plt.text(x[i], y[i], text[i], fontsize=10)

plt.figure(figsize=(16, 16))
textscatter(XX[:, 0], XX[:, 1], 
            [w+'\n'+d for w, d in zip(words, definitions)], 20)
plt.show()

In the function textscatter we segment all the points created at the previous steps in k clusters using K-Means, then we plot the word related to the center of cluster (and also its definion). Given the properties of K-Means we know that the centers are distant from each other and with the right choice of k we can maximize the number of words we can display. This is the result of the snippet above:

(click on the figure to see the entire chart)

Sunday, May 17, 2020

Neural Networks Regularization made easy with sklearn and matplotlib

Using regularization has many benefits, the most common are reduction of overfitting and solving multicollinearity issues. All of this is covered very well in literature, especially in (Hastie et all). Howerver, wihout touching too many details we can have a very straigthforward interpretation of regularization. Regularization is a way to constrain a model in order to learn less from the data. In this post we will experimentally show what are the effects of regularization on a Neural Network (Multilayer Perceptron) validating this interpretation.

Let's define a goal for our Neural Network. We have a dataset H and we want to build a model able to reconstruct the same data. More formally, we want to build a function f that takes in input H and returns the same values or an approximation close as possible to H. We can say that we want f to minimize the following error

To begin our experiment we build a data matrix H that contains the coordinate of a stylized star:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x= [np.cos(np.pi/2), 2/5*np.cos(np.pi/2+np.pi/5), np.cos(np.pi/2+2*np.pi/5), 
    2/5*np.cos(np.pi/2+3*np.pi/5), np.cos(np.pi/2+4*np.pi/5), 
    2/5*np.cos(3*np.pi/2), np.cos(np.pi/2+6*np.pi/5), 
    2/5*np.cos(np.pi/2+7*np.pi/5), np.cos(np.pi/2+8*np.pi/5),
    2/5*np.cos(np.pi/2+9*np.pi/5), np.cos(np.pi/2)]

y=[np.sin(np.pi/2), 2/5*np.sin(np.pi/2+np.pi/5), np.sin(np.pi/2+2*np.pi/5),
   2/5*np.sin(np.pi/2+3*np.pi/5), np.sin(np.pi/2+4*np.pi/5),
   2/5*np.sin(3*np.pi/2), np.sin(np.pi/2+6*np.pi/5),
   2/5*np.sin(np.pi/2+7*np.pi/5), np.sin(np.pi/2+8*np.pi/5),
   2/5*np.sin(np.pi/2+9*np.pi/5), np.sin(np.pi/2)]

xp = np.linspace(0, 1, len(x))
np.interp(2.5, xp, x)
x = np.log(np.interp(np.linspace(0, 1, len(x)*10), xp, x)+1)

yp = np.linspace(0, 1, len(y))
np.interp(2.5, yp, y)
y = np.interp(np.linspace(0, 1, len(y)*10), yp, y)
#y[::2] += .1

H = np.array([x, y]).T
plt.plot(H[:,0], H[:,1], '-o')

Now the matrix H contains the x coordinates of our star in the first column and the y coordinates in the second. With the help of sklearn, we can now train a Neural Network and plot the result:

from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import minmax_scale

H = scale(H)

plt.figure()
f = MLPRegressor(hidden_layer_sizes=(200, 200, 200), alpha=0)
f.fit(H, H)
result = f.predict(H)
plt.plot(result[:,0], result[:,1], 'C3', linewidth=3, label='Neural Network')
plt.plot(H[:,0], H[:,1], 'o', alpha=.3, label='original')
plt.legend()
#plt.xlim([-0.1, 1.1])
#plt.ylim([-0.1, 1.1])
plt.show()

In the snippet above we created a Neural Network with 3 layers of 200 neurons. Then, we trained the model using H as both input and output data. In the chart we compare the original data with the estimation. It's easy to see that there are small differences between the two stars but they are very close.
Here's important to notice that we initialized MLPRegressor using alpha=0. This parameter controls the regularization of the model and the higher its value, the more regularization we apply. To understand how alpha affects the learning of the model we need to add a term to the computation of the error that was just introduced:

where W is a matrix of all the weights in the network. The error not only takes in account the difference between the ouput of the function and the data, but also the size of weights of the connections of the network. Hence, the higher alpha, the less the model is free to learn. If we set alpha to 20 in our experiment we have the following result:

We still achie an approximation of the star but the output of our model is smaller than before and the edges of the star are smoother.
Increasing alpha gradually is a good way to understand the effects of regularization:

from matplotlib.animation import FuncAnimation
from IPython.display import HTML

fig, ax = plt.subplots()
ln, = plt.plot([], [], 'C3', linewidth=3, label='Neural Network')
plt.plot(H[:,0], H[:,1], 'o', alpha=.3, label='original')
plt.legend()

def update(frame):
    f = MLPRegressor(hidden_layer_sizes=(200, 200, 200), alpha=frame)
    f.fit(H, H)
    result = f.predict(H)
    ln.set_data(result[:,0], result[:,1])
    plt.title('alpha = %.2f' % frame)
    return ln,

ani = FuncAnimation(fig, update, frames=np.linspace(0, 40, 100), blit=True)
HTML(ani.to_html5_video())

Here we vary alpha from 0 to 40 and plot the result for each value. We notice here that not only the star gets smaller and smoother as alpha increases but also that the network tends to preserve long lines as much as possible getting away from the edges which account less on error function. Finally we see that the result degenerates into a point when alpha is too high.

Friday, June 7, 2019

Exporting Decision Trees in textual format with sklearn

In the past we have covered Decision Trees showing how interpretable these models can be (see the tutorials here). In the previous tutorials we have exported the rules of the models using the function export_graphviz from sklearn and visualized the output of this function in a graphical way with an external tool which is not easy to install in some cases. Luckily, since version 0.21.2, scikit-learn offers the possibility to export Decision Trees in a textual format (I implemented this feature personally ^_^) and in this post we will see an example how of to use this new feature.

Let's train a tree with 2 layers on the famous iris dataset using all the data and print the resulting rules using the brand new function export_text:

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_text
from sklearn.datasets import load_iris

iris = load_iris()
X = iris['data']
y = ['setosa']*50+['versicolor']*50+['virginica']*50
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(X, y)
r = export_text(decision_tree, feature_names=iris['feature_names'])
print(r)

|--- petal width (cm) <= 0.80
|   |--- class: setosa
|--- petal width (cm) >  0.80
|   |--- petal width (cm) <= 1.75
|   |   |--- class: versicolor
|   |--- petal width (cm) >  1.75
|   |   |--- class: virginica

Reading the them we note that if the feature petal width is less or equal than 80mm the samples are always classified as setosa. Otherwise if the petal width is less or equal than 1.75cm they're classified as versicolor or as virginica if the petal width is more than 1.75cm. This model might well suffer from overfitting but tells us some important details of the data. It's easy to note that the petal width is the only feature used, we could even say that the petal width is small for setosa samples, medium for versicolor and large for virginica.

To understand how the rules separate the labels we can also print the number of samples from each class (class weights) on the leaves:

r = export_text(decision_tree, feature_names=iris['feature_names'],
                decimals=0, show_weights=True)
print(r)

|--- petal width (cm) <= 1
|   |--- weights: [50, 0, 0] class: setosa
|--- petal width (cm) >  1
|   |--- petal width (cm) <= 2
|   |   |--- weights: [0, 49, 5] class: versicolor
|   |--- petal width (cm) >  2
|   |   |--- weights: [0, 1, 45] class: virginica

Here we have the number of samples per class among square brackets. Recalling that we have 50 samples per class, we see that all the samples labeled as setosa are correctly modelled by the tree while for 5 virginica and 1 versicolor the model fails to capture the information given by the label.

Check out the documentation of the function export_text to discover all its capabilities here.

Monday, October 9, 2017

Spotting outliers with Isolation Forest using sklearn

Isolation Forest is an algorithm to detect outliers. It partitions the data using a set of trees and provides an anomaly scores looking at how isolated is the point in the structure found, the anomaly score is then used to tell apart outliers from normal observations. In this post we will see an example of how IsolationForest behaves in simple case. First, we will generate 1-dimensional data from bimodal distribution, then we will compare the anomaly score with the distribution of the data and highlighting the regions considered where the outliers fall.

To start, let's generate the data and plot the histogram:

import numpy as np
import matplotlib.pyplot as plt

x = np.concatenate((np.random.normal(loc=-2, scale=.5,size=500), 
                    np.random.normal(loc=2, scale=.5, size=500)))

plt.hist(x, normed=True)
plt.xlim([-5, 5])
plt.show()

Here we note that there are three regions where the data has low probability to appear. One on the right side of the distribution, another one and the left and another around zero. Let's see if using IsolationForest we are able to identify these three regions:

from sklearn.ensemble import IsolationForest

isolation_forest = IsolationForest(n_estimators=100)
isolation_forest.fit(x.reshape(-1, 1))

xx = np.linspace(-6, 6, 100).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)

plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score), 
                 where=outlier==-1, color='r', 
                 alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('x')
plt.xlim([-5, 5])
plt.show()

In the snippet above we have trained our IsolationForest using the data generated, computed the anomaly score for each observation and classified each observation as outlier or non outlier. The chart shows, the anomaly scores and the regions where the outliers are. As expected, the anomaly score reflects the shape of the underlying distribution and the outlier regions correspond to low probability areas.