K Nearest Neighbor ¶
1. KNN Theory
1.1 Type of algorithm
KNN can be used for both classification and regression predictive problems.
[ KNN puede usarse tanto para la clasificación como para los problemas predictivos de regresión.]
KNN falls in the supervised learning family of algorithms.
[ KNN cae en la familia de aprendizaje supervisada de algoritmos.]
Informally, this means that we are given a labelled dataset consiting of training observations (x, y) (���, ���) and would like to
capture the relationship between x ��� and y ���. More formally, our goal is to learn a function h : X → Y ℎ: ��� → ��� so
that given an unseen observation x ���, h(x) ℎ(���) can confidently predict the corresponding output y ���.
1.2 Distance measure
In the classification setting, the K-nearest neighbor algorithm essentially boils down to forming a majority vote between the K most
similar instances to a given “unseen” observation.
[ En el entorno de clasificación, el algoritmo de vecino K-Near más se reduce esencialmente a formar un voto mayoritario entre las
instancias K más similares a una observación 'invisible' dada.]
Similarity is defined according to a distance metric between two data points.
[ La similitud se define de acuerdo con una métrica de distancia entre dos puntos de datos.]
The k-nearest-neighbor classifier is commonly based on the Euclidean distance between a test sample and the specified training
samples.
[ El clasificador K-Nearest-Neighbor se basa comúnmente en la distancia euclidiana entre una muestra de prueba y las muestras de
entrenamiento especificadas.]
Let x i ������ be an input sample with p ��� features (x i1 , x i2 , . . . , x ip ) (������1 , ������2 , . . . , ��������� ),
n ��� be the total number of input samples (i = 1, 2, . . . , n) (��� = 1, 2, . . . , ���). The Euclidean distance between sample
x i ������ and x l ������ is is defined as:
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
d(x i , x l ) = √ (x i1 − x l1 ) 2 + (x i2 − x l2 ) 2 +. . . +(x ip − x lp ) 2
���(������ , ������ ) = √(������1 − ������1 )2 + (������2 − ������2 )2 + . . . + (��������� − �����
Sometimes other measures can be more suitable for a given setting and include the Manhattan, Chebyshev and Hamming distance.
[ A veces, otras medidas pueden ser más adecuadas para una configuración dada e incluyen la distancia de Manhattan, Chebyshev y
Hamming.]
1.3 Algorithm steps
STEP 1: Cgoose the number K of neighbors
[ Paso 1: CGOOSE El número K de vecinos]
STEP 2: Take the K nearest neighbors of the new data point, according to your distance metric
[ Paso 2: tome los v vecinos más cercanos del nuevo punto de datos, de acuerdo con su métrica de distancia]
STEP 3: Among these K neighbors, count the number of data points to each category
[ Paso 3: Entre estos K vecinos, cuente el número de puntos de datos a cada categoría]
STEP 4: Assign the new data point to the category where you counted the most neighbors
[ Paso 4: Asigne el nuevo punto de datos a la categoría donde contó la mayoría de los vecinos]
2. Importing and preperation of data
2.1 Import libraries
In [1]:
import numpy as np
import pandas as pd
2.2 Load dataset
NOTE: Iris dataset includes three iris species with 50 samples each as well as some properties about each flower.
[ Nota: El conjunto de datos de Iris incluye tres especies de iris con 50 muestras cada una, así como algunas propiedades sobre cada
flor.]
One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.
[ Una especie de flor es linealmente separable de las otras dos, pero las otras dos no son linealmente separables entre sí.]
In [2]:
# Importing the dataset
dataset = pd.read_csv('../input/Iris.csv')
2.3 Summarize the Dataset
In [3]:
# We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains
with the shape property.
[ # Podemos tener una idea rápida de cuántas instancias (filas) y cuántos atributos (columnas) contienen c
on la propiedad de la forma.]
dataset.shape
Out[3]:
(150, 6)
In [4]:
dataset.head(5)
Out[4]:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa
In [5]:
dataset.describe()
Out[5]:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 75.500000 5.843333 3.054000 3.758667 1.198667
std 43.445368 0.828066 0.433594 1.764420 0.763161
min 1.000000 4.300000 2.000000 1.000000 0.100000
25% 38.250000 5.100000 2.800000 1.600000 0.300000
50% 75.500000 5.800000 3.000000 4.350000 1.300000
75% 112.750000 6.400000 3.300000 5.100000 1.800000
max 150.000000 7.900000 4.400000 6.900000 2.500000
In [6]:
# Let’s now take a look at the number of instances (rows) that belong to each class.
[ # Echemos un vistazo a la cantidad de instancias (filas) que pertenecen a cada clase.]
We can view this as an absolute count.
[ Podemos ver esto como un recuento absoluto.]
dataset.groupby('Species').size()
Out[6]:
Species
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
dtype: int64
2.4 Dividing data into features and labels
[ Dividir los datos en características y etiquetas¶]
[ código de enlace]
NOTE: As we can see dataset contain six columns: Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm and Species.
[ Nota: Como podemos ver el conjunto de datos, contiene seis columnas: ID, Sepallengthcm, SepalWidthcm, Petallengthcm,
PetalWidthcm y especies.]
The actual features are described by columns 1-4.
[ Las características reales se describen mediante las columnas 1-4.]
Last column contains labels of samples.
[ La última columna contiene etiquetas de muestras.]
Firstly we need to split data into two arrays: X (features) and y (labels).
[ En primer lugar, necesitamos dividir los datos en dos matrices: x (características) e y (etiquetas).]
In [7]:
feature_columns = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm','PetalWidthCm']
X = dataset[feature_columns].values
y = dataset['Species'].values
# Alternative way of selecting features and labels arrays:
# X = dataset.iloc[:, 1:5].values
# y = dataset.iloc[:, 5].values
2.5 Label encoding
NOTE: As we can see labels are categorical.
[ Nota: Como podemos ver, las etiquetas son categóricas.]
KNeighborsClassifier does not accept string labels.
[ KneighBorsClassifier no acepta etiquetas de cadena.]
We need to use LabelEncoder to transform them into numbers.
[ Necesitamos usar LabelEncoder para transformarlos en números.]
Iris-setosa correspond to 0, Iris-versicolor correspond to 1 and Iris-virginica correspond to 2.
[ Iris-Setosa corresponde a 0, Iris -versicolor corresponde a 1 e Iris-Virginica corresponde a 2.]
In [8]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
2.6 Spliting dataset into training set and test set
[ 2.6 División de datos en el conjunto de entrenamiento y conjunto de pruebas]
[ código de enlace]
Let's split dataset into training set and test set, to check later on whether or not our classifier works correctly.
[ Dividamos el conjunto de datos en el conjunto de capacitación y el conjunto de pruebas, para verificar más adelante si nuestro
clasificador funciona correctamente o no.]
In [9]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
/opt/conda/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module w
as deprecated in version 0.18 in favor of the model_selection module into which all the refactored class
es and functions are moved. Also note that the interface of the new CV iterators are different from that
of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
Lastly, because features values are in the same order of magnitude, there is no need for feature scaling.
[ Por último, debido a que los valores de las características están en el mismo orden de magnitud, no hay necesidad de escala de
características.]
Nevertheless in other sercostamses it is extremly important to apply feature scaling before running classification algorythms.
[ Sin embargo, en otros sercostamses es extremadamente importante aplicar la escala de características antes de ejecutar los
algorales de clasificación.]
3. Data Visualization
In [10]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
3.1. Parallel Coordinates
Parallel coordinates is a plotting technique for plotting multivariate data.
[ Las coordenadas paralelas son una técnica de trazado para trazar datos multivariados.]
It allows one to see clusters in data and to estimate other statistics visually.
[ Permite que uno vea grupos en datos y estime otras estadísticas visualmente.]
Using parallel coordinates points are represented as connected line segments.
[ El uso de los puntos de coordenadas paralelas se representan como segmentos de línea conectados.]
Each vertical line represents one attribute.
[ Cada línea vertical representa un atributo.]
One set of connected line segments represents one data point.
[ Un conjunto de segmentos de línea conectados representa un punto de datos.]
Points that tend to cluster will appear closer together.
[ Los puntos que tienden a agruparse aparecerán más juntos.]
In [11]:
from pandas.plotting import parallel_coordinates
plt.figure(figsize=(15,10))
parallel_coordinates(dataset.drop("Id", axis=1), "Species")
plt.title('Parallel Coordinates Plot', fontsize=20, fontweight='bold')
plt.xlabel('Features', fontsize=15)
plt.ylabel('Features values', fontsize=15)
plt.legend(loc=1, prop={'size': 15}, frameon=True,shadow=True, facecolor="white", edgecolor="black")
plt.show()
3.2. Andrews Curves
Andrews curves allow one to plot multivariate data as a large number of curves that are created using the attributes of samples as
coefficients for Fourier series.
[ Las curvas de Andrews permiten trazar datos multivariados como una gran cantidad de curvas que se crean utilizando los atributos de
las muestras como coeficientes para la serie Fourier.]
By coloring these curves differently for each class it is possible to visualize data clustering.
[ Al colorear estas curvas de manera diferente para cada clase, es posible visualizar la agrupación de datos.]
Curves belonging to samples of the same class will usually be closer together and form larger structures.
[ Las curvas que pertenecen a muestras de la misma clase generalmente estarán más juntas y formarán estructuras más grandes.]
In [12]:
from pandas.plotting import andrews_curves
plt.figure(figsize=(15,10))
andrews_curves(dataset.drop("Id", axis=1), "Species")
plt.title('Andrews Curves Plot', fontsize=20, fontweight='bold')
plt.legend(loc=1, prop={'size': 15}, frameon=True,shadow=True, facecolor="white", edgecolor="black")
plt.show()
3.3. Pairplot
Pairwise is useful when you want to visualize the distribution of a variable or the relationship between multiple variables separately
within subsets of your dataset.
[ El parque es útil cuando desea visualizar la distribución de una variable o la relación entre múltiples variables por separado dentro de
los subconjuntos de su conjunto de datos.]
In [13]:
plt.figure()
sns.pairplot(dataset.drop("Id", axis=1), hue = "Species", size=3, markers=["o", "s", "D"])
plt.show()
<matplotlib.figure.Figure at 0x7fe3f7eb1630>
3.4. Boxplots
In [14]:
plt.figure()
dataset.drop("Id", axis=1).boxplot(by="Species", figsize=(15, 10))
plt.show()
<matplotlib.figure.Figure at 0x7fe3f3e7dd30>
3.5. 3D visualization
You can also try to visualize high-dimensional datasets in 3D using color, shape, size and other properties of 3D and 2D objects.
[ También puede intentar visualizar conjuntos de datos de alta dimensión en 3D usando color, forma, tamaño y otras propiedades de los
objetos 3D y 2D.]
In this plot I used marks sizes to visualize fourth dimenssion which is Petal Width [cm].
[ En esta parcela utilicé tamaños de marcas para visualizar la cuarta dimensión, que es el ancho del pétalo [CM].]
In [15]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(1, figsize=(20, 15))
ax = Axes3D(fig, elev=48, azim=134)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y,
cmap=plt.cm.Set1, edgecolor='k', s = X[:, 3]*50)
for name, label in [('Virginica', 0), ('Setosa', 1), ('Versicolour', 2)]:
ax.text3D(X[y == label, 0].mean(),
X[y == label, 1].mean(),
X[y == label, 2].mean(), name,
horizontalalignment='center',
bbox=dict(alpha=.5, edgecolor='w', facecolor='w'),size=25)
ax.set_title("3D visualization", fontsize=40)
ax.set_xlabel("Sepal Length [cm]", fontsize=25)
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("Sepal Width [cm]", fontsize=25)
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("Petal Length [cm]", fontsize=25)
ax.w_zaxis.set_ticklabels([])
plt.show()
4. Using KNN for classification
4.1. Making predictions
In [16]:
# Fitting clasifier to the Training set
# Loading libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_score
# Instantiate learning model (k = 3)
classifier = KNeighborsClassifier(n_neighbors=3)
# Fitting the model
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
4.2. Evaluating predictions
Building confusion matrix:
In [17]:
cm = confusion_matrix(y_test, y_pred)
cm
Out[17]:
array([[11, 0, 0],
[ 0, 12, 1],
[ 0, 0, 6]])
Calculating model accuracy:
In [18]:
accuracy = accuracy_score(y_test, y_pred)*100
print('Accuracy of our model is equal ' + str(round(accuracy, 2)) + ' %.')
Accuracy of our model is equal 96.67 %.
4.3. Using cross-validation for parameter tuning:
In [19]:
# creating list of K for KNN
k_list = list(range(1,50,2))
# creating list of cv scores
cv_scores = []
# perform 10-fold cross validation
for k in k_list:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())
In [20]:
# changing to misclassification error
MSE = [1 - x for x in cv_scores]
plt.figure()
plt.figure(figsize=(15,10))
plt.title('The optimal number of neighbors', fontsize=20, fontweight='bold')
plt.xlabel('Number of Neighbors K', fontsize=15)
plt.ylabel('Misclassification Error', fontsize=15)
sns.set_style("whitegrid")
plt.plot(k_list, MSE)
plt.show()
<matplotlib.figure.Figure at 0x7fe3f160efd0>
In [21]:
# finding best k
best_k = k_list[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d." % best_k)
The optimal number of neighbors is 9.
5. My own KNN implementation
In [22]:
import numpy as np
import pandas as pd
import scipy as sp
class MyKNeighborsClassifier():
"""
My implementation of KNN algorithm.
"""
def __init__(self, n_neighbors=5):
self.n_neighbors=n_neighbors
def fit(self, X, y):
"""
Fit the model using X as array of features and y as array of labels.
"""
n_samples = X.shape[0]
# number of neighbors can't be larger then number of samples
if self.n_neighbors > n_samples:
raise ValueError("Number of neighbors can't be larger then number of samples in training set.")
# X and y need to have the same number of samples
if X.shape[0] != y.shape[0]:
raise ValueError("Number of samples in X and y need to be equal.")
# finding and saving all possible class labels
self.classes_ = np.unique(y)
self.X = X
self.y = y
def predict(self, X_test):
# number of predictions to make and number of features inside single sample
n_predictions, n_features = X_test.shape
# allocationg space for array of predictions
predictions = np.empty(n_predictions, dtype=int)
# loop over all observations
for i in range(n_predictions):
# calculation of single prediction
predictions[i] = single_prediction(self.X, self.y, X_test[i, :], self.n_neighbors)
return(predictions)
In [23]:
def single_prediction(X, y, x_train, k):
# number of samples inside training set
n_samples = X.shape[0]
# create array for distances and targets
distances = np.empty(n_samples, dtype=np.float64)
# distance calculation
for i in range(n_samples):
distances[i] = (x_train - X[i]).dot(x_train - X[i])
# combining arrays as columns
distances = sp.c_[distances, y]
# sorting array by value of first column
sorted_distances = distances[distances[:,0].argsort()]
# celecting labels associeted with k smallest distances
targets = sorted_distances[0:k,1]
unique, counts = np.unique(targets, return_counts=True)
return(unique[np.argmax(counts)])
In [24]:
# Instantiate learning model (k = 3)
my_classifier = MyKNeighborsClassifier(n_neighbors=3)
# Fitting the model
my_classifier.fit(X_train, y_train)
# Predicting the Test set results
my_y_pred = my_classifier.predict(X_test)
In [25]:
accuracy = accuracy_score(y_test, my_y_pred)*100
print('Accuracy of our model is equal ' + str(round(accuracy, 2)) + ' %.')
Accuracy of our model is equal 96.67 %.
6. Bibliography
1. MIT Lecture: https://www.youtube.com/watch?v=09mb78oiPkA
2. Iris dataset: https://www.kaggle.com/uciml/iris
3. Theory: http://www.scholarpedia.org/article/K-nearest_neighbor
4. https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/
5. https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/
6. https://www.analyticsvidhya.com/blog/2014/10/introduction-k-neighbours-algorithm-clustering/