Data Mining- Practical Fourth Class-IT Dr.
Eman Hato
5. Data Classification
Classification is a data analysis technique, used to categorize data into different
classes. Classification technique is supervised learning work in two phases which are
training phase and testing phase. During the training phase (known as learning
phase), a classifier model is constructed by training a classification algorithm with a
predetermined set of training data inputs. In the testing phase, the classifier model
is used to predict the class labels for the test data.
K-Nearest Neighbors Algorithm (KNN)
The k-nearest neighbors (KNN) algorithm is a simple supervised machine learning
algorithm that can be used to solve both classification and regression problems.
KNN calculate the similarity between an input test sample and each training samples
to assign the new test sample into the category that is most similar to the available
categories of the training set. The KNN selects the specified number of samples (K)
closest to the input test sample, and then votes for the most frequent category. The
choosing right K for data set is done by trying several Ks and picking the one that
works best.
Advantages
The algorithm is simple to implement.
The algorithm not needs to build a model.
The algorithm is versatile. It can be used for classification, regression, and search.
Disadvantages
Computationally expensive.
High memory requirement because it stores all of the training data.
Sensitive to irrelevant features and the scale of the data.
19
Data Mining- Practical Fourth Class-IT Dr.Eman Hato
KNN Algorithm
The KNN model can implement by following the steps:
1. Load the dataset, load the training as well as test data.
2. Initialize the value of K (the nearest data points).
3. For getting the predicted class of test sample in the test data do the following:
3.1. Calculate the distance between the test sample and each sample of training
data (any distance method can be used).
3.2. Sort the distances and indices in ascending order based on the distance
values.
3.3. Get top K entries from the sorted array.
3.4. Assign a class to the test sample based on most frequent class of these
entries.
4. End.
KNN Example
Example of KNN classification
The data samples K=3 K=5
The test sample (green dot) should be classified either to blue squares or to
red triangles. If k = 3 it is assigned to the red triangles because there are 2
triangles and only 1 square inside the inner circle. If k = 5 it is assigned to the
blue squares (3 squares vs. 2 triangles inside the outer circle).
20
Data Mining- Practical Fourth Class-IT Dr.Eman Hato
Numerical Example of KNN Algorithm
Suppose a training data have four objects and each object have two attributes
(Time, Strength) classified whether the object is good or bad as shown in table
below:
Training
Time Strength Classification
Sample
1 7 7 Bad
2 7 4 Bad
3 3 4 Good
4 1 4 Good
The goal is to classify the test sample Test (Time =3, and Strength=7) as good or bad
class?
1. Determine the value of K (the nearest data points), K=3.
2. Calculate the distance between the test sample and all the training samples.
Training samples Test Sample Distance Classification
1 (7,7) (3,7) 𝐃 = √(7 − 3)2 + (7 − 7)2 = 𝟒 Bad
2 (7,4) (3,7) 𝐃 = √(7 − 3)2 + (4 − 7)2 = 𝟓 Bad
3 (3,4) (3,7) 𝐃 = √(3 − 3)2 + (4 − 7)2 = 𝟑 Good
4 (1,4) (3,7) 𝐃 = √(1 − 3)2 + (4 − 7)2 = 𝟑. 𝟔 Good
3. Sort the calculated distances in ascending order based on distance values.
Training samples Test Sample Ascending Order of Distance Classification
3 (3,4) (3,7) 3 Good
4 (1,4) (3,7) 3.6 Good
1 (7,7) (3,7) 4 Bad
2 (7,4) (3,7) 5 Bad
4. Get top K items from the sorted array, and here k=3.
Training samples Test Sample Ascending Order of Distance Classification
3 (3,4) (3,7) 3 Good
4 (1,4) (3,7) 3.6 Good
1 (7,7) (3,7) 4 Bad
21
Data Mining- Practical Fourth Class-IT Dr.Eman Hato
5. Assign a class to the test sample based on most frequent class of top K items, the
classification labels is (2 good and 1 bad) so the prediction class label is good.
Most frequent class of nearest
Classification
neighbors (K=3)
Good
Good Good
Bad
The class of test sample (3,7) is Good
The Code
Form Design:
Form CS:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
namespace DM_KNN_Algorithm
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
// Initialize the training data.
int[,] Train = new int[5,4] { {10, 20, 30, 40},
{91, 92, 93, 94},
{81, 82, 83, 84},
{95, 96, 97, 98},
{11, 17, 25, 36}};
int[] Lable = new int[5] { 1, 2, 2, 2, 1 };
22
Data Mining- Practical Fourth Class-IT Dr.Eman Hato
//Initialize the value of K (the nearest data points).
int K = 3;
// Compute Euclidean Distance
public double Euclidean(int[] X, int[] Y)
{
double Sum = 0;
int len = X.Length;
for (int i = 0; i < len; i++)
{
Sum += Math.Pow((X[i] - Y[i]), 2.0);
}
double Dist = Math.Sqrt(Sum);
return Dist;
}
//Compute the Most Frequnt Item
public int MostFrequnt(int [] input)
{
int[] Fritem;
int Max = 0;
//Remove the redundancy items from array
int[] Items = input.Distinct().ToArray();
int len = Items.Length;
// Give initial value to MostFrequent
int MostFreq = Items[0];
//Find the number of redundancy of each items
for (int i = 0; i < len; i++)
{
Fritem = Array.FindAll(input, x => x == Items[i]);
if (Max < Fritem.Length)
{
Max = Fritem.Length;
MostFreq = Items[i];
}
}
return MostFreq;
}
private void button1_Click(object sender, EventArgs e)
{
// The Test sample
int[] Test = new int[4] { 40, 30, 20, 10 };
//Get the number of columns and rows
int Norow = Train.GetLength(0);
int Nocol = Train.GetLength(1);
//Define the vector size to hold the train sample data
int[] Tsample = new int[Nocol];
//Define the vector size to hold the similarity or dissimilarity values
double[] D=new double [Norow];
23
Data Mining- Practical Fourth Class-IT Dr.Eman Hato
// Calculate the distance between test sample and each sample of training data
for (int i = 0; i < Norow; i++)
{
for (int j = 0; j< Nocol; j++)
{
Tsample[j] = Train[i, j];
}
D[i] = Euclidean(Tsample, Test);
}
// Sort the calculated distances in ascending order based on distance values
Array.Sort(D, Lable);
// Get top K items from the sorted array
int [] temp= new int [K];
Array.Copy(Lable, temp, K);
// Assign a class to the test sample based on most frequent class of top K items.
int Class = MostFrequnt(temp);
//Return the predicted class
textBox1.Text = Class.ToString();
private void textBox1_TextChanged(object sender, EventArgs e)
{
}
}
}
24