0% found this document useful (0 votes)
79 views6 pages

Data Mining-Lecture6

The document discusses data classification using the k-nearest neighbors (KNN) algorithm. KNN is a supervised learning algorithm that can be used for classification or regression problems. It works by calculating the similarity between a test sample and training samples, and assigning the test sample to the class of its k nearest neighbors. The algorithm is simple to implement but computationally expensive, as it stores all training data and calculates distances to every point. An example is provided to demonstrate how KNN classifies a test point based on its 3 or 5 nearest neighbors in the training data. Pseudocode is also given to outline the KNN classification process.

Uploaded by

ashraf8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views6 pages

Data Mining-Lecture6

The document discusses data classification using the k-nearest neighbors (KNN) algorithm. KNN is a supervised learning algorithm that can be used for classification or regression problems. It works by calculating the similarity between a test sample and training samples, and assigning the test sample to the class of its k nearest neighbors. The algorithm is simple to implement but computationally expensive, as it stores all training data and calculates distances to every point. An example is provided to demonstrate how KNN classifies a test point based on its 3 or 5 nearest neighbors in the training data. Pseudocode is also given to outline the KNN classification process.

Uploaded by

ashraf8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Mining- Practical Fourth Class-IT Dr.

Eman Hato

5. Data Classification
Classification is a data analysis technique, used to categorize data into different
classes. Classification technique is supervised learning work in two phases which are
training phase and testing phase. During the training phase (known as learning
phase), a classifier model is constructed by training a classification algorithm with a
predetermined set of training data inputs. In the testing phase, the classifier model
is used to predict the class labels for the test data.

K-Nearest Neighbors Algorithm (KNN)


The k-nearest neighbors (KNN) algorithm is a simple supervised machine learning
algorithm that can be used to solve both classification and regression problems.
KNN calculate the similarity between an input test sample and each training samples
to assign the new test sample into the category that is most similar to the available
categories of the training set. The KNN selects the specified number of samples (K)
closest to the input test sample, and then votes for the most frequent category. The
choosing right K for data set is done by trying several Ks and picking the one that
works best.

Advantages

 The algorithm is simple to implement.


 The algorithm not needs to build a model.
 The algorithm is versatile. It can be used for classification, regression, and search.

Disadvantages

 Computationally expensive.
 High memory requirement because it stores all of the training data.
 Sensitive to irrelevant features and the scale of the data.

19
Data Mining- Practical Fourth Class-IT Dr.Eman Hato

KNN Algorithm
The KNN model can implement by following the steps:
1. Load the dataset, load the training as well as test data.
2. Initialize the value of K (the nearest data points).
3. For getting the predicted class of test sample in the test data do the following:
3.1. Calculate the distance between the test sample and each sample of training
data (any distance method can be used).
3.2. Sort the distances and indices in ascending order based on the distance
values.
3.3. Get top K entries from the sorted array.
3.4. Assign a class to the test sample based on most frequent class of these
entries.
4. End.

KNN Example

Example of KNN classification

The data samples K=3 K=5


The test sample (green dot) should be classified either to blue squares or to
red triangles. If k = 3 it is assigned to the red triangles because there are 2
triangles and only 1 square inside the inner circle. If k = 5 it is assigned to the
blue squares (3 squares vs. 2 triangles inside the outer circle).

20
Data Mining- Practical Fourth Class-IT Dr.Eman Hato

Numerical Example of KNN Algorithm

Suppose a training data have four objects and each object have two attributes
(Time, Strength) classified whether the object is good or bad as shown in table
below:
Training
Time Strength Classification
Sample
1 7 7 Bad
2 7 4 Bad
3 3 4 Good
4 1 4 Good

The goal is to classify the test sample Test (Time =3, and Strength=7) as good or bad
class?
1. Determine the value of K (the nearest data points), K=3.
2. Calculate the distance between the test sample and all the training samples.

Training samples Test Sample Distance Classification


1 (7,7) (3,7) 𝐃 = √(7 − 3)2 + (7 − 7)2 = 𝟒 Bad
2 (7,4) (3,7) 𝐃 = √(7 − 3)2 + (4 − 7)2 = 𝟓 Bad
3 (3,4) (3,7) 𝐃 = √(3 − 3)2 + (4 − 7)2 = 𝟑 Good
4 (1,4) (3,7) 𝐃 = √(1 − 3)2 + (4 − 7)2 = 𝟑. 𝟔 Good

3. Sort the calculated distances in ascending order based on distance values.

Training samples Test Sample Ascending Order of Distance Classification


3 (3,4) (3,7) 3 Good
4 (1,4) (3,7) 3.6 Good
1 (7,7) (3,7) 4 Bad
2 (7,4) (3,7) 5 Bad

4. Get top K items from the sorted array, and here k=3.

Training samples Test Sample Ascending Order of Distance Classification


3 (3,4) (3,7) 3 Good
4 (1,4) (3,7) 3.6 Good
1 (7,7) (3,7) 4 Bad

21
Data Mining- Practical Fourth Class-IT Dr.Eman Hato

5. Assign a class to the test sample based on most frequent class of top K items, the
classification labels is (2 good and 1 bad) so the prediction class label is good.

Most frequent class of nearest


Classification
neighbors (K=3)
Good
Good Good
Bad
The class of test sample (3,7) is Good

The Code
Form Design:

Form CS:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace DM_KNN_Algorithm
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}

// Initialize the training data.


int[,] Train = new int[5,4] { {10, 20, 30, 40},
{91, 92, 93, 94},
{81, 82, 83, 84},
{95, 96, 97, 98},
{11, 17, 25, 36}};
int[] Lable = new int[5] { 1, 2, 2, 2, 1 };

22
Data Mining- Practical Fourth Class-IT Dr.Eman Hato

//Initialize the value of K (the nearest data points).


int K = 3;

// Compute Euclidean Distance


public double Euclidean(int[] X, int[] Y)
{
double Sum = 0;
int len = X.Length;

for (int i = 0; i < len; i++)


{
Sum += Math.Pow((X[i] - Y[i]), 2.0);
}

double Dist = Math.Sqrt(Sum);


return Dist;
}

//Compute the Most Frequnt Item

public int MostFrequnt(int [] input)


{
int[] Fritem;
int Max = 0;

//Remove the redundancy items from array


int[] Items = input.Distinct().ToArray();
int len = Items.Length;

// Give initial value to MostFrequent


int MostFreq = Items[0];
//Find the number of redundancy of each items
for (int i = 0; i < len; i++)
{
Fritem = Array.FindAll(input, x => x == Items[i]);
if (Max < Fritem.Length)
{
Max = Fritem.Length;
MostFreq = Items[i];
}
}
return MostFreq;
}

private void button1_Click(object sender, EventArgs e)


{
// The Test sample
int[] Test = new int[4] { 40, 30, 20, 10 };

//Get the number of columns and rows


int Norow = Train.GetLength(0);
int Nocol = Train.GetLength(1);

//Define the vector size to hold the train sample data

int[] Tsample = new int[Nocol];

//Define the vector size to hold the similarity or dissimilarity values

double[] D=new double [Norow];

23
Data Mining- Practical Fourth Class-IT Dr.Eman Hato

// Calculate the distance between test sample and each sample of training data

for (int i = 0; i < Norow; i++)


{
for (int j = 0; j< Nocol; j++)
{
Tsample[j] = Train[i, j];
}
D[i] = Euclidean(Tsample, Test);
}

// Sort the calculated distances in ascending order based on distance values


Array.Sort(D, Lable);

// Get top K items from the sorted array


int [] temp= new int [K];
Array.Copy(Lable, temp, K);

// Assign a class to the test sample based on most frequent class of top K items.
int Class = MostFrequnt(temp);

//Return the predicted class


textBox1.Text = Class.ToString();

private void textBox1_TextChanged(object sender, EventArgs e)


{

}
}
}

24

You might also like