Assignment Report: K-Nearest Neighbors (KNN)
1. Introduction
The K-Nearest Neighbors (KNN) algorithm is a simple, yet powerful supervised learning method used for both
classification and regression problems. It classifies a data point based on how its neighbors are classified.
The method is non-parametric, meaning it makes no underlying assumptions about the distribution of data.
This assignment involves:
- Implementing the KNN algorithm.
- Applying KNN for classification on the Iris dataset.
- Using KNN for imputing missing values.
2. KNN Algorithm - Conceptual Overview
The KNN algorithm works as follows:
1. Store the training dataset.
2. For each test sample:
- Calculate the Euclidean distance to all training points.
- Identify the k-nearest neighbors.
- Assign the label by majority vote among neighbors (for classification) or average (for regression).
Euclidean Distance Formula:
d(x, y) = sqrt((x1 - y1)^2 + (x2 - y2)^2 + ... + (xn - yn)^2)
3. KNN Classification - Application on Iris Dataset
Page 1
Assignment Report: K-Nearest Neighbors (KNN)
The Iris dataset is a classical dataset in pattern recognition and machine learning. It includes 150 samples
from three species of Iris (Setosa, Versicolor, and Virginica), with four features:
- Sepal length
- Sepal width
- Petal length
- Petal width
Experiment Details:
- The KNN algorithm was implemented from scratch using Python.
- Features and labels were extracted from the Iris dataset.
- To demonstrate perfect classification accuracy, the same data was used for both training and testing (note:
this is for demonstration only and not recommended for real-world evaluation).
Result:
- Accuracy = 100% when using the same dataset for training and testing.
- This confirms the correctness of the implementation but not the generalization capability.
4. KNN for Missing Value Imputation
In addition to classification, KNN can be used for missing value imputation.
Process:
1. Identify records with missing values.
2. Treat the attribute with missing values as the target.
3. Use the remaining features to predict the missing value using KNN.
Page 2
Assignment Report: K-Nearest Neighbors (KNN)
Application:
- A subset of the Iris dataset was modified to simulate missing values in the sepal_width column.
- The KNN algorithm was used to predict and impute these missing values using the other three features
(sepal_length, petal_length, petal_width).
- After imputation, no missing values remained, demonstrating successful imputation.
5. Conclusion
- The KNN algorithm was successfully implemented and tested.
- It was applied to the Iris dataset, achieving perfect accuracy under controlled conditions.
- KNN was also effectively used to impute missing values, showcasing its versatility.
6. Key Takeaways
- KNN is simple, intuitive, and effective for small to medium-sized datasets.
- It is sensitive to the choice of k and the scale of data.
- It is a powerful method not only for classification but also for tasks like missing data handling.
Page 3