MACHINE LEARNING LAB
Paper Code: CIE-421P
Faculty Name: Student Name: Jatin Bansal
Dr. Sudha Narang Roll No: 03596402722
(Associate Professor) Semester: 7
Group: 7AIML-3C
Maharaja Agrasen Institute of Technology, PSP Area,
Sector – 22, Rohini, New Delhi - 110085
MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY
COMPUTER SCIENCE & ENGINEERING DEPARTMENT
VISION
"To attain global excellence through education, innovation, research, and work ethics in the
field of Computer Science and engineering with the commitment to serve humanity."
MISSION
M1: To lead in the advancement of computer science and engineering through internationally
recognized research and education.
M2: To prepare students for full and ethical participation in a diverse society and encourage
lifelong learning.
M3: To foster development of problem solving and communication skills as an integral
component of the profession.
M4: To impart knowledge, skills and cultivate an environment supporting incubation, product
development, technology transfer, capacity building and entrepreneurship in the field of
computer science and engineering.
M5: To encourage faculty, student’s networking with alumni, industry, institutions, and other
stakeholders for collective engagement.
Rubrics for Lab Assessment:
10 Marks POs and PSOs Covered
Rubrics
0 Marks 1 Marks 2 Marks PO PSO
Is able to identify and define
PSO1,
R1 the objective of the given No Partially Completely PO1, PO2
PSO2
problem?
Is proposed
PO1,PO2, PSO1,
R2 design/procedure/algorithm No Partially Completely
PO3 PSO2
solves the problem?
Has the understanding of the
tool/programming language PO1,PO3, PSO1,
R3 No Partially Completely
to implement the proposed PO5 PSO2
solution?
Are the result(s) verified
PO2,PO4,
R4 using sufficient test data to No Partially Completely PSO2
PO5
support the conclusions?
PSO1,
R5 Individuality of submission? No Partially Completely PO8, PO12
PSO3
INDEX
R1 R2 R3 R4 R5
Has the
understa
Is able Is nding of Are the
to propose the result(s)
identify d design tool/pro verified
and /proced grammi using
Individu
define ure ng sufficie
ality of
Date Of the /algorith languag nt test
submiss
Total Faculty
S.No Experiment objectiv m e to data to
Performance ion? Marks Signature
e of the solves implem support
given the ent the the
problem problem propose conclusi
? ? d ons?
solution
?
2 2 2 2 2
Marks Marks Marks Marks Marks
EXPERIMENT NO. 1
AIM: Introduction to JUPYTER IDE and its libraries Pandas and NumPy.
THEORY:
Jupyter Notebooks have become an integral tool in the fields of data science, machine
learning, and scientific research. It is an open-source web application that allows you to
create and share documents that contain live code, equations, visualizations, and narrative
text. It is part of Project Jupyter, which is a non-profit, open-source project that evolved from
the IPython Project in 2014. The name 'Jupyter' is a reference to the core supported
programming languages that it was designed to support: Julia, Python, and R, but today it
supports over 40 programming languages.
NumPy (Numerical Python) is a fundamental Python library used for scientific computing.
It provides:
Support for multi-dimensional arrays (ndarray).
Fast mathematical and statistical operations.
Functions for linear algebra, random numbers, and numerical analysis.
Much faster performance than normal Python lists when working with large datasets.
Pandas is a Python library built on top of NumPy, mainly used for data analysis and
manipulation.
It provides two main data structures:
Series → One-dimensional labeled array (like a column in Excel).
DataFrame → Two-dimensional labeled data structure (like an Excel table).
With Pandas, one can:
Import/export data (CSV, Excel, SQL, JSON, etc.).
Clean and transform datasets.
Perform filtering, grouping, merging, and statistical analysis.
Features of Jupyter Notebook
1. Interactive Development Environment
2. Rich Text and Media Support
3. Collaboration and Sharing
4. Extensibility and Customization
5. Reproducibility and Portability
6. Data Science and Visualization
7. Open Source and Community-Driven
JUPYTER NOTEBOOK:
EXPERIMENT NO. 2
AIM: Program to demonstrate Simple Linear Regression.
THEORY:
Linear regression predicts the relationship between two variables by assuming a linear
connection between the independent and dependent variables. It seeks the optimal line that
minimizes the sum of squared differences between predicted and actual values. In a simple
linear regression, there is one independent variable and one dependent variable. The model
estimates the slope and intercept of the line of best fit, which represents the relationship
between the variables. The slope represents the change in the dependent variable for each unit
change in the independent variable, while the intercept represents the predicted value of the
dependent variable when the independent variable is zero. To calculate best-fit line linear
regression uses a traditional slope-intercept form which is given below, Yi = β0 + β1Xi
JUPYTER NOTEBOOK:
EXPERIMENT NO. 3
AIM: Program to demonstrate Logistic Regression.
THEORY:
Logistic regression is a supervised machine learning algorithm that accomplishes binary
classification tasks by predicting the probability of an outcome, event, or observation. The
model delivers a binary or dichotomous outcome limited to two possible outcomes: yes/no,
0/1, or true/false. Logical regression analyses the relationship between one or more
independent variables and classifies data into discrete classes. It is extensively used in
predictive modelling, where the model estimates the mathematical probability of whether an
instance belongs to a specific category or not.
Logistic regression uses a logistic function called a sigmoid function to map predictions and
their probabilities. The sigmoid function refers to an S-shaped curve that converts any real
value to a range between 0 and 1.
Moreover, if the output of the sigmoid function (estimated probability) is greater than a
predefined threshold on the graph, the model predicts that the instance belongs to that class.
If the estimated probability is less than the predefined threshold, the model predicts that the
instance does not belong to the class.
JUPYTER NOTEBOOK:
EXPERIMENT NO. 4
AIM: Program to demonstrate Decision Tree-ID3 Algorithm.
THEORY:
A decision tree is a structure that contains nodes (rectangular boxes) and edges (arrows) and
is built from a dataset (table of columns representing features/attributes and rows corresponds
to records). Each node is either used to make a decision (known as decision node) or
represent an outcome (known as leaf node). ID3 stands for Iterative Dichotomiser 3 and is
named such because the algorithm iteratively (repeatedly) dichotomizes (divides) features
into two or more groups at each step. Invented by Ross Quinlan, ID3 uses a top-down greedy
approach to build a decision tree.
JUPYTER NOTEBOOK:
EXPERIMENT NO. 5
AIM: To demonstrate k-Nearest Neighbor flowers classification.
THEORY:
The K-Nearest Neighbors (KNN) algorithm is a popular machine learning technique used for
classification and regression tasks. It relies on the idea that similar data points tend to have
similar labels or values. During the training phase, the KNN algorithm stores the entire
training dataset as a reference. When making predictions, it calculates the distance between
the input data point and all the training examples, using a chosen distance metric such as
Euclidean distance. Next, the algorithm identifies the K nearest neighbors to the input data
point based on their distances. In the case of classification, the algorithm assigns the most
common class label among the K neighbors as the predicted label for the input data point. For
regression, it calculates the average or weighted average of the target values of the K
neighbors to predict the value for the input data point. KNN Algorithm can be used for both
classification and regression predictive problems.
JUPYTER NOTEBOOK:
EXPERIMENT NO. 6
AIM: Program to demonstrate Naive-Bayes Classifier.
THEORY:
The Naïve Bayes Classifier is a probabilistic machine learning model used for classification
tasks based on Bayes’ Theorem, assuming independence among features.
Bayes’ theorem states:
P (X ∣C) P(C)
P(C ∣ X )=
P(X )
where:
P(C ∣ X ): Posterior probability of class given features
P( X ∣C ): Likelihood
P(C ): Prior probability of the class
P( X): Evidence (constant for comparison)
The “naïve” assumption simplifies the computation by treating all features as independent,
making it efficient for large datasets.
JUPYTER NOTBOOK:
EXPERIMENT NO. 7
AIM: To demonstrate Principal Component Analysis (PCA) and Linear Discriminant
Analysis (LDA) on the Iris dataset.
THEORY:
PCA (Principal Component Analysis):
PCA is a dimensionality reduction technique that transforms a large set of variables into a
smaller one while retaining most of the variance.
It works by:
1. Standardizing the data.
2. Computing the covariance matrix.
3. Calculating eigenvalues and eigenvectors.
4. Choosing the top k eigenvectors (principal components).
5. Projecting data onto the new feature space.
LDA (Linear Discriminant Analysis):
LDA is a supervised dimensionality reduction technique that aims to maximize class
separability.
It projects data onto a lower-dimensional space where the ratio of between-class variance to
within-class variance is maximized.
JUPYTER NOTEBOOK:
EXPERIMENT NO. 8
AIM: Program to demonstrate DBSCAN clustering algorithm
THEORY:
DBSCAN is a density-based clustering algorithm that groups together closely packed points
and marks points in low-density areas as outliers.
Key Parameters:
eps: Maximum distance between two samples for them to be considered as neighbors.
min_samples: Minimum number of points to form a dense region.
Advantages:
Can find arbitrarily shaped clusters.
Robust to noise.
JUPYTER NOTEBOOK:
EXPERIMENT NO. 9
AIM: Program to demonstrate K-Medoid clustering algorithm
THEORY:
K-Medoids is a partition-based clustering algorithm similar to K-Means but uses actual data
points as cluster centers (medoids), making it more robust to outliers.
Steps:
1. Choose k random medoids.
2. Assign each data point to the nearest medoid.
3. Update medoids by minimizing total dissimilarity.
4. Repeat until convergence.
Difference from K-Means:
K-Means uses centroids (mean points),
K-Medoids uses actual points → more stable with noisy data.
JUPYTER NOTEBOOK:
EXPERIMENT NO. 10
AIM: Program to demonstrate K-Means Clustering Algorithm on Handwritten Dataset
THEORY:
K-Means is an unsupervised learning algorithm used to partition a dataset into k clusters
based on feature similarity.
Steps:
1. Select k initial centroids randomly.
2. Assign each data point to the nearest centroid.
3. Recalculate centroids as the mean of all assigned points.
4. Repeat until centroids stabilize.
Applications:
Image compression, customer segmentation, pattern recognition.
JUPYTER NOTEBOOK: