MACHINE
LEARNING
WITH PYTHON
SEMESTER 5
UNIT - 1
HI COLLEGE
SYLLABUS
UNIT - 1
HI COLLEGE
INTRODUCTION TO MACHINE LEARNING
Machine learning is a rapidly growing field in artificial intelligence (AI) that
focuses on enabling computers to learn and improve from experience without
being explicitly programmed. In other words, machine learning algorithms can
automatically learn and make predictions or decisions based on data.
Machine learning algorithms are trained on large datasets, which they use to
learn patterns and relationships in the data. The algorithms then use this
knowledge to make predictions or decisions about new, unseen data.
WHY MACHINE LEARNING
1. Large amounts of data: With the rise of big data, there is an overwhelming
amount of data being generated every day. Machine learning algorithms can
process and analyze this data to extract insights and make predictions.
2. Accurate predictions: Machine learning algorithms can make accurate
predictions based on historical data, which can help businesses make informed
decisions. For example, in finance, machine learning algorithms can predict
stock prices or credit risk.
3. Faster decision-making: Machine learning algorithms can make decisions
faster than humans, as they can process large amounts of data quickly and
accurately. This can help businesses respond quickly to changing market
conditions or customer needs.
4. Personalization: Machine learning algorithms can personalize products and
services based on a user's preferences and past behavior, which can improve
the user experience and increase customer satisfaction.
5. Cost savings: By automating tasks that would typically require human
intervention, machine learning algorithms can save businesses time and
money. For example, in healthcare, machine learning algorithms can help
diagnose diseases faster and more accurately than humans, which can save
lives and reduce healthcare costs.
6. Improved efficiency: Machine learning algorithms can improve the efficiency
of various processes by automating repetitive tasks or identifying areas for
improvement. For example, in manufacturing, machine learning algorithms can
optimize production processes by predicting equipment failures before they
occur.
HiCollege Click Here For More Notes 01
TYPES OF MACHINE LEARNING PROBLEMS
Machine learning algorithms can be applied to various types of problems,
depending on the nature of the data and the desired output. Here are some
common types of machine learning problems:
1. Classification: In classification problems, the algorithm is trained to predict
the category or class of a new, unseen data point based on its features. For
example, in image recognition, the algorithm is trained to classify images into
different categories, such as cats or dogs.
2. Regression: In regression problems, the algorithm is trained to predict a
continuous numerical value based on input features. For example, in finance,
the algorithm can predict stock prices based on historical data.
3. Clustering: In clustering problems, the algorithm is trained to group similar
data points together based on their features. For example, in customer
segmentation, the algorithm can group customers with similar buying
behaviors together.
4. Anomaly detection: In anomaly detection problems, the algorithm is trained
to identify unusual or abnormal data points that do not fit the normal pattern
or distribution of the data. For example, in fraud detection, the algorithm can
identify unusual credit card transactions that may be fraudulent.
5. Dimensionality reduction: In dimensionality reduction problems, the
algorithm is trained to reduce the number of input features while preserving
most of the important information in the data. This can help simplify complex
problems and make them easier to understand and analyze.
6. Reinforcement learning: In reinforcement learning problems, the algorithm
learns by interacting with its environment and receiving feedback in the form
of rewards or penalties. The goal is to find a policy that maximizes the
cumulative reward over time. For example, in robotics, the algorithm can learn
to navigate a maze by interacting with it and receiving feedback from sensors.
HiCollege Click Here For More Notes 02
APPLICATIONS OF MACHINE LEARNING.
Machine learning has a wide range of applications across various industries and
domains. Here are some examples:
1. Finance: Machine learning algorithms are used in finance for tasks such as
fraud detection, credit scoring, stock price prediction, and portfolio
optimization.
2. Healthcare: Machine learning algorithms are used in healthcare for tasks such
as medical image analysis, disease diagnosis, drug discovery, and personalized
medicine.
3. Retail: Machine learning algorithms are used in retail for tasks such as
demand forecasting, inventory optimization, personalized recommendations,
and pricing optimization.
4. Manufacturing: Machine learning algorithms are used in manufacturing for
tasks such as predictive maintenance, quality control, and supply chain
optimization.
5. Transportation: Machine learning algorithms are used in transportation for
tasks such as traffic prediction, route optimization, and autonomous driving.
6. Education: Machine learning algorithms are used in education for tasks such
as student performance prediction, personalized learning, and intelligent
tutoring systems.
7. Energy: Machine learning algorithms are used in energy for tasks such as
wind turbine performance prediction, energy demand forecasting, and smart
grid optimization.
8. Cybersecurity: Machine learning algorithms are used in cybersecurity for tasks
such as network intrusion detection, malware detection, and anomaly
detection.
HiCollege Click Here For More Notes 03
SUPERVISED MACHINE LEARNING- REGRESSION AND
CLASSIFICATION.
Supervised machine learning is a type of machine learning algorithm that is
trained on labeled data to make predictions or decisions for new, unseen data
points. In supervised learning, the algorithm learns a function that maps input
features to output labels or values. There are two main types of supervised
learning problems: regression and classification.
1. Regression: In regression problems, the output label is a continuous
numerical value. The goal is to predict this value based on the input features.
For example, in finance, regression can be used to predict stock prices based on
historical data. The output label is the stock price, and the input features might
include factors such as company earnings, economic indicators, and market
trends.
2. Classification: In classification problems, the output label is a category or
class. The goal is to predict which category a new, unseen data point belongs to
based on its input features. For example, in image recognition, the algorithm is
trained to classify images into different categories, such as cats or dogs. The
input features might include pixel values or features extracted from the image
using techniques such as convolutional neural networks (CNNs).
BINARY CLASSIFIER, MULTICLASS CLASSIFICATION,
MULTILABEL CLASSIFICATION
In binary classification, the output label is a binary value, indicating whether a
data point belongs to one of two classes or not. For example, in spam filtering,
the algorithm is trained to classify emails as either spam or not spam.
In multiclass classification, the output label is a categorical value, indicating
which of multiple classes a data point belongs to. For example, in image
recognition, the algorithm is trained to classify images into one of several
categories, such as cats, dogs, birds, and cars.
HiCollege Click Here For More Notes 04
In multilabel classification, the output label is a set of binary values, indicating
which of multiple labels apply to a data point. For example, in document
classification, the algorithm is trained to classify documents into multiple
categories simultaneously.
The techniques used for training and evaluating binary classifiers can be
adapted for multiclass and multilabel classification problems as well. However,
multiclass and multilabel classification problems can be more challenging than
binary classification due to the increased number of classes and labels that
need to be considered. Techniques such as one-vs-rest (OVR) and one-vs-one
(OVO) are commonly used for multiclass classification, while techniques such as
binary relevance (BR) and label powerset (LP) are commonly used for multilabel
classification.
PERFORMANCE MEASURESCONFUSION MATRIX, ACCURACY,
PRECISION & RECALL, ROC CURVE.
In supervised machine learning, the performance of a binary classifier can be
evaluated using various metrics, depending on the specific problem and the
desired outcome. Some commonly used performance measures are:
1. Confusion Matrix: A confusion matrix is a tabular representation of the
performance of a binary classifier on a test set. It shows the number of true
positives (TP), true negatives (TN), false positives (FP), and false negatives (FN)
for the classifier's predictions.
2. Accuracy: Accuracy is the fraction of correctly classified data points in the test
set. It is calculated as (TP + TN) / (TP + TN + FP + FN). Accuracy is a simple and
intuitive metric, but it may not be meaningful for imbalanced datasets or when
the cost of misclassification is unequal for different classes.
3. Precision & Recall: Precision and recall are measures of how well the classifier
performs for each class separately. Precision is the fraction of true positives
among all positive predictions, while recall is the fraction of true positives
among all actual positives. They are calculated as precision = TP / (TP + FP) and
recall = TP / (TP + FN), respectively. Precision and recall are important because
they provide information about the classifier's ability to correctly identify
positive and negative examples, respectively.
HiCollege Click Here For More Notes 05
4. ROC Curve: The receiver operating characteristic (ROC) curve is a graphical
representation of the trade-off between true positive rate (TPR = recall) and
false positive rate (FPR = 1 - TNR = 1 - specificity) for different threshold values
used to classify data points as positive or negative. The area under the ROC
curve (AUC) is a single metric that summarizes the overall performance of the
classifier, with higher values indicating better performance. The ROC curve is
particularly useful for imbalanced datasets or when the cost of misclassification
is unequal for different classes, as it allows us to evaluate the classifier's ability
to distinguish between positive and negative examples while controlling for
false positives.
ADVANCED PYTHON- NUMPY, PANDAS
NumPy and Pandas are two popular libraries in Python that are widely used for
data manipulation, analysis, and visualization. Here's a brief overview of their
features:
1. NumPy: NumPy (Numerical Python) is a library for the Python programming
language, adding support for large, multi-dimensional arrays and matrices,
along with a large collection of high-level mathematical functions to operate
on these arrays. Some key features of NumPy include:
N-dimensional array objects (ndarrays)
Fast arithmetic operations on arrays
Broadcasting (automatic extension of arrays to match the shape required by
an operation)
Ufunc (universal functions) for element-wise mathematical operations
Linear algebra functions (matrix multiplication, determinant,
eigenvalues/vectors, etc.)
Fourier transforms (FFT)
Random number generation
2. Pandas: Pandas is a library for data manipulation and analysis in Python. It
provides data structures designed to make working with structured and
labeled data easier and more efficient. Some key features of Pandas include:
HiCollege Click Here For More Notes 06
DataFrame: A 2D labeled data structure with columns of potentially
different data types. It is a flexible and generalized extension of the concept
of a spreadsheet or SQL table.
Series: A 1D labeled data structure with an index and values of any data type.
It is similar to a column in a DataFrame or a vector in NumPy.
Index: A label or key for rows or columns in a DataFrame or Series. It can be
integer, float, or string based, and can be used to efficiently slice and index
DataFrames and Series.
Merging and joining: Pandas provides efficient methods for merging and
joining multiple DataFrames based on common columns or keys.
Time series: Pandas has built-in support for time series data, including
resampling, rolling window calculations, and date/time handling functions.
Grouping: Pandas provides powerful grouping functions that allow us to
group rows based on one or more columns and perform aggregate
calculations on each group.
PYTHON MACHINE LEARNING LIBRARY SCIKIT-LEARN
Scikit-Learn is a popular open-source machine learning library for Python. It
provides a wide range of tools for data preprocessing, model selection, and
model fitting, as well as support for grid search and cross-validation. Some key
features of Scikit-Learn include:
1. Supervised Learning: Scikit-Learn provides a variety of supervised learning
algorithms for regression and classification tasks, including linear regression,
logistic regression, decision trees, random forests, support vector machines
(SVMs), and neural networks.
2. Unsupervised Learning: Scikit-Learn also includes a range of unsupervised
learning algorithms for clustering and dimensionality reduction tasks, such as
k-means clustering, hierarchical clustering, principal component analysis (PCA),
and t-distributed stochastic neighbor embedding (t-SNE).
3. Model Selection: Scikit-Learn provides tools for model selection, including
grid search and randomized search to find the best hyperparameters for a
given algorithm
4. Preprocessing: Scikit-Learn includes functions for data preprocessing, such
as scaling, normalization, feature selection, and missing value imputation.
HiCollege Click Here For More Notes 07
5. Evaluation: Scikit-Learn provides functions for evaluating the performance of
machine learning models, such as confusion matrix, classification report, and
ROC curve.
6. Cross-validation: Scikit-Learn supports cross-validation techniques such as k-
fold cross-validation, leave-one-out cross-validation (LOOCV), and stratified k-
fold cross-validation for model evaluation and selection.
7. Ensemble Methods: Scikit-Learn includes ensemble methods such as
bagging (bootstrap aggregating) and boosting to improve the performance of
machine learning models by combining multiple weak learners into a strong
learner.
8. Pipelines: Scikit-Learn provides a pipeline interface to chain together
multiple transformers and estimators in a single object for easy model training
and evaluation.
LINEAR REGRESSION WITH ONE VARIABLE
Linear regression is a statistical technique used to model the relationship
between a dependent variable (y) and an independent variable (x) by fitting a
linear equation to the data. In this case, we'll be discussing linear regression
with one variable.
The formula for linear regression with one variable is:
y = mx + b
where m is the slope of the line, b is the y-intercept, and x is the independent
variable. The coefficient of determination (R²) is used to measure the goodness
of fit of the model.
In Python, we can use NumPy and Scipy libraries to perform linear regression
with one variable. Here's an example:
HiCollege Click Here For More Notes 08
LINEAR REGRESSION WITH ONE VARIABLE
Linear regression is a statistical technique used to model the relationship
between a dependent variable (y) and an independent variable (x) by fitting a
linear equation to the data. In this case, we'll be discussing linear regression
with one variable.
The formula for linear regression with one variable is:
y = mx + b
where m is the slope of the line, b is the y-intercept, and x is the independent
variable. The coefficient of determination (R²) is used to measure the goodness
of fit of the model.
In Python, we can use NumPy and Scipy libraries to perform linear regression
with one variable. Here's an example:
HiCollege Click Here For More Notes 09
LINEAR REGRESSION WITH ONE VARIABLE
Linear regression is a statistical technique used to model the relationship
between a dependent variable (y) and an independent variable (x) by fitting a
linear equation to the data. In this case, we'll be discussing linear regression
with one variable.
The formula for linear regression with one variable is:
y = mx + b
where m is the slope of the line, b is the y-intercept, and x is the independent
variable. The coefficient of determination (R²) is used to measure the goodness
of fit of the model.
In Python, we can use NumPy and Scipy libraries to perform linear regression
with one variable. Here's an example:
In this example, we first import necessary libraries like NumPy and matplotlib.
We then create a sample dataset with x and y values. Next, we calculate the
coefficients using scipy's linregress function which returns slope, intercept, R²
value along with some other statistics. Finally, we plot the data and regression
line using matplotlib library.
HiCollege Click Here For More Notes 10
LINEAR REGRESSION WITH MULTIPLE VARIABLES
Linear regression with multiple variables is used to model the relationship
between a dependent variable (y) and multiple independent variables (x1, x2, ...,
xn). The formula for linear regression with multiple variables is:
y = b0 + b1*x1 + b2*x2 + ... + bn*xn
where b0 is the y-intercept, and bi (i = 1, 2, ..., n) are the coefficients for each
independent variable.
In Python, we can use NumPy, Scipy, and Pandas libraries to perform linear
regression with multiple variables. Here's an example:
In this example, we first import necessary libraries like NumPy, Pandas and
matplotlib. We then create a sample dataset in a pandas DataFrame with x1, x2
and y values. Next, we calculate the coefficients using scipy's linregress function
separately for each independent variable and also calculate the intercept using
linregress function with np.ones() array as input to get the y-intercept. Finally,
we print the coefficients and R² values separately for each independent variable
and also print the R² value for both models combined using hstack function
from NumPy library to concatenate the two columns into a single array.
HiCollege Click Here For More Notes 11
LOGISTIC REGRESSION
Logistic regression is a statistical analysis technique used to predict the
probability of a binary outcome (dependent variable) based on one or more
independent variables. Let's consider a simple example to understand logistic
regression theoretically.
Let's say we want to predict whether a person will buy a product or not based
on their age and income. We have a dataset with these variables and the binary
outcome (whether the person bought the product or not).
First, we calculate the odds of buying the product for each age and income
level using the formula:
Odds = Probability of buying / Probability of not buying
For example, if the probability of buying for people aged 25 with an income of
$25,000 is 0.6, then the odds would be:
Odds = 0.6 / (1 - 0.6) = 2.4
Next, we calculate the logarithm of these odds using the natural logarithm (ln)
to get the logit values. The logit is the logarithm of the odds and is used as a
linear predictor in logistic regression:
Logit = ln(Odds) = ln(Probability of buying / Probability of not buying)
For example, if the odds for people aged 25 with an income of $25,000 is 2.4,
then the logit would be:
Logit = ln(2.4) = 1.213
Now, we can use linear regression to find the relationship between these logit
values and age and income levels:
Logit = b0 + b1*Age + b2*Income
Here, b0 is the intercept (the logit value when age and income are both zero),
b1 is the coefficient for age, and b2 is the coefficient for income. By
exponentiating both sides of this equation, we can convert it back to odds:
Odds = e^(b0 + b1*Age + b2*Income) / (1 + e^(b0 + b1*Age + b2*Income))
HiCollege Click Here For More Notes 12
This formula gives us the probability of buying by dividing the odds by (1 +
odds). By fitting this model to our data using logistic regression, we can predict
the probability of buying based on age and income levels for new individuals.
This helps us understand which factors are most important in predicting
whether someone will buy our product and how we can target our marketing
efforts accordingly.
HiCollege Click Here For More Notes 13
MACHINE
LEARNING
WITH PYTHON
SEMESTER 5
UNIT - 2
HI COLLEGE
SYLLABUS
UNIT - 2
HI COLLEGE
SUPERVISED LEARNING ALGORITHMS:
Supervised learning algorithms are a type of machine learning technique that
use labeled data to train a model to make predictions or decisions for new,
unseen data. Here are some examples of commonly used supervised learning
algorithms:
1. Decision Trees: Decision trees are a popular supervised learning algorithm
used for both classification and regression tasks. They work by recursively
splitting the data into smaller subsets based on the values of its features, until a
decision is reached. The resulting tree can be used to make predictions by
following the path from the root node to a leaf node based on the feature
values of the input data.
2. Tree Pruning: Tree pruning is a technique used to improve the performance
of decision trees by removing unnecessary branches or nodes from the tree.
This helps to reduce overfitting and improve generalization performance.
3. Rule-base Classification: Rule-based classification is a type of decision tree
where each path from the root node to a leaf node represents a rule or
condition that must be met for that decision. This approach is particularly
useful for interpretability and explainability, as the rules can be easily
understood by humans.
4. Naïve Bayes: Naïve Bayes is a probabilistic algorithm used for both
classification and regression tasks. It works by calculating the probability of
each class given the feature values of the input data, based on Bayes' theorem.
The class with the highest probability is then selected as the prediction.
5. Bayesian Networks: Bayesian networks are a type of graphical model used
for probabilistic reasoning and decision making. They consist of a directed
acyclic graph (DAG) representing the conditional dependencies between
variables, along with a set of probability distributions for each variable given its
parents in the graph. Bayesian networks can be used for both classification and
regression tasks, as well as more complex tasks such as causal inference and
uncertainty propagation.
HiCollege Click Here For More Notes 01
SUPPORT VECTOR MACHINES
Support Vector Machines (SVMs) are a type of supervised learning algorithm
used for both classification and regression tasks. SVMs work by finding the best
hyperplane (a line or plane in a high-dimensional space) that separates the data
into two classes, with the largest margin possible between the hyperplane and
the data points.
The margin is the distance between the hyperplane and the closest data points,
called support vectors. By maximizing the margin, SVMs can achieve better
generalization performance and reduce overfitting.
SVMs can also handle non-linear decision boundaries by mapping the input
data into a higher-dimensional space using a kernel function, such as a
polynomial, radial basis function (RBF), or sigmoid function. This allows SVMs to
separate complex, non-linear decision boundaries.
In classification tasks, SVMs output a label for each new input data point based
on which side of the hyperplane it falls on. In regression tasks, SVMs output a
continuous value based on which side of the hyperplane the input data point
falls on.
SVMs have several advantages over other supervised learning algorithms, such
as their ability to handle high-dimensional input spaces, their robustness to
outliers and noise, and their ability to provide interpretable decision
boundaries. However, SVMs can be computationally expensive for large
datasets due to their optimization requirements.
K-NEAREST NEIGHBOUR
Imagine you're lost in a forest and need to find your way to a specific tree. You
wouldn't draw a map, right? Instead, you'd look around and ask nearby trees
which way to go.
k-Nearest Neighbours (k-NN) is like that! It's a simple way to make predictions
in machine learning. Here's how:
HiCollege Click Here For More Notes 02
1. Imagine your data points as trees: Each point has features (like color, size) and
a label (like type of tree).
2. you're lost: You have a new data point (a lost person) with features but no
label (no idea what tree they are).
3. Ask the neighbours:Find the k closest trees (k-NN) to the lost person based on
their features.
4. Follow the majority:Look at the labels of the k closest trees. The most
common label becomes the prediction for the lost person!
Example:
Imagine you want to classify fruits as apples or oranges. You have a basket of
fruits with features like size and color. When you encounter a new fruit, k-NN
would:
* Find the 3 closest fruits (k=3) in the basket based on size and color.
* If 2 are apples and 1 is an orange, the new fruit is likely an apple!
k-NN is simple, but powerful for many tasks like image recognition and
recommendation systems. It's like asking your friends for advice when you're
unsure just in a more mathematical way.
For deep understanding refer; Machine Learning | K-Nearest Neighbor (KNN)
HiCollege Click Here For More Notes 03
ENSEMBLE LEARNING
Ensemble learning is a machine learning technique that combines multiple
machine learning models to improve the accuracy and robustness of
predictions. Ensemble methods can be used for both classification and
regression tasks.
There are several types of ensemble learning algorithms:
1. Bagging (Bootstrap Aggregating): Bagging is a technique that creates
multiple models on different subsets of the training data, called bootstrap
samples, and combines their predictions using a voting or averaging scheme.
This helps to reduce overfitting and improve the stability of the model.
2. Boosting: Boosting is a technique that iteratively trains multiple models, with
each subsequent model focusing on the misclassified or mispredicted
examples from the previous model. This helps to improve the accuracy of the
model by focusing on the most difficult examples.
3. Stacking: Stacking is a technique that combines multiple models trained on
different feature subsets or different algorithms, and uses a meta-model to
combine their predictions. This helps to improve the accuracy and robustness
of the model by leveraging the strengths of multiple algorithms.
4. Gradient Boosting: Gradient Boosting is a type of boosting algorithm that
iteratively adds new trees to the model, with each tree focusing on correcting
the errors of the previous tree. This helps to improve the accuracy and
interpretability of the model, as each tree can be interpreted as a correction to
the previous tree's errors.
HiCollege Click Here For More Notes 04
RANDOM FOREST ALGORITHM
The Random Forest algorithm is a versatile and powerful tool in the machine
learning toolbox. Imagine it like a team of detectives, each with their own
unique approach to solving a mystery.
Here's the basic idea:
1. Building the Forest
Instead of relying on one detective, you gather a whole team (the forest).
Each detective builds their own "case file" (decision tree) based on a random
subset of features and data points from the investigation. This diversity
prevents any one detective from getting stuck on irrelevant details.
2. Investigating the Evidence:
When you encounter a new "suspect" (data point), each detective examines
it through their unique lens, asking different questions and analyzing
different clues (features).
3. Collaborating for Answers:
No single detective has all the answers, so they share their findings. Each
tree makes a prediction (e.g., guilty or innocent) based on its own analysis.
The final verdict? The majority vote wins! The most common prediction from
the forest becomes the overall prediction for the new data point.
Benefits of the Random Forest:
- Accuracy Boost:By combining diverse perspectives, the forest reduces the risk
of overfitting and improves overall accuracy compared to single decision trees.
- Robustness:Even if some detectives make mistakes, the majority vote can still
lead to a correct conclusion.
- Versatility: The forest can handle various tasks, including classification (e.g.,
spam vs. not spam) and regression (e.g., predicting house prices).
HiCollege Click Here For More Notes 05
MACHINE
LEARNING
WITH PYTHON
SEMESTER 5
UNIT - 3
HI COLLEGE
SYLLABUS
UNIT - 3
HI COLLEGE
ARTIFICIAL NEURAL NETWORKS
Artificial Neural Networks (ANN) are a type of machine learning model
inspired by the structure and functioning of the human brain.
They consist of interconnected artificial neurons, also known as nodes or
units, that work together to process and learn from data.
Each node takes in input values, applies certain transformations to them,
and produces an output value.
The connections between the nodes are represented by weights, which
determine how the input values influence the output values.
Training an ANN involves adjusting these weights based on the input data
and the desired output.
ANN can be used for a variety of tasks, such as classification, regression, and
pattern recognition.
Let's take an example of recognizing handwritten digits using an ANN. Each
pixel of the digit image is treated as an input value, and the weights
determine how strongly each pixel contributes to identifying the digit. The
output of the ANN would be the digit class with the highest probability.
The popular Python library for implementing ANNs is TensorFlow, which
provides useful tools and functions to build, train, and evaluate neural
networks.
HEBBNET
HebbNet, also known as Hebbian Networks, is a type of artificial neural network
(ANN) that follows the Hebbian learning rule.
The Hebbian learning rule states that "neurons that fire together wire
together". It means that when two connected neurons have correlated
activities, the strength of their connection is increased. This rule captures
the idea that the connections between neurons are strengthened through
repeated activation.
In HebbNet, the weights between neurons are adjusted based on the input
correlation. If two connected neurons are simultaneously firing or inhibiting
each other, the weight between them is increased. Conversely, if they have
opposite activities, the weight is decreased.
HiCollege Click Here For More Notes 01
HebbNet can learn to recognize patterns by detecting and strengthening
the correlations between the input features. For example, if a network is
trained on a dataset of images, it can learn to recognize common features
and patterns shared by different images.
Compared to other ANN architectures, HebbNet is relatively simple and
biologically plausible. However, it also has limitations. It can easily overfit the
training data, and it may struggle with handling noisy or incomplete inputs.
Implementation of HebbNet using Python would require defining the
network structure, initializing the weights, providing the input patterns, and
updating the weights based on the Hebbian learning rule.
PERCEPTRON
A perceptron is a simple type of artificial neural network (ANN) that can be
used for binary classification tasks.
The perceptron consists of a single artificial neuron that takes in multiple
inputs, applies weights to them, and produces an output.
Each input is multiplied by its corresponding weight, and these weighted
inputs are summed up.
The summed value is then passed through an activation function, typically a
step function or a sigmoid function, to produce a binary output (0 or 1).
The activation function helps in separating the inputs into two classes and
making a decision based on the calculated value.
Initially, the weights of the perceptron are randomly assigned, and the
model learns by iteratively adjusting the weights based on the correctness
of its predictions.
The adjustment of weights is done using a learning rule called the
perceptron learning rule, which updates the weights proportionally to the
input data and the error in prediction.
To train a perceptron, labeled training data is required, consisting of input
features and corresponding target labels (0 or 1).
The training process involves presenting the input data to the perceptron,
comparing the predicted output with the target output, and updating the
weights accordingly.
Once trained, a perceptron can be used to classify new unseen data by
feeding it through the trained model and observing the output.
HiCollege Click Here For More Notes 02
Example: Let's say we have a perceptron model that aims to classify whether an
email is spam or not. The input features could include the number of words, the
presence of specific keywords, etc. The perceptron would learn the weights
based on labeled training data (emails labeled as spam or not). Then, when
given a new email input, the perceptron will predict whether it is spam or not
based on the learned weights and the activation function.
ADALINE
Adaline (Adaptive Linear Neuron) is another type of artificial neural network
(ANN) that is closely related to the perceptron. While the perceptron focuses on
binary classification, Adaline is primarily used for regression tasks or continuous
output prediction.
Here are some key characteristics of Adaline:
Adaline consists of a single-layer neural network with one or more input
neurons, a weight adjustment mechanism, and a linear activation function.
Unlike the perceptron, Adaline does not use a step function or sigmoid
function for activation; instead, it uses the identity (linear) activation
function.
Training Adaline involves a continuous weight adjustment process that
minimizes the difference between the predicted output and the desired
output value.
The key learning algorithm for Adaline is the Widrow-Hoff rule, also known
as the Least Mean Squares (LMS) algorithm or Delta Rule.
The LMS algorithm updates the weights in a way that minimizes the mean
squared error between the predicted output and the actual output.
The weight adjustments are conducted based on the gradient descent
method, where the weights are modified in the opposite direction of the
gradient to approach the optimal solution.
Adaline networks are capable of learning linear relationships between input
variables and the continuous output variable, making them suitable for
regression problems.
Similar to the perceptron, Adaline's weights are initialized randomly before
training and they are updated iteratively until the model converges or
reaches a desired level of accuracy.
Once Adaline is trained, it can be used to predict the output for new input
data by applying the learned weights and the linear activation function.
HiCollege Click Here For More Notes 03
MULTILAYER NEURAL NETWORK
A multilayer neural network, also known as a multi-layer perceptron (MLP), is a
type of artificial neural network (ANN) that consists of multiple layers of
interconnected neurons. It is a nonlinear model that can be used for both
classification and regression tasks.
Here are some key characteristics of a multilayer neural network:
Structure: A multilayer neural network typically consists of an input layer,
one or more hidden layers, and an output layer. Each layer is composed of
several neurons or nodes.
Neuron connectivity: In a multilayer neural network, neurons are organized
in a layered manner, with each neuron in one layer connected to every
neuron in the previous layer and every neuron in the next layer.
Activation function: Each neuron in the hidden layers and the output layer
applies an activation function to its inputs, transforming the input values
into a more useful form. Common activation functions used in MLPs include
sigmoid, hyperbolic tangent (tanh), and rectified linear unit (ReLU).
Training: The weights and biases of the network are initially assigned
randomly, and the network goes through a training process to adjust these
parameters. Backpropagation, a popular training algorithm, is commonly
used in MLPs. It calculates the gradient of the network's error with respect to
its weights and biases and then updates these parameters iteratively to
minimize the error.
Learning nonlinearity: The presence of hidden layers and the use of
nonlinear activation functions enable multilayer neural networks to learn
complex nonlinear relationships between the input and output variables.
Universal approximator: A well-designed multilayer neural network with
sufficient hidden neurons can approximate any continuous function to
arbitrary precision, given enough training data and training time.
Overfitting: Multilayer neural networks are prone to overfitting, which occurs
when the model becomes too complex and starts to remember the noise
and specifics of the training data instead of learning general patterns.
HiCollege Click Here For More Notes 04
ARCHITECTURE
The architecture of a neural network refers to the structure and arrangement of
its layers, neurons, and connections. It determines how information flows
through the network and how computations are performed.
Here are some key elements of neural network architecture:
1. Input Layer: The input layer accepts the input data and passes it to the next
layer. The number of neurons in the input layer depends on the number of
input features or variables.
2. Hidden Layers: Hidden layers are intermediary layers between the input and
output layers. They play a crucial role in capturing and learning complex
patterns and representations from the input data. The number of hidden layers
and the number of neurons in each hidden layer are design choices and
depend on the specific problem and data.
3. Output Layer: The output layer produces the final predictions or outputs of
the neural network. The number of neurons in the output layer depends on the
type of task. For example, in binary classification, there may be a single neuron
with a sigmoid activation function, while in multi-class classification, there may
be multiple neurons with softmax activation.
4. Neurons and Activation Functions: Neurons are the fundamental units in a
neural network that compute and transmit information. Each neuron applies
an activation function to its input to produce an output. Common activation
functions include sigmoid, hyperbolic tangent (tanh), and rectified linear unit
(ReLU). The choice of activation function depends on the nature of the problem
and the desired properties of the network.
5. Connections and Weights: Connections are the links or pathways through
which information flows between neurons. Each connection is associated with
a weight, which determines the strength or importance of the connection.
During the training process, the weights are adjusted using optimization
algorithms like backpropagation in order to minimize the error or loss of the
network.
HiCollege Click Here For More Notes 05
6. Bias Units: Bias units are additional neurons that provide an offset or bias to
the input of a neuron in the next layer. They help in better representation and
flexibility of the learned models.
7. Dropouts: Dropouts are a regularization technique used in deep neural
networks to mitigate overfitting. Randomly selected neurons are ignored, or
their outputs are set to zero, during training. This prevents the network from
relying too heavily on any individual neuron or feature.
8. Architectural Choices: The design of a neural network architecture depends
on the specific problem and data. Various architectural choices, such as the
number of layers, number of neurons, type of activation functions,
regularization techniques, and optimization algorithms, impact the network's
capacity to learn and its generalization ability.
Example: A common architecture for a simple neural network might have an
input layer with the number of neurons equal to the number of input features,
followed by one or more hidden layers with varying numbers of neurons, and
finally, an output layer with the number of neurons corresponding to the
number of classes in a classification task.
HiCollege Click Here For More Notes 06
ACTIVATION FUNCTION
Activation functions are an essential component of neural networks. They
introduce nonlinearity into the network, allowing it to learn and model
complex relationships between inputs and outputs. Here are some commonly
used activation functions:
1. Sigmoid Activation Function: The sigmoid function applies a sigmoid curve
to squash the input values between 0 and 1. It has a smooth and continuous
output and is useful for binary classification tasks. However, its gradient
saturates for large input values, which can lead to slow learning.
Formula: f(x) = 1 / (1 + exp(-x))
2. Hyperbolic Tangent (tanh) Activation Function: The tanh function is similar
to the sigmoid function but squashes the input values between -1 and 1. It also
suffers from the saturation problem, but its output is zero-centered, leading to
faster convergence compared to the sigmoid function.
Formula: f(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
3. Rectified Linear Unit (ReLU) Activation Function: ReLU is a widely used
activation function for deep neural networks. It returns the input as the output
if the input is positive, and zero otherwise. ReLU overcomes the saturation
problem and speeds up training compared to sigmoid and tanh functions.
However, it can cause dead neurons, i.e., neurons that output zero and stop
learning any further.
Formula: f(x) = max(0, x)
4. Leaky ReLU Activation Function: Leaky ReLU is a modification of the ReLU
function that solves the dead neurons issue. It introduces a small slope for
negative inputs, allowing backward gradients to flow even for negative inputs.
Formula: f(x) = max(0.01x, x) (or any small positive value instead of 0.01)
HiCollege Click Here For More Notes 07
5. Softmax Activation Function: Softmax is commonly used in multi-class
classification tasks. It converts a vector of real numbers into a probability
distribution by exponentiating and normalizing the values. Each output
represents the probability of the corresponding class.
Formula: f(x_i) = exp(x_i) / sum(exp(x_j)) for each element x_i in the input
vector x
6. Linear Activation Function: The linear function simply computes the
weighted sum of the inputs without any nonlinearity. It is commonly used in
regression tasks or as the output activation function when the network needs
to predict continuous values.
Formula: f(x) = x
LOSS FUNCTION
A loss function, also known as a cost function or objective function, measures
the discrepancy between the predicted output of a model and the true target
values. It quantifies how well the model is performing and guides the learning
process by updating the model's parameters to minimize the loss.
The choice of the loss function depends on the type of task being solved. Here
are some commonly used loss functions for different types of problems:
1. Mean Squared Error (MSE): MSE is a common loss function for regression
problems, where the aim is to predict continuous values. It calculates the
average squared difference between the predicted and true values.
Formula: MSE = (1/n) * sum((y_pred - y_true)^2)
2. Mean Absolute Error (MAE): MAE is another loss function for regression tasks.
It calculates the average absolute difference between the predicted and true
values. MAE provides a more interpretable measure as compared to MSE.
Formula: MAE = (1/n) * sum(abs(y_pred - y_true))
3. Binary Cross-Entropy (or Logistic Loss): Binary cross-entropy is commonly
used in binary classification tasks, where the aim is to predict one of two
classes. It measures the dissimilarity between the predicted probabilities and
the true binary labels.
HiCollege Click Here For More Notes 08
Formula: BCE = - ( y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred) )
4. Categorical Cross-Entropy: Categorical cross-entropy is used in multi-class
classification problems, where more than two classes need to be predicted. It
calculates the average of the cross-entropy loss for each class.
Formula: CCE = - sum( y_true * log(y_pred) )
5. Sparse Categorical Cross-Entropy: Sparse categorical cross-entropy is similar
to categorical cross-entropy but is used when the target labels are integers
instead of one-hot-encoded vectors. This is commonly used when the number
of classes is large.
6. Hinge Loss: Hinge loss is used in binary and multi-class classification tasks,
particularly in support vector machines (SVMs). It encourages correct
classification by penalizing misclassifications based on a margin.
Formula: Hinge Loss = max(0, 1 - y_true * y_pred)
7. Kullback-Leibler Divergence (KL Divergence): KL divergence is a loss function
used in variational autoencoders (VAEs) and generative models. It measures the
difference between two probability distributions, typically the predicted
distribution and a target distribution.
Formula: KL Divergence = sum( P(x) * log(P(x) / Q(x)) )
HiCollege Click Here For More Notes 09
HYPER PARAMETERS
Hyperparameters are the configuration parameters of a machine learning
model that are not learned from the data but set before the training process.
These parameters control the behavior and performance of the model and are
typically set by the user or determined through trial and error. Examples of
hyperparameters include:
1. Learning Rate: This hyperparameter controls the step size or rate at which the
model updates its parameters during training. A larger learning rate can lead to
faster convergence, but if it is too large, the model may overshoot the optimal
solution. Conversely, a too small learning rate can slow down or prevent
convergence.
2. Number of Hidden Layers and Neurons: The architecture of a neural network
consists of hidden layers and neurons. The number of hidden layers and
neurons in each layer is a hyperparameter that defines the capacity and
complexity of the network. Increasing the number of layers and neurons can
potentially improve the model's ability to capture complex patterns, but it can
also increase overfitting and slow down training.
3. Activation Functions: Different activation functions (such as sigmoid, tanh,
ReLU) can be chosen for different layers of a neural network. The selection of
activation functions is a hyperparameter that influences the nonlinearity and
expressiveness of the model.
4. Regularization Parameters: Regularization techniques such as L1 or L2
regularization are used to prevent overfitting by penalizing large parameter
values. The hyperparameters associated with regularization methods, such as
the strength of regularization (lambda), control the trade-off between fitting
the training data and maintaining simplicity in the model.
5. Dropout Rate: Dropout is a regularization technique that randomly drops out
a fraction of neurons during training, preventing the model from becoming too
dependent on any specific set of neurons. The dropout rate hyperparameter
determines the fraction of neurons that will be dropped out.
6. Batch Size: During training, the data is divided into batches for efficiency. The
batch size hyperparameter specifies the number of samples to be processed
before updating the model's parameters. Larger batch sizes can accelerate
training, but smaller batch sizes can lead to better generalization and faster
convergence for certain datasets.
HiCollege Click Here For More Notes 10
7. Number of Epochs: An epoch is a full pass over the training data. The number
of epochs is a hyperparameter that determines how many times the entire
dataset will be passed through the model during training. Too few epochs may
result in underfitting, while too many epochs can lead to overfitting.
GRADIENT DESCENT
Gradient descent is an optimization algorithm commonly used in machine
learning and deep learning to minimize the loss function and find the optimal
values for the model's parameters. It works by iteratively adjusting the
parameters in the direction of the steepest descent of the loss function.
The main idea behind gradient descent is to compute the gradient of the loss
function with respect to each parameter. The gradient represents the direction
of the steepest ascent of the loss function. Since we want to minimize the loss,
we move in the opposite direction of the gradient, which is the direction of
steepest descent.
Here is a high-level overview of the gradient descent algorithm:
1. Initialize the model's parameters with some initial values.
2. Compute the loss function by applying the current parameters to the training
data.
3. Compute the gradient of the loss function with respect to each parameter.
This is done using techniques like backpropagation in neural networks.
4. Update the parameters by taking a small step in the opposite direction of the
gradient. This step is determined by the learning rate hyperparameter, which
controls the step size or rate of convergence.
5. Repeat steps 2-4 until the loss function converges or a stopping criterion is
met, such as reaching a maximum number of iterations or a desired level of
loss.
HiCollege Click Here For More Notes 11
Batch gradient descent computes the gradient and updates the parameters
based on the entire training dataset. This can be computationally expensive for
large datasets but can provide a more accurate estimate of the gradient.
Stochastic gradient descent computes the gradient and updates the
parameters for each individual data sample. It is much faster but can result
in noisy updates.
Mini-batch gradient descent is a compromise between batch and stochastic
gradient descent. It computes the gradient and updates the parameters
using a small batch of data samples. It is commonly used because it
combines the benefits of both approaches in terms of faster convergence
and more accurate estimates.
Gradient descent is an iterative process that continues until the loss function is
minimized or until a stopping criterion is met. By updating the parameters in
the direction of the negative gradient, the algorithm slowly converges towards
the optimal parameter values that minimize the loss function and improve the
model's performance.
BACKPROPAGATION
Backpropagation, short for "backward propagation of errors," is a key algorithm
used to train artificial neural networks. It is a technique to compute the
gradients of the loss function with respect to the parameters of a neural
network efficiently. These gradients are then used to update the parameters
during the optimization process, typically using gradient descent.
The backpropagation algorithm consists of two main steps: forward
propagation and backward propagation.
1. Forward Propagation:
During forward propagation, the input data is passed through the neural
network, layer by layer, to compute the predicted output. Each layer
performs a weighted sum of the inputs, applies an activation function, and
passes the result to the next layer.
The computed output is compared with the ground truth labels to calculate
the loss (or cost) function, which quantifies the discrepancy between the
predicted and actual outputs.
HiCollege Click Here For More Notes 12
2. Backward Propagation:
In backward propagation, the goal is to compute the gradients of the loss
function with respect to the parameters of the neural network.
Initially, the gradient of the loss with respect to the final layer's activations is
calculated using the derivative of the loss function.
Then, starting from the last layer and moving backward through the
network, the gradients of the loss with respect to the parameters of each
layer are computed using the chain rule and the calculated gradients from
the subsequent layers.
The obtained gradients are used to update the parameters using an
optimization algorithm such as gradient descent.
The backpropagation algorithm can be visualized as propagating the error
calculated at the output of the network backward through each layer. The
errors are distributed to each neuron in proportion to their contribution to the
overall error.
Backpropagation commonly employs the gradient of the activation function
and the chain rule to calculate the gradients efficiently. Different activation
functions have different derivatives, so their choice can impact the
performance of backpropagation. Additionally, techniques like regularization
and dropout can be incorporated during the backward pass for more robust
training.
Backpropagation continues to iterate through the forward and backward
passes, adjusting the parameters using the calculated gradients, until the loss
function is minimized and the neural network achieves the desired
performance on the training data.
HiCollege Click Here For More Notes 13
VARIANTS OF BACKPROPAGATION
There are several variants of backpropagation that have been developed over
the years to improve the training efficiency or address specific challenges in
neural network training. Some of the commonly used variants include:
1. Stochastic Backpropagation:
In traditional backpropagation, the gradients are computed using the entire
training dataset (batch gradient descent). In stochastic backpropagation,
the gradients are computed and updated for each individual training
sample.
This variant is computationally more efficient and works well for large
datasets. However, the updates can be noisy as they are based on individual
samples, which can introduce more variability in the optimization process.
2. Mini-Batch Backpropagation:
Mini-batch backpropagation is a compromise between batch gradient
descent and stochastic backpropagation.
Instead of using the entire training dataset or just a single sample, this
variant computes the gradients and updates the parameters using a small
batch of training samples.
This approach balances the computational efficiency of stochastic
backpropagation and the stability of batch gradient descent.
3. Hessian Backpropagation:
Hessian backpropagation takes into account the curvature information of
the loss function by computing and using the Hessian matrix during the
parameter updates.
The Hessian matrix provides additional information about the shape of the
loss function and can lead to more efficient optimization.
However, computing and manipulating the Hessian matrix can be
computationally expensive, especially for large neural networks.
HiCollege Click Here For More Notes 14
4. Resilient Backpropagation (RProp):
RProp is a variant of backpropagation that modifies the parameter update
rules to adaptively adjust the step size based on the sign of the gradient.
It adjusts the step size independently for each parameter and can result in
faster convergence and improved robustness compared to traditional
gradient descent or adaptive gradient methods like AdaGrad or Adam.
5. Levenberg-Marquardt Algorithm:
The Levenberg-Marquardt algorithm is a specialized method for training
neural networks with a form of backpropagation called the Gauss-Newton
method.
This algorithm utilizes second-order derivatives (the Hessian matrix) to
compute efficient weight updates.
It is especially useful in situations where the neural network is trained on
continuous-valued output data and the underlying model is non-linear.
AVOIDING OVERFITTING THROUGH REGULARIZATION
Overfitting occurs when a model learns the training data too well, leading to
poor generalization on new, unseen data. Regularization is a technique used to
prevent overfitting by adding a penalty term to the loss function during
training. This penalty discourages the model from learning complex patterns
that might be specific to the training data but don't generalize well.
Here are some common regularization techniques to avoid overfitting:
1. L1 and L2 Regularization:
L1 and L2 regularization are two widely used techniques.
L1 regularization adds the sum of the absolute values of the model weights
multiplied by a regularization parameter to the loss function. It encourages
the model to have sparse weights, i.e., many weights become close to zero,
effectively performing feature selection.
L2 regularization adds the sum of squares of the model weights multiplied
by a regularization parameter to the loss function. It encourages the model
to have small weights, spreading the influence of each feature across
multiple weights.
Both regularization techniques help prevent overfitting by reducing the
model's reliance on any single feature and mitigating the impact of noisy or
irrelevant features.
HiCollege Click Here For More Notes 15
2. Dropout:
Dropout is a regularization technique introduced to combat overfitting in
neural networks.
During training, dropout randomly sets a proportion of the outputs of the
neurons in a layer to zero. Essentially, it "drops out" random neurons
temporarily, forcing the model to learn more robust representations.
Dropout acts as a form of ensemble learning, as different subsets of neurons
are active in each forward pass, resulting in multiple different models being
trained simultaneously.
At testing time, the entire network is used without dropout, but the weights
are usually scaled down to account for the missing neurons' effect.
3. Early Stopping:
Early stopping is a simple yet effective regularization technique based on
monitoring the model's performance on a validation dataset during training.
The training process is stopped early if the model's performance on the
validation dataset starts deteriorating.
This prevents the model from overfitting by finding the point where further
training is likely to lead to generalization loss.
Early stopping saves computation time as the model does not have to
continue training until convergence but rather stops at an optimal point.
4. Data Augmentation:
Data augmentation is a technique commonly used in computer vision tasks
to prevent overfitting.
It involves generating additional training examples by applying random
transformations to the existing data, such as random rotations, flips,
translations, or adding noise.
Data augmentation increases the size and diversity of the training dataset,
making the model more robust and less prone to overfitting to specific
patterns in the original data.
5. Batch Normalization:
Batch normalization is a regularization technique often used in deep neural
networks.
It normalizes the intermediate outputs of the network by subtracting the
batch mean and scaling by the batch standard deviation.
Batch normalization reduces the internal covariate shift, helping the model
converge faster and making it less sensitive to the weight initialization.
By providing a form of regularization, batch normalization can also help
prevent overfitting.
HiCollege Click Here For More Notes 16
APPLICATIONS OF NEURAL NETWORKS
Image and Video Recognition:
Neural networks excel in image classification, object detection, and
segmentation.
Convolutional Neural Networks (CNNs) stand out, achieving top
performance in computer vision tasks.
Applications span facial recognition, self-driving cars, surveillance, and
medical image analysis.
Natural Language Processing (NLP):
Neural networks are pivotal in language translation, sentiment analysis,
text generation, and speech recognition.
Recurrent Neural Networks (RNNs) and LSTM networks handle
sequential data processing.
NLP applications cover chatbots, machine translation, text-to-speech
synthesis, sentiment analysis, and unstructured text analysis.
Speech and Speaker Recognition:
Neural networks excel in speech recognition and speaker identification
tasks.
Deep neural networks, like CNNs and RNNs, model speech signals and
extract pertinent features.
Applications include automatic speech recognition, voice-controlled
interfaces, speaker verification, and audio transcription.
Recommender Systems:
Neural networks power recommender systems predicting user
preferences for personalized recommendations.
Collaborative Filtering methods like matrix factorization and neural
collaborative filtering analyze user-item interactions, offering precise
suggestions.
Applications encompass product recommendations, movie/music
suggestions, and personalized content delivery.
Financial Forecasting:
Neural networks find applications in financial markets for tasks like stock
market prediction, asset price forecasting, and credit scoring.
Models employing recurrent or feedforward neural networks capture
patterns and trends in financial data.
Applications span stock market prediction, algorithmic trading, credit
risk assessment, fraud detection, and investment portfolio optimization.
HiCollege Click Here For More Notes 17
MACHINE
LEARNING
WITH PYTHON
SEMESTER 5
UNIT - 4
HI COLLEGE
SYLLABUS
UNIT - 4
HI COLLEGE
UNSUPERVISED LEARNING ALGORITHMS:
INTRODUCTION TO CLUSTERING
Clustering is a fundamental technique in unsupervised learning that involves
grouping similar data points together. It is used to explore and uncover
patterns or structures within a dataset without any predefined labels or target
variables. The goal of clustering is to partition the data into distinct groups, or
clusters, such that data points within the same cluster are more similar to each
other than those in different clusters.
K-MEANS CLUSTERING
K-means clustering is a popular and widely used algorithm for partitioning data
into distinct clusters. It is an iterative algorithm that aims to minimize the
within-cluster variance, also known as the sum of squared distances between
data points and their assigned cluster centroid.
Here's how the K-means algorithm works:
1. Initialization:
Begin by choosing the number of clusters, denoted as 'k', that you want to
identify in your data.
Randomly initialize the centroids of these 'k' clusters by selecting 'k' data
points from the dataset.
2. Assignment:
For each data point, calculate the distance between the data point and
each centroid.
Assign the data point to the cluster associated with the nearest centroid.
This is typically done based on Euclidean distance, but other distance
metrics can also be used.
3. Update:
Once all data points are assigned to a cluster, compute the new centroids of
each cluster. The centroids are calculated as the mean of all the data points
assigned to that cluster.
4. Iteration:
Repeat steps 2 and 3 until convergence is achieved. Convergence occurs
when the assignments of data points to clusters no longer change
significantly or when a maximum number of iterations is reached.
HiCollege Click Here For More Notes 01
5. Result:
After convergence, the K-means algorithm will have identified 'k' clusters,
with each data point belonging to one of the clusters.
The final result will include the cluster centers (representing the centroid of
each cluster) and the assignment of data points to their respective clusters.
HIERARCHICAL CLUSTERING
Hierarchical clustering is a clustering algorithm that builds a hierarchical
structure of clusters by iteratively merging or splitting clusters. It does not
require a predefined number of clusters, unlike the K-means algorithm.
Hierarchical clustering assigns a data point to a cluster at each level of the
hierarchy, creating a tree-like structure called a dendrogram.
There are two main types of hierarchical clustering:
Agglomerative Hierarchical Clustering:
Agglomerative clustering initiates by treating each data point as a
distinct cluster.
In subsequent iterations, it merges the two closest clusters based on a
predefined distance measure.
This merging process persists until either all data points unite into a
single cluster or a defined stopping condition is fulfilled.
The outcome is a dendrogram illustrating the merging sequence and
hierarchical relationships among clusters.
The dendrogram allows cutting at a desired similarity level to derive a
specific number of clusters.
Divisive Hierarchical Clustering:
Divisive clustering adopts the contrary strategy compared to
agglomerative clustering.
It commences with all data points grouped into a single cluster.
Through each iteration, it recursively divides a cluster into two based on
a specified criterion.
This division process continues until either each data point forms its
cluster or a designated stopping condition is achieved.
Similar to agglomerative clustering, the result is a dendrogram; however,
it displays the splitting process within the hierarchy.
HiCollege Click Here For More Notes 02
KOHONEN SELF-ORGANIZING MAPS
Kohonen Self-Organizing Maps (SOM), also known as a Self-Organizing Feature
Map, is an unsupervised learning algorithm that creates a low-dimensional
representation of high-dimensional data. It was developed by Teuvo Kohonen
in the 1980s.
SOMs are often used for visualizing and analyzing complex, high-dimensional
data. The algorithm maps the input data onto a grid of neurons, each with a
weight vector associated with it. The grid can have any shape, but it is usually
organized as a two-dimensional grid.
Here's how the Kohonen SOM algorithm works:
1. Initialization:
Randomly initialize the weight vectors of the neurons in the grid. These
weight vectors have the same dimensionality as the input data.
2. Training:
Select a random input vector from the dataset.
Compute the Euclidean distance between the input vector and the weight
vectors of all neurons.
Identify the best-matching unit (BMU) - the neuron with the closest weight
vector to the input vector.
3. Update:
Update the weight vectors of the BMU and its neighboring neurons to make
them more similar to the input vector.
The magnitude of the update depends on the learning rate, which is initially
high and gradually decreases over time.
The neighborhood size also decreases over time, allowing the algorithm to
refine the representation.
4. Iteration:
Repeat steps 2 and 3 for a specified number of iterations or until
convergence is achieved.
After training, the SOM represents the data in a low-dimensional grid where
similar input vectors are placed close together. This allows for visual analysis of
the data and identification of clusters or patterns. Each neuron in the SOM grid
can be associated with a specific cluster or category.
HiCollege Click Here For More Notes 03
Some key properties and benefits of Kohonen SOMs include:
1. Dimensionality Reduction: SOMs enable the visualization and exploration of
high-dimensional data by mapping it onto a lower-dimensional grid.
2. Unsupervised Learning: SOMs do not require labeled data and can discover
patterns and relationships in an unsupervised manner.
3. Topological Preservation: SOMs preserve the topological structure of the
input data, meaning nearby input vectors remain close in the SOM grid.
4. Robustness: SOMs are robust to noise, as a noisy input vector will still find its
place within the grid based on its relationship to other vectors.
Kohonen Self-Organizing Maps have been widely used in various domains,
including data visualization, clustering analysis, image compression, anomaly
detection, and pattern recognition. They provide a powerful tool for
understanding complex datasets and finding underlying structures.
IMPLEMENTATION OF UNSUPERVISED ALGORITHMS
Unsupervised learning algorithms are used to discover patterns, clusters, and
relationships in data without the need for labeled examples. Here are some
common implementations of unsupervised algorithms:
1. K-means Clustering:
One of the most popular clustering algorithms.
Divides data points into K clusters, where K is a user-specified parameter.
Each data point is assigned to the cluster with the closest centroid.
The algorithm iteratively updates the cluster centroids until convergence.
2. Hierarchical Clustering:
As discussed earlier, this algorithm builds a hierarchy of clusters based on
the similarity between data points.
The choice of linkage method (e.g., single-linkage, complete-linkage,
average-linkage) and distance metric impacts the results.
Dendrograms can be used to visualize the clustering process.
HiCollege Click Here For More Notes 04
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
DBSCAN clusters data points based on proximity and the presence of a
minimum number of neighbors within a specified distance.
It excels in detecting clusters of various shapes and is robust against
noisy data.
Points not assigned to any cluster are considered outliers or noise.
4. Gaussian Mixture Models (GMM):
Assumes data points are generated from a blend of Gaussian
distributions.
Models data by combining K Gaussian distributions, where the value of K
is specified by the user.
The algorithm estimates Gaussian component parameters through an
Expectation-Maximization (EM) approach.
5. Self-Organizing Maps (SOM):
SOMs offer a low-dimensional representation of high-dimensional data
to aid visualization and pattern recognition.
The algorithm maps input data onto a grid of neurons and updates their
weights based on the similarity to the input vectors.
6. Principal Component Analysis (PCA):
A technique for reducing the dimensionality of data, projecting it from a
high-dimensional space to a lower-dimensional one.
It identifies principal components—directions capturing the most
variance in the data.
The lower-dimensional representation is achieved by projecting the data
onto these principal components.
7. t-SNE (t-Distributed Stochastic Neighbor Embedding):
Another dimensionality reduction technique primarily used for
visualization purposes.
It maps high-dimensional data into a two- or three-dimensional space
while preserving pairwise distances between points.
Particularly effective at preserving local structure and clustering patterns
in the data.
HiCollege Click Here For More Notes 05
FEATURE SELECTION AND DIMENSIONALITY REDUCTION
Feature selection and dimensionality reduction are techniques used to reduce
the number of features or dimensions in a dataset. They are important
preprocessing steps in machine learning, as they can improve model
performance, reduce overfitting, and enhance interpretability. Here's a brief
explanation of feature selection and dimensionality reduction:
Feature Selection:
Feature selection is the process of selecting a subset of relevant features from
the original feature set. The goal is to choose features that contribute the most
to the target variable while discarding irrelevant or redundant features. Benefits
of feature selection include reducing computation time, improving model
interpretability, and mitigating the risk of overfitting.
Popular techniques for feature selection include:
1. Filter Methods:
Use statistical measures like correlation, chi-square, or mutual information
to rank features.
Select the top-ranked features based on a predefined threshold or a fixed
number.
2. Wrapper Methods:
Involve evaluating the performance of different feature subsets using an
external machine learning algorithm.
Search for the best subset of features, typically through a backward or
forward selection process.
3. Embedded Methods:
Incorporate feature selection directly into the learning algorithm.
Model-specific techniques that determine feature importance during the
training phase, e.g., LASSO regularization or decision tree-based importance.
HiCollege Click Here For More Notes 06
DIMENSIONALITY REDUCTION:
Dimensionality Reduction Overview:
Aims to decrease the number of dimensions in a dataset while retaining
maximal information.
Particularly useful for high-dimensional data suffering from the curse of
dimensionality or for visualizing data in lower-dimensional spaces.
Common Dimensionality Reduction Techniques:
a. Principal Component Analysis (PCA):
Identifies orthogonal axes (principal components) capturing
maximum data variance.
Enables projection onto a lower-dimensional space with minimal
information loss.
b. Singular Value Decomposition (SVD):
Decomposes the data matrix into three distinct matrices.
Yields a low-rank approximation that preserves critical features.
c. t-Distributed Stochastic Neighbor Embedding (t-SNE):
Proficient for visualizing high-dimensional data in reduced
dimensions.
Constructs a probability distribution over pairs of high-dimensional
points, learning a lower-dimensional map that retains pairwise
similarities.
d. Autoencoders:
Neural network architectures reconstruct input data from a
compressed representation of lower dimensions.
The intermediate layers learn a compressed representation suitable
for lower-dimensional usage.
HiCollege Click Here For More Notes 07
PRINCIPAL COMPONENT ANALYSIS
Principal Component Analysis (PCA) is a popular dimensionality reduction
technique used to transform a high-dimensional dataset into a lower-
dimensional space. It identifies the principal components, which are the
directions that capture the maximum variance in the data.
Here is a step-by-step explanation of how PCA works:
1. Standardize the data:
- PCA requires the data to be centered around zero and have unit variance.
- If the features in the dataset are on different scales, it is recommended to
standardize them.
2. Compute the covariance matrix:
- The covariance matrix measures the relationship between pairs of features.
- It is computed as the dot product of the standardized data matrix and its
transpose.
3. Compute the eigenvectors and eigenvalues of the covariance matrix:
- The eigenvectors represent the principal components, and the eigenvalues
represent their corresponding variances.
- They are computed using linear algebra techniques.
4. Select the principal components:
- Principal components are chosen based on their corresponding eigenvalues.
- The components with higher eigenvalues capture more variance and are
considered more important.
5. Project the data onto the selected principal components:
- The original data is transformed into a lower-dimensional space by
projecting it onto the selected principal components.
- This is done by taking the dot product of the standardized data matrix with
the eigenvector matrix.
6. Determine the explained variance:
- The proportion of variance explained by each principal component can be
calculated using the corresponding eigenvalues.
- This information helps in understanding how much information is retained
in the lower-dimensional representation.
HiCollege Click Here For More Notes 08
HiCollege Click Here For More Notes 09