Unit 1 notes
Historical Trends in Deep Learning:
Deep Learning have been three waves of development: The first wave started with
cybernetics in the 1940s-1960s, with the development of theories of biological
learning and implementations of the first models such as the perceptron allowing
the training of a single neuron. The second wave started with the connectionist
approach of the 1980-1995 period, with back-propagation to train a neural network
with one or two hidden layers. The current and third wave, deep learning, started
around 2006.
Deep learning history timeline:
Deep learning is basically a branch of machine learning (another hot topic) that
uses algorithms to e.g. recognize objects and understand human speech. Scientists
have used deep learning algorithms with multiple processing layers (hence “deep”)
to make better models from large quantities of unlabelled data (such as photos with
no description, voice recordings or videos on YouTube).
It’s one kind of supervised machine learning, in which a computer is provided a
training set of examples to learn a function, where each example is a pair of an
input and an output from the function. Very simply: if we give the computer a
picture of a cat and a picture of a ball, and show it which one is the cat, we can
then ask it to decide if subsequent pictures are cats. The computer compares the
image to its training set and makes an answer. Today’s algorithms can also do this
unsupervised; that is, they don’t need every decision to be pre-programmed.
Of course, the more complex the task, the bigger the training set has to
be. Google’s voice recognition algorithms operate with a massive training set —
yet it’s not nearly big enough to predict every possible word or phrase or question
you could put to it.
But it’s getting there. Deep learning is responsible for recent advances in computer
vision, speech recognition, natural language processing, and audio recognition.
Deep learning is based on the concept of artificial neural networks, or
computational systems that mimic the way the human brain functions. And so, our
brief history of deep learning must start with those neural networks.
1943: Warren McCulloch and Walter Pitts create a computational model for neural
networks based on mathematics and algorithms called threshold logic.
1958: Frank Rosenblatt creates the perceptron, an algorithm for pattern recognition
based on a two-layer computer neural network using simple addition and
subtraction. He also proposed additional layers with mathematical notations, but
these wouldn’t be realised until 1975.
1980: Kunihiko Fukushima proposes the Neoconitron, a hierarchical, multilayered
artificial neural network that has been used for handwriting recognition and other
pattern recognition problems.
1989: Scientists were able to create algorithms that used deep neural networks, but
training times for the systems were measured in days, making them impractical for
real-world use.
1992: Juyang Weng publishes Cresceptron, a method for performing 3-D object
recognition automatically from cluttered scenes.
Mid-2000s: The term “deep learning” begins to gain popularity after a paper by
Geoffrey Hinton and Ruslan Salakhutdinov showed how a many-layered neural
network could be pre-trained one layer at a time.
2009: NIPS Workshop on Deep Learning for Speech Recognition discovers that
with a large enough data set, the neural networks don’t need pre-training, and the
error rates drop significantly.
2012: Artificial pattern-recognition algorithms achieve human-level performance
on certain tasks. And Google’s deep learning algorithm discovers cats.
2014: Google buys UK artificial intelligence startup Deepmind for £400m
2015: Facebook puts deep learning technology – called DeepFace – into operations
to automatically tag and identify Facebook users in photographs. Algorithms
perform superior face recognition tasks using deep networks that take into account
120 million parameters.
2016: Google DeepMind’s algorithm AlphaGo masters the art of the complex
board game Go and beats the professional go player Lee Sedol at a highly
publicised tournament in Seoul.
The promise of deep learning is not that computers will start to think like humans.
That’s a bit like asking an apple to become an orange. Rather, it demonstrates that
given a large enough data set, fast enough processors, and a sophisticated enough
algorithm, computers can begin to accomplish tasks that used to be completely left
in the realm of human perception — like recognising cat videos on the web (and
other, perhaps more useful purposes).
Machine learning basics:
Machine Learning is an application of artificial intelligence where a
computer/machine learns from the past experiences (input data) and makes future
predictions. The performance of such a system should be at least human level.
A more technical definition given by Tom M. Mitchell’s (1997) : “A computer
program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E.” Example:
A handwriting recognition learning problem: Task T: recognizing and classifying
handwritten words within images , Performance measure P: percent of words
correctly classified, accuracy , Training experience E: a data-set of handwritten
words with given classifications
Machine Learning Categories
Machine Learning is generally categorized into three types: Supervised Learning,
Unsupervised Learning, Reinforcement learning
Supervised Learning:
In supervised learning the machine experiences the examples along with the labels
or targets for each example. The labels in the data help the algorithm to correlate
the features.
Two of the most common supervised machine learning tasks
are classification and regression.
In classification problems the machine must learn to predict discrete values. That
is, the machine must predict the most probable category, class, or label for new
examples. Applications of classification include predicting whether a stock's price
will rise or fall, or deciding if a news article belongs to the politics or leisure
section. In regression problems the machine must predict the value of a continuous
response variable. Examples of regression problems include predicting the sales for
a new product, or the salary for a job based on its description.
Unsupervised Learning:
When we have unclassified and unlabeled data, the system attempts to uncover
patterns from the data . There is no label or target given for the examples. One
common task is to group similar examples together called clustering.
Reinforcement Learning:
Reinforcement learning refers to goal-oriented algorithms, which learn how to
attain a complex objective (goal) or maximize along a particular dimension over
many steps. This method allows machines and software agents to automatically
determine the ideal behavior within a specific context in order to maximize its
performance. Simple reward feedback is required for the agent to learn which
action is best; this is known as the reinforcement signal. For example, maximize
the points won in a game over many moves.
Steps used in Machine Learning:
There are 5 basic steps used to perform a machine learning task:
1. Collecting data: Be it the raw data from excel, access, text files etc., this step
(gathering past data) forms the foundation of the future learning. The better
the variety, density and volume of relevant data, better the learning prospects
for the machine becomes.
2. Preparing the data: Any analytical process thrives on the quality of the data
used. One needs to spend time determining the quality of data and then
taking steps for fixing issues such as missing data and treatment of
outliers. Exploratory analysis is perhaps one method to study the nuances of
the data in details thereby burgeoning the nutritional content of the data.
3. Training a model: This step involves choosing the appropriate algorithm and
representation of data in the form of the model. The cleaned data is split into
two parts – train and test (proportion depending on the prerequisites); the
first part (training data) is used for developing the model. The second part
(test data), is used as a reference.
4. Evaluating the model: To test the accuracy, the second part of the data
(holdout / test data) is used. This step determines the precision in the choice
of the algorithm based on the outcome. A better test to check accuracy of
model is to see its performance on data which was not used at all during
model build.
5. Improving the performance: This step might involve choosing a different
model altogether or introducing more variables to augment the efficiency.
That’s why significant amount of time needs to be spent in data collection
and preparation.
Applications of Machine Learning:
It is very interesting to know the applications of machine learning. Google and
Facebook uses ML extensively to push their respective ads to the relevant users.
Here are a few applications that you should know:
• Banking & Financial services: ML can be used to predict the customers who
are likely to default from paying loans or credit card bills. This is of
paramount importance as machine learning would help the banks to identify
the customers who can be granted loans and credit cards.
• Healthcare: It is used to diagnose deadly diseases (e.g. cancer) based on the
symptoms of patients and tallying them with the past data of similar kind of
patients.
• Retail: It is used to identify products which sell more frequently (fast
moving) and the slow moving products which help the retailers to decide
what kind of products to introduce or remove from the shelf. Also, machine
learning algorithms can be used to find which two / three or more products
sell together. This is done to design customer loyalty initiatives which in
turn helps the retailers to develop and maintain loyal customers.
Difference between machine learning and Artificial Intelligence:
o Artificial intelligence is a technology using which we can create intelligent
systems that can simulate human intelligence, whereas Machine learning is a
subfield of artificial intelligence, which enables machines to learn from past
data or experiences.
o Artificial Intelligence is a technology used to create an intelligent system
that enables a machine to simulate human behavior. Whereas, Machine
Learning is a branch of AI which helps a machine to learn from experience
without being explicitly programmed.
o AI helps to make humans like intelligent computer systems to solve complex
problems. Whereas, ML is used to gain accurate predictions from past data
or experience.
o AI can be divided into Weak AI, General AI, and Strong AI. Whereas, IML
can be divided into Supervised learning, Unsupervised learning, and
Reinforcement learning.
o Each AI agent includes learning, reasoning, and self-correction. Each ML
model includes learning and self-correction when introduced with new data.
o AI deals with Structured, semi-structured, and unstructured data. ML deals
with Structured and semi-structured data.
o Applications of AI: Siri, customer support using catboats, Expert System,
Online game playing, an intelligent humanoid robot, etc. Applications of
ML: Online recommender system, Google search algorithms, Facebook auto
friend tagging suggestions, etc
Learning algorithm: supervised and unsupervised training:
Supervised learning is a machine learning approach that’s defined by its use of
labeled datasets. These datasets are designed to train or “supervise” algorithms into
classifying data or predicting outcomes accurately. Using labeled inputs and
outputs, the model can measure its accuracy and learn over time.
Supervised learning can be separated into two types of problems when data
mining: classification and regression:
• Classification problems use an algorithm to accurately assign test data into
specific categories, such as separating apples from oranges. Or, in the real
world, supervised learning algorithms can be used to classify spam in a
separate folder from your inbox. Linear classifiers, support vector machines,
decision trees and random forest are all common types of classification
algorithms.
• Regression is another type of supervised learning method that uses an
algorithm to understand the relationship between dependent and independent
variables. Regression models are helpful for predicting numerical values
based on different data points, such as sales revenue projections for a given
business. Some popular regression algorithms are linear regression, logistic
regression and polynomial regression.
Unsupervised learning uses machine learning algorithms to analyze and cluster
unlabeled data sets. These algorithms discover hidden patterns in data without the
need for human intervention (hence, they are “unsupervised”).
Unsupervised learning models are used for three main tasks: clustering, association
and dimensionality reduction:
• Clustering is a data mining technique for grouping unlabeled data based on
their similarities or differences. For example, K-means clustering algorithms
assign similar data points into groups, where the K value represents the size
of the grouping and granularity. This technique is helpful for market
segmentation, image compression, etc.
• Association is another type of unsupervised learning method that uses
different rules to find relationships between variables in a given dataset.
These methods are frequently used for market basket analysis and
recommendation engines, along the lines of “Customers Who Bought This
Item Also Bought” recommendations.
Supervised vs. Unsupervised Learning:
Supervised machine learning Unsupervised machine learning
Parameters
technique technique
In a supervised learning model,
In unsupervised learning model, only
Process input and output variables will
input data will be given
be given.
Algorithms are trained using Algorithms are used against data
Input Data
labeled data. which is not labeled
Support vector machine, Neural Unsupervised algorithms can be
network, Linear and logistics divided into different categories: like
Algorithms Used
regression, random forest, and Cluster algorithms, K-means,
Classification trees. Hierarchical clustering, etc.
Computational Supervised learning is a simpler Unsupervised learning is
Complexity method. computationally complex
Supervised learning model uses
training data to learn a link Unsupervised learning does not use
Use of Data
between the input and the output data.
outputs.
Supervised machine learning Unsupervised machine learning
Parameters
technique technique
Accuracy of Highly accurate and
Less accurate and trustworthy method.
Results trustworthy method.
Real Time Learning method takes place Learning method takes place in real
Learning offline. time.
Number of
Number of classes is known. Number of classes is not known.
Classes
You cannot get precise information
Classifying big data can be a
regarding data sorting, and the output
Main Drawback real challenge in Supervised
as data used in unsupervised learning
Learning.
is labeled and not known.
Linear algebra for machine learning:
The first step towards learning Math for ML is to learn linear algebra. Linear
Algebra is the mathematical foundation that solves the problem of representing
data as well as computations in machine learning models. It is the math of arrays —
technically referred to as vectors, matrices and tensors.
Important areas of application that are enabled by linear algebra are:
1. data and learned model representation
2. word embeddings
3. dimensionality reduction
Data Representation : The fuel of ML models, that is data, needs to be converted
into arrays before you can feed it into your models. The computations performed
on these arrays include operations like matrix multiplication (dot product). This
further returns the output that is also represented as a transformed matrix/tensor of
numbers.
Word embeddings :– It is just about representing large-dimensional data (think of a
huge number of variables in your data) with a smaller dimensional vector. Natural
Language Processing (NLP) deals with textual data. Dealing with text means
comprehending the meaning of a large corpus of words. Each word represents a
different meaning which might be similar to another word. Vector embeddings in
linear algebra allow us to represent these words more efficiently.
Eigenvectors (SVD) : Finally, concepts like eigenvectors allow us to reduce the
number of features or dimensions of the data while keeping the essence of all of
them using something called principal component analysis.
Linear algebra basically deals with vectors and matrices (different shapes of
arrays) and operations on these arrays. In NumPy, vectors are basically a 1-
dimensional array of numbers but geometrically, they have both magnitude and
direction.
Dimensionality Reduction — Vector Space Transformation: When it comes to
embeddings, you can basically think of an n-dimensional vector being replaced
with another vector that belongs to a lower-dimensional space. This is more
meaningful and it's the one that overcomes computational complexities.
For example, here is a 3-dimensional vector that is replaced by a 2-dimensional
space. But you can extrapolate it to a real-world scenario where you have a very
large number dimensions.
Reducing dimensions doesn’t mean dropping features from the data. Instead, it's
about finding new features that are linear functions of the original features and
preserving the variance of the original features.
Finding these new variables (features) translates to finding the principal
components (PCs). This then converges to solving eigenvectors and eigen values
problems.
Examples of Linear Algebra in Machine Learning:
1. Datasets and data files
In machine learning, you fit the model in the dataset. It’s a table like a set of
numbers where each row represents the observation and each column represents
the characteristic of the observation.
Below is a fragment of the Iris Flower Dataset 1
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
…
These data are the main data structure in a matrix, linear algebra. Yet, when you
are partitioning data into inputs and outputs to match supervised machine learning
models for measuring and flowering species, you have a matrix (X) and a vector
(Y).
Vector is another important data structure in linear algebra. Each row has the same
length, i.e., the same number of columns, therefore we can say that the data is
vectorized where rows can be provided to a model one at a time or in batch and the
model can be pre-configured to expect rows of a fixed width.
2. Images and photos
Perhaps, they are accustomed to working with images or photographs in computer
vision applications.
Each image you work with is a table structure with width and height and pixel
value in each cell for black and white images or 3-pixel values per cell per color
image. A photo is another example of a matrix from linear algebra. Operations on
the image, such as cropping, scaling, shearing, and so on are all described using the
notation and operations of linear algebra.
3. One Hot Encoding
Sometimes you work with classified data in machine learning. Perhaps the class
labels for classification problems, or perhaps categorical input variables. It is
common to encode categorical variables to make it easier to work with and learn
by some techniques. A popular encoding for categorical variables is the one-hot
encoding. A one-hot encoding is where a table is created to represent the variable
with one column for each category and a row for each example in the dataset. A
check or one-value is added in the column for the categorical value for a given
row, and a zero-value is added to all other columns. For example, the variable
color variable with the 3 rows:
Red
Green
Blue
It can be encoded as follows:
Red, green, blue
1, 0, 0
0, 1, 0
0, 0, 1
Each row is encoded as a binary vector, a vector with zero or one values and this is
an example of sparse representation, a whole sub-field of linear algebra.
4. Linear Regression
Linear regression is an old-fashioned way of interpreting statistics between
relationships. It is often used in machine learning to simplify numerical values in
simpler regression problems. There are several ways to describe and solve the
linear regression problem, which is, to find a set of multiples when each input
variable is multiplied and added to the output variable is the best reference
together. If you have used a machine learning tool or library, the most common
way of solving linear regression is via a least-squares optimization that is solved
using matrix factorization methods from linear regression, such as an LU
decomposition or a singular-value decomposition or SVD. Even the common way
of summarizing the linear regression equation uses linear algebra notation: y = A ·
b Where y is the output variable A is the dataset and b are the model coefficients.
Testing-cross validation:
In machine learning (ML), generalization usually refers to the ability of an
algorithm to be effective across various inputs. It means that the ML model does
not encounter performance degradation on the new inputs from the same
distribution of the training data.
For human beings generalization is the most natural thing possible. We can classify
on the fly. For example, we would definitely recognize a dog even if we didn’t see
this breed before. Nevertheless, it might be quite a challenge for an ML model.
That’s why checking the algorithm’s ability to generalize is an important task that
requires a lot of attention when building the model.
Cross-validation is a technique for evaluating a machine learning model and testing
its performance. CV is commonly used in applied ML tasks. It helps to compare
and select an appropriate model for the specific predictive modeling problem.
CV is easy to understand, easy to implement, and it tends to have a lower bias than
other methods used to count the model’s efficiency scores. All this makes cross-
validation a powerful tool for selecting the best model for the specific task.
There are a lot of different techniques that may be used to cross-validate a model.
Still, all of them have a similar algorithm:
1. Divide the dataset into two parts: one for training, other for testing
2. Train the model on the training set
3. Validate the model on the test set
4. Repeat 1-3 steps a couple of times. This number depends on the CV method
that you are using
Why cross-validation?
CV provides the ability to estimate model performance on unseen data not used
while training.
Data scientists rely on several reasons for using cross-validation during their
building process of Machine Learning (ML) models. For instance, tuning the
model hyper-parameters, testing different properties of the overall datasets, and
iterate the training process. Also, in cases where your training dataset is small, and
the ability to split them into training, validation, and testing will significantly affect
training accuracy.
Cross validation techniques:
There are plenty of CV techniques
• Hold-out
• K-folds
• Leave-one-out
• Leave-p-out
• Stratified K-folds
Hold-out cross-validation: Hold-out cross-validation is the simplest and most
common technique. You might not know that it is a hold-out method but you
certainly use it every day.
The algorithm of hold-out technique:
1. Divide the dataset into two parts: the training set and the test set. Usually,
80% of the dataset goes to the training set and 20% to the test set but you
may choose any splitting that suits you better
2. Train the model on the training set
3. Validate on the test set
4. Save the result of the validation
k-Fold cross-validation: k-Fold cross-validation is a technique that minimizes the
disadvantages of the hold-out method. k-Fold introduces a new way of splitting the
dataset which helps to overcome the “test only once bottleneck”.
The algorithm of the k-Fold technique:
1. Pick a number of folds – k. Usually, k is 5 or 10 but you can choose any
number which is less than the dataset’s length.
2. Split the dataset into k equal (if possible) parts (they are called folds)
3. Choose k – 1 folds as the training set. The remaining fold will be the test set
4. Train the model on the training set. On each iteration of cross-validation,
you must train a new model independently of the model trained on the
previous iteration
5. Validate on the test set
6. Save the result of the validation
7. Repeat steps 3 – 6 k times. Each time use the remaining fold as the test set.
In the end, you should have validated the model on every fold that you have.
8. To get the final score average the results that you got on step 6.
Leave-one-out cross-validation: Leave-one-out сross-validation (LOOCV) is an
extreme case of k-Fold CV. Imagine if k is equal to n where n is the number of
samples in the dataset. Such k-Fold case is equivalent to Leave-one-out technique.
The algorithm of LOOCV technique:
1. Choose one sample from the dataset which will be the test set
2. The remaining n – 1 samples will be the training set
3. Train the model on the training set. On each iteration, a new model must be
trained
4. Validate on the test set
5. Save the result of the validation
6. Repeat steps 1 – 5 n times as for n samples we have n different training and
test sets
7. To get the final score average the results that you got on step 5.
Leave-p-out cross-validation: Leave-p-out cross-validation (LpOC) is similar
to Leave-one-out CV as it creates all the possible training and test sets by
using p samples as the test set. All mentioned about LOOCV is true and for LpOC.
Still, it is worth mentioning that unlike LOOCV and k-Fold test sets will overlap
for LpOC if p is higher than 1.
The algorithm of LpOC technique:
1. Choose p samples from the dataset which will be the test set
2. The remaining n – p samples will be the training set
3. Train the model on the training set. On each iteration, a new model must be
trained
4. Validate on the test set
5. Save the result of the validation
6. Repeat steps 2 – 5 Cpn times
7. To get the final score average the results that you got on step 5
Stratified k-Fold cross-validation: Sometimes we may face a large imbalance of the
target value in the dataset. For example, in a dataset concerning wristwatch prices,
there might be a larger number of wristwatch having a high price. In the case of
classification, in cats and dogs dataset there might be a large shift towards the dog
class.
Stratified k-Fold is a variation of the standard k-Fold CV technique which is
designed to be effective in such cases of target imbalance.
It works as follows. Stratified k-Fold splits the dataset on k folds such that each
fold contains approximately the same percentage of samples of each target class as
the complete set. In the case of regression, Stratified k-Fold makes sure that the
mean target value is approximately equal in all the folds.
The algorithm of Stratified k-Fold technique:
1. Pick a number of folds – k
2. Split the dataset into k folds. Each fold must contain approximately the same
percentage of samples of each target class as the complete set
3. Choose k – 1 folds which will be the training set. The remaining fold will be
the test set
4. Train the model on the training set. On each iteration a new model must be
trained
5. Validate on the test set
6. Save the result of the validation
7. Repeat steps 3 – 6 k times. Each time use the remaining fold as the test set.
In the end, you should have validated the model on every fold that you have.
8. To get the final score average the results that you got on step 6.
Dimensionality reduction:
Dimensionality reduction is the task of reducing the number of features in a
dataset. In machine learning tasks like regression or classification, there are often
too many variables to work with. These variables are also called features. The
higher the number of features, the more difficult it is to model them, this is known
as the curse of dimensionality. This will be discussed in detail in the next section.
Additionally, some of these features can be quite redundant, adding noise to the
dataset and it makes no sense to have them in the training data. This is where
feature space needs to be reduced.
The process of dimensionality reduction essentially transforms data from high-
dimensional feature space to a low-dimensional feature space. Simultaneously, it is
also important that meaningful properties present in the data are not lost during the
transformation.
Dimensionality reduction is commonly used in data visualization to understand and
interpret the data, and in machine learning or deep learning techniques to simplify
the task at hand.
Issues that arise with high dimensional data are:
1. Running a risk of overfitting the machine learning model.
2. Difficulty in clustering similar features.
3. Increased space and computational time complexity.
The curse of dimensionality, first introduced by Bellman, describes that in order to
estimate an arbitrary function with a certain accuracy the number of features or
dimensionality required for estimation grows exponentially. This is especially true
with big data which yields more sparsity.
Sparsity in data is usually referred to as the features having a value of zero; this
doesn’t mean that the value is missing. If the data has a lot of sparse features then
the space and computational complexity increase.
Dimensionality reduction is also used to clean the data and feature extraction.
Advantages of Dimensionality Reduction
• It helps in data compression, and hence reduced storage space.
• It reduces computation time.
• It also helps remove redundant features, if any.
• Improved Visualization: High dimensional data is difficult to visualize, and
dimensionality reduction techniques can help in visualizing the data in 2D or
3D, which can help in better understanding and analysis.
• Overfitting Prevention: High dimensional data may lead to overfitting in
machine learning models, which can lead to poor generalization
performance. Dimensionality reduction can help in reducing the complexity
of the data, and hence prevent overfitting.
• Feature Extraction: Dimensionality reduction can help in extracting
important features from high dimensional data, which can be useful in
feature selection for machine learning models.
• Data Preprocessing: Dimensionality reduction can be used as a
preprocessing step before applying machine learning algorithms to reduce
the dimensionality of the data and hence improve the performance of the
model.
• Improved Performance: Dimensionality reduction can help in improving the
performance of machine learning models by reducing the complexity of the
data, and hence reducing the noise and irrelevant information in the data.
Disadvantages of Dimensionality Reduction
• It may lead to some amount of data loss.
• PCA tends to find linear correlations between variables, which is sometimes
undesirable.
• PCA fails in cases where mean and covariance are not enough to define
datasets.
• We may not know how many principal components to keep- in practice,
some thumb rules are applied.
• Interpretability: The reduced dimensions may not be easily interpretable, and
it may be difficult to understand the relationship between the original
features and the reduced dimensions.
• Overfitting: In some cases, dimensionality reduction may lead to overfitting,
especially when the number of components is chosen based on the training
data.
• Sensitivity to outliers: Some dimensionality reduction techniques are
sensitive to outliers, which can result in a biased representation of the data.
• Computational complexity: Some dimensionality reduction techniques, such
as manifold learning, can be computationally intensive, especially when
dealing with large datasets
Principal component analysis:
Principal Component Analysis, or PCA, is a dimensionality-reduction method to
find lower-dimensional space by preserving the variance as measured in the high
dimensional input space. It is an unsupervised method for dimensionality
reduction.
PCA transformations are linear transformations. It involves the process of finding
the principal components, which is the decomposition of the feature matrix into
eigenvectors. This means that PCA will not be effective when the distribution of
the dataset is non-linear.
PCA implementation is quite straightforward. We can define the whole process
into just four steps:
• Standardization: The data has to be transformed to a common scale by
taking the difference between the original dataset with the mean of the
whole dataset. This will make the distribution 0 centered.
• Finding covariance: Covariance will help us to understand the relationship
between the mean and original data.
• Determining the principal components: Principal components can be
determined by calculating the eigenvectors and eigenvalues. Eigenvectors
are a special set of vectors that help us to understand the structure and the
property of the data that would be principal components. The eigenvalues on
the other hand help us to determine the principal components. The highest
eigenvalues and their corresponding eigenvectors make the most important
principal components.
• Final output: It is the dot product of the standardized matrix and the
eigenvector. Note that the number of columns or features will be changed.
Reducing the number of variables of data not only reduces complexity but also
decreases the accuracy of the machine learning model. However, with a smaller
number of features it is easy to explore, visualize and analyze, it also makes
machine learning algorithms computationally less expensive. In simple words, the
idea of PCA is to reduce the number of variables of a data set, while preserving as
much information as possible.
Overfitting and underfitting:
Underfitting occurs when a model is not able to make accurate predictions based
on training data and hence, doesn’t have the capacity to generalize well on new
data. Another case of underfitting is when a model is not able to learn enough from
training data, making it difficult to capture the dominating trend (the model is
unable to create a mapping between the input and the target variable).
Machine learning models with underfitting tend to have poor performance both in
training and testing sets (like the child who learned only addition and was not able
to solve problems related to other basic arithmetic operations both from his math
problem book and during the math exam). Underfitting models usually have high
bias and low variance.
A model is considered overfitting when it does extremely well on training data but
fails to perform on the same level on the validation data (like the child who
memorized every math problem in the problem book and would struggle when
facing problems from anywhere else). An overfitting model fails to generalize
well, as it learns the noise and patterns of the training data to the point where it
negatively impacts the performance of the model on new data .If the model is
overfitting, even a slight change in the output data will cause the model to change
significantly. Models that are overfitting usually have low bias and high variance .
How to avoid underfitting
There are several things you can do to prevent underfitting in AI and machine
learning models:
1) Train a more complex model – Lack of model complexity in terms of data
characteristics is the main reason behind underfitting models. For example, you
may have data with upwards of 100000 rows and more than 30 parameters. If you
train data with the Random Forest model and set max depth (max depth determines
the maximum depth of the tree) to a small number (for example, 2), your model
will definitely be underfitting. Training a more complex model (in this respect, a
model with a higher value of max depth) will help us solve the problem of
underfitting.
2) More time for training - Early training termination may cause underfitting. As
a machine learning engineer, you can increase the number of epochs or increase
the duration of training to get better results.
3) Eliminate noise from data – Another cause of underfitting is the existence of
outliers and incorrect values in the dataset. Data cleaning techniques can help deal
with this problem.
4) Adjust regularization parameters - the regularization coefficient can cause
both overfitting and underfitting models.
5) Try a different model – if none of the above-mentioned principles work, you
can try a different model (usually, the new model must be more complex by its
nature). For example, you can try to replace the linear model with a higher-order
polynomial model.
How to prevent overfitting
There are numerous ways to overcome overfitting in machine learning models.
Some of those methods are listed below.
1) Adding more data – Most of the time, adding more data can help machine
learning models detect the “true” pattern of the model, generalize better, and
prevent overfitting. However, this is not always the case, as adding more data that
is inaccurate or has many missing values can lead to even worse results.
2) Early stopping – In iterative algorithms, it is possible to measure how the
model iteration performance. Up until a certain number of iterations, new iterations
improve the model. After that point, however, the model’s ability to generalize can
deteriorate as it begins to overfit the training data. Early stopping refers to stopping
the training process before the learner passes that point.
3) Data augmentation – In machine learning, data augmentation techniques
increase the amount of data by slightly changing previously existing data and
adding new data points or by producing synthetic data from a previously existing
dataset.
4) Remove features – You can remove irrelevant aspects from data to improve the
model. Many characteristics in a dataset may not contribute much to prediction.
Removing non-essential characteristics can enhance accuracy and decrease
overfitting.
5) Regularization – Regularization refers to a variety of techniques to push your
model to be simpler. The approach you choose will be determined by the model
you are training. For example, you can add a penalty parameter for a regression
(L1 and L2 regularization), prune a decision tree or use dropout on a neural
network.
6) Ensembling – Ensembling methods merge predictions from numerous different
models. These methods not only deal with overfitting but also assist in solving
complex machine learning problems (like combining pictures taken from different
angles into the overall view of the surroundings). The most popular ensembling
methods are boosting and bagging.
• Boosting – In boosting method, you train a large number of weak learners
(constrained models) in sequence, and each sequence learns from the
mistakes of the previous sequence. Then you combine all weak learners into
a single strong learner.
• Bagging is another technique to reduce overfitting. It trains a large number
of strong learners (unconstrained models) and then combines them all in
order to optimize their predictions.
Hyperparameters:
Hyperparameters are parameters whose values control the learning process and
determine the values of model parameters that a learning algorithm ends up
learning. The prefix ‘hyper_’ suggests that they are ‘top-level’ parameters that
control the learning process and the model parameters that result from it.
Basically, anything in machine learning and deep learning that you decide their
values or choose their configuration before training begins and whose values or
configuration will remain the same when training ends is a hyperparameter.
Here are some common examples
• Train-test split ratio
• Learning rate in optimization algorithms (e.g. gradient descent)
• Choice of optimization algorithm (e.g., gradient descent, stochastic gradient
descent, or Adam optimizer)
• Choice of activation function in a neural network (nn) layer (e.g. Sigmoid,
ReLU, Tanh)
• The choice of cost or loss function the model will use
• Number of hidden layers in a nn
• Number of activation units in each layer
• The drop-out rate in nn (dropout probability)
• Number of iterations (epochs) in training a nn
• Number of clusters in a clustering task
• Kernel or filter size in convolutional layers
• Pooling size
• Batch size
Parameters
Parameters on the other hand are internal to the model. That is, they are learned or
estimated purely from the data during training as the algorithm used tries to learn
the mapping between the input features and the labels or targets.
Model training typically starts with parameters being initialized to some values
(random values or set to zeros). As training/learning progresses the initial values
are updated using an optimization algorithm (e.g. gradient descent). The learning
algorithm is continuously updating the parameter values as learning progress but
hyperparameter values set by the model designer remain unchanged.
At the end of the learning process, model parameters are what constitute the model
itself.
Examples of parameters
• The coefficients (or weights) of linear and logistic regression models.
• Weights and biases of a nn
• The cluster centroids in clustering
Simply put, parameters in machine learning and deep learning are the values your
learning algorithm can change independently as it learns and these values are
affected by the choice of hyperparameters you provide. So you set the
hyperparameters before training begins and the learning algorithm uses them to
learn the parameters. Behind the training scene, parameters are continuously being
updated and the final ones at the end of the training constitute your model.
Therefore, setting the right hyperparameter values is very important because it
directly impacts the performance of the model that will result from them being
used during model training. The process of choosing the best hyperparameters for
your model is called hyperparameter tuning
Validation sets: A validation set is a set of data used to train artificial intelligence
(AI) with the goal of finding and optimizing the best model to solve a given
problem. Validation sets are also known as dev sets.
Supervised learning and machine learning models are trained on very large sets of
labeled data, in which validation data sets play an important role in their creation.
Training, tuning, model selection and testing are performed with three different
sets of data: train, test and validation. Validation sets are used to select and tune the
AI model.
Validation data sets use a sample of data that is withheld from training. That data is
then used to evaluate any apparent errors. Machine learning engineers can
then tune the model's hyperparameters -- which are adjustable parameters used to
control the behavior of the model. This process acts as an independent data set for
comparing the model's performance.
Bias and variance:
Bias:In general, a machine learning model analyses the data, find patterns in it and
make predictions. While training, the model learns these patterns in the dataset and
applies them to test data for prediction. While making predictions, a difference
occurs between prediction values made by the model and actual values/expected
values, and this difference is known as bias errors or Errors due to bias. It can be
defined as an inability of machine learning algorithms such as Linear Regression to
capture the true relationship between the data points. Each algorithm begins with
some amount of bias because bias occurs from assumptions in the model, which
makes the target function simple to learn. A model has either:
o Low Bias: A low bias model will make fewer assumptions about the form of
the target function.
o High Bias: A model with a high bias makes more assumptions, and the
model becomes unable to capture the important features of our dataset. A
high bias model also cannot perform well on new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The
simpler the algorithm, the higher the bias it has likely to be introduced. Whereas a
nonlinear algorithm often has low bias.
Some examples of machine learning algorithms with low bias are Decision Trees,
k-Nearest Neighbours and Support Vector Machines. At the same time, an
algorithm with high bias is Linear Regression, Linear Discriminant Analysis and
Logistic Regression.
Variance Error: The variance would specify the amount of variation in the
prediction if the different training data was used. In simple words, variance tells
that how much a random variable is different from its expected value. Ideally, a
model should not vary too much from one training dataset to another, which means
the algorithm should be good in understanding the hidden mapping between inputs
and output variables. Variance errors are either of low variance or high variance.
Low variance means there is a small variation in the prediction of the target
function with changes in the training data set. At the same time, High
variance shows a large variation in the prediction of the target function with
changes in the training dataset.
A model that shows high variance learns a lot and perform well with the training
dataset, and does not generalize well with the unseen dataset. As a result, such a
model gives good results with the training dataset but shows high error rates on the
test dataset.
Since, with high variance, the model learns too much from the dataset, it leads to
overfitting of the model. A model with high variance has the below problems:
o A high variance model leads to overfitting.
o Increase model complexities.
Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high
variance.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of
bias and variance in order to avoid overfitting and underfitting in the model. If the
model is very simple with fewer parameters, it may have low variance and high
bias. Whereas, if the model has a large number of parameters, it will have high
variance and low bias. So, it is required to make a balance between bias and
variance errors, and this balance between the bias error and variance error is
known as the Bias-Variance trade-off.
For an accurate prediction of the model, algorithms need a low variance and low
bias. But this is not possible because bias and variance are related to each other:
o If we decrease the variance, it will increase the bias.
o If we decrease the bias, it will increase the variance.
Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a
model that accurately captures the regularities in training data and simultaneously
generalizes well with the unseen dataset. Unfortunately, doing this is not possible
simultaneously. Because a high variance algorithm may perform well with training
data, but it may lead to overfitting to noisy data. Whereas, high bias algorithm
generates a much simple model that may not even capture important regularities in
the data. So, we need to find a sweet spot between bias and variance to make an
optimal model.
Hence, the Bias-Variance trade-off is about finding the sweet spot to make a
balance between bias and variance errors.
Loss function –regularization:
Regularization is a technique used in machine learning and deep learning to
prevent overfitting and improve the generalization performance of a model. It
involves adding a penalty term to the loss function during training.
This penalty discourages the model from becoming too complex or having large
parameter values, which helps in controlling the model’s ability to fit noise in the
training data. Regularization methods include L1 and L2 regularization, dropout,
early stopping, and more. By applying regularization, models become more robust
and better at making accurate predictions on unseen data.
The “something” we’re making regular in our ML context is the “objective
function”, something we try to minimize during the optimization problem.
To put it simply, in regularization, information is added to an objective function.
We use regularization because we want to add some bias into our model to prevent
it overfitting to our training data. After adding a regularization, we end up with a
machine learning model that performs well on the training data, and has a good
ability to generalize to new examples that it has not seen during training
In order to get the “best” implementation of our model, we can use an optimization
algorithm to identify the set of inputs that maximizes – or minimizes – the
objective function. Generally, in machine learning we want to minimize the
objective function to lower the error of our model. This is why the objective
function is called the loss function amongst practitioners, but it can also be called
the cost function.
L1 regularization : L1 regularization, also known as L1 norm or Lasso (in
regression problems), combats overfitting by shrinking the parameters towards 0.
This makes some features obsolete.
It’s a form of feature selection, because when we assign a feature with a 0 weight,
we’re multiplying the feature values by 0 which returns 0, eradicating the
significance of that feature. If the input features of our model have weights closer
to 0, our L1 norm would be sparse. A selection of the input features would have
weights equal to zero, and the rest would be non-zero.
For example, imagine we want to predict housing prices using machine learning.
Consider the following features:
• Street – road access,
• Neighborhood – property location,
• Accessibility – transport access,
• Year Built – year the house was built in,
• Rooms – number of rooms,
• Kitchens – number of kitchens,
• Fireplaces – number of fireplaces in the house.
When predicting the value of a house, intuition tells us that different input features
won’t have the same influence on the price. For example, it’s highly likely that the
neighborhood or the number of rooms have a higher influence on the price of the
property than the number of fireplaces.
So, our L1 regularization technique would assign the fireplaces feature with a zero
weight, because it doesn’t have a significant effect on the price. We can expect the
neighborhood and the number rooms to be assigned non-zero weights, because
these features influence the price of a property significantly.
Mathematically, we express L1 regularization by extending our loss function like
such:
Essentially, when we use L1 regularization, we are penalizing the absolute value of
the weights.
In real world environments, we often have features that are highly correlated. For
example, the year our home was built and the number of rooms in the home may
have a high correlation. Something to consider when using L1 regularization is that
when we have highly correlated features, the L1 norm would select only 1 of the
features from the group of correlated features in an arbitrary nature, which is
something that we might not want.
Nonetheless, for our example regression problem, Lasso regression (Linear
Regression with L1 regularization) would produce a model that is highly
interpretable, and only uses a subset of input features, thus reducing the complexity
of the model.
L2 regularization : L2 regularization, or the L2 norm, or Ridge (in regression
problems), combats overfitting by forcing weights to be small, but not making
them exactly 0.
So, if we’re predicting house prices again, this means the less significant features
for predicting the house price would still have some influence over the final
prediction, but it would only be a small influence.
The regularization term that we add to the loss function when performing L2
regularization is the sum of squares of all of the feature weights:
So, L2 regularization returns a non-sparse solution since the weights will be non-
zero (although some may be close to 0).
A major snag to consider when using L2 regularization is that it’s not robust to
outliers. The squared terms will blow up the differences in the error of the outliers.
The regularization would then attempt to fix this by penalizing the weights.
The differences between L1 and L2 regularization:
• L1 regularization penalizes the sum of absolute values of the weights,
whereas L2 regularization penalizes the sum of squares of the weights.
• The L1 regularization solution is sparse. The L2 regularization solution is
non-sparse.
• L2 regularization doesn’t perform feature selection, since weights are only
reduced to values near 0 instead of 0. L1 regularization has built-in feature
selection.
• L1 regularization is robust to outliers, L2 regularization is not.
Biological neuron-idea of computational units: In living organisms, the brain is
the control unit of the neural network, and it has different subunits that take care of
vision, senses, movement, and hearing. The brain is connected with a dense
network of nerves to the rest of the body’s sensors and actors. There are
approximately 10¹¹ neurons in the brain, and these are the building blocks of the
complete central nervous system of the living body.
The neuron is the fundamental building block of neural networks. In the biological
systems, a neuron is a cell just like any other cell of the body, which has a DNA
code and is generated in the same way as the other cells. Though it might have
different DNA, the function is similar in all the organisms. A neuron comprises
three major parts: the cell body (also called Soma), the dendrites, and the axon. The
dendrites are like fibers branched in different directions and are connected to many
cells in that cluster.
Dendrites receive the signals from surrounding neurons, and the axon transmits the
signal to the other neurons. At the ending terminal of the axon, the contact with the
dendrite is made through a synapse. Axon is a long fiber that transports the output
signal as electric impulses along its length. Each neuron has one axon. Axons pass
impulses from one neuron to another like a domino effect.
Why Understand Biological Neural Networks?
For creating mathematical models for artificial neural networks, theoretical
analysis of biological neural networks is essential as they have a very close
relationship. And this understanding of the brain’s neural networks has opened
horizons for the development of artificial neural network systems and adaptive
systems designed to learn and adapt to the situations and inputs.
Neuron in an Artificial Neural Network :After going through the biological neuron,
let’s move to the artificial neuron.
An artificial neuron or neural node is a mathematical model. In most cases, it
computes the weighted average of its input and then applies a bias to it. Post that, it
passes this resultant term through an activation function. This activation function is
a nonlinear function such as the sigmoid function that accepts a linear input and
gives a nonlinear output.
The following figure shows a typical artificial neuron:
1. It can be viewed as weighted directed graphs in which artificial neurons are
nodes, and directed edges with weights are connections between neuron outputs
and neuron inputs.
2. The Artificial Neural Network receives information from the external world in
pattern and image in vector form. These inputs are designated by the notation x(n)
for n number of inputs.
3. Every input is multiplied by its specific weights, which serve as crucial
information for the neural network to solve problems. These weights essentially
represent the strength of the connections between neurons within the neural
network.
4. The weighted inputs are all summed up inside the computing unit (artificial
neuron). In case the weighted sum is zero, bias is added to make the output not-
zero or to scale up the system response. Bias has the weight and input always equal
to ‘1'.
5. The sum corresponds to any numerical value ranging from 0 to infinity. To limit
the response to arrive at the desired value, the threshold value is set up. For this,
the sum is forward through an activation function.
6. The activation function is set to the transfer function to get the desired output.
There are linear as well as the nonlinear activation function.
McCulloch- pitts units and Thresholding logic:
The McCulloch-Pitts neural model, which was the earliest ANN model, has only
two types of inputs — Excitatory and Inhibitory. The excitatory inputs have
weights of positive magnitude and the inhibitory weights have weights of negative
magnitude. The inputs of the McCulloch-Pitts neuron could be either 0 or 1. It has
a threshold function as an activation function. So, the output signal yout is 1 if the
input ysum is greater than or equal to a given threshold value, else 0. The
diagrammatic representation of the model is as follows:
McCulloch-Pitts Model:
Simple McCulloch-Pitts neurons can be used to design logical operations. For that
purpose, the connection weights need to be correctly decided along with the
threshold function (rather than the threshold value of the activation function). For
better understanding purpose, let me consider an example: John carries an umbrella
if it is sunny or if it is raining. There are four given situations. I need to decide
when John will carry the umbrella. The situations are as follows:
• First scenario: It is not raining, nor it is sunny
• Second scenario: It is not raining, but it is sunny
• Third scenario: It is raining, and it is not sunny
• Fourth scenario: It is raining as well as it is sunny
To analyse the situations using the McCulloch-Pitts neural model, I can consider
the input signals as follows:
• X1: Is it raining?
• X2 : Is it sunny?
So, the value of both scenarios can be either 0 or 1. We can use the value of both
weights X1 and X2 as 1 and a threshold function as 1. So, the neural network model
will look like:
Truth Table for this case will be:
Situation x1 x2 ysum yout
1 0 0 0 0
2 0 1 1 1
3 1 0 1 1
4 1 1 2 1
So, I can say that,
g(x1, x2, ..., xn) = g(x) , y = f(g(x)) = 1 if g(x) ≥ θ and y = 0 if g(x) < θ
The truth table built with respect to the problem is depicted above. From the truth
table, I can conclude that in the situations where the value of yout is 1, John needs to
carry an umbrella. Hence, he will need to carry an umbrella in scenarios 2, 3 and
4.
Linear perceptron: Perceptron was introduced by Frank Rosenblatt in 1957. He
proposed a Perceptron learning rule based on the original MCP neuron. A
Perceptron is an algorithm for supervised learning of binary classifiers. This
algorithm enables neurons to learn and processes elements in the training set one at
a time.
Basic Components of Perceptron
Perceptron is a type of artificial neural network, which is a fundamental concept in
machine learning. The basic components of a perceptron are:
1. Input Layer: The input layer consists of one or more input neurons, which
receive input signals from the external world or from other layers of the
neural network.
2. Weights: Each input neuron is associated with a weight, which represents
the strength of the connection between the input neuron and the output
neuron.
3. Bias: A bias term is added to the input layer to provide the perceptron with
additional flexibility in modeling complex patterns in the input data.
4. Activation Function: The activation function determines the output of the
perceptron based on the weighted sum of the inputs and the bias term.
Common activation functions used in perceptrons include the step function,
sigmoid function, and ReLU function.
5. Output: The output of the perceptron is a single binary value, either 0 or 1,
which indicates the class or category to which the input data belongs.
6. Training Algorithm: The perceptron is typically trained using a supervised
learning algorithm such as the perceptron learning algorithm or
backpropagation. During training, the weights and biases of the perceptron
are adjusted to minimize the error between the predicted output and the true
output for a given set of training examples.
7. Overall, the perceptron is a simple yet powerful algorithm that can be used
to perform binary classification tasks and has paved the way for more
complex neural networks used in deep learning today.
Types of Perceptron:
1. Single layer: Single layer perceptron can learn only linearly separable
patterns.
2. Multilayer: Multilayer perceptrons can learn about two or more layers
having a greater processing power.
The Perceptron algorithm learns the weights for the input signals in order to draw a
linear decision boundary.
Advantages:
• A multi-layered perceptron model can solve complex non-linear problems.
• It works well with both small and large input data.
• Helps us to obtain quick predictions after the training.
• Helps us obtain the same accuracy ratio with big and small data.
Disadvantages:
• In multi-layered perceptron model, computations are time-consuming and
complex.
• It is tough to predict how much the dependent variable affects each
independent variable.
• The model functioning depends on the quality of training.
Characteristics of the Perceptron Model:
1. It is a machine learning algorithm that uses supervised learning of binary
classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and then the decision is
made whether the neuron is fired or not.
4. The activation function applies a step rule to check whether the function is
more significant than zero.
5. The linear decision boundary is drawn, enabling the distinction between the
two linearly separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it must
have an output signal; otherwise, no output will be shown.
Working:
The perceptron works on these simple steps
a. All the inputs x are multiplied with their weights w. Let’s call it k.
b. Add all the multiplied values and call them Weighted Sum.
c. Apply that weighted sum to the correct Activation Function.
For Example: Unit Step Activation Function.
Perceptron Learning Algorithm:
Perceptron Learning Algorithm is also understood as an Artificial Neuron or neural
network unit that helps to detect certain input data computations in business
intelligence. The perceptron learning algorithm is treated as the most
straightforward Artificial Neural network. It is a supervised learning algorithm of
binary classifiers. Hence, it is a single-layer neural network with four main
parameters, i.e., input values, weights and Bias, net sum, and an activation
function.
There are four significant steps in a perceptron learning algorithm:
1. First, multiply all input values with corresponding weight values and then
add them to determine the weighted sum. Mathematically, we can calculate
the weighted sum as follows: ∑wi∗xi=x1∗w1+x2∗w2+…+wn∗xn. Add
another essential term called bias 'b' to the weighted sum to improve the
model performance.∑wi∗xi+b.
2. Next, an activation function is applied to this weighed sum, producing a
binary or a continuous-valued output. Y=f(∑wi∗xi+b)
3. Next, the difference between this output and the actual target value is
computed to get the error term, E, generally in terms of mean squared error.
The steps up to this form the forward propagation part of the
algorithm. E=(Y−Yactual)2
4. We optimize this error (loss function) using an optimization algorithm.
Generally, some form of gradient descent algorithm is used to find the
optimal values of the hyperparameters like learning rate, weight, Bias, etc.
This step forms the backward propagation part of the algorithm.
Convergence theorem for perceptron learning algorithm:
Variables and Parameters : x(n)= (m+1) by 1 input vector =
[+1,x1(n),x2(n),.....xm(n)] T
w(n)=(m+1) by 1 weight vector=[b(n),w1(n),w2(n),.....wm(n)]T
b(n)=bias
y(n)=actual response
d(n)=desired response
η= learning rate parameter, a +ve constant less than unity
1. Initialization: Set w(0)=0, then perform the following computations for time step
n=1,2
2. Activation: At time step n, activate the perceptron by applying input vector x(n)
and desired response d(n).
3. Computation of actual response: Compute the actual response of the perceptron:
y(n)=sgn[wT(x)x(n)]
4. Adaptation of weight vector: Update the weight vector of the perceptron:
w(n+1)=w(n)+η[d(n)−y(n)]x(n)
5. Continuation: Increment time step n by 1, go to step 1
PERCEPTRON CONVERGENCE THEOREM: Says that there if there is a weight
vector w* such that f(w*p(q)) = t(q) for all q, then for any starting vector w, the
perceptron learning rule will converge to a weight vector (not necessarily unique
and not necessarily w*) that gives the correct response for all training patterns, and
it will do so in a finite number of steps.
IDEA OF THE PROOF: The idea is to find upper and lower bounds on the length
of the weight vector. If the length is finite, then the perceptron has converged,
which also implies that the weights have changed a finite number of times.
PROOF: 1) Assume that the inputs to the perceptron originate from two linearly
separable classes. That is, the classes can be distinguished by a perceptron. Let X1
be the subset of training vectors belonging to C1. That is, p(1), p(2),... . Let X2 be
the set of training vectors belonging to C2. That is, p(1), p(2),... . Then we can say
that X1 ∪ X2 is the complete training set X.
2) Given the set of vectors X1 and X2 to train this perceptron to train this
perceptron, the training process (as we have seen) involves the adjustment of the
weight vector w such that C1 and C2 are linearly separable. That is, there exists
some w such that 3) wT p > 0 for every input vector p ∈ C1 4) wT p < 0 for every
input vector p ∈ C2
3) What need to do is find some w such that the above is satisfied, which is the
purpose of the perceptron algorithm. One algorithm for adapting the weight vector
for the perceptron algorithm can be formulated as follows (there are others):
a. If the kth member of the training set p(k) is correctly classified at the kth
iteration no correction is made to w. This is done according to the following rule:
w(k+1) = w(k) if wT p(k) > 0 and p(k) ∈ C1 w(k+1) = w(k) if wT p(k) < 0 and p(k)
∈ C2 2
b. Otherwise, the weight vector of the perceptron is updated according to the
following rule: w(k+1) = w(k) – p(k) if wT p(k) > 0 and p(k) ∈ C2 w(k+1) = w(k) +
p(k) if wT p(k) < 0 and p(k) ∈ C1
4) The proof begins assuming that w(1) = 0 (i.e., the zero vector). Suppose that w T
(k)p(k) < 0 for k = 1, 2, ..., and all the input vectors p(k) ∈ X1 (or C1). Here, the
perceptron incorrectly classifies the vectors p(1), p(2),... since the second condition
in Equation (1) is violated (i.e., wT (k)p(k) should be greater than 0). Now we can
use Equation (3) to write the adjustments to the weight matrix according to the
perceptron learning rule w(k+1) = w(k) + p(k) for p(k) ∈ C1 Recall, what we are
doing is modifying the weight vector so that it points in the right direction.
5) Given the initial condition w(1) = 0, we can iteratively solve for w(k+1)
obtaining: w(k+1) = p(1) + p(2) + ... + p(k)
CLS EXERCISE: Show w(k+1) = p(1) + p(2) + ... + p(k)
ANSWER: w(1) = 0
w(2) = w(1) + p(1) = p(1)
w(3) = w(2) + p(2) = w(1) + p(1) + p(2) = p(1) + p(2)
w(4) = w(3) + p(3) = w(2) + p(2) + p(3) = w(1) + p(1) + p(2) + p(3) = p(1) + p(2) +
p(3)
So, in general: w(k+1) = p(1) + p(2) + ... + p(k)
Since C1 and C2 are assumed to be linearly separable, then there exists a wo that
can correctly classify all input vectors belonging to C1 and C2. In this proof, we
have assumed the existence of wo for which wo T p(k) > 0 if p(k) ∈ C1 and wo T
p(k) < 0 if p(k) ∈ C2 This is equivalent to the existence of a weight vector wo for
which wo T p(k) > 0 if p(k) ∈ C1 ∪ C2 or X
The reason is that the training set can be considered to consist of two parts: C1 =
{p such that the target value is 1} and C2 = {p such that the target value is 0} So
we can think of the training set X as X = C1 ∪ C2 where C2 = {-p such that p ∈
C2} Now, if the response for the network is incorrect, then this allows the weights
to be updated according to: w(k+1) = w(k) + p(k) since -wo T p(k) < 0-> wo T p(k)
> 0 if p(k) ∈ C2. Now in this case the perceptron is updated by w(k+1) = w(k) – (–
p(k)) = w(k+1) = w(k) + p(k) since wT p(k) > 0 and p(k) ∈ C2.
For the solution wo, we can define some α > 0 as
α = min p(k) ∈X wot p(k)
which is just the minimum (scalar) value of all wo T p(k) for all p ∈ X.
6) By multiplying each side of Equation (5) by wo T , we get wo T w (k+1) = wo T
p(1) + wo T p(2) + ... + wo T p(k)
7) And, by applying Equation (6), we get wo T w (k+1) ≥ kα since wo T w(k+1) has
to be greater than or equal to k × min p(k) ∈X wot p(k)
8) Next, use the Cauchy-Swartz inequality for wo T and wo T (k+1), which states for
any two vectors x and y: [x ⋅ y] 2 ≤ ||x||2 ||y||2 or ||x||2 ≥ [x ⋅ y] 2 / ||y||2 where ||x|| is
the Euclidian norm.
Thus, we can say: ||wo T ||2 ||w(k+1)||2 ≥ [wo T w (k+1)]2
9) From Equation (7), we know by applying Cauchy-Swartz that wo T w (k+1) ≥
kα-> ||wo T ||2 ||w (k+1)||2 ≥ [kα] 2→ ||wo T ||2 ||w(k+1)||2 ≥ k2 α2 and we get:
||w(k+1)||2≥ k2 α2/||wo T ||2
which shows that the squared length of the weight vector (||w(k+1)||2 ) grows by a
factor of k2 , where k is the number of time the weights have changed.
10) So, what we have established from Equation (9) is a lower-bound in the terms
of the squared Euclidian norm of the weight vector w at iteration k + 1. But there is
another aspect we must consider. In order to show that the weights cannot continue
to grow indefinitely, we must establish an upper bound for the weight vector w.
11) To find an upper-bound, we now do the following. Write Equation (4) as the
following (where q is the number of patterns in X): w(k+1) = w(k) + p(k) for k =
1,...,q p(k) ∈ X
12) After taking the square of the Euclidian norm, we get ||w(k+1)||2 = ||w(k)||2 +
2wT (k) p(k) + ||p(k)||2
13) But assuming that the perceptron incorrectly classifies an input vector
belonging to X (i.e., we have to make the adjustment w(k+1) = w(k) + p(k) ) since
wT (k) p(k) < 0, then from Equation (10), it follows that: ||w(k+1)||2 ≤ ||w(k)||2 +
||p(k)||2
CLS EXERCISE: Why can we make this claim? ANSWER:
Because since the input vector p(k) was incorrectly classified, this implies that w T
(k) p(k) < 0. Thus, given that ||w(k+1)||2 = ||w(k)||2 + 2wT (k) p(k) + ||p(k)||2 , then
clearly 2wT (k) p(k) must be < 0 also. Thus, we can claim that: ||w(k+1)||2 ≤ ||w(k)||2
+ ||p(k)||2 .
We can rewrite the above as: ||w(k+1)||2 - ||w(k)||2 ≤ ||p(k)||2
14) Adding these quantities for k = 1,...,q (where q is the number of patterns in C1)
and using the initial condition that w(1) = 0, it can be shown that:
||w(k+1)||2 ≤ ∑ k n=1 || p(n)||2 ≤ kβ where β > 0 and is defined by the following:
β = max p(k)∈X || p(k)||2
15) Equation (13) says that the Euclidian norm of the weight vector w(k+1), grows
at most linearly with the number of iterations, k, given the input patterns. In other
words, w( k+1) 2 ≤ kβ .
16) Thus, we have established both upper and lower bounds for a perceptron to
classify correctly the input patterns p(k) for k = 1,...,q for two class C1 and C2
given they are linearly separable.
17) Observe that w( k+1) 2≤ kβ conflicts with the earlier result of ||w(k+1)||2≥ k2
α2/||wo T ||2; namely, it can be shown that they don’t agree for ||w(k+1)|| for large
values of k.
So we can state that k cannot be larger than some k max for which Equations (9) and
(14) are both satisfied; and they must be equal to determine the maximum number
of iterations for the two linearly separable classes to converge. To determine this,
set Equations (9) and (14) equal to each other and substitute and solve for kmax.
Thus we get, k2 α2/||wo ||2=kmaxβ, kmax = ||wo ||2 β / α2 which means that changing the
weights of the perceptron must terminate after at most k max iterations, which means
that the machine has solved the (linearly separable) problem correctly.
Linear separability:
The concept of separability applies to binary classification problems. In them, we
have two classes: one positive and the other negative. We say they’re separable if
there’s a classifier whose decision boundary separates the positive objects from the
negative ones.
If such a decision boundary is a linear function of the features, we say that the
classes are linearly separable.
We say a two-dimensional dataset is linearly separable if we can separate the
positive from the negative objects with a straight line.
It doesn’t matter if more than one such line exists. For linear separability, it’s
sufficient to find only one:
If the data are linearly separable, we can find the decision boundary’s equation by
fitting a linear model to the data. For example, a linear Support Vector
Machine classifier finds the hyper plane with the widest margins.
Linear models come with three advantages. First, they’re simple and operate with
original features. So, it’s easier to interpret them than the more complex non-linear
models. Second, we can derive analytical solutions to the optimization problems
that arise while fitting the models. In contrast, we can rely only on numerical
methods to train a general non-linear model. And finally, it’s easier to apply
numerical methods to linear optimization problems than non-linear ones.
However, if the data aren’t linearly separable, we can’t enjoy the advantages of
linear models.
Multilinear perceptron:
It is a neural network where the mapping between inputs and output is non-linear.
A Multilayer Perceptron has input and output layers, and one or more hidden
layers with many neurons stacked together. And while in the Perceptron the neuron
must have an activation function that imposes a threshold, like ReLU or sigmoid,
neurons in a Multilayer Perceptron can use any arbitrary activation function.
Multilayer Perceptron falls under the category of feedforward algorithms, because
inputs are combined with the initial weights in a weighted sum and subjected to the
activation function, just like in the Perceptron. But the difference is that each linear
combination is propagated to the next layer.
Each layer is feeding the next one with the result of their computation, their
internal representation of the data. This goes all the way through the hidden layers
to the output layer.
Backpropagation
Backpropagation is the learning mechanism that allows the Multilayer Perceptron
to iteratively adjust the weights in the network, with the goal of minimizing the
cost function.
There is one hard requirement for backpropagation to work properly. The function
that combines inputs and weights in a neuron, for instance the weighted sum, and
the threshold function, for instance ReLU, must be differentiable. These functions
must have a bounded derivative, because Gradient Descent is typically the
optimization function used in MultiLayer Perceptron.
In each iteration, after the weighted sums are forwarded through all layers, the
gradient of the Mean Squared Error is computed across all input and output pairs.
Then, to propagate it back, the weights of the first hidden layer are updated with
the value of the gradient. That’s how the weights are propagated back to the
starting point of the neural network!
This process keeps going until gradient for each input-output pair has converged,
meaning the newly computed gradient hasn’t changed more than a
specified convergence threshold, compared to the previous iteration.