Machine Learning Using Watson Studio
Machine Learning Using Watson Studio
Watson Studio
Machine Learning, as the name says, is all about machines learning automatically without being explicitly
programmed or learning without any direct human intervention. This machine learning process starts with
feeding them good quality data and then training the machines by building various machine learning
models using the data and different algorithms. The choice of algorithms depends on what type of data
we have and what kind of task we are trying to automate.
As for the formal definition of Machine Learning, we can say that a Machine Learning algorithm learns
from experience E with respect to some type of task T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E.
For example, If a Machine Learning algorithm is used to play chess. Then the experience E is playing many
games of chess, the task T is playing chess with many players, and the performance measure P is the
probability that the algorithm will win in the game of chess.
One of the most significant benefits of machine learning is its ability to improve accuracy and precision in
various tasks. ML models can process vast amounts of data and identify patterns that might be overlooked
by humans. For instance, in medical diagnostics, ML algorithms can analyze medical images or patient data
to detect diseases with a high degree of accuracy.
Machine learning enables the automation of repetitive and mundane tasks, freeing up human resources
for more complex and creative endeavors. In industries like manufacturing and customer service, ML-
driven automation can handle routine tasks such as quality control, data entry, and customer inquiries,
resulting in increased productivity and efficiency.
3. Enhanced Decision-Making
ML models can analyze large datasets and provide insights that aid in decision-making. By identifying
trends, correlations, and anomalies, machine learning helps businesses and organizations make data-
driven decisions. This is particularly valuable in sectors like finance, where ML can be used for risk
assessment, fraud detection, and investment strategies.
Machine learning enables the personalization of products and services, enhancing customer experience.
In e-commerce, ML algorithms analyze customer behavior and preferences to recommend products
tailored to individual needs. Similarly, streaming services use ML to suggest content based on user viewing
history, improving user engagement and satisfaction.
5. Predictive Analytics
Predictive analytics is a powerful application of machine learning that helps forecast future events based
on historical data. Businesses use predictive models to anticipate customer demand, optimize inventory,
and improve supply chain management. In healthcare, predictive analytics can identify potential outbreaks
of diseases and help in preventive measures.
6. Scalability
Machine learning models can handle large volumes of data and scale efficiently as data grows. This
scalability is essential for businesses dealing with big data, such as social media platforms and online
retailers. ML algorithms can process and analyze data in real-time, providing timely insights and responses.
7. Improved Security
8. Cost Reduction
By automating processes and improving efficiency, machine learning can lead to significant cost
reductions. In manufacturing, ML-driven predictive maintenance helps identify equipment issues before
they become costly failures, reducing downtime and maintenance costs. In customer service, chatbots
powered by ML reduce the need for human agents, lowering operational expenses.
Adopting machine learning fosters innovation and provides a competitive edge. Companies that leverage
ML for product development, marketing strategies, and customer insights are better positioned to respond
to market changes and meet customer demands. ML-driven innovation can lead to the creation of new
products and services, opening up new revenue streams.
Machine learning augments human capabilities by providing tools and insights that enhance performance.
In fields like healthcare, ML assists doctors in diagnosing and treating patients more effectively. In research,
ML accelerates the discovery process by analyzing vast datasets and identifying potential breakthroughs.
1. Data Dependency
Machine learning models require vast amounts of data to train effectively. The quality, quantity, and
diversity of the data significantly impact the model’s performance. Insufficient or biased data can lead to
inaccurate predictions and poor decision-making. Additionally, obtaining and curating large datasets can
be time-consuming and costly.
Many machine learning models, particularly deep neural networks, function as black boxes. Their
complexity makes it difficult to interpret how they arrive at specific decisions. This lack of transparency
poses challenges in fields where understanding the decision-making process is critical, such as healthcare
and finance.
Machine learning models can suffer from overfitting or underfitting. Overfitting occurs when a model
learns the training data too well, capturing noise and anomalies, which reduces its generalization ability
to new data. Underfitting happens when a model is too simple to capture the underlying patterns in the
data, leading to poor performance on both training and test data.
5. Ethical Concerns
ML applications can raise ethical issues, particularly concerning privacy and bias. Data privacy is a
significant concern, as ML models often require access to sensitive and personal information. Bias in
training data can lead to biased models, perpetuating existing inequalities and unfair treatment of certain
groups.
6. Lack of Generalization
Machine learning models are typically designed for specific tasks and may struggle to generalize across
different domains or datasets. Transfer learning techniques can mitigate this issue to some extent, but
developing models that perform well in diverse scenarios remains a challenge.
7. Dependency on Expertise
Developing and deploying machine learning models require specialized knowledge and expertise. This
includes understanding algorithms, data preprocessing, model training, and evaluation. The scarcity of
skilled professionals in the field can hinder the adoption and implementation of ML solutions.
8. Security Vulnerabilities
ML models are susceptible to adversarial attacks, where malicious actors manipulate input data to deceive
the model into making incorrect predictions. This vulnerability poses significant risks in critical applications
such as autonomous driving, cybersecurity, and financial fraud detection.
ML models require continuous monitoring, maintenance, and updates to ensure they remain accurate and
effective over time. Changes in the underlying data distribution, known as data drift, can degrade model
performance, necessitating frequent retraining and validation.
Statistics is a key component of machine learning, with broad applicability in various fields.
• Feature engineering relies heavily on statistics to convert geometric features into meaningful
predictors for machine learning algorithms.
• In image processing tasks like object recognition and segmentation, statistics accurately reflect the
shape and structure of objects in images.
• Anomaly detection and quality control benefit from statistics by identifying deviations from
norms, aiding in the detection of defects in manufacturing processes.
• Environmental observation and geospatial mapping leverage statistical analysis to monitor land
cover patterns and ecological trends effectively.
There are several types of machine learning, each with special characteristics and applications. Some of
the main types of machine learning algorithms are as follows:
4. Reinforcement Learning
Supervised learning is defined as when a model gets trained on a “Labelled Dataset”. Labelled datasets
have both input and output parameters. In Supervised Learning algorithms learn to map points between
inputs and correct outputs. It has both training and validation datasets labelled.
Supervised Learning
Example: Consider a scenario where you have to build an image classifier to differentiate between cats
and dogs. If you feed the datasets of dogs and cats labelled images to the algorithm, the machine will learn
to classify between a dog or a cat from these labeled images. When we input new dog or cat images that
it has never seen before, it will use the learned algorithms and predict whether it is a dog or a cat. This is
how supervised learning works, and this is particularly an image classification.
There are two main categories of supervised learning that are mentioned below:
• Classification
• Regression
Classification
Classification deals with predicting categorical target variables, which represent discrete classes or labels.
For instance, classifying emails as spam or not spam, or predicting whether a patient has a high risk of
heart disease. Classification algorithms learn to map the input features to one of the predefined classes.
• Logistic Regression
• Decision Tree
• Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous target variables, which represent
numerical values. For example, predicting the price of a house based on its size, location, and amenities,
or forecasting the sales of a product. Regression algorithms learn to map the input features to a continuous
numerical value.
• Linear Regression
• Polynomial Regression
• Ridge Regression
• Lasso Regression
• Decision tree
• Random Forest
• Supervised Learning models can have high accuracy as they are trained on labelled data.
• It can often be used in pre-trained models which saves time and resources when developing new
models from scratch.
• It has limitations in knowing patterns and may struggle with unseen or unexpected patterns that
are not present in the training data.
• Natural language processing: Extract information from text, such as sentiment, entities, and
relationships.
• Speech recognition: Convert spoken language into text.
• Predictive analytics: Predict outcomes, such as sales, customer churn, and stock prices.
• Weather forecasting: Make predictions for temperature, precipitation, and other meteorological
parameters.
• Sports analytics: Analyze player performance, make game predictions, and optimize strategies.
Unsupervised Learning
There are two main categories of unsupervised learning that are mentioned below:
• Clustering
• Association
Clustering
Clustering is the process of grouping data points into clusters based on their similarity. This technique is
useful for identifying patterns and relationships in data without the need for labeled examples.
• Mean-shift algorithm
• DBSCAN Algorithm
Association
Association rule learning is a technique for discovering relationships between items in a dataset. It
identifies rules that indicate the presence of one item implies the presence of another item with a specific
probability.
• Apriori Algorithm
• Eclat
• FP-growth Algorithm
• It helps to discover hidden patterns and various relationships between the data.
• Used for tasks such as customer segmentation, anomaly detection, and data exploration.
• It does not require labeled data and reduces the effort of data labeling.
• Without using labels, it may be difficult to predict the quality of the model’s output.
• Cluster Interpretability may not be clear and may not have meaningful interpretations.
• It has techniques such as autoencoders and dimensionality reduction that can be used to extract
meaningful features from raw data.
• Dimensionality reduction: Reduce the dimensionality of data while preserving its essential
information.
• Recommendation systems: Suggest products, movies, or content to users based on their historical
behavior or preferences.
• Image and video compression: Reduce the amount of storage required for multimedia content.
• Data preprocessing: Help with data preprocessing tasks such as data cleaning, imputation of
missing values, and data scaling.
• Genomic data analysis: Identify patterns or group genes with similar expression profiles.
• Customer behavior analysis: Uncover patterns and insights for better marketing and product
recommendations.
• Content recommendation: Classify and tag content to make it easier to recommend similar items
to users.
• Exploratory data analysis (EDA): Explore data and gain insights before defining specific tasks.
3. Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works between the supervised and
unsupervised learning so it uses both labelled and unlabelled data. It’s particularly useful when obtaining
labeled data is costly, time-consuming, or resource-intensive. This approach is useful when the dataset is
expensive and time-consuming. Semi-supervised learning is chosen when labeled data requires skills and
relevant resources in order to train or learn from it.
We use these techniques when we are dealing with data that is a little bit labeled and the rest large portion
of it is unlabeled. We can use the unsupervised techniques to predict labels and then feed these labels to
supervised techniques. This technique is mostly applicable in the case of image data sets where usually all
images are not labeled.
Semi-Supervised Learning
Example: Consider that we are building a language translation model, having labeled translations for every
sentence pair can be resources intensive. It allows the models to learn from labeled and unlabeled
sentence pairs, making them more accurate. This technique has led to significant improvements in the
quality of machine translation services.
There are a number of different semi-supervised learning methods each with its own characteristics. Some
of the most common ones include:
• Graph-based semi-supervised learning: This approach uses a graph to represent the relationships
between the data points. The graph is then used to propagate labels from the labeled data points
to the unlabeled data points.
• Label propagation: This approach iteratively propagates labels from the labeled data points to the
unlabeled data points, based on the similarities between the data points.
• Co-training: This approach trains two different machine learning models on different subsets of
the unlabeled data. The two models are then used to label each other’s predictions.
• Self-training: This approach trains a machine learning model on the labeled data and then uses
the model to predict labels for the unlabeled data. The model is then retrained on the labeled
data and the predicted labels for the unlabeled data.
• Generative adversarial networks (GANs): GANs are a type of deep learning algorithm that can be
used to generate synthetic data. GANs can be used to generate unlabeled data for semi-supervised
learning by training two neural networks, a generator and a discriminator.
• It leads to better generalization as compared to supervised learning, as it takes both labeled and
unlabeled data.
• It still requires some labeled data that might not always be available or easy to obtain.
• Image Classification and Object Recognition: Improve the accuracy of models by combining a small
set of labeled images with a larger set of unlabeled images.
• Natural Language Processing (NLP): Enhance the performance of language models and classifiers
by combining a small set of labeled text data with a vast amount of unlabeled text.
• Speech Recognition: Improve the accuracy of speech recognition by leveraging a limited amount
of transcribed speech data and a more extensive set of unlabeled audios.
• Healthcare and Medical Imaging: Enhance medical image analysis by utilizing a small set of labeled
medical images alongside a larger set of unlabeled images.
Reinforcement machine learning algorithm is a learning method that interacts with the environment by
producing actions and discovering errors. Trial, error, and delay are the most relevant characteristics of
reinforcement learning. In this technique, the model keeps on increasing its performance using Reward
Feedback to learn the behavior or pattern. These algorithms are specific to a particular problem e.g.
Google Self Driving car, AlphaGo where a bot competes with humans and even itself to get better and
better performers in Go Game. Each time we feed in data, they learn and add the data to their knowledge
which is training data. So, the more it learns the better it gets trained and hence experienced.
Here are some of most common reinforcement learning algorithms:
• Q-learning: Q-learning is a model-free RL algorithm that learns a Q-function, which maps states to
actions. The Q-function estimates the expected reward of taking a particular action in a given
state.
• Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep learning. Deep Q-
learning uses a neural network to represent the Q-function, which allows it to learn complex
relationships between states and actions.
Example: Consider that you are training an AI agent to play a game like chess. The agent explores different
moves and receives positive or negative feedback based on the outcome. Reinforcement Learning also
finds applications in which they learn to perform tasks by interacting with their surroundings.
Positive reinforcement
• Examples: Giving a treat to a dog for sitting, providing a point in a game for a correct answer.
Negative reinforcement
• It has autonomous decision-making that is well-suited for tasks and that can learn to make a
sequence of decisions, like robotics and game-playing.
• This technique is preferred to achieve long-term results that are very difficult to achieve.
• It needs a lot of data and a lot of computation, which makes it impractical and costly.
• Game Playing: RL can teach agents to play games, even complex ones.
• Autonomous Vehicles: RL can help self-driving cars navigate and make decisions.
• Natural Language Processing (NLP): RL can be used in dialogue systems and chatbots.
• Supply Chain and Inventory Management: RL can be used to optimize supply chain operations.
• Game AI: RL can be used to create more intelligent and adaptive NPCs in video games.
• Virtual Reality (VR) and Augmented Reality (AR): RL can be used to create immersive and
interactive experiences.
Supervised learning algorithms utilize the information on the class membership of each training instance.
This information allows supervised learning algorithms to detect pattern misclassifications as feedback to
themselves. In unsupervised learning algorithms, unlabeled instances are used. They blindly or
heuristically process them. Unsupervised learning algorithms often have less computational complexity
and less accuracy than supervised learning algorithms.
Less Computational
More Computational Complex
Computational Complexity Complexity
Output data The desired output is given. The desired, output is not given.
Test of model We can test our model. We can not test our model.
K-Means, DBSCAN,
Hierarchical
When there is a need to categorize the data points in clustering, Gaussian
similar groupings or clusters, this is called a clustering mixture models,
Clustering problem. BIRCH
Generative
When there is a need to generate data such as adversarial network
images, videos, articles, posts, etc, the problem is (GAN), Hidden
Data generation called a data generation problem. Markov models
Moving ahead, now let’s check out the basic differences between artificial intelligence and machine
learning.
The development of AI and ML has the potential to transform various industries and improve people’s lives
in many ways. AI systems can be used to diagnose diseases, detect fraud, analyze financial data, and
optimize manufacturing processes. ML algorithms can help to personalize content and services, improve
customer experiences, and even help to solve some of the world’s most pressing environmental challenges.
[Link]. ARTIFICIAL INTELLIGENCE MACHINE LEARNING
The aim is to increase the chance of success and not The aim is to increase accuracy, but
4.
accuracy. it does not care about; the success
It works as a computer program that does smart Here, the tasks systems machine
6.
work. takes data and learns from data.
AI can work with structured, semi-structured, and ML can work with only structured
13.
unstructured data. and semi-structured data.
images and sounds, making decisions, and solving make predictions, decisions, and
complex problems. recommendations.
AI is a broader concept that encompasses many ML, on the other hand, is primarily
different applications, including robotics, natural used for pattern
language processing, speech recognition, recognition, predictive modeling,
18.
and autonomous vehicles. AI systems can be used to and decision-making in fields such
solve complex problems in various fields, such as as marketing, fraud detection, and
healthcare, finance, and transportation. credit scoring.
Artificial Neural Network has a huge number of interconnected processing elements, also known as Nodes.
These nodes are connected with other nodes using a connection link. The connection link contains
weights, these weights contain the information about the input signal. Each iteration and input in turn
leads to updation of these weights. After inputting all the data instances from the training data set, the
final weights of the Neural Network along with its architecture is known as the Trained Neural Network.
This process is called Training of Neural Networks. These trained neural networks solve specific problems
as defined in the problem statement.
Types of tasks that can be solved using an artificial neural network include Classification problems, Pattern
Matching, Data Clustering, etc.
We use artificial neural networks because they learn very efficiently and adaptively. They have the
capability to learn “how” to solve a specific problem from the training data it receives. After learning, it
can be used to solve that specific problem very quickly and efficiently with high accuracy.
Some real-life applications of neural networks include Air Traffic Control, Optical Character Recognition as
used by some scanning apps like Google Lens, Voice Recognition, etc.
• Identifying objects, faces, and understanding spoken language in applications like self-driving cars
and voice assistants.
• Analyzing and understanding human language, enabling sentiment analysis, chatbots, language
translation, and text generation.
• Diagnosing diseases from medical images, predicting patient outcomes, and drug discovery.
• Predicting stock prices, credit risk assessment, fraud detection, and algorithmic trading.
• Powering robotics and autonomous vehicles by processing sensor data and making real-time
decisions.
• Enhancing game AI, generating realistic graphics, and creating immersive virtual environments.
• Monitoring and optimizing manufacturing processes, predictive maintenance, and quality control.
• Analyzing complex datasets, simulating scientific phenomena, and aiding in research across
disciplines.
ANN is also known as an artificial neural network. It is a feed-forward neural network because the inputs
are sent in the forward direction. It can also contain hidden layers which can make the model even denser.
They have a fixed length as specified by the programmer. It is used for Textual Data or Tabular Data. A
widely used real-life application is Facial Recognition. It is comparatively less powerful than CNN and RNN.
CNNs is mainly used for Image Data. It is used for Computer Vision. Some of the real-life applications are
object detection in autonomous vehicles. It contains a combination of convolutional layers and neurons.
It is more powerful than both ANN and RNN.
It is also known as RNNs. It is used to process and interpret time series data. In this type of model, the
output from a processing node is fed back into nodes in the same or previous layers. The most known
types of RNN are LSTM (Long Short-Term Memory) Networks
Now that we know the basics about Neural Networks, we know that Neural Networks’ learning capability
is what makes it interesting.
What is Probability?
Probability can be defined as the ratio of the number of favorable outcomes to the total number of
outcomes of an event. For an experiment having 'n' number of outcomes, the number of favorable
outcomes can be denoted by x. The formula to calculate the probability of an event is as follows.
Joint Probability: It tells the Probability of simultaneously occurring two random events.
P (A ∩ B) = P(A). P(B)
Where;
Bayes theorem (also known as the Bayes Rule or Bayes Law) is used to determine the conditional
probability of event A when event B has already occurred.
The general statement of Bayes’ theorem is “The conditional probability of an event A, given the
occurrence of another event B, is equal to the product of the event of B, given A and the probability of A
divided by the probability of event B.” i.e.
Vector Calculus is a branch of mathematics that deals with the operations of calculus i.e. differentiation
and integration of vector field usually in a 3 Dimensional physical space also called Euclidean Space. The
applicability of Vector calculus is extended to partial differentiation and multiple integration. Vector Field
refers to a point in space that has magnitude and direction. These Vector Fields are nothing but Vector
Functions. Vector calculus is also known as vector analysis.
The vector fields are the vector functions whose domain and range are not dimensionally related to each
other. The branch of Vector Calculus corresponds to the multivariable calculus which deals with partial
differentiation and multiple integration. This differentiation and integration of vector is done for a quantity
in 3D physical space represented as R3. For n-dimensional space, it is represented as Rn.
Read in Detail: Calculus in Maths
Vector calculus, also known as vector analysis or vector differential calculus, is a branch of mathematics
that deals with vector fields and the differentiation and integration of vector functions
Vector Calculus often called Vector Analysis deals with vector quantities i.e. the quantities that have both
magnitude as well as direction. Since we know that Vector Calculus deals
with differentiation and integration of functions, there are three types of integrals dealt with in Vector
Calculus that are
• Line Integral
• Surface Integral
• Volume Integral
Line Integral
Line Integral in mathematics is the integration of a function along the line of the curve. The function can
be a scalar or vector whose line integral is given by summing up the values of the field at all points on a
curve weighted by some scalar function on the curve. Line Integral is also called Path Integral and is
represented by Φ = ∫Lf. Line Integral has got its application in physics. For Example, Work Done by Force is
along a path given as W = ∫LF(s).ds because we know that work done is given as the product of force and
distance covered.
Surface Integral
Surface Integral in mathematics is the integration of a function along the whole region or space that is not
flat. In Surface integral, the surfaces are assumed of small points hence, the integration is given by
summing up all the small points present on the surface. The surface integral is equivalent to the double
integration of a line integral. Surface Integral has got its application in Electromagnetism and many more
branches of physics where the vector function is spread over the surface. Surface Integral is represented
as ∬sf(x,y)dA.
Volume Integral
A volume integral, also known as a triple integral, is a mathematical concept used in calculus and vector
calculus to calculate the volume of a three-dimensional region within a space. It is an extension of the
concept of a definite integral in one dimension to three dimensions.
Mathematically, the volume integral of a scalar function f(x, y, z) over a region R in three-dimensional
space is denoted as:
∭Rf(x,y,z) dV
Where
Operation in Vector
The different operations performed with vector quantities are tabulated below with their notation and
illustration.
Scalar Multiplication q.r1 Multiplying a vector ‘r1‘ with scalar ‘q’ result in a vector
Scalar Triple Product r1 · (r2 ⨯ r3) Dot Product of Cross product of two vectors
Vector Triple Product r1 ⨯ (r2 ⨯ r3) Cross Product of Cross Product of two Vectors
• Navigation
• Sports
• Three-dimensional geometry
Decision theory
Decision theory is an interdisciplinary field that deals with the logic and methodology of making choices,
particularly under conditions of uncertainty. It is a branch of applied probability theory and analytic
philosophy that involves assigning probabilities to various factors and numerical consequences to
outcomes.
At its core, decision theory is the study of choices under uncertainty. It seeks to identify the optimal action
from a set of possible actions by evaluating the outcomes of each decision. There are two primary
branches of decision theory:
• Normative decision theory focuses on identifying the optimal decision, assuming the decision-
maker is rational and has complete information.
• Descriptive decision theory examines how decisions are actually made in practice, often dealing
with cognitive limitations and psychological biases.
choose courses of action that maximize predicted benefit, it blends probability, or the likelihood of
outcomes, and utility, or the worth of outcomes. In this way, AI systems imitate human decision-making,
even in the face of erroneous or inadequate evidence.
AI systems often use decision theory in two primary ways: supervised learning and reinforcement
learning.
1. Supervised Learning
In supervised learning, AI systems are trained using labeled data to make predictions or decisions. Decision
theory helps optimize the classification or regression tasks by evaluating the trade-offs between false
positives, false negatives, and other outcomes based on the utility of each result.
For instance, in medical diagnosis, the utility of correctly identifying a disease may be far higher than the
cost of a false alarm, leading the AI to favor sensitivity over specificity.
2. Reinforcement Learning
Reinforcement learning (RL) is one of the key areas where decision theory shines in AI. In RL, agents learn
to make decisions through trial and error, receiving feedback from their environment in the form of
rewards or penalties.
• In MDPs, the agent needs to choose actions that optimize future rewards, which aligns with the
decision-theoretic concept of maximizing expected utility.
1. Agents and Actions: In decision theory, an agent is an entity that makes decisions. The agent has
a set of possible actions or decisions to choose from.
2. States of the World: These represent the possible conditions or scenarios that may affect the
outcome of the agent’s decision. The agent often has incomplete knowledge about the current or
future states of the world.
3. Outcomes and Consequences: Every decision leads to an outcome. Outcomes can be desirable,
neutral, or undesirable, depending on the goals of the agent.
4. Probabilities: Since outcomes are often uncertain, decision theory involves assigning probabilities
to different states or outcomes based on available data.
5. Utility Function: This is a measure of the desirability of an outcome. A utility function quantifies
how much an agent values a specific result, helping in ranking outcomes to guide decisions.
6. Decision Rules: These are the guidelines the agent follows to choose the best action. Examples
include the Maximization of Expected Utility (MEU), where an agent selects the action that offers
the highest expected utility.
Waymo uses decision theory to help its vehicles make safe, rational choices under uncertainty. The AI
system processes data from a variety of sensors (LIDAR, radar, and cameras) to assess the vehicle’s
environment. By assigning probabilities to different events (such as a pedestrian stepping into the road or
a nearby vehicle swerving), the system evaluates multiple actions, such as slowing down, changing lanes,
or stopping, and selects the one that maximizes safety while minimizing disruptions.
• States of the world: Traffic signals, vehicle speeds, weather conditions, and positions of other
road users.
• Utility: Prioritizing passenger safety and compliance with traffic laws while maintaining a smooth
and efficient ride.
While decision theory provides a robust framework for decision-making in AI, there are several
challenges associated with its implementation:
1. Complexity and Computation: Calculating optimal decisions based on large-scale data with many
variables is computationally expensive. Approximation algorithms and heuristics are often used in
practice to overcome these limitations.
Information Theory
Is a mathematical framework for quantifying information, data compression, and transmission. In machine
learning, information theory provides powerful tools for analyzing and improving algorithms.
Key Concepts of Information Theory
1. Entropy
Entropy measures the uncertainty or unpredictability of a random variable. In machine learning, entropy
quantifies the amount of information required to describe a dataset.
• Interpretation: Higher entropy indicates greater unpredictability, while lower entropy indicates
more predictability.
2. Mutual Information
Mutual information measures the amount of information obtained about one random variable through
another random variable. It quantifies the dependency between variables.
Applications of Information Theory in Machine Learning
1. Feature Selection
Feature selection aims to identify the most relevant features for building a predictive model. Information-
theoretic measures like mutual information can quantify the relevance of each feature with respect to the
target variable.
• Method: Calculate the mutual information between each feature and the target variable. Select
features with the highest mutual information values.
2. Decision Trees
Decision trees use entropy and information gain to split nodes and build a tree structure. Information gain,
based on entropy, measures the reduction in uncertainty after splitting a node.
KL divergence is used in regularization techniques like variational inference in Bayesian neural networks.
By minimizing KL divergence between the approximate and true posterior distributions, we achieve better
model regularization.
• Example: Variational Autoencoders (VAEs) use KL divergence to regularize the latent space
distribution, ensuring it follows a standard normal distribution.
4. Information Bottleneck
The information bottleneck method aims to find a compressed representation of the input data that
retains maximal information about the output.
• Objective: Maximize mutual information between the compressed representation and the output
while minimizing mutual information between the input and the compressed representation.
Generative algorithms are designed to simulate the joint probability distribution of the input features and
labels. In order to create fresh samples, their goal is to learn the underlying data distribution.
To capture the statistical characteristics of the full dataset, generative models use a variety of strategies,
such as Gaussian Mixture Models (GMMs) or Hidden Markov Models (HMMs). The combined probability
distribution is modeled to give generative algorithms a comprehensive grasp of the data.
After learning the distribution, generative models can create artificial samples that mirror the training set.
They are useful for jobs like text generation, where the model picks up the grammar and creates logical
text sequences.
• Text Generation and Language Modelling: Recurrent Neural Networks (RNNs) and Transformers
are two examples of generative models that have excelled in text generation tasks. They create
fresh, meaningful sequences by learning the statistical patterns in text data.
• Image and Video Synthesis: The discipline of image synthesis has undergone a revolution thanks
to Generative Adversarial Networks (GANs). GANs create visuals that seem convincing and lifelike
by competing for a generator against a discriminator. They are useful for creating virtual
characters, bogus films, and artwork.
Because they model the entire data distribution, generative models are useful for addressing missing or
incomplete data. They might have trouble with discrimination tasks in complicated datasets, though. They
might not be particularly good at distinguishing between classes or categories because their concentration
is on modeling the total data distribution.
Aiming to simulate the combined probability distribution of the input features (X) and the related class
labels (Y), generative methods are used to create new data. To create fresh samples, they learn the
probability distribution for each class and use it. By selecting a sample from the learned distribution, these
algorithms can produce new data points. Additionally, they employ the Bayes theorem to estimate the
conditional probability of a class given the input features. Gaussian Mixture Models (GMMs), and Hidden
Markov Models (HMMs) are a few examples of generative algorithms.
The goal of generative algorithms is to model the P(X, Y) notation, which represents the joint probability
distribution of the input data (X) and the accompanying class labels (Y). Generic algorithms can produce
fresh samples and learn about the underlying data by estimating this joint distribution. Estimating the
prior probability of each class, P(Y), as well as the class-conditional probability distribution, P(X|Y), is
important to the mathematical reasoning behind generative algorithms. Utilizing methods like maximum
likelihood estimation (MLE) or maximum a posteriori (MAP) estimation, these estimations can be derived.
Once these probabilities have been learned, the posterior probability of the class given the input features,
P(Y|X), is computed using Bayes’ theorem. It is possible to categorize new data points using this posterior
probability.
Discriminative algorithms are primarily concerned with simulating the conditional probability distribution
of the output labels given the input features. Their goal is to understand the line of judgment that
delineates various classes or categories.
Learning the Decision Boundary Discriminative models, such as Logistic Regression, Support Vector
Machines (SVMs), and Neural Networks, train the decision boundary that best distinguishes various
classes in the data. On the basis of the input features and their related labels, they are trained to produce
predictions.
• Sentiment Analysis: Discriminative models perform exceptionally well in tasks involving sentiment
analysis, where the goal is to ascertain the sentiment of text data. These models make it possible
for applications like sentiment analysis in social media or customer feedback analysis by teaching
the link between text elements and sentiment labels.
Discriminative models excel at tasks that require a clear distinction between classes or categories. They
perform incredibly well in categorization problems by concentrating on the decision border. As they rely
on labeled samples for training, they could struggle with data that is incomplete or missing.
Discriminative algorithms are designed to directly represent the decision boundary rather than implicitly
modeling the underlying probability distribution. In light of the input features, they concentrate on
estimating the conditional probability of the class label. The classes in the input feature space are divided
by a decision boundary learned by these algorithms. Support vector machines (SVMs), neural networks,
and logistic regression are a few examples of discriminative algorithms. Discriminative models are
frequently utilized when the decision boundary is complex or when there is a lot of training data because
they typically perform well in classification tasks.
Mathematical Intuitions
Discriminative algorithms seek to directly represent the line where two classes diverge without explicitly
modeling the probability distribution that underlies that line. They concentrate on estimating the
conditional probability of the class label given the input features, represented as P(Y|X), rather than
calculating the joint distribution. Learning the variables or weights that specify the decision boundary is
essential to understanding the mathematical intuition behind discriminative algorithms. The use of
optimization techniques like gradient descent and maximum likelihood estimation is common in this
learning process. The objective is to identify the parameters that maximize the likelihood of the observed
data given the model while minimizing the classification error. Discriminative algorithms can instantly
categorize fresh data points after learning the parameters by calculating the conditional probability P(Y|X),
then selecting the class label with the highest probability.
Creates new samples by learning the Acquires the threshold of judgment that
Methodology distribution of the underlying data. distinguishes various classes or categories.
Text generation and image synthesis Used in activities like sentiment analysis and
Application are examples of generative tasks. image categorization.
May have trouble distinguishing Less useful when dealing with incomplete or
Weakness different classes in large datasets. missing data.
Generative Models Explained
Generative models are a cornerstone in the world of artificial intelligence (AI). Their primary function is to
understand and capture the underlying patterns or distributions from a given set of data. Once these
patterns are learned, the model can then generate new data that shares similar characteristics with the
original dataset.
Imagine you're teaching a child to draw animals. After showing them several pictures of different animals,
the child begins to understand the general features of each animal. Given some time, the child might draw
an animal they've never seen before, combining features they've learned. This is analogous to how a
generative model operates: it learns from the data it's exposed to and then creates something new based
on that knowledge.
The distinction between generative and discriminative models is fundamental in machine learning:
Generative models: These models focus on understanding how the data is generated. They aim to learn
the distribution of the data itself. For instance, if we're looking at pictures of cats and dogs, a generative
model would try to understand what makes a cat look like a cat and a dog look like a dog. It would then
be able to generate new images that resemble either cats or dogs.
Discriminative models: These models, on the other hand, focus on distinguishing between different types
of data. They don't necessarily learn or understand how the data is generated; instead, they learn the
boundaries that separate one class of data from another. Using the same example of cats and dogs, a
discriminative model would learn to tell the difference between the two, but it wouldn't necessarily be
able to generate a new image of a cat or dog on its own.
In the realm of AI, generative models play a pivotal role in tasks that require the creation of new content.
This could be in the form of synthesizing realistic human faces, composing music, or even generating
textual content. Their ability to "dream up" new data makes them invaluable in scenarios where original
content is needed, or where the augmentation of existing datasets is beneficial.
In essence, while discriminative models excel at classification tasks, generative models shine in their ability
to create. This creative prowess, combined with their deep understanding of data distributions, positions
generative models as a powerful tool in the AI toolkit.
Generative models come in various forms, each with its unique approach to understanding and generating
data. Here's a more comprehensive list of some of the most prominent types:
• Bayesian networks. These are graphical models that represent the probabilistic relationships
among a set of variables. They're particularly useful in scenarios where understanding causal
relationships is crucial. For example, in medical diagnosis, a Bayesian network might help
determine the likelihood of a disease given a set of symptoms.
• Diffusion models. These models describe how things spread or evolve over time. They're often
used in scenarios like understanding how a rumor spreads in a network or predicting the spread
of a virus in a population.
• Generative Adversarial Networks (GANs). GANs consist of two neural networks, the generator
and the discriminator, that are trained together. The generator tries to produce data, while the
discriminator attempts to distinguish between real and generated data. Over time, the generator
becomes so good that the discriminator can't tell the difference. GANs are popular in image
generation tasks, such as creating realistic human faces or artworks.
• Variational Autoencoders (VAEs). VAEs are a type of autoencoder that produces a compressed
representation of input data, then decodes it to generate new data. They're often used in tasks
like image denoising or generating new images that share characteristics with the input data.
• Restricted Boltzmann Machines (RBMs). RBMs are neural networks with two layers that can learn
a probability distribution over its set of inputs. They've been used in recommendation systems,
like suggesting movies on streaming platforms based on user preferences.
• Pixel Recurrent Neural Networks (PixelRNNs). These models generate images pixel by pixel, using
the context of previous pixels to predict the next one. They're particularly useful in tasks where
the sequential generation of data is crucial, like drawing an image line by line.
• Markov chains. These are models that predict future states based solely on the current state,
without considering the states that preceded it. They're often used in text generation, where the
next word in a sentence is predicted based on the current word.
• Normalizing flows. These are a series of invertible transformations applied to simple probability
distributions to produce more complex distributions. They're useful in tasks where understanding
the transformation of data is crucial, like in financial modeling.
Real-World Use Cases of Generative Models
Generative models have penetrated mainstream consumption, revolutionizing the way we interact with
technology and experience content, for example:
• Art creation. Artists and musicians are using generative models to create new pieces of art or
compositions, based on styles they feed into the model. For example, Midjourney is a very
popular tool that is used to generate artwork.
• Drug discovery. Scientists can use generative models to predict molecular structures for new
potential drugs.
• Content creation. Website owners leverage generative models to speed up the content creation
process. For example, Hubspot's AI content writer helps marketers generate blog posts, landing
page copy and social media posts.
• Video games. Game designers use generative models to create diverse and unpredictable game
environments or characters.
Regression: It predicts the continuous output variables based on the independent input variable.
like the prediction of house prices based on different parameters like house age, distance from
the main road, location, area, etc.
Linear regression is a type of supervised machine learning algorithm that computes the linear
relationship between the dependent variable and one or more independent features by fitting a
linear equation to observed data.
This is the simplest form of linear regression, and it involves only one independent variable and
one dependent variable. The equation for simple linear regression is:
y=β0+β1X
where:
• β0 is the intercept
• β1 is the slope
Multiple Linear Regression
This involves more than one independent variable and one dependent variable. The equation for
multiple linear regression is:
y=β0+β1X1+β2X2+………βnXn
where:
• β0 is the intercept
The goal of the algorithm is to find the best Fit Line equation that can predict the values based
on the independent variables.
In regression set of records are present with X and Y values and these values are used to learn a
function so if you want to predict Y from an unknown X this learned function can be used. In
regression we have to find the value of Y, So, a function is required that predicts continuous Y in
the case of regression given X as independent features.
Our primary objective while using linear regression is to locate the best-fit line, which implies that
the error between the predicted and actual values should be kept to a minimum. There will be the
least error in the best-fit line.
The best Fit Line equation provides a straight line that represents the relationship between the
dependent and independent variables. The slope of the line indicates how much the dependent
variable changes for a unit change in the independent variable(s).
Here Y is called a dependent or target variable and X is called an independent variable also known
as the predictor of Y. There are many types of functions or modules that can be used for regression.
A linear function is the simplest type of function. Here, X may be a single feature or multiple
features representing the problem.
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x)). Hence, the name is Linear Regression. In the figure above, X (input) is
the work experience and Y (output) is the salary of a person. The regression line is the best-fit line
for our model.
We utilize the cost function to compute the best values in order to get the best fit line since
different values for weights or the coefficient of lines result in different regression lines.
These depend upon linearity or nonlinearity of the residuals. The linear problems are often seen in
regression analysis in statistics. On the other hand, the non-linear problems are generally used in the
iterative method of refinement in which the model is approximated to the linear one with each iteration.
In linear regression, the line of best fit is a straight line as shown in the following diagram:
The given data points are to be minimized by the method of reducing residuals or offsets of each point
from the line. The vertical offsets are generally used in surface, polynomial and hyperplane problems,
while perpendicular offsets are utilized in common practice.
The least-square method states that the curve that best fits a given set of observations, is said to be a
curve having a minimum sum of the squared residuals (or deviations or errors) from the given data
points. Let us assume that the given points of data are (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) in which all x’s
are independent variables, while all y’s are dependent ones. Also, suppose that f(x) is the fitting curve
and d represents error or deviation from each given point.
d1 = y1 − f(x1)
d2 = y2 − f(x2)
d3 = y3 − f(x3)
…..
dn = yn – f(xn)
The least-squares explain that the curve that best fits is represented by the property that the sum of
squares of all the deviations from given values must be minimum, i.e:
∑Y = na + b∑X
Solving these two normal equations we can get the required trend line equation.
Solved Example
The Least Squares Model for a set of data (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) passes through the point (xa,
ya) where xa is the average of the xi‘s and ya is the average of the yi‘s. The below example explains how to
find the equation of a straight line or a least square line using the least square method.
Question:
xi 8 3 2 10 11 3 6 5 6 8
yi 4 12 1 12 9 4 9 6 1 14
Use the least square method to determine the equation of line of best fit for the data. Then plot the line.
Solution:
∑y = an + b∑x
8 4 64 32
3 12 9 36
2 1 4 2
10 12 100 120
11 9 121 99
3 4 9 12
6 9 36 54
5 6 25 30
6 1 36 6
8 14 64 112
-836b = -566
b = 566/836
b = 283/418
b = 0.677
10a + 62(0.677) = 72
10a + 41.974 = 72
10a = 72 – 41.974
10a = 30.026
a = 30.026/10
a = 3.0026
y = a + bx
y = 3.0026 + 0.677x
Now, we can find the sum of squares of deviations from the obtained values as:
A statistical model or a machine learning algorithm is said to have underfitting when a model is too simple
to capture data complexities. It represents the inability of the model to learn the training data effectively
result in poor performance both on the training and testing data. In simple terms, an underfit models are
inaccurate, especially when applied to new, unseen examples. It mainly happens when we use very simple
model with overly simplified assumptions. To address underfitting problem of the model, we need to use
more complex models, with enhanced feature representation, and less regularization.
Note: The underfitting model has High bias and low variance.
1. The model is too simple, so it may be not capable to represent the complexities in the data.
2. The input features which are used to train the model is not the adequate representations of
underlying factors influencing the target variable.
4. Excessive regularization is used to prevent the overfitting, which constraint the model to capture
the data well.
4. Increase the number of epochs or increase the duration of training to get better results.
Overfitting in Machine Learning
A statistical model is said to be overfitted when the model does not make accurate predictions on testing
data. When a model gets trained with so much data, it starts learning from the noise and inaccurate data
entries in our data set. And when testing with test data results in High variance. Then the model does not
categorize the data correctly, because of too many details and noise. The causes of overfitting are the non-
parametric and non-linear methods because these types of machine learning algorithms have more
freedom in building the model based on the dataset and therefore, they can really build unrealistic models.
A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like
the maximal depth if we are using decision trees.
In a nutshell, Overfitting is a problem where the evaluation of machine learning algorithms on training
data is different from unseen data.
1. Improving the quality of training data reduces overfitting by focusing on meaningful patterns,
mitigate the risk of fitting the noise or irrelevant features.
2. Increase the training data can improve the model’s ability to generalize to unseen data and reduce
the likelihood of overfitting.
4. Early stopping during the training phase (have an eye over the loss over the training period as soon
as loss begins to increase stop training).
The main purpose of cross validation is to prevent overfitting, which occurs when a model is trained too
well on the training data and performs poorly on new, unseen data. By evaluating the model on multiple
validation sets, cross validation provides a more realistic estimate of the model’s generalization
performance, i.e., its ability to perform well on new, unseen data.
Types of Cross-Validation
There are several types of cross validation techniques, including k-fold cross validation, leave-one-out
cross validation, and Holdout validation, Stratified Cross-Validation. The choice of technique depends on
the size and nature of the data, as well as the specific requirements of the modeling problem.
1. Holdout Validation
In Holdout Validation, we perform training on the 50% of the given dataset and rest 50% is used for the
testing purpose. It’s a simple and quick way to evaluate a model. The major drawback of this method is
that we perform training on the 50% of the dataset, it may possible that the remaining 50% of the data
contains some important information which we are leaving while training our model i.e. higher bias.
In this method, we perform training on the whole dataset but leaves only one data-point of the available
dataset and then iterates for each data-point. In LOOCV, the model is trained on n−1 samples and tested
on the one omitted sample, repeating this process for each data point in the dataset. It has some
advantages as well as disadvantages also.
An advantage of using this method is that we make use of all data points and hence it is low bias.
The major drawback of this method is that it leads to higher variation in the testing model as we are testing
against one data point. If the data point is an outlier it can lead to higher variation. Another drawback is
it takes a lot of execution time as it iterates over ‘the number of data points’ times.
3. Stratified Cross-Validation
It is a technique used in machine learning to ensure that each fold of the cross-validation process maintains
the same class distribution as the entire dataset. This is particularly important when dealing with
imbalanced datasets, where certain classes may be underrepresented. In this method,
1. The dataset is divided into k folds while maintaining the proportion of classes in each fold.
2. During each iteration, one-fold is used for testing, and the remaining folds are used for training.
3. The process is repeated k times, with each fold serving as the test set exactly once.
6. In K-Fold Cross Validation, we split the dataset into k number of subsets (known as folds) then we
perform training on the all the subsets but leave one(k-1) subset for the evaluation of the trained
model. In this method, we iterate k times with a different subset reserved for testing purpose each
time.
Note: It is always suggested that the value of k should be 10 as the lower value of k takes towards validation
and higher value of k leads to LOOCV method.
Example of K Fold Cross Validation
The diagram below shows an example of the training subsets and evaluation subsets generated in k-fold
cross-validation. Here, we have total 25 instances. In first iteration we use the first 20 percent of data for
evaluation, and the remaining 80 percent for training ([1-5] testing and [5-25] training) while in the
second iteration we use the second subset of 20 percent for evaluation, and the remaining three subsets
o f the data for training ([5-10] testing and [1-5 and 10-25] training), and so on.
What is overfitting? In machine learning, overfitting occurs when an algorithm fits too closely or even
exactly to its training data, resulting in a model that can't make accurate predictions or conclusions from
any data other than the training data.
Underfitting is a scenario in data science where a data model is unable to capture the relationship between
the input and output variables accurately, generating a high error rate on both the training set and unseen
data.
Lasso regression is a classification algorithm that uses shrinkage in simple and sparse models (i.e models
with fewer parameters). In Shrinkage, data values are shrunk towards a central point like the mean. Lasso
regression is a regularized regression algorithm that performs L1 regularization which adds a penalty equal
to the absolute value of the magnitude of coefficients.
“LASSO” stands for Least Absolute Shrinkage and Selection Operator. Lasso regression is good for models
showing high levels of multicollinearity or when you want to automate certain parts of model selection i.e
variable selection or parameter elimination. Lasso regression solutions are quadratic programming
problems that can best be solved with software like RStudio, Matlab, etc. It has the ability to select
predictors.
Here:
• xij : is the value of the j-th predictor for the i-th observation.
The algorithm minimizes the sum of squares with constraint. Some Beta are shrunk to zero that results in
a regression model. A tuning parameter lambda controls the strength of the L1 regularization
penalty. lambda is basically the amount of shrinkage:
• As lambda increases, more and more coefficients are set to zero and eliminated & bias
increases.
Logistic regression
Logistic regression is a supervised machine learning algorithm used for classification tasks where
the goal is to predict the probability that an instance belongs to a given class or not. Logistic
regression is a statistical algorithm which analyze the relationship between two data factors.
Logistic regression is used for binary classification where we use sigmoid function, that takes
input as independent variables and produces a probability value between 0 and 1.
For example, we have two classes Class 0 and Class 1 if the value of the logistic function for an
input is greater than 0.5 (threshold value) then it belongs to Class 1 otherwise it belongs to Class
0. It’s referred to as regression because it is the extension of linear regression but is mainly used
for classification problems.
Key Points:
• Logistic regression predicts the output of a categorical dependent variable. Therefore, the
outcome must be a categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and
1, it gives the probabilistic values which lie between 0 and 1.
• In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic function,
which predicts two maximum values (0 or 1).
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1. The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like
the “S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
On the basis of the categories, Logistic Regression can be classified into three types:
1. Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as “low”, “Medium”, or “High”.
Gradient Descent
Gradient Descent is an iterative optimization algorithm that tries to find the optimum value
(Minimum/Maximum) of an objective function. It is one of the most used optimization techniques in
machine learning projects for updating the parameters of a model in order to minimize a cost function.
The main aim of gradient descent is to find the best parameters of a model which gives the highest
accuracy on training as well as testing datasets. In gradient descent, the gradient is a vector that points in
the direction of the steepest increase of the function at a specific point. Moving in the opposite direction
of the gradient allows the algorithm to gradually descend towards lower values of the function, and
eventually reaching to the minimum of the function.
• Step 2 Compute the gradient of the cost function with respect to each parameter. It involves
making partial differentiation of cost function with respect to the parameters.
• Step 3 Update the parameters of the model by taking steps in the opposite direction of the model.
Here we choose a hyperparameter learning rate which is denoted by alpha. It helps in deciding
the step size of the gradient.
• Step 4 Repeat steps 2 and 3 iteratively to get the best parameter for the defined model.
The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is
called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision boundary that helps to classify the data points. This
best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there
are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features,
then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset
that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that
can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there
can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region
is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is
called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used
two dimensions x and y, so for non-linear data, we will add a third-dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space
with z=1, then it will become as:
Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made to the target
function.
2. This algorithm can adapt to new data easily, one which is collected as we go .
Disadvantages:
2. Large amount of memory required to store the data, and each query involves starting the
identification of a local model from scratch.
5. Case-Based Reasoning
o K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put
the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by
using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying
data.
o It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
o
o How does K-NN work?
o The K-NN working can be explained on the basis of the below algorithm:
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Manhattan Distance metric is generally used when we are interested in the total distance traveled by the
object instead of the displacement. This metric is calculated by summing the absolute difference between
the coordinates of the points in n-dimensions.
Minkowski Distance
We can say that the Euclidean, as well as the Manhattan distance, are special cases of the Minkowski
distance.
From the formula above we can say that when p = 2 then it is the same as the formula for the Euclidean
distance and when p = 1 then we obtain the formula for the Manhattan distance
• Adapts Easily – As per the working of the KNN algorithm it stores all the data in memory storage
and hence whenever a new example or data point is added then the algorithm adjusts itself as per
that new example and has its contribution to the future predictions as well.
• Few Hyperparameters – The only parameters which are required in the training of a KNN algorithm
are the value of k and the choice of the distance metric which we would like to choose from our
evaluation metric.
• Does not scale – As we have heard about this that the KNN algorithm is also considered a Lazy
Algorithm. The main significance of this term is that this takes lots of computing power as well as
data storage. This makes this algorithm both time-consuming and resource exhausting.
• Curse of Dimensionality – There is a term known as the peaking phenomenon according to this
the KNN algorithm is affected by the curse of dimensionality which implies the algorithm faces a
hard time classifying the data points properly when the dimensionality is too high.
• Prone to Overfitting – As the algorithm is affected due to the curse of dimensionality it is prone to
the problem of overfitting as well. Hence generally feature selection as well as dimensionality
reduction techniques are applied to deal with this problem.
Tree Based Machine Learning Algorithms
Tree-based algorithms are a class of supervised machine learning models that construct decision trees to
typically partition the feature space into regions, enabling a hierarchical representation of complex
relationships between input variables and output labels.
o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision based
on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree
into subtrees.
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the two
reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like structure.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of
the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and,
based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be
better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes of
the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data.
Entropy can be calculated as:
Where,
o P(no)= probability of no
o It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
o CART is a predictive algorithm used in Machine learning and it explains how the target variable’s
values can be predicted based on other matters. It is a decision tree where each fork is split into a
predictor variable and each node has a prediction for the target variable at the end.
o The term CART serves as a generic term for the following categories of decision trees:
o Classification Trees: The tree is used to determine which “class” the target variable is most likely
to fall into when it is continuous.
CART Algorithm
Classification and Regression Trees (CART) is a decision tree algorithm that is used for both classification
and regression tasks. It is a supervised learning algorithm that learns from labelled data to predict unseen
data.
• Tree structure: CART builds a tree-like structure consisting of nodes and branches. The nodes
represent different decision points, and the branches represent the possible outcomes of those
decisions. The leaf nodes in the tree contain a predicted class label or value for the target variable.
• Splitting criteria: CART uses a greedy approach to split the data at each node. It evaluates all
possible splits and selects the one that best reduces the impurity of the resulting subsets. For
classification tasks, CART uses Gini impurity as the splitting criterion. The lower the Gini impurity,
the more pure the subset is. For regression tasks, CART uses residual reduction as the splitting
criterion. The lower the residual reduction, the better the fit of the model to the data.
• Pruning: To prevent overfitting of the data, pruning is a technique used to remove the nodes that
contribute little to the model accuracy. Cost complexity pruning and information gain pruning are
two popular pruning techniques. Cost complexity pruning involves calculating the cost of each
node and removing nodes that have a negative cost. Information gain pruning involves calculating
the information gain of each node and removing nodes that have a low information gain.
• Based on the best-split points of each input in Step 1, the new “best” split point is identified.
• Continue splitting until a stopping rule is satisfied or no further desirable splitting is available.
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by searching for the
best homogeneity for the sub nodes, with the help of the Gini index criterion.
The Gini index is a metric for the classification tasks in CART. It stores the sum of squared probabilities of
each class. It computes the degree of probability of a specific variable that is wrongly being classified when
chosen randomly and a variation of the Gini coefficient. It works on categorical variables, provides
outcomes either “successful” or “failure” and hence conducts binary splitting only.
• Where 0 depicts that all the elements are allied to a certain class, or only one class exists there.
• Gini index close to 1 means a high level of impurity, where each class contains a very small fraction
of elements, and
• A value of 1-1/n occurs when the elements are uniformly distributed into n classes and each class
has an equal probability of 1/n. For example, with two classes, the Gini impurity is 1 – 1/2 = 0.5.
In conclusion, Gini impurity is the probability of misclassification, assuming independent selection of the
element and its class based on the class probabilities.
CART for Classification
A classification tree is an algorithm where the target variable is categorical. The algorithm is then used to
identify the “Class” within which the target variable is most likely to fall. Classification trees are used when
the dataset needs to be split into classes that belong to the response variable (like yes or no)
For classification in decision tree learning algorithm that creates a tree-like structure to predict class labels.
The tree consists of nodes, which represent different decision points, and branches, which represent the
possible result of those decisions. Predicted class labels are present at each leaf node of the tree.
CART for classification works by recursively splitting the training data into smaller and smaller subsets
based on certain criteria. The goal is to split the data in a way that minimizes the impurity within each
subset. Impurity is a measure of how mixed up the data is in a particular subset. For classification tasks,
CART uses Gini impurity
• Gini Impurity- Gini impurity measures the probability of misclassifying a random instance from a
subset labeled according to the majority class. Lower Gini impurity means more purity of the
subset.
• Splitting Criteria- The CART algorithm evaluates all potential splits at every node and chooses the
one that best decreases the Gini impurity of the resultant subsets. This process continues until a
stopping criterion is reached, like a maximum tree depth or a minimum number of instances in a
leaf node.
A Regression tree is an algorithm where the target variable is continuous and the tree is used to predict
its value. Regression trees are used when the response variable is continuous. For example, if the response
variable is the temperature of the day.
CART for regression is a decision tree learning method that creates a tree-like structure to predict
continuous target variables. The tree consists of nodes that represent different decision points and
branches that represent the possible outcomes of those decisions. Predicted values for the target variable
are stored in each leaf node of the tree.
Regression CART works by splitting the training data recursively into smaller subsets based on specific
criteria. The objective is to split the data in a way that minimizes the residual reduction in each subset.
• Residual Reduction- Residual reduction is a measure of how much the average squared difference
between the predicted values and the actual values for the target variable is reduced by splitting
the subset. The lower the residual reduction, the better the model fits the data.
• Splitting Criteria- CART evaluates every possible split at each node and selects the one that results
in the greatest reduction of residual error in the resulting subsets. This process is repeated until a
stopping criterion is met, such as reaching the maximum tree depth or having too few instances
in a leaf node.
Ensemble Learning
Ensemble learning is a machine learning technique that combines the predictions from multiple
models to create a more accurate and stable prediction. It is an approach that leverages the
collective intelligence of multiple models to improve the overall performance of the learning
system.
1. Bagging (Bootstrap Aggregating): This method involves training multiple models on random
subsets of the training data. The predictions from the individual models are then combined,
typically by averaging.
2. Boosting: This method involves training a sequence of models, where each subsequent model
focuses on the errors made by the previous model. The predictions are combined using a weighted
voting scheme.
3. Stacking: This method involves using the predictions from one set of models as input features for
another model. The final prediction is made by the second-level model.
Bagging
Bootstrap Aggregating, also known as bagging, is a machine learning ensemble meta-algorithm designed
to improve the stability and accuracy of machine learning algorithms used in statistical classification and
regression. It decreases the variance and helps to avoid overfitting. It is usually applied to decision tree
methods. Bagging is a special case of the model averaging approach.
Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples is selected via row sampling
with a replacement method (i.e., there can be repetitive elements from different d tuples) from D (i.e.,
bootstrap). Then a classifier model Mi is learned for each training set D < i. Each classifier Mi returns its
class prediction. The bagged classifier M* counts the votes and assigns the class with the most votes to X
(unknown sample).
• Step 1: Multiple subsets are created from the original data set with equal tuples, selecting
observations with replacement.
• Step 3: Each model is learned in parallel with each training set and independent of each other.
• Step 4: The final predictions are determined by combining the predictions from all the models.
An illustration for the concept of bootstrap aggregating (Bagging)
Example of Bagging
The Random Forest model uses Bagging, where decision tree models with higher variance are present. It
makes random feature selection to grow trees. Several random trees make a Random Forest.
Boosting
Boosting is an ensemble modeling technique designed to create a strong classifier by combining multiple
weak classifiers. The process involves building models sequentially, where each new model aims to correct
the errors made by the previous ones.
• Subsequent models are then trained to address the mistakes of their predecessors.
• Higher weights: Instances that were misclassified by the previous model receive higher weights.
• Lower weights: Instances that were correctly classified receive lower weights.
• Training on weighted data: The subsequent model learns from the weighted dataset, focusing its
attention on harder-to-learn examples (those with higher weights).
Boosting Algorithms
There are several boosting algorithms. The original ones, proposed by Robert Schapire and Yoav
Freund were not adaptive and could not take full advantage of the weak learners. Schapire and Freund
then developed AdaBoost, an adaptive boosting algorithm that won the prestigious Gödel Prize. AdaBoost
was the first really successful boosting algorithm developed for the purpose of binary classification.
AdaBoost is short for Adaptive Boosting and is a very popular boosting technique that combines multiple
“weak classifiers” into a single “strong classifier”.
Algorithm:
1. Initialize the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points and decrease the weights of correctly
classified data points. And then normalize the weights of all data points.
5. End
An illustration presenting the intuition behind the boosting algorithm, consisting of the parallel learners
and weighted dataset.
Similarities Between Bagging and Boosting
Bagging and boosting, both being the commonly used methods, have a universal similarity of being
classified as ensemble methods. Here we will explain the similarities between them.
3. Both make the final decision by averaging the N learners (or taking the majority of them i.e
Majority Voting).
A random forest is an ensemble learning method that combines the predictions from multiple decision
trees to produce a more accurate and stable prediction. It is a type of supervised learning algorithm that
can be used for both classification and regression tasks.
Every decision tree has high variance, but when we combine all of them in parallel then the resultant
variance is low as each decision tree gets perfectly trained on that particular sample data, and hence the
output doesn’t depend on one decision tree but on multiple decision trees. In the case of a classification
problem, the final output is taken by using the majority voting classifier. In the case of a regression
problem, the final output is the mean of all the outputs. This part is called Aggregation.
Random Forest Regression Model Working
Random Forest has multiple decision trees as base learning models. We randomly perform row sampling
and feature sampling from the dataset forming sample datasets for every model. This part is called
Bootstrap.
We need to approach the Random Forest regression technique like any other machine learning technique.
• Design a specific question or data and get the source to determine the required data.
• Make sure the data is in an accessible format else convert it to the required format.
• Specify all noticeable anomalies and missing data points that may be required to achieve the
required data.
• Now compare the performance metrics of both the test data and the predicted data from the
model.
• If it doesn’t satisfy your expectations, you can try improving your model accordingly or dating your
data, or using another data modeling technique.
• At this stage, you interpret the data you have gained and report accordingly.
Classification Accuracy
Classification accuracy is a fundamental metric for evaluating the performance of a classification model,
providing a quick snapshot of how well the model is performing in terms of correct predictions. This is
calculated as the ratio of correct predictions to the total number of input Samples.
It is one of the widely used metrics and basically used for binary classification. The AUC of a classifier is
defined as the probability of a classifier will rank a randomly chosen positive example higher than a
negative example. Before going into AUC more, let me make you comfortable with a few basic terms.
Also called or termed sensitivity. True Positive Rate is considered as a portion of positive data points that
are correctly considered as positive, with respect to all data points that are positive.
Also called or termed specificity. False Negative Rate is considered as a portion of negative data points that
are correctly considered as negative, with respect to all data points that are negatives.
False-positive Rate
False Negatives rate is actually the proportion of actual positives that are incorrectly identified as negatives
F1 Score
It is a harmonic mean between recall and precision. Its range is [0,1]. This metric usually tells us how
precise (It correctly classifies how many instances) and robust (does not miss any significant number of
instances) our classifier is.
Precision
There is another metric named Precision. Precision is a measure of a model’s performance that tells you
how many of the positive predictions made by the model are actually correct. It is calculated as the number
of true positive predictions divided by the number of true positive and false positive predictions.
Lower recall and higher precision give you great accuracy but then it misses a large number of instances.
The more the F1 score better will be performance. It can be expressed mathematically in this way:
Confusion Matrix
It creates a N X N matrix, where N is the number of classes or categories that are to be predicted. Here we
have N = 2, so we get a 2 X 2 matrix. Suppose there is a problem with our practice which is a binary
classification. Samples of that classification belong to either Yes or No. So, we build our classifier which
will predict the class for the new input sample. After that, we tested our model with 165 samples, and we
get the following result.
There are 4 terms you should keep in mind:
1. True Positives: It is the case where we predicted Yes and the real output was also yes.
2. True Negatives: It is the case where we predicted No and the real output was also No.
3. False Positives: It is the case where we predicted Yes but it was actually No.
4. False Negatives: It is the case where we predicted No but it was actually Yes.
Clustering in Machine Learning
Introduction to Clustering: It is basically a type of unsupervised learning method. An
unsupervised learning method is a method in which we draw references from datasets consisting
of input data without labeled responses. Generally, it is used as a process to find meaningful
structure, explanatory underlying processes, generative features, and groupings inherent in a set
of examples.
Clustering is the task of dividing the population or data points into a number of groups such that
data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis of
similarity and dissimilarity between them.
For example, the data points in the graph below clustered together can be classified into one
single group. We can distinguish the clusters, and we can identify that there are 3 clusters in the
below picture.
Fundamentally, all clustering methods use the same approach i.e. first we calculate similarities
and then we use it to cluster the data points into groups or batches. Here we will focus on
the Density-based spatial clustering of applications with noise (DBSCAN) clustering
method.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Clusters are dense regions in the data space, separated by regions of the lower density of points.
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea
is that for each point of a cluster, the neighborhood of a given radius has to contain at least a
minimum number of points.
Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding
spherical-shaped clusters or convex clusters. In other words, they are suitable only for compact
and well-separated clusters. Moreover, they are also severely affected by the presence of noise
and outliers in the data.
Real-life data may contain irregularities, like:
1. Clusters can be of arbitrary shape such as those shown in the figure below.
2. Data may contain noise.
The figure above shows a data set containing non-convex shape clusters and outliers. Given such
data, the k-means algorithm has difficulties in identifying these clusters with arbitrary shapes.
1. eps: It defines the neighborhood around a data point i.e. if the distance between two points
is lower or equal to ‘eps’ then they are considered neighbors. If the eps value is chosen too
small then a large part of the data will be considered as an outlier. If it is chosen very large
then the clusters will merge and the majority of the data points will be in the same clusters.
One way to find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. The larger the
dataset, the larger value of MinPts must be chosen. As a general rule, the minimum MinPts
can be derived from the number of dimensions D in the dataset as, MinPts >= D+1. The
minimum value of MinPts must be chosen at least
3. In this algorithm, we have 3 types of data points.
Core Point: A point is a core point if it has more than MinPts points within eps.
4.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood
of a core point.
Noise or outlier: A point which is not a core point or border point.
Steps Used in DBSCAN Algorithm
1. Find all the neighbor points within eps and identify the core points or visited with more than
MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same cluster as the
core point. A point a and b are said to be density connected if there exists a point c which
has a sufficient number of points in its neighbors and both points a and b are within the eps
distance. This is a chaining process. So, if b is a neighbor of c, c is a neighbor of d, and d is
a neighbor of e, which in turn is neighbor of an implying that b is a neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not belong
to any cluster are noise.
Distance based
The dataset is divided into a set of k groups, where K is used to define the number of pre-defined
groups. The cluster center is created in such a way that the between the data points of one cluster
is minimum as compared to another cluster centroid.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
AD
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables
is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting the
below two points as k points, which are not the part of our dataset. Consider the below
image:
o Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by applying some mathematics that we have studied to calculate the
distance between two points. So, we will draw a median between both the centroids.
Consider the below image:
AD
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color
them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the
line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:
AD
How to choose the value of "K number of clusters"
in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient
clusters that it forms. But choosing the optimal number of clusters is a big task. There are
some different ways to find the optimal number of clusters, but here we are discussing
the most appropriate method to find the number of clusters or value of K. The method is
given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance (Pi C1)2 +∑Pi in Cluster2distance (Pi C2)2+∑Pi in CLuster3 distance
(Pi C3)2
∑Pi in Cluster1 distance (Pi C1)2: It is the sum of the square of the distances between each data
point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such
as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges from
1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as
the elbow method. The graph for the elbow method looks like the below image:
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram.
Sometimes the results of K-means clustering and hierarchical clustering may look similar,
but they both differ depending on how they work. As there is no requirement to
predetermine the number of clusters as we did in the K-Means algorithm.
The hierarchical clustering technique has two approaches:
o Step-2: Take two closest data points or clusters and merge them to form one cluster. So,
there will now be N-1 clusters.
o Step-3: Again, take the two closest clusters and merge them together to form one
cluster. There will be N-2 clusters.
o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram
to divide the clusters as per the problem.
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters.
Consider the below image:
2. Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than single-
linkage.
3. Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid
of the clusters is calculated. Consider the below image:
From the above-given approaches, we can apply any of them according to the type of
problem or business requirement.
The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in agglomerative
clustering, and the right part is showing the corresponding dendrogram.
o As we have discussed above, firstly, the datapoints P2 and P3 combine together and form
a cluster, correspondingly a dendrogram is created, which connects P2 and P3 with a
rectangular shape. The height is decided according to the Euclidean distance between the
data points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created.
It is higher than of previous, as the Euclidean distance between P5 and P6 is a little bit
greater than the P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram,
and P4, P5, and P6, in another dendrogram.
o At last, the final dendrogram is created that combines all the data points together.
Cluster Validity
Why do we need cluster validity indices?
o Internal cluster validation: The clustering result is evaluated based on the data clustered
itself (internal information) without reference to external information.
o External cluster validation: Clustering results are evaluated based on some externally
known result, such as externally provided class labels.
o Relative cluster validation: The clustering results are evaluated by varying different
parameters for the same algorithm (e.g. changing the number of clusters).
o Besides the term cluster validity index, we need to know about inter-cluster distance d (a,
b) between two cluster a, b and intra-cluster index D(a) of cluster a.
Inter-cluster distance d(a, b) between two clusters a and b can be –
o Single linkage distance: Closest distance between two objects belonging
to a and b respectively.
o Complete linkage distance: Distance between two most remote objects belonging
to a and b respectively.
o Average linkage distance: Average distance between all the objects belonging
to a and b respectively.
o Complete diameter linkage distance: Distance between two farthest objects belonging to
cluster a.
o Average diameter linkage distance: Average distance between all the objects belonging
to cluster a.
o Centroid diameter linkage distance: Twice the average distance between all the objects
and the centroid of the cluster a.
Dimensionality Reduction
Feature selection involves selecting a subset of the original features that are most relevant
to the problem at hand. The goal is to reduce the dimensionality of the dataset while
retaining the most important features. There are several methods for feature selection,
including filter methods, wrapper methods, and embedded methods. Filter methods rank
the features based on their relevance to the target variable, wrapper methods use the
model performance as the criteria for selecting features, and embedded methods combine
feature selection with the model training process.
o Feature Extraction:
• Feature selection: In this, we try to find a subset of the original set of variables, or features,
to get a smaller subset which can be used to model the problem. It usually involves three
ways:
1. Filter
2. Wrapper
3. Embedded
• Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.
Dimensionality reduction may be both linear and non-linear, depending upon the method
used. The prime linear method, called Principal Component Analysis, or PCA, is discussed
below.
This method was introduced by Karl Pearson. It works on the condition that while the data
in a higher dimensional space is mapped to data in a lower dimension space, the variance
of the data in the lower dimensional space should be maximum.
It involves the following steps:
Hence, we are left with a lesser number of eigenvectors, and there might have been some
data loss in the process. But, the most important variances should be retained by the
remaining eigenvectors.
Recommendation Systems
Recommender systems, also known as recommendation systems, are machine learning
algorithms that use data to recommend items or content to users based on their
preferences, past behavior, or their combination. These systems can recommend various
items, such as movies, books, music, products, etc.,
The two main kinds are content-based filtering (which takes into account the
characteristics of products and user profiles) and collaborative filtering (which generates
recommendations based on user behaviour and preferences). Hybrid strategies that
integrate the two approaches are also popular. These kinds of systems improve user
experiences, boost user involvement, and propel corporate expansion.
Item profile
TF-IDF Vectorizer
• Term Frequency (TF): Term frequency, or TF for short, is a key idea in information
retrieval and natural language processing. It displays the regularity with which a certain
term or word occurs in a text corpus or document. TF is used to rank terms in a
document according to their relative value or significance.
The term-frequency can be calculated by:
• where, ni number of documents that mention term i. N is the total number of docs.
User profile
The user profile is a vector that describes the user preference. During the creation of the user’s
profile, we use a utility matrix that describes the relationship between user and item. From this
information, the best estimate we can decide which item the user likes, is some aggregation of
the profiles of those items.
• Advantages:
o No need for data on other users when applying to similar users.
o Able to recommend to users with unique tastes.
o Able to recommend new & popular items
o Explanations for recommended items.
• Disadvantages:
o Finding the appropriate feature is hard.
o Doesn’t recommend items outside the user profile.
Collaborative Filtering
Collaborative filtering is based on the idea that similar people (based on the data) generally tend
to like similar things. It predicts which item a user will like based on the item preferences of other
similar users.
Collaborative filtering uses a user-item matrix to generate recommendations. This matrix contains
the values that indicate a user’s preference towards a given item. These values can represent either
explicit feedback (direct user ratings) or implicit feedback (indirect user behavior such as listening,
purchasing, watching).
• Explicit Feedback: The amount of data that is collected from the users when they choose
to do so. Many of the times, users choose not to provide data for the user. So, this data is
scarce and sometimes costs money. For example, ratings from the user.
• Implicit Feedback: In implicit feedback, we track user behavior to predict their preference
• Advantages:
o No need for the domain knowledge because embedding are learned automatically.
o Capture inherent subtle characteristics.
• Disadvantages:
o Cannot handle fresh items due to cold start problem.
o Hard to add any new features that may improve quality of model
EM algorithm.
o The Expectation-Maximization (EM) algorithm is an iterative optimization method
that combines different unsupervised machine learning algorithms to find
maximum likelihood or maximum posterior estimates of parameters in statistical
models that involve unobserved latent variables. The EM algorithm is commonly
used for latent variable models and can handle missing data. It consists of an
estimation step (E-step) and a maximization step (M-step), forming an iterative
process to improve model fit.
o In the E step, the algorithm computes the latent variables i.e. expectation of the
log-likelihood using the current parameter estimates.
o In the M step, the algorithm determines the parameters that maximize the
expected log-likelihood obtained in the E step, and corresponding model
parameters are updated based on the estimated latent variables.
o
o Expectation-Maximization in EM Algorithm
o By iteratively repeating these steps, the EM algorithm seeks to maximize the
likelihood of the observed data. It is commonly used for unsupervised learning
tasks, such as clustering, where latent variables are inferred and has applications in
various fields, including machine learning, computer vision, and natural language
processing.
o Some of the most commonly used key terms in the Expectation-Maximization (EM)
Algorithm are as follows:
o Latent Variables: Latent variables are unobserved variables in statistical models
that can only be inferred indirectly through their effects on observable variables.
They cannot be directly measured but can be detected by their impact on the
observable variables.
o Likelihood: It is the probability of observing the given data given the parameters
of the model. In the EM algorithm, the goal is to find the parameters that maximize
the likelihood.
o Log-Likelihood: It is the logarithm of the likelihood function, which measures the
goodness of fit between the observed data and the model. EM algorithm seeks to
maximize the log-likelihood.
o Maximum Likelihood Estimation (MLE): MLE is a method to estimate the
parameters of a statistical model by finding the parameter values that maximize
the likelihood function, which measures how well the model explains the observed
data.
o Posterior Probability: In the context of Bayesian inference, the EM algorithm can
be extended to estimate the maximum a posteriori (MAP) estimates, where the
posterior probability of the parameters is calculated based on the prior distribution
and the likelihood function.
o Expectation (E) Step: The E-step of the EM algorithm computes the expected
value or posterior probability of the latent variables given the observed data and
current parameter estimates. It involves calculating the probabilities of each latent
variable for each data point.
o Maximization (M) Step: The M-step of the EM algorithm updates the parameter
estimates by maximizing the expected log-likelihood obtained from the E-step. It
involves finding the parameter values that optimize the likelihood function,
typically through numerical optimization methods.
o Convergence: Convergence refers to the condition when the EM algorithm has
reached a stable solution. It is typically determined by checking if the change in
the log-likelihood or the parameter estimates falls below a predefined threshold.
Reinforcement Learning
Reinforcement Learning (RL) is a branch of machine learning focused on making decisions to
maximize cumulative rewards in a given situation. Unlike supervised learning, which relies on a
training dataset with predefined answers, RL involves learning through experience. In RL, an agent
learns to achieve a goal in an uncertain, potentially complex environment by performing actions
and receiving feedback through rewards or penalties.
RL operates on the principle of learning optimal behavior through trial and error. The agent takes
actions within the environment, receives rewards or penalties, and adjusts its behavior to maximize
the cumulative reward. This learning process is characterized by the following elements:
• Policy: A strategy used by the agent to determine the next action based on the current
state.
• Reward Function: A function that provides a scalar feedback signal based on the state
and action.
• Value Function: A function that estimates the expected cumulative reward from a given
state.
The problem is as follows: We have an agent and a reward, with many hurdles in between. The
agent is supposed to find the best possible path to reach the reward. The following problem
explains the problem more easily.
The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward
that is the diamond and avoid the hurdles that are fired. The robot learns by trying all the possible
paths and then choosing the path which gives him the reward with the least hurdles. Each right
step will give the robot a reward and each wrong step will subtract the reward of the robot. The
total reward will be calculated when it reaches the final reward that is the diamond.
• Input: The input should be an initial state from which the model will start
• Output: There are many possible outputs as there are a variety of solutions to a particular
problem
• Training: The training is based upon the input; The model will return a state and the user
will decide to reward or punish the model based on its output.
Types of Reinforcement:
• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states which can diminish the
results
• Increases Behavior
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior
ii) Reward Function: Defines the goal of the RL problem by providing feedback.
iv) Model of the Environment: Helps in predicting future states and rewards for planning.
In model-based learning, the training data is used to create a model that can be generalized to
new data. The model is typically created using statistical algorithms such as linear regression,
logistic regression, decision trees, and neural networks. These algorithms use the training data to
create a mathematical model that can be used to predict outcomes.
1. Requires a large dataset: model-based learning requires a large dataset to train the model.
This can be a disadvantage if you have a small dataset.
2. Requires expert knowledge: Model-based learning requires expert knowledge of statistical
algorithms and mathematical modeling. This can be a disadvantage if you don’t have the
expertise to create the model.
3. Requires expert knowledge: Model-based learning requires expert knowledge of statistical
algorithms and mathematical modeling. This can be a disadvantage if you don’t have the
expertise to create the model.
An example of model-based learning is predicting the price of a house based on its size, number
of rooms, location, and other features. In this case, a model could be created using linear
regression to predict the price of the house based on these features. The model would be trained
on a dataset of house prices and features and then used to make predictions on new data.
Temporal Difference (TD) learning a model-free reinforcement learning technique that aims to align
the expected prediction with the latest prediction, thus matching expectations with actual
outcomes and progressively enhancing the accuracy of the overall prediction chain. It also seeks
to predict a combination of the immediate reward and its own reward prediction at the same
moment.
In temporal difference learning, the signal used for training a prediction comes from a future
prediction. This approach is a combination of the Monte Carlo (MC) technique and the Dynamic
Programming (DP) technique. Monte Carlo methods modify their estimates only after the final
result is known, whereas temporal difference techniques adjust predictions to match later, more
precise predictions for the future, well before knowing the final outcome. This is essentially a type
of bootstrapping.
Parameters used in Temporal Difference Learning
• Alpha (αα) − This indicates the learning rate which varies between 0 to 1. It determines
how much our estimates should be adjusted based on the error.
• Gamma (γγ) − This implies the discount rate which varies between 0 to 1. A large discount
rate signifies that future rewards are valued to a greater extent.
• ϵϵ − This means examining new possibilities with a likelihood of ϵϵ and remaining at the
existing maximum with a likelihood of 1−ϵ1−ϵ. A greater ϵϵ indicates that more
explorations take place during training.
The main goal of Temporal Difference (TD) learning is to estimate the value
function V(s)V(s), which represents the expected future reward started from the state ss.
Following is the list of algorithms used in TD learning −
1. TD(λ) Algorithm
TD(λ) is a reinforcement learning algorithm that combines concepts from both Monte
Carlo methods and TD (0). It calculates the value function by taking weighted average of
n-steps return from the agent's trajectory, with the weight determined by λ.
• When λ=0λ=0 it corresponds to TD (0), where the latest reward and the value of the next
state are considered in updating the estimate.
• When λ=1λ=1, it indicates the use of Monte Carlo methods, which involve updating the
value based on the total return from a state until the episode ends.
• If the λ lies between 0 to 1, TD(λ) combines short-term TD (0) and Monte Carlo methods,
emphasizing latest rewards.
2. TD (0) Algorithm
The simplest form of TD learning is TD (0) TD (0) algorithm (One-Step TD learning), where
the value of a state is updated based on the successive reward and the estimated value of
the next state. The update rule –
Where,
The rule adjusts the current estimate based on the difference between the predicted return
(using V(st+1) and the actual return (using Rt+1).
3. TD (1) Algorithm
3. Activation: The result of the linear transformation (denoted as zz) is then passed through
an activation function. The activation function is crucial because it introduces non-
linearity into the system, enabling the network to learn more complex patterns. Popular
activation functions include ReLU, sigmoid, and tanh.
Backpropagation
After forward propagation, the network evaluates its performance using a loss function, which
measures the difference between the actual output and the predicted output. The goal of training
is to minimize this loss. This is where backpropagation comes into play:
1. Loss Calculation: The network calculates the loss, which provides a measure of error in
the predictions. The loss function could vary; common choices are mean squared error for
regression tasks or cross-entropy loss for classification.
2. Gradient Calculation: The network computes the gradients of the loss function with
respect to each weight and bias in the network. This involves applying the chain rule of
calculus to find out how much each part of the output error can be attributed to each
weight and bias.
3. Weight Update: Once the gradients are calculated, the weights and biases are updated
using an optimization algorithm like stochastic gradient descent (SGD). The weights are
adjusted in the opposite direction of the gradient to minimize the loss. The size of the step
taken in each update is determined by the learning rate.
Iteration
This process of forward propagation, loss calculation, backpropagation, and weight update is
repeated for many iterations over the dataset. Over time, this iterative process reduces the loss,
and the network’s predictions become more accurate.
Through these steps, neural networks can adapt their parameters to better approximate the
relationships in the data, thereby improving their performance on tasks such as classification,
regression, or any other predictive modeling.
Learning of a Neural Network
1. Learning with Supervised Learning
In supervised learning, a neural network learns from labeled input-output pairs provided by a
teacher. The network generates outputs based on inputs, and by comparing these outputs to the
known desired outputs, an error signal is created. The network iteratively adjusts its parameters
to minimize errors until it reaches an acceptable performance level.
2. Learning with Unsupervised Learning
Unsupervised learning involves data without labeled output variables. The primary goal is to
understand the underlying structure of the input data (X). Unlike supervised learning, there is no
instructor to guide the process. Instead, the focus is on modeling data patterns and relationships,
with techniques like clustering and association commonly used.
3. Learning with Reinforcement Learning
Reinforcement learning enables a neural network to learn through interaction with its
environment. The network receives feedback in the form of rewards or penalties, guiding it to
find an optimal policy or strategy that maximizes cumulative rewards over time. This approach is
widely used in applications like gaming and decision-making.
Types of Neural Networks
There are seven types of neural networks that can be used.
• Feedforward Networks: A feedforward neural network is a simple artificial neural
network architecture in which data moves from input to output in a single direction.
• Multilayer Perceptron (MLP): MLP is a type of feedforward neural network with three
or more layers, including an input layer, one or more hidden layers, and an output layer. It
uses nonlinear activation functions.
• Convolutional Neural Network (CNN): A Convolutional Neural Network (CNN) is a
specialized artificial neural network designed for image processing. It employs
convolutional layers to automatically learn hierarchical features from input images,
enabling effective image recognition and classification.
• Recurrent Neural Network (RNN): An artificial neural network type intended for
sequential data processing is called a Recurrent Neural Network (RNN). It is appropriate
for applications where contextual dependencies are critical, such as time series prediction
and natural language processing, since it makes use of feedback loops, which enable
information to survive within the network.
• Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed to
overcome the vanishing gradient problem in training RNNs. It uses memory cells and
gates to selectively read, write, and erase information.
Advantages of Neural Networks
Neural networks are widely used in many different applications because of their many benefits:
• Adaptability: Neural networks are useful for activities where the link between inputs and
outputs is complex or not well defined because they can adapt to new situations and learn
from data.
• Pattern Recognition: Their proficiency in pattern recognition renders them efficacious
in tasks like as audio and image identification, natural language processing, and other
intricate data patterns.
• Parallel Processing: Because neural networks are capable of parallel processing by
nature, they can process numerous jobs at once, which speeds up and improves the
efficiency of computations.
• Non-Linearity: Neural networks are able to model and comprehend complicated
relationships in data by virtue of the non-linear activation functions found in neurons,
which overcome the drawbacks of linear models.
Disadvantages of Neural Networks
Neural networks, while powerful, are not without drawbacks and difficulties:
• Computational Intensity: Large neural network training can be a laborious and
computationally demanding process that demands a lot of computing power.
• Black box Nature: As “black box” models, neural networks pose a problem in important
applications since it is difficult to understand how they make decisions.
• Overfitting: Overfitting is a phenomenon in which neural networks commit training
material to memory rather than identifying patterns in the data. Although regularization
approaches help to alleviate this, the problem still exists.
• Need for Large datasets: For efficient training, neural networks frequently need sizable,
labeled datasets; otherwise, their performance may suffer from incomplete or skewed
data.
Applications of Neural Networks
Neural networks have numerous applications across various fields:
1. Image and Video Recognition: CNNs are extensively used in applications such as facial
recognition, autonomous driving, and medical image analysis.
2. Natural Language Processing (NLP): RNNs and transformers power language
translation, chatbots, and sentiment analysis.
3. Finance: Predicting stock prices, fraud detection, and risk management.
4. Healthcare: Neural networks assist in diagnosing diseases, analyzing medical images,
and personalizing treatment plans.
5. Gaming and Autonomous Systems: Neural networks enable real-time decision-making,
enhancing user experience in video games and enabling autonomous systems like self-
driving cars.
What is Perceptron?
Perceptron is a type of neural network that performs binary classification that maps input features
to an output decision, usually classifying data into one of two categories, such as 0 or 1.
Perceptron consists of a single layer of input nodes that are fully connected to a layer of output
nodes. It is particularly good at learning linearly separable patterns. It utilizes a variation of
artificial neurons called Threshold Logic Units (TLU), which were first introduced by
McCulloch and Walter Pitts in the 1940s. This foundational model has played a crucial role in the
development of more advanced neural networks and machine learning algorithms.
Types of Perceptron
1. Single-Layer Perceptron is a type of perceptron is limited to learning linearly separable
patterns. It is effective for tasks where the data can be divided into distinct categories
through a straight line. While powerful in its simplicity, it struggles with more complex
problems where the relationship between inputs and outputs is non-linear.
2. Multi-Layer Perceptron possess enhanced processing capabilities as they consist of two
or more layers, adept at handling more complex patterns and relationships within the
data.
Basic Components of Perceptron
A Perceptron is composed of key components that work together to process information and
make predictions.
• Input Features: The perceptron takes multiple input features, each representing a
characteristic of the input data.
• Weights: Each input feature is assigned a weight that determines its influence on the
output. These weights are adjusted during training to find the optimal values.
• Summation Function: The perceptron calculates the weighted sum of its inputs,
combining them with their respective weights.
• Activation Function: The weighted sum is passed through the Heaviside step function,
comparing it to a threshold to produce a binary output (0 or 1).
• Output: The final output is determined by the activation function, often used for binary
classification tasks.
• Bias: The bias term helps the perceptron make adjustments independent of the input,
improving its flexibility in learning.
• Learning Algorithm: The perceptron adjusts its weights and bias using a learning
algorithm, such as the Perceptron Learning Rule, to minimize prediction errors.
These components enable the perceptron to learn from data and make predictions. While a single
perceptron can handle simple binary classification, complex tasks require multiple perceptrons
organized into layers, forming a neural network.
How does Perceptron work?
A weight is assigned to each input node of a perceptron, indicating the importance of that input
in determining the output. The Perceptron’s output is calculated as a weighted sum of the inputs,
which is then passed through an activation function to decide whether the Perceptron will fire.
The weighted sum is computed as:
The step function compares this weighted sum to a threshold. If the input is larger than the
threshold value, the output is 1; otherwise, it’s 0. This is the most common activation function used
in Perceptron’s are represented by the Heaviside step function:
A perceptron consists of a single layer of Threshold Logic Units (TLU), with each TLU fully
connected to all input nodes.
In a fully connected layer, also known as a dense layer, all neurons in one layer are connected to
every neuron in the previous layer.
The output of the fully connected layer is computed as:
where X is the input W is the weight for each inputs neurons and b is the bias and h is the step
function.
During training, the Perceptron’s weights are adjusted to minimize the difference between the
predicted output and the actual output. This is achieved using supervised learning algorithms like
the delta rule or the Perceptron learning rule.
The weight update formula is:
What is a Multilayer Perceptron?
A Multi-Layer Perceptron (MLP) consists of fully connected dense layers that transform input
data from one dimension to another. It is called “multi-layer” because it contains an input layer,
one or more hidden layers, and an output layer. The purpose of an MLP is to model complex
relationships between inputs and outputs, making it a powerful tool for various machine learning
tasks.
The key components of Multi-Layer Perceptron include:
• Input Layer: Each neuron (or node) in this layer corresponds to an input feature. For
instance, if you have three input features, the input layer will have three neurons.
• Hidden Layers: An MLP can have any number of hidden layers, with each layer
containing any number of nodes. These layers process the information received from the
input layer.
• Output Layer: The output layer generates the final prediction or result. If there are
multiple outputs, the output layer will have a corresponding number of neurons.
Every connection in the diagram is a representation of the fully connected nature of an MLP. This
means that every node in one layer connects to every node in the next layer. As the data moves
through the network, each layer transforms it until the final output is generated in the output layer.
Working of Multi-Layer Perceptron
Let’s delve in to the working of the multi-layer perceptron. The key mechanisms such as forward
propagation, loss function, backpropagation, and optimization.
Step 1: Forward Propagation
In forward propagation, the data flows from the input layer to the output layer, passing through
any hidden layers. Each neuron in the hidden layers processes the input as follows:
1. Weighted Sum: The neuron computes the weighted sum of the inputs:
Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the network’s weights
and biases. This is achieved through backpropagation:
1. Gradient Calculation: The gradients of the loss function with respect to each weight and
bias are calculated using the chain rule of calculus.
2. Error Propagation: The error is propagated back through the network, layer by layer.
• Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss:
• Where:
o w is the weight.
o η is the learning rate.
o ∂L/∂w is the gradient of the loss function with respect to the weight.
Forward Propagation
Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the loss function by iteratively
updating the weights in the direction of the negative gradient. Common variants of gradient descent
include:
• Batch Gradient Descent: Updates weights after computing the gradient over the entire
dataset.
• Stochastic Gradient Descent (SGD): Updates weights for each training example
individually.
• Mini-batch Gradient Descent: Updates weights after computing the gradient over a small
batch of training examples.
Evaluation of Feedforward neural network
Evaluating the performance of the trained model involves several metrics:
• Accuracy: The proportion of correctly classified instances out of the total instances.
• Precision: The ratio of true positive predictions to the total predicted positives.
• Recall: The ratio of true positive predictions to the actual positives.
• F1 Score: The harmonic means of precision and recall, providing a balance between the
two.
• Confusion Matrix: A table used to describe the performance of a classification model,
showing the true positives, true negatives, false positives, and false negatives.
What is Backpropagation?
Backpropagation is a powerful algorithm in deep learning, primarily used to train artificial neural
networks, particularly feed-forward networks. It works iteratively, minimizing the cost function
by adjusting weights and biases.
In each epoch, the model adapts these parameters, reducing loss by following the error gradient.
Backpropagation often utilizes optimization algorithms like gradient descent or stochastic
gradient descent. The algorithm computes the gradient using the chain rule from calculus,
allowing it to effectively navigate complex layers in the neural network to minimize the cost
function.
fig(a) A simple illustration of how the backpropagation works by adjustments of weights
Why is Backpropagation Important?
Backpropagation plays a critical role in how neural networks improve over time. Here's why:
1. Efficient Weight Update: It computes the gradient of the loss function with respect to each
weight using the chain rule, making it possible to update weights efficiently.
2. Scalability: The backpropagation algorithm scales well to networks with multiple layers
and complex architectures, making deep learning feasible.
3. Automated Learning: With backpropagation, the learning process becomes automated,
and the model can adjust itself to optimize its performance.
Working of Backpropagation Algorithm
The Backpropagation algorithm involves two main steps: the Forward Pass and the Backward
Pass.
How Does the Forward Pass Work?
In the forward pass, the input data is fed into the input layer. These inputs, combined with their
respective weights, are passed to hidden layers.
For example, in a network with two hidden layers (h1 and h2 as shown in Fig. (a)), the output from
h1 serves as the input to h2. Before applying an activation function, a bias is added to the weighted
inputs.
Each hidden layer applies an activation function like ReLU (Rectified Linear Unit), which
returns the input if it’s positive and zero otherwise. This adds non-linearity, allowing the model to
learn complex relationships in the data. Finally, the outputs from the last hidden layer are passed
to the output layer, where an activation function, such as softmax, converts the weighted outputs
into probabilities for classification.
Linear Activation Function or Identity Function returns the input as the output
2. Non-Linear Activation Functions
1. Sigmoid Function
Sigmoid Activation Function is characterized by ‘S’ shape. It is mathematically defined as
This formula ensures a smooth and continuous output that is essential for gradient-based
optimization methods.
• It allows neural networks to handle and model complex patterns that linear equations
cannot.
• The output ranges between 0 and 1, hence useful for binary classification.
• The function exhibits a steep gradient when x values are between -2 and 2. This sensitivity
means that small changes in input x can cause significant changes in output y, which is
critical during the training process.
Deep learning
Deep learning is a type of machine learning that uses artificial neural networks to learn from data.
Artificial neural networks are inspired by the human brain, and they can be used to solve a wide
variety of problems, including image recognition, natural language processing, and speech
recognition.
What is Deep Learning?
The definition of Deep learning is that it is the branch of machine learning that is based on artificial
neural network architecture. An artificial neural network or ANN uses layers of interconnected
nodes called neurons that work together to process and learn from the input data.
In a fully connected Deep neural network, there is an input layer and one or more hidden layers
connected one after the other. Each neuron receives input from the previous layer neurons or the
input layer. The output of one neuron becomes the input to other neurons in the next layer of the
network, and this process continues until the final layer produces the output of the network. The
layers of the neural network transform the input data through a series of nonlinear transformations,
allowing the network to learn complex representations of the input data.
Convolution Neural Network
A Convolutional Neural Network (CNN) is a type of Deep Learning neural network architecture
commonly used in Computer Vision. Computer vision is a field of Artificial Intelligence that
enables a computer to understand and interpret the image or visual data.
When it comes to Machine Learning, Artificial Neural Networks perform really well. Neural
Networks are used in various datasets like images, audio, and text. Different types of Neural
Networks are used for different purposes, for example for predicting the sequence of words we
use Recurrent Neural Networks more precisely an LSTM, similarly for image classification we use
Convolution Neural networks. In this blog, we are going to build a basic building block for CNN.
Convolution Neural Network
Convolutional Neural Network (CNN) is the extended version of artificial neural networks (ANN)
which is predominantly used to extract the feature from the grid-like matrix dataset. For example
visual datasets like images or videos where data patterns play an extensive role.
CNN Architecture
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer,
Pooling layer, and fully connected layers.
1. Recurrent Neurons
The fundamental processing unit in a Recurrent Neural Network (RNN) is a Recurrent Unit, which
is not explicitly called a “Recurrent Neuron.” Recurrent units hold a hidden state that maintains
information about previous inputs in a sequence. Recurrent units can “remember” information
from prior steps by feeding back their hidden state, allowing them to capture dependencies across
time.
Recurrent Neuron
2. RNN Unfolding
RNN unfolding, or “unrolling,” is the process of expanding the recurrent structure over time steps.
During unfolding, each step of the sequence is represented as a separate layer in a series,
illustrating how information flows across each time step. This unrolling enables backpropagation
through time (BPTT), a learning process where errors are propagated across time steps to adjust
the network’s weights, enhancing the RNN’s ability to learn dependencies within sequential data.
RNN Unfolding
Types Of Recurrent Neural Networks
There are four types of RNNs based on the number of inputs and outputs in the network:
1. One-to-One RNN
One-to-One RNN behaves as the Vanilla Neural Network, is the simplest type of neural network
architecture. In this setup, there is a single input and a single output. Commonly used for
straightforward classification tasks where input data points do not depend on previous elements.
Benefits
Optimize AI and cloud economics
Put multicloud AI to work for business. Use flexible consumption models. Build and deploy AI
anywhere.
Predict outcomes and prescribe actions
Optimize schedules, plans and resource allocations using predictions. Simplify optimization
modeling with a natural language interface.
Synchronize apps and AI
Unite and cross-train developers and data scientists. Push models through REST API across any
cloud. Save time and cost managing disparate tools.
Unify tools and increase productivity for ModelOps
Operationalize enterprise AI across clouds. Govern and secure data science projects at scale.
Deliver explainable AI
Reduce model monitoring efforts by 35% to 50%.¹ Increase model accuracy by 15% to
30%.² Increase net profits on a data and AI platform.
Manage risks and regulatory compliance
Protect against exposure and regulatory penalties. Simplify AI model risk management through
automated validation.
Features
AutoAI for faster experimentation
Automatically build model pipelines. Prepare data and select model types. Generate and rank
model pipelines.
Advanced data refinery
Cleanse and shape data with a graphical flow editor. Apply interactive templates to code
operations, functions and logical operators.
Open-source notebook support
Create a notebook file, use a sample notebook or bring your own notebook. Code and run a
notebook.
Integrated visual tooling
Prepare data quickly and develop models visually with IBM SPSS Modeler in Watson Studio.
Model training and development
Build experiments quickly and enhance training by optimizing pipelines and identifying the right
combination of data.
Extensive open-source frameworks
Bring your model of choice to production. Track and retrain models using production feedback.
Embedded decision optimization
Combine predictive and prescriptive models. Use predictions to optimize decisions. Create and
edit models in Python, in OPL or with natural language.
Model management and monitoring
Monitor quality, fairness and drift metrics. Select and configure deployment for model insights.
Customize model monitors and metrics.
Model risk management
Compare and evaluate models. Evaluate and select models with new data. Examine the key model
metrics side-by-side.
Data science methodology
IBM has defined a lightweight IBM Cloud Garage Method that includes a process model to map
individual technology components to the reference architecture. This method does not include any
requirement engineering or design thinking tasks. Because it can be hard to initially define the
architecture of a project, this method supports architectural changes during the process model.
Each stage plays a vital role in the context of the overall methodology. At a certain level of
abstraction, it can be seen as a refinement of the workflow outlined by the CRISP-DM method for
data mining.
According to both methodologies, every project starts with Business understanding, where the
problem and objectives are defined. This is followed in the IBM Data Science Method by
the Analytical approach phase, where the data scientist can define the approach to solving the
problem. The IBM Data Science Method then continues with three phases called Data
requirements, Data collection, and Data understanding, which in CRISP-DM are presented by a
single Data understanding phase.
After the data scientist has an understanding of the data and has sufficient data to get started, they
move to the Data preparation phase. This phase is usually very time-consuming. A data scientist
spends about 80% of their time in this phase, performing tasks such as data cleansing and feature
engineering. The term "data wrangling" is often used in this context. During and after cleansing
the data, the data scientist generally performs exploration, such as descriptive statistics to get an
overall feel for the data, and clustering to look at the relationships and latent structure of the data.
This process is often iterated several times until the data scientist is satisfied with their data set.
The model training stage is where machine learning is used in building a predictive model. The
model is trained and then evaluated by statistical measures such as prediction accuracy, sensitivity,
and specificity. After the model is deemed sufficient, it is deployed and used for scoring on unseen
data. The IBM Data Science Methodology adds an additional Feedback stage for obtaining
feedback from using the model, which is then used to improve the model. Both methods are highly
iterative by nature.
IBM Watson Studio
IBM Watson Studio gives you the environment and tools to solve business problems by
collaboratively working with data. You can choose the tools needed to analyze and visualize data;
to cleanse and shape the data; to ingest streaming data; or to create, train, and deploy machine
learning models.
In the context of data science, IBM Watson Studio can be viewed as an integrated, multirole
collaboration platform that supports the developer, data engineer, business analyst, and the data
scientist in the process of solving a data science problem. For the developer role, other components
of the IBM Cloud platform might be relevant as well in building applications that use machine
learning services. However, the data scientist can build machine learning models using a variety
of tools, ranging from:
• AutoAI Model Builder: A graphical tool requiring no programming skills
Using IBM Watson Machine Learning, you can deploy machine learning models, scripts, and
functions, and prompt templates for generative AI models. After you create deployments, you can
test and manage them, and prepare your assets to deploy into pre-production and production
environments to generate predictions and insights.
Service The administrator must provision the Watson Machine Learning service on Cloud Pak for
Data as a Service platform to use its capabilities.
Analyzing data and building models with Watson Studio, Watson Machine Learning, and
other supplemental services
You can analyze data and build models with the Watson Studio service. Supplemental services to
Watson Studio, such as Watson Machine Learning, add tools and compute resources to projects.
Service The Watson Studio, Watson Machine Learning, and other supplemental services are not
available by default. An administrator must install these services on the IBM Cloud Pak for Data
platform. To determine whether a service is installed, open the Services catalog and check whether
the service is enabled.
To start analyzing data with Watson Studio:
You can use the data analytics and model building methods that are listed in the following table
with Watson Studio plus the other listed services.
Supplementary services to
Method
Watson Studio
Analyze data by writing code in Jupyter notebooks or scripts
Code notebooks and Python scripts in the JupyterLab IDE
with Git integration
Visualize and prepare data in Data Refinery
[Link]
Prompt Lab
Watson Machine Learning
[Link]
Tuning Studio
Watson Machine Learning
Supplementary services to
Method
Watson Studio
Develop Shiny applications in the RStudio IDE RStudio Server with R 3.6
Visualize your data without coding with Cognos dashboards Cognos Dashboards
Run analytic workloads with Spark environments or SparkAnalytics Engine powered by
APIs Apache Spark
Execution Engine for Apache
Analyze data on Apache Hadoop clusters
Hadoop
Analyze data with SQL queries on Hadoop clusters or cloud
Db2 Big SQL
object stores
Build models with AutoAI Watson Machine Learning
Train models with federated learning Watson Machine Learning
Build models in notebooks Watson Machine Learning
Watson Machine Learning
Run deep learning experiments Watson Machine Learning
Accelerator
Watson Machine Learning
Solve Decision Optimization models
Decision Optimization
Build models with SPSS Modeler SPSS Modeler
Data analysis and model building methods with Watson Studio and supplementary services
If you don't have the Watson Studio service that is installed, you can use the data analytics methods
with the services that are listed in the following table.
Method Service
Analytics Engine powered by
Run analytic workloads with Spark APIs
Apache Spark
Analyze data with SQL queries on Hadoop clusters or cloud
Db2 Big SQL
object stores
Table 2. Data analysis methods and the required services
Create a deployment space to collaborate with stakeholders and deploy and manage your AI assets.
To manage your assets within a deployment space, you must promote your assets from a project
to your deployment space. You can also import or export assets from your deployment space. For
more information, see Deployment spaces.
The above graphic shows the typical activities to deploy AI assets:
Deployment spaces
Deployment spaces contain deployable assets, deployments, deployment jobs, associated input and
output data, and the associated environments. You can use spaces to deploy various assets and manage
your deployments.
Deployment spaces are not associated with projects. You can promote assets from multiple
projects to a space, and you can deploy assets to more than one space. For example, you might
have a test space for evaluating deployments, and a production space for deployments that you
want to deploy in business applications.
When you open a deployment space from Cloud Pak for Data, you see these components:
• Overview: Use the Overview tab to view all space activity, such as machine learning and
generative AI assets within the deployment space and the status of deployments and jobs.
• Assets: The Assets tab provides details about assets that are created or imported in the
deployment space. You can organize assets by types. For example, the Data access view
provides information about connections to DataStax Enterprise, Google BigQuery, Oracle,
and more.
• Deployments: Use the Deployments tab to monitor the status of your deployments in the
deployment space. You must promote assets from your project to the deployment space
before they can be deployed.
• Jobs: Use the Jobs tab to monitor jobs that are associated with your batch deployments.
• Manage: Use the Manage tab to access and edit details about your deployment space. You
can share a space with collaborators. When you add collaborators to a deployment space,
you can specify which actions they can do by assigning them access levels. You can also
create new environments and managing resource usage.
Required permissions:
All users in your IBM Cloud account with the Editor IAM platform access role for all IAM
enabled services or for Cloud Pak for Data can manage to create deployment spaces. For more
information, see IAM Platform access roles.
A deployment space is not associated with a project. You can publish assets from multiple projects
to a space. For example, you might have a test space for evaluating deployments, and a production
space for deployments you want to deploy in business applications.
1. From the navigation menu, select Deployments > New deployment space. Enter a name
for your deployment space.
2. Optional: Add a description and tags.
3. Select a storage service to store your space assets.
o If you have a Cloud Object Storage repository that is associated with your IBM Cloud
account, choose a repository from the list to store your space assets.
o If you do not have a Cloud Object Storage repository that is associated with your IBM
Cloud account, you are prompted to create one.
4. Optional: If you want to deploy assets from your space, select a machine learning service
instance to associate with your deployment space.
To associate a machine learning instance to a space, you must:
o Be a space administrator.
o Have view access to the machine learning service instance that you want to
associate with the space.
Tip: If you want to evaluate assets in the space, switch to the **Manage** tab and associate a
Watson OpenScale instance.
5. Optional: Assign the space to a deployment stage. Deployment stages are used for MLOps,
to manage access for assets in various stages of the AI lifecycle. They are also used in
governance, for tracking assets. Choose from:
o Development for assets under development. Assets that are tracked for governance
are displayed in the Develop stage of their associated use case.
o Testing for assets that are being validated. Assets that are tracked for governance
are displayed in the Validate stage of their associated use case.
o Production for assets in production. Assets that are tracked for governance are
displayed in the Operate stage of their associated use case.
6. Optional: Upload space assets, such as exported project or exported space. If the imported
space is encrypted, you must enter the password.
Tip: If you get an import error, clear your browser cookies and then try again.
7. Click Create.
• To view all deployment spaces that you can access, click Deployments on the navigation
menu.
• To view any of the details about the space after you create it, such as the associated service
instance or storage ID, open your deployment space and then click the Manage tab.
• Your space assets are stored in a Cloud Object Storage repository. You can access this
repository from IBM Cloud. To find the bucket ID, open your deployment space, and click
the Manage tab.
Spaces that are not used for 90 days are automatically archived to preserve system resources. If
you access an archived space, either through the user interface or programmatically, you must wait
for the space to be restored before you can use it.
Note: You cannot promote or add assets to an archived space. Restore the space first, then promote
or add assets.
You cannot promote or add assets to an archived space. Restore the space first, then promote or
add assets.
• Use a no-code approach: You can use a no-code approach to deploy and manage assets in
a deployment space. For more information, see Deploying and managing assets in deployment
spaces
• Use a custom-code approach You can use a custom-code approach to deploy and manage
assets programmatically by using:
o Python client
o Watson Machine Learning API
For additional Cloud Pak for Data as a Service APIs, see Cloud Pak for Data APIs.
Types of deployments
Depending on your organization's needs, you can create an online or a batch deployment:
To use you deployed asset in applications for making predictions, retrieve the endpoint URL for
your online or batch deployment. The model endpoint provides access to an interface to invoke
and manage model deployments.
For more information, see Retrieving the endpoint for an online deployment or Retrieving the endpoint
for a batch deployment.
• Foundation model assets: You can deploy foundation model assets such as tuned model
or prompt template assets with [Link]. For more information, see Deploying foundation
model assets.
• Machine Learning assets: You can deploy machine learning Machine Learning assets
such as Python functions, R Shiny applications, NLP models, scripts, and more with
Watson Machine Learning. For more information, see Deploying Machine Learning assets.
• Decision Optimization models: You can deploy Decision Optimization model with
Watson Machine Learning.
Managing deployments
You can access, update, scale, delete, and monitor the performance for your deployment in your
deployment space:
• Accessing a deployment: You can access details that are related to your deployment, such
as stage type, which describes whether the deployment space is for preproduction or
production purposes.
• Updating a deployment: You can update your deployment details such as deployment
name, software specification, and more. For more information, see Updating a deployment.
• Scaling a deployment: You can create multiple copies of your deployment to increase
scalability and availability for a larger volume of scoring requests. For more information,
see Scaling a deployment.
• Deleting a deployment: Delete your deployment when you no longer need it to free up
resources. For more information, see Deleting a deployment.
• Monitor deployment performance: You can evaluate your deployments to measure
performance and understand model predictions by provisioning a Watson OpenScale
instance and configuring monitors for fairness, quality, drift, and explainability.
Runtime environments provide the necessary functions that are required to run your deployment.
Important: You must use the same runtime environment to build and deploy your model.
You can use predefined runtime environments or create custom runtime environments to include
more components, depending on your use case. To create a custom runtime environment for your
deployment, you must create a Dockerfile and add a base image. Further, you can add
the docker commands to build the runtime environment for your deployment. For more
information, see Customizing Watson Machine Learning deployment runtimes.
Deploying an asset makes it available for testing or for productive use via an endpoint.
The following graphic describes the process of deploying your model, automating path to
production, and monitoring and managing AI lifecycle after you build your model:
Deploy assets
You can deploy assets from your deployment space by using [Link] Runtime. To deploy your
assets, you must promote these assets from a project to your deployment space or import these
assets directly to your deployment space.
Automate pipelines
You can automate the path to production by building a pipeline to automate parts of the AI lifecycle
from building the model to deployment by using Orchestration Pipelines.
Short Answers
1. What is Conditional Probability?
2. How Machine Learning is differed from Traditional Learning?
3. How classification is differed from Clustering? Which situation can use clustering?
4. What are the different types of deployments?
5. How can avoid overfitting in a Machine Learning?
6. What are the different types of decision theory?
7. What are the various distance metrics used in KNN?
8. What is the role of filters (kernels) in CNNs?
9. How Generative model is differed from Discriminative models.
10. How to monitor deployment activity?
11. Why dimension reduction is required? In machine learning techniques.
12. What do you mean by pruning?
13. What is dimensionality reduction in ML? List is advantages.
14. What is model based learning?
15. Define Deep Learning. How it is differed from ML.
16. Draw the diagrammatic representation of multi-layer perceptron, and write mathematical notation
of output calculation.
17. Define Perceptron.
18. How can you choose K value in the clustering concept?
19. How to monitor deployment activity?
20. What are the ways to deploy assets?
Big Answers
1. Explain in detail vector calculation and optimization techniques.
2. Explain in detail about Decision theory and Information with its applications.
3. Explain in detail principal component analysis with suitable example and list its
applications
4. Explain in detail various Ensemble learning methods with suitable examples, list
differentiate among them.
5. Explain in detail about different types of Machine Learning types with advantages,
disadvantages and applications.
6. How Machine learning is differed from Artificial intelligence?
7. Explain in detail K Means clustering with suitable example.
8. Explain in detail Hierarchical clustering with its types with suitable examples.
9. Compare supervised learning, unsupervised learning, and reinforcement learning.
10. Explain in detail Decision Tree used in supervised learning for classification.
11. Explain in detail Convolution Neural Network with its applications.
12. Explain in detail Feed Forward Network and Backpropagation algorithm with suitable
example.
13. Explain in detail DBSCAN clustering with suitable example
14. Explain in detail Recommendations systems its types, and its advantages and
disadvantages.
15. Explain in detail Support vector Machine used in supervised learning for classification.
16. Explain different types of deployable assets in IBM Watson studio.
17. How to Managing runtime environments for deployments in IBM Watson studio.
18. How to evaluate classification algorithms in Machine Learning.
19. Explain in detail about different types of Machine Learning types with advantages, disadvantages
and applications.
20. Describe about RNN and its various types.
21. Explain in detail Multilayer perceptron.
22. Explain various IBM Watson Machine Learning service.
23. How to Managing runtime environments for deployments in IBM Watson studio.
24. Explain in detail K Means clustering with suitable example.
25. Explain various cross validation methods with suitable examples.
26. How to monitor deployment activity in IBM Watson studio.
Dimension Reduction-
• It is a process of converting a data set having vast dimensions into a data set with lesser
dimensions.
• It ensures that the converted data set conveys similar information concisely.
Example-
In machine learning,
• We convert the dimensions of data from 2 dimensions (x1 and x2) to 1 dimension (z1).
• It makes the data relatively easier to explain.
Benefits-
• It compresses the data and thus reduces the storage space requirements.
• It reduces the time required for computation since less dimensions require less computation.
• It eliminates the redundant features.
• It improves the model performance.
Problem-01:
Given data = { 2, 3, 4, 5, 6, 7 ; 1, 5, 3, 6, 7, 8 }.
OR
Consider the two-dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8).
OR
CLASS 1
X=2,3,4
Y=1,5,3
CLASS 2
X=5,6,7
Y=6,7,8
Solution-
Step-01:
Get data.
• x1 = (2, 1)
• x2 = (3, 5)
• x3 = (4, 3)
• x4 = (5, 6)
• x5 = (6, 7)
• x6 = (7, 8)
Step-02:
= ((2 + 3 + 4 + 5 + 6 + 7) / 6, (1 + 5 + 3 + 6 + 7 + 8) / 6)
= (4.5, 5)
Thus,
Step-03:
Step-04:
Now,
Covariance matrix
= (m1 + m2 + m3 + m4 + m5 + m6) / 6
On adding the above matrices and dividing by 6, we get-
Step-05:
Calculate the eigen values and eigen vectors of the covariance matrix.
So, we have-
From here,
λ2 – 8.59λ + 3.09 = 0
Solving this quadratic equation, we get λ = 8.22, 0.38
Clearly, the second eigen value is very small compared to the first eigen value.
Eigen vector corresponding to the greatest eigen value is the principal component for the given data
set. So. we find the eigen vector corresponding to eigen value λ1.
MX = λX
where-
• M = Covariance Matrix
• X = Eigen vector
• λ = Eigen value
On simplification, we get-
Lastly, we project the data points onto the new subspace as-
Problem-02:
Use PCA Algorithm to transform the pattern (2, 1) onto the eigen vector in the previous question.
Solution-
23
Assignment -1
Build a decision tree using ID3 algorithm for the given training data in the table (Buy Computer data),
and predict the class of the following new example: age<=30, income=medium, student=yes, credit-
rating=fair