0% found this document useful (0 votes)
34 views176 pages

Machine Learning Using Watson Studio

Uploaded by

dr428440
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views176 pages

Machine Learning Using Watson Studio

Uploaded by

dr428440
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning Using

Watson Studio

DR N KUMARAN, ASSOCIATE PROFESSOR/CSE

SCHOOL OF ENGINEERING AND TECHNOLOGY DHANALAKSHMI SRINIVASAN


UNIVERSITY | Samayapuram, Trichy
Unit- 1
What is Machine Learning?

Machine Learning, as the name says, is all about machines learning automatically without being explicitly
programmed or learning without any direct human intervention. This machine learning process starts with
feeding them good quality data and then training the machines by building various machine learning
models using the data and different algorithms. The choice of algorithms depends on what type of data
we have and what kind of task we are trying to automate.

As for the formal definition of Machine Learning, we can say that a Machine Learning algorithm learns
from experience E with respect to some type of task T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E.

For example, If a Machine Learning algorithm is used to play chess. Then the experience E is playing many
games of chess, the task T is playing chess with many players, and the performance measure P is the
probability that the algorithm will win in the game of chess.

Advantages of Machine Learning

1. Improved Accuracy and Precision

One of the most significant benefits of machine learning is its ability to improve accuracy and precision in
various tasks. ML models can process vast amounts of data and identify patterns that might be overlooked
by humans. For instance, in medical diagnostics, ML algorithms can analyze medical images or patient data
to detect diseases with a high degree of accuracy.

2. Automation of Repetitive Tasks

Machine learning enables the automation of repetitive and mundane tasks, freeing up human resources
for more complex and creative endeavors. In industries like manufacturing and customer service, ML-
driven automation can handle routine tasks such as quality control, data entry, and customer inquiries,
resulting in increased productivity and efficiency.

3. Enhanced Decision-Making

ML models can analyze large datasets and provide insights that aid in decision-making. By identifying
trends, correlations, and anomalies, machine learning helps businesses and organizations make data-
driven decisions. This is particularly valuable in sectors like finance, where ML can be used for risk
assessment, fraud detection, and investment strategies.

4. Personalization and Customer Experience

Machine learning enables the personalization of products and services, enhancing customer experience.
In e-commerce, ML algorithms analyze customer behavior and preferences to recommend products
tailored to individual needs. Similarly, streaming services use ML to suggest content based on user viewing
history, improving user engagement and satisfaction.
5. Predictive Analytics

Predictive analytics is a powerful application of machine learning that helps forecast future events based
on historical data. Businesses use predictive models to anticipate customer demand, optimize inventory,
and improve supply chain management. In healthcare, predictive analytics can identify potential outbreaks
of diseases and help in preventive measures.

6. Scalability

Machine learning models can handle large volumes of data and scale efficiently as data grows. This
scalability is essential for businesses dealing with big data, such as social media platforms and online
retailers. ML algorithms can process and analyze data in real-time, providing timely insights and responses.

7. Improved Security

ML enhances security measures by detecting and responding to threats in real-time. In cybersecurity, ML


algorithms analyze network traffic patterns to identify unusual activities indicative of cyberattacks.
Similarly, financial institutions use ML for fraud detection by monitoring transactions for suspicious
behavior.

8. Cost Reduction

By automating processes and improving efficiency, machine learning can lead to significant cost
reductions. In manufacturing, ML-driven predictive maintenance helps identify equipment issues before
they become costly failures, reducing downtime and maintenance costs. In customer service, chatbots
powered by ML reduce the need for human agents, lowering operational expenses.

9. Innovation and Competitive Advantage

Adopting machine learning fosters innovation and provides a competitive edge. Companies that leverage
ML for product development, marketing strategies, and customer insights are better positioned to respond
to market changes and meet customer demands. ML-driven innovation can lead to the creation of new
products and services, opening up new revenue streams.

10. Enhanced Human Capabilities

Machine learning augments human capabilities by providing tools and insights that enhance performance.
In fields like healthcare, ML assists doctors in diagnosing and treating patients more effectively. In research,
ML accelerates the discovery process by analyzing vast datasets and identifying potential breakthroughs.

Disadvantages of Machine Learning

1. Data Dependency

Machine learning models require vast amounts of data to train effectively. The quality, quantity, and
diversity of the data significantly impact the model’s performance. Insufficient or biased data can lead to
inaccurate predictions and poor decision-making. Additionally, obtaining and curating large datasets can
be time-consuming and costly.

2. High Computational Costs


Training ML models, especially deep learning algorithms, demands significant computational resources.
High-performance hardware such as GPUs and TPUs are often required, which can be expensive. The
energy consumption associated with training large models is also substantial, raising concerns about the
environmental impact.

3. Complexity and Interpretability

Many machine learning models, particularly deep neural networks, function as black boxes. Their
complexity makes it difficult to interpret how they arrive at specific decisions. This lack of transparency
poses challenges in fields where understanding the decision-making process is critical, such as healthcare
and finance.

4. Overfitting and Underfitting

Machine learning models can suffer from overfitting or underfitting. Overfitting occurs when a model
learns the training data too well, capturing noise and anomalies, which reduces its generalization ability
to new data. Underfitting happens when a model is too simple to capture the underlying patterns in the
data, leading to poor performance on both training and test data.

5. Ethical Concerns

ML applications can raise ethical issues, particularly concerning privacy and bias. Data privacy is a
significant concern, as ML models often require access to sensitive and personal information. Bias in
training data can lead to biased models, perpetuating existing inequalities and unfair treatment of certain
groups.

6. Lack of Generalization

Machine learning models are typically designed for specific tasks and may struggle to generalize across
different domains or datasets. Transfer learning techniques can mitigate this issue to some extent, but
developing models that perform well in diverse scenarios remains a challenge.

7. Dependency on Expertise

Developing and deploying machine learning models require specialized knowledge and expertise. This
includes understanding algorithms, data preprocessing, model training, and evaluation. The scarcity of
skilled professionals in the field can hinder the adoption and implementation of ML solutions.

8. Security Vulnerabilities

ML models are susceptible to adversarial attacks, where malicious actors manipulate input data to deceive
the model into making incorrect predictions. This vulnerability poses significant risks in critical applications
such as autonomous driving, cybersecurity, and financial fraud detection.

9. Maintenance and Updates

ML models require continuous monitoring, maintenance, and updates to ensure they remain accurate and
effective over time. Changes in the underlying data distribution, known as data drift, can degrade model
performance, necessitating frequent retraining and validation.

10. Legal and Regulatory Challenges


The deployment of ML applications often encounters legal and regulatory hurdles. Compliance with data
protection laws, such as GDPR, requires careful handling of user data. Additionally, the lack of clear
regulations specific to ML can create uncertainty and challenges for businesses and developers.

Applications of Statistics in Machine Learning

Statistics is a key component of machine learning, with broad applicability in various fields.

• Feature engineering relies heavily on statistics to convert geometric features into meaningful
predictors for machine learning algorithms.

• In image processing tasks like object recognition and segmentation, statistics accurately reflect the
shape and structure of objects in images.

• Anomaly detection and quality control benefit from statistics by identifying deviations from
norms, aiding in the detection of defects in manufacturing processes.

• Environmental observation and geospatial mapping leverage statistical analysis to monitor land
cover patterns and ecological trends effectively.

Types of Machine Learning

There are several types of machine learning, each with special characteristics and applications. Some of
the main types of machine learning algorithms are as follows:

1. Supervised Machine Learning

2. Unsupervised Machine Learning

3. Semi-Supervised Machine Learning

4. Reinforcement Learning

Types of Machine Learning

1. Supervised Machine Learning

Supervised learning is defined as when a model gets trained on a “Labelled Dataset”. Labelled datasets
have both input and output parameters. In Supervised Learning algorithms learn to map points between
inputs and correct outputs. It has both training and validation datasets labelled.
Supervised Learning

Let’s understand it with the help of an example.

Example: Consider a scenario where you have to build an image classifier to differentiate between cats
and dogs. If you feed the datasets of dogs and cats labelled images to the algorithm, the machine will learn
to classify between a dog or a cat from these labeled images. When we input new dog or cat images that
it has never seen before, it will use the learned algorithms and predict whether it is a dog or a cat. This is
how supervised learning works, and this is particularly an image classification.

There are two main categories of supervised learning that are mentioned below:

• Classification

• Regression

Classification

Classification deals with predicting categorical target variables, which represent discrete classes or labels.
For instance, classifying emails as spam or not spam, or predicting whether a patient has a high risk of
heart disease. Classification algorithms learn to map the input features to one of the predefined classes.

Here are some classification algorithms:

• Logistic Regression

• Support Vector Machine


• Random Forest

• Decision Tree

• K-Nearest Neighbors (KNN)

• Naive Bayes

Regression

Regression, on the other hand, deals with predicting continuous target variables, which represent
numerical values. For example, predicting the price of a house based on its size, location, and amenities,
or forecasting the sales of a product. Regression algorithms learn to map the input features to a continuous
numerical value.

Here are some regression algorithms:

• Linear Regression

• Polynomial Regression

• Ridge Regression

• Lasso Regression

• Decision tree

• Random Forest

Advantages of Supervised Machine Learning

• Supervised Learning models can have high accuracy as they are trained on labelled data.

• The process of decision-making in supervised learning models is often interpretable.

• It can often be used in pre-trained models which saves time and resources when developing new
models from scratch.

Disadvantages of Supervised Machine Learning

• It has limitations in knowing patterns and may struggle with unseen or unexpected patterns that
are not present in the training data.

• It can be time-consuming and costly as it relies on labeled data only.

• It may lead to poor generalizations based on new data.

Applications of Supervised Learning

Supervised learning is used in a wide variety of applications, including:

• Image classification: Identify objects, faces, and other features in images.

• Natural language processing: Extract information from text, such as sentiment, entities, and
relationships.
• Speech recognition: Convert spoken language into text.

• Recommendation systems: Make personalized recommendations to users.

• Predictive analytics: Predict outcomes, such as sales, customer churn, and stock prices.

• Medical diagnosis: Detect diseases and other medical conditions.

• Fraud detection: Identify fraudulent transactions.

• Autonomous vehicles: Recognize and respond to objects in the environment.

• Email spam detection: Classify emails as spam or not spam.

• Quality control in manufacturing: Inspect products for defects.

• Credit scoring: Assess the risk of a borrower defaulting on a loan.

• Gaming: Recognize characters, analyze player behavior, and create NPCs.

• Customer support: Automate customer support tasks.

• Weather forecasting: Make predictions for temperature, precipitation, and other meteorological
parameters.

• Sports analytics: Analyze player performance, make game predictions, and optimize strategies.

2. Unsupervised Machine Learning

Unsupervised Learning Unsupervised learning is a type of machine learning technique in which an


algorithm discovers patterns and relationships using unlabeled data. Unlike supervised learning,
unsupervised learning doesn’t involve providing the algorithm with labeled target outputs. The primary
goal of Unsupervised learning is often to discover hidden patterns, similarities, or clusters within the data,
which can then be used for various purposes, such as data exploration, visualization, dimensionality
reduction, and more.

Unsupervised Learning

Let’s understand it with the help of an example.


Example: Consider that you have a dataset that contains information about the purchases you made from
the shop. Through clustering, the algorithm can group the same purchasing behavior among you and other
customers, which reveals potential customers without predefined labels. This type of information can help
businesses get target customers as well as identify outliers.

There are two main categories of unsupervised learning that are mentioned below:

• Clustering

• Association

Clustering

Clustering is the process of grouping data points into clusters based on their similarity. This technique is
useful for identifying patterns and relationships in data without the need for labeled examples.

Here are some clustering algorithms:

• K-Means Clustering algorithm

• Mean-shift algorithm

• DBSCAN Algorithm

• Principal Component Analysis

• Independent Component Analysis

Association

Association rule learning is a technique for discovering relationships between items in a dataset. It
identifies rules that indicate the presence of one item implies the presence of another item with a specific
probability.

Here are some association rule learning algorithms:

• Apriori Algorithm

• Eclat

• FP-growth Algorithm

Advantages of Unsupervised Machine Learning

• It helps to discover hidden patterns and various relationships between the data.

• Used for tasks such as customer segmentation, anomaly detection, and data exploration.

• It does not require labeled data and reduces the effort of data labeling.

Disadvantages of Unsupervised Machine Learning

• Without using labels, it may be difficult to predict the quality of the model’s output.

• Cluster Interpretability may not be clear and may not have meaningful interpretations.
• It has techniques such as autoencoders and dimensionality reduction that can be used to extract
meaningful features from raw data.

Applications of Unsupervised Learning

Here are some common applications of unsupervised learning:

• Clustering: Group similar data points into clusters.

• Anomaly detection: Identify outliers or anomalies in data.

• Dimensionality reduction: Reduce the dimensionality of data while preserving its essential
information.

• Recommendation systems: Suggest products, movies, or content to users based on their historical
behavior or preferences.

• Topic modeling: Discover latent topics within a collection of documents.

• Density estimation: Estimate the probability density function of data.

• Image and video compression: Reduce the amount of storage required for multimedia content.

• Data preprocessing: Help with data preprocessing tasks such as data cleaning, imputation of
missing values, and data scaling.

• Market basket analysis: Discover associations between products.

• Genomic data analysis: Identify patterns or group genes with similar expression profiles.

• Image segmentation: Segment images into meaningful regions.

• Community detection in social networks: Identify communities or groups of individuals with


similar interests or connections.

• Customer behavior analysis: Uncover patterns and insights for better marketing and product
recommendations.

• Content recommendation: Classify and tag content to make it easier to recommend similar items
to users.

• Exploratory data analysis (EDA): Explore data and gain insights before defining specific tasks.

3. Semi-Supervised Learning

Semi-Supervised learning is a machine learning algorithm that works between the supervised and
unsupervised learning so it uses both labelled and unlabelled data. It’s particularly useful when obtaining
labeled data is costly, time-consuming, or resource-intensive. This approach is useful when the dataset is
expensive and time-consuming. Semi-supervised learning is chosen when labeled data requires skills and
relevant resources in order to train or learn from it.

We use these techniques when we are dealing with data that is a little bit labeled and the rest large portion
of it is unlabeled. We can use the unsupervised techniques to predict labels and then feed these labels to
supervised techniques. This technique is mostly applicable in the case of image data sets where usually all
images are not labeled.

Semi-Supervised Learning

Let’s understand it with the help of an example.

Example: Consider that we are building a language translation model, having labeled translations for every
sentence pair can be resources intensive. It allows the models to learn from labeled and unlabeled
sentence pairs, making them more accurate. This technique has led to significant improvements in the
quality of machine translation services.

Types of Semi-Supervised Learning Methods

There are a number of different semi-supervised learning methods each with its own characteristics. Some
of the most common ones include:

• Graph-based semi-supervised learning: This approach uses a graph to represent the relationships
between the data points. The graph is then used to propagate labels from the labeled data points
to the unlabeled data points.

• Label propagation: This approach iteratively propagates labels from the labeled data points to the
unlabeled data points, based on the similarities between the data points.

• Co-training: This approach trains two different machine learning models on different subsets of
the unlabeled data. The two models are then used to label each other’s predictions.
• Self-training: This approach trains a machine learning model on the labeled data and then uses
the model to predict labels for the unlabeled data. The model is then retrained on the labeled
data and the predicted labels for the unlabeled data.

• Generative adversarial networks (GANs): GANs are a type of deep learning algorithm that can be
used to generate synthetic data. GANs can be used to generate unlabeled data for semi-supervised
learning by training two neural networks, a generator and a discriminator.

Advantages of Semi- Supervised Machine Learning

• It leads to better generalization as compared to supervised learning, as it takes both labeled and
unlabeled data.

• Can be applied to a wide range of data.

Disadvantages of Semi- Supervised Machine Learning

• Semi-supervised methods can be more complex to implement compared to other approaches.

• It still requires some labeled data that might not always be available or easy to obtain.

• The unlabeled data can impact the model performance accordingly.

Applications of Semi-Supervised Learning

Here are some common applications of semi-supervised learning:

• Image Classification and Object Recognition: Improve the accuracy of models by combining a small
set of labeled images with a larger set of unlabeled images.

• Natural Language Processing (NLP): Enhance the performance of language models and classifiers
by combining a small set of labeled text data with a vast amount of unlabeled text.

• Speech Recognition: Improve the accuracy of speech recognition by leveraging a limited amount
of transcribed speech data and a more extensive set of unlabeled audios.

• Recommendation Systems: Improve the accuracy of personalized recommendations by


supplementing a sparse set of user-item interactions (labeled data) with a wealth of unlabeled
user behavior data.

• Healthcare and Medical Imaging: Enhance medical image analysis by utilizing a small set of labeled
medical images alongside a larger set of unlabeled images.

4. Reinforcement Machine Learning

Reinforcement machine learning algorithm is a learning method that interacts with the environment by
producing actions and discovering errors. Trial, error, and delay are the most relevant characteristics of
reinforcement learning. In this technique, the model keeps on increasing its performance using Reward
Feedback to learn the behavior or pattern. These algorithms are specific to a particular problem e.g.
Google Self Driving car, AlphaGo where a bot competes with humans and even itself to get better and
better performers in Go Game. Each time we feed in data, they learn and add the data to their knowledge
which is training data. So, the more it learns the better it gets trained and hence experienced.
Here are some of most common reinforcement learning algorithms:

• Q-learning: Q-learning is a model-free RL algorithm that learns a Q-function, which maps states to
actions. The Q-function estimates the expected reward of taking a particular action in a given
state.

• SARSA (State-Action-Reward-State-Action): SARSA is another model-free RL algorithm that learns


a Q-function. However, unlike Q-learning, SARSA updates the Q-function for the action that was
actually taken, rather than the optimal action.

• Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep learning. Deep Q-
learning uses a neural network to represent the Q-function, which allows it to learn complex
relationships between states and actions.

Reinforcement Machine Learning

Let’s understand it with the help of examples.

Example: Consider that you are training an AI agent to play a game like chess. The agent explores different
moves and receives positive or negative feedback based on the outcome. Reinforcement Learning also
finds applications in which they learn to perform tasks by interacting with their surroundings.

Types of Reinforcement Machine Learning

There are two main types of reinforcement learning:

Positive reinforcement

• Rewards the agent for taking a desired action.

• Encourages the agent to repeat the behavior.

• Examples: Giving a treat to a dog for sitting, providing a point in a game for a correct answer.

Negative reinforcement

• Removes an undesirable stimulus to encourage a desired behavior.

• Discourages the agent from repeating the behavior.


• Examples: Turning off a loud buzzer when a lever is pressed, avoiding a penalty by completing a
task.

Advantages of Reinforcement Machine Learning

• It has autonomous decision-making that is well-suited for tasks and that can learn to make a
sequence of decisions, like robotics and game-playing.

• This technique is preferred to achieve long-term results that are very difficult to achieve.

• It is used to solve a complex problem that cannot be solved by conventional techniques.

Disadvantages of Reinforcement Machine Learning

• Training Reinforcement Learning agents can be computationally expensive and time-consuming.

• Reinforcement learning is not preferable to solving simple problems.

• It needs a lot of data and a lot of computation, which makes it impractical and costly.

Applications of Reinforcement Machine Learning

Here are some applications of reinforcement learning:

• Game Playing: RL can teach agents to play games, even complex ones.

• Robotics: RL can teach robots to perform tasks autonomously.

• Autonomous Vehicles: RL can help self-driving cars navigate and make decisions.

• Recommendation Systems: RL can enhance recommendation algorithms by learning user


preferences.

• Healthcare: RL can be used to optimize treatment plans and drug discovery.

• Natural Language Processing (NLP): RL can be used in dialogue systems and chatbots.

• Finance and Trading: RL can be used for algorithmic trading.

• Supply Chain and Inventory Management: RL can be used to optimize supply chain operations.

• Energy Management: RL can be used to optimize energy consumption.

• Game AI: RL can be used to create more intelligent and adaptive NPCs in video games.

• Adaptive Personal Assistants: RL can be used to improve personal assistants.

• Virtual Reality (VR) and Augmented Reality (AR): RL can be used to create immersive and
interactive experiences.

• Industrial Control: RL can be used to optimize industrial processes.

• Education: RL can be used to create adaptive learning systems.

• Agriculture: RL can be used to optimize agricultural operations.


Difference between Supervised and Unsupervised Learning
The distinction between supervised and unsupervised learning depends on whether the learning
algorithm uses pattern-class information. Supervised learning assumes the availability of a teacher or
supervisor who classifies the training examples, whereas unsupervised learning must identify the pattern-
class information as a part of the learning process.

Supervised learning algorithms utilize the information on the class membership of each training instance.
This information allows supervised learning algorithms to detect pattern misclassifications as feedback to
themselves. In unsupervised learning algorithms, unlabeled instances are used. They blindly or
heuristically process them. Unsupervised learning algorithms often have less computational complexity
and less accuracy than supervised learning algorithms.

Supervised Learning Unsupervised Learning

Uses Known and Labeled Data


Uses Unknown Data as input
Input Data as input

Less Computational
More Computational Complex
Computational Complexity Complexity

Real-Time Uses off-line analysis Uses Real-Time Analysis of Data

The number of Classes is The number of Classes is not


Number of Classes known known

Moderate Accurate and Reliable


Accurate and Reliable Results
Accuracy of Results Results

Output data The desired output is given. The desired, output is not given.

In supervised learning it is not In unsupervised learning it is


possible to learn larger and possible to learn larger and
more complex models than in more complex models than in
Model unsupervised learning supervised learning
Supervised Learning Unsupervised Learning

In supervised learning training In unsupervised learning


Training data data is used to infer model training data is not used.

Supervised learning is also Unsupervised learning is also


Another name called classification. called clustering.

Test of model We can test our model. We can not test our model.

Example Optical Character Recognition Find a face in an image.

Most Common Types of Machine Learning Problems

Problem types Details Algorithms


When the need is to predict numerical values, such Linear regression, K-
kinds of problems are called regression problems. For NN, random forest,
Regression example, house price prediction neural networks

When there is a need to classify the data in different


classes, it is called a classification problem. If there
are two classes, it is called a binary classification
problem. When it is multiple classes, it is multi-nomial Logistic
classification. For example, classify whether a person regression, random
is suffering from a disease or otherwise. Classify forest, K-NN,
whether a stock is “buy”, “sell”, or “hold”. Check this gradient boosting
related post – Machine learning techniques for stock classifier, neural
Classification prediction networks

K-Means, DBSCAN,
Hierarchical
When there is a need to categorize the data points in clustering, Gaussian
similar groupings or clusters, this is called a clustering mixture models,
Clustering problem. BIRCH

When there is a need to predict a number based on


the time-series data, it is called a time-series
forecasting problem. A time series is a sequence of
numerical data points in successive order. Time
series data means that data is in a series of
particular time periods or intervals. For example, a ARIMA, SARIMA,
time-series forecasting problem is about forecasting LSTM, Exponential
the sales demand for a product, based on a set of smoothing, Prophet,
input data such as previous sales figures, consumer GARCH, TBATS,
Time-series sentiment, and weather. Another kind of time series Dynamic linear
forecasting problem is demand forecasting. models

When there is a need to find the outliers in the


dataset, the problem is called an anomaly detection
problem. In other words, if a given record can be IsolationForest,
classified as an outlier or unexpected event/item, this Minimum covariance
can be called an anomaly detection problem. For determinant, Local
example, credit card fraud transactions detection is an outlier factor, One-
Anomaly detection anomaly detection problem. class SVM
When there is a need to order the results of a request
or a query based on some criteria, the problem is
ranking problems. We rank the output of query
execution based on scores we assign to each output
based on some algorithms. These algorithms are Bipartite
called a ranking algorithm. Recommendation engines ranking (Bipartite
make use of the ranking algorithm to recommend the Rankboost, Bipartite
Ranking next items. RankSVM)

When there is a need to recommend such as “next


item” to buy or “next video” to watch or “next song” Content-based and
to listen to, the problem is called a recommendation collaborative filtering
problem. The solutions to such problems are called machine learning
Recommendation recommender systems. methods

Generative
When there is a need to generate data such as adversarial network
images, videos, articles, posts, etc, the problem is (GAN), Hidden
Data generation called a data generation problem. Markov models

When there is a need to generate a set of outputs


that optimize outcomes related to some objective Linear programming
(objective function), the problem is called an methods, genetic
Optimization objective function. progra

Artificial Intelligence vs Machine Learning

Moving ahead, now let’s check out the basic differences between artificial intelligence and machine
learning.

The development of AI and ML has the potential to transform various industries and improve people’s lives
in many ways. AI systems can be used to diagnose diseases, detect fraud, analyze financial data, and
optimize manufacturing processes. ML algorithms can help to personalize content and services, improve
customer experiences, and even help to solve some of the world’s most pressing environmental challenges.
[Link]. ARTIFICIAL INTELLIGENCE MACHINE LEARNING

The terminology “Machine


The terminology “Artificial Intelligence” was Learning” was first used in 1952 by
1. originally used by John McCarthy in 1956, who also IBM computer scientist Arthur
hosted the first AI conference. Samuel, a pioneer in artificial
intelligence and computer games.

AI stands for Artificial intelligence, ML stands for Machine Learning


2. where intelligence is defined as the ability to acquire which is defined as the acquisition
and apply knowledge. of knowledge or skill

AI is the broader family consisting of ML and DL as Machine Learning is the subset of


3.
its components. Artificial Intelligence.

The aim is to increase the chance of success and not The aim is to increase accuracy, but
4.
accuracy. it does not care about; the success

AI is aiming to develop an intelligent system capable Machine learning is attempting to


of construct machines that can only
5.
performing a variety of complex jobs. decision- accomplish the jobs for which they
making have been trained.

It works as a computer program that does smart Here, the tasks systems machine
6.
work. takes data and learns from data.

The goal is to learn from data on


The goal is to simulate natural intelligence to solve
7. certain tasks to maximize
complex problems.
the performance on that task.

The scope of machine learning is


8. AI has a very broad variety of applications.
constrained.
[Link]. ARTIFICIAL INTELLIGENCE MACHINE LEARNING

ML allows systems to learn new


9. AI is decision-making.
things from data.

It is developing a system that mimics humans to It involves creating self-learning


10.
solve problems. algorithms.

AI is a broader family consisting of ML and DL as its


11. ML is a subset of AI.
components.

Three broad categories of AI are : Three broad categories of ML are :

1. Artificial Narrow Intelligence (ANI) 1. Supervised Learning


12.
2. Artificial General Intelligence (AGI) 2. Unsupervised Learning

3. Artificial Super Intelligence (ASI) 3. Reinforcement Learning

AI can work with structured, semi-structured, and ML can work with only structured
13.
unstructured data. and semi-structured data.

The most common uses of machine


learning-
AI’s key uses include-
• Facebook’s automatic
• Siri, customer service via chatbots friend suggestions
• Expert Systems • Google’s search algorithms
14.
• Machine Translation like Google Translate • Banking fraud analysis
• Intelligent humanoid robots such as Sophia, • Stock price forecast
and so on.
• Online recommender
systems, and so on.

AI refers to the broad field of creating machines that


15. ML is a subset of AI that involves
can simulate human intelligence and perform tasks
training algorithms on data to
such as understanding natural language, recognizing
[Link]. ARTIFICIAL INTELLIGENCE MACHINE LEARNING

images and sounds, making decisions, and solving make predictions, decisions, and
complex problems. recommendations.

AI is a broad concept that includes various methods ML focuses on teaching machines


for creating intelligent machines, including rule- how to learn from data without
based systems, expert systems, and machine being explicitly programmed, using
16.
learning algorithms. AI systems can be programmed algorithms such as neural
to follow specific rules, make logical inferences, or networks, decision trees, and
learn from data using ML. clustering.

In contrast, ML algorithms require


AI systems can be built using large amounts of structured data to
both structured and unstructured data, including learn and improve their
text, images, video, and audio. AI algorithms can performance. The quality and
17.
work with data in a variety of formats, and they can quantity of the data used to train
analyze and process data to extract meaningful ML algorithms are critical factors in
insights. determining the accuracy and
effectiveness of the system.

AI is a broader concept that encompasses many ML, on the other hand, is primarily
different applications, including robotics, natural used for pattern
language processing, speech recognition, recognition, predictive modeling,
18.
and autonomous vehicles. AI systems can be used to and decision-making in fields such
solve complex problems in various fields, such as as marketing, fraud detection, and
healthcare, finance, and transportation. credit scoring.

In contrast, ML algorithms require


AI systems can be designed to work autonomously human involvement to set up,
or with minimal human intervention, depending on train, and optimize the system. ML
19. the complexity of the task. AI systems can make algorithms require the expertise of
decisions and take actions based on the data and data scientists, engineers, and
rules provided to them. other professionals to design and
implement the system.

What is Neural Networks?


Neural networks mimic the basic functioning of the human brain and are inspired by how the human brain
interprets information. They solve various real-time tasks because of its ability to perform computations
quickly and its fast responses.

Artificial Neural Network has a huge number of interconnected processing elements, also known as Nodes.
These nodes are connected with other nodes using a connection link. The connection link contains
weights, these weights contain the information about the input signal. Each iteration and input in turn
leads to updation of these weights. After inputting all the data instances from the training data set, the
final weights of the Neural Network along with its architecture is known as the Trained Neural Network.
This process is called Training of Neural Networks. These trained neural networks solve specific problems
as defined in the problem statement.

Types of tasks that can be solved using an artificial neural network include Classification problems, Pattern
Matching, Data Clustering, etc.

Importance of Neural Networks

We use artificial neural networks because they learn very efficiently and adaptively. They have the
capability to learn “how” to solve a specific problem from the training data it receives. After learning, it
can be used to solve that specific problem very quickly and efficiently with high accuracy.

Some real-life applications of neural networks include Air Traffic Control, Optical Character Recognition as
used by some scanning apps like Google Lens, Voice Recognition, etc.

What are Neural Networks Used For?

Neural networks are employed across various domains for:

• Identifying objects, faces, and understanding spoken language in applications like self-driving cars
and voice assistants.

• Analyzing and understanding human language, enabling sentiment analysis, chatbots, language
translation, and text generation.

• Diagnosing diseases from medical images, predicting patient outcomes, and drug discovery.
• Predicting stock prices, credit risk assessment, fraud detection, and algorithmic trading.

• Personalizing content and recommendations in e-commerce, streaming platforms, and social


media.

• Powering robotics and autonomous vehicles by processing sensor data and making real-time
decisions.

• Enhancing game AI, generating realistic graphics, and creating immersive virtual environments.

• Monitoring and optimizing manufacturing processes, predictive maintenance, and quality control.

• Analyzing complex datasets, simulating scientific phenomena, and aiding in research across
disciplines.

• Generating music, art, and other creative content.

Types of Neural Networks in Machine Learning

Explore different kinds of neural networks in machine learning in this section:

Artificial Neural Network (ANN)

ANN is also known as an artificial neural network. It is a feed-forward neural network because the inputs
are sent in the forward direction. It can also contain hidden layers which can make the model even denser.
They have a fixed length as specified by the programmer. It is used for Textual Data or Tabular Data. A
widely used real-life application is Facial Recognition. It is comparatively less powerful than CNN and RNN.

Convolution Neural Network (CNN)

CNNs is mainly used for Image Data. It is used for Computer Vision. Some of the real-life applications are
object detection in autonomous vehicles. It contains a combination of convolutional layers and neurons.
It is more powerful than both ANN and RNN.

Recurrent Neural Network (RNN)

It is also known as RNNs. It is used to process and interpret time series data. In this type of model, the
output from a processing node is fed back into nodes in the same or previous layers. The most known
types of RNN are LSTM (Long Short-Term Memory) Networks

Now that we know the basics about Neural Networks, we know that Neural Networks’ learning capability
is what makes it interesting.

What is Probability?

Probability can be defined as the ratio of the number of favorable outcomes to the total number of
outcomes of an event. For an experiment having 'n' number of outcomes, the number of favorable
outcomes can be denoted by x. The formula to calculate the probability of an event is as follows.

Probability (Event) = Favorable Outcomes/Total Outcomes = x/n

Joint Probability: It tells the Probability of simultaneously occurring two random events.
P (A ∩ B) = P(A). P(B)

Where;

P (A ∩ B) = Probability of occurring events A and B both.

P (A) = Probability of event A

P (B) = Probability of event B

What is Bayes’ Theorem?

Bayes theorem (also known as the Bayes Rule or Bayes Law) is used to determine the conditional
probability of event A when event B has already occurred.

The general statement of Bayes’ theorem is “The conditional probability of an event A, given the
occurrence of another event B, is equal to the product of the event of B, given A and the probability of A
divided by the probability of event B.” i.e.

What is Vector Calculus?

Vector Calculus is a branch of mathematics that deals with the operations of calculus i.e. differentiation
and integration of vector field usually in a 3 Dimensional physical space also called Euclidean Space. The
applicability of Vector calculus is extended to partial differentiation and multiple integration. Vector Field
refers to a point in space that has magnitude and direction. These Vector Fields are nothing but Vector
Functions. Vector calculus is also known as vector analysis.

The vector fields are the vector functions whose domain and range are not dimensionally related to each
other. The branch of Vector Calculus corresponds to the multivariable calculus which deals with partial
differentiation and multiple integration. This differentiation and integration of vector is done for a quantity
in 3D physical space represented as R3. For n-dimensional space, it is represented as Rn.
Read in Detail: Calculus in Maths

Vector Calculus Definition

Vector calculus, also known as vector analysis or vector differential calculus, is a branch of mathematics
that deals with vector fields and the differentiation and integration of vector functions

Vector Calculus often called Vector Analysis deals with vector quantities i.e. the quantities that have both
magnitude as well as direction. Since we know that Vector Calculus deals
with differentiation and integration of functions, there are three types of integrals dealt with in Vector
Calculus that are

• Line Integral

• Surface Integral

• Volume Integral

Let’s learn about these integrals in detail.

Line Integral

Line Integral in mathematics is the integration of a function along the line of the curve. The function can
be a scalar or vector whose line integral is given by summing up the values of the field at all points on a
curve weighted by some scalar function on the curve. Line Integral is also called Path Integral and is
represented by Φ = ∫Lf. Line Integral has got its application in physics. For Example, Work Done by Force is
along a path given as W = ∫LF(s).ds because we know that work done is given as the product of force and
distance covered.

Surface Integral

Surface Integral in mathematics is the integration of a function along the whole region or space that is not
flat. In Surface integral, the surfaces are assumed of small points hence, the integration is given by
summing up all the small points present on the surface. The surface integral is equivalent to the double
integration of a line integral. Surface Integral has got its application in Electromagnetism and many more
branches of physics where the vector function is spread over the surface. Surface Integral is represented
as ∬sf(x,y)dA.

Volume Integral

A volume integral, also known as a triple integral, is a mathematical concept used in calculus and vector
calculus to calculate the volume of a three-dimensional region within a space. It is an extension of the
concept of a definite integral in one dimension to three dimensions.

Mathematically, the volume integral of a scalar function f(x, y, z) over a region R in three-dimensional
space is denoted as:

∭Rf(x,y,z) dV
Where

• dV represents an infinitesimal volume element, and

• Integral is taken over region R.

Operation in Vector

The different operations performed with vector quantities are tabulated below with their notation and
illustration.

Operation Notation Illustration

Vector Addition r1 + r2 Addition of two vectors gives a vector

Scalar Multiplication q.r1 Multiplying a vector ‘r1‘ with scalar ‘q’ result in a vector

Dot Product r1 · r2 Dot product of two vectors gives a scalar

Cross Product r1 ⨯ r2 Cross product of two vectors gives a vector

Scalar Triple Product r1 · (r2 ⨯ r3) Dot Product of Cross product of two vectors

Vector Triple Product r1 ⨯ (r2 ⨯ r3) Cross Product of Cross Product of two Vectors

Vector Calculus Applications in the real world

• Navigation

• Sports

• Partial differential equation

• Three-dimensional geometry

• Used in heat transfer

Decision theory
Decision theory is an interdisciplinary field that deals with the logic and methodology of making choices,
particularly under conditions of uncertainty. It is a branch of applied probability theory and analytic
philosophy that involves assigning probabilities to various factors and numerical consequences to
outcomes.

What is Decision Theory?

At its core, decision theory is the study of choices under uncertainty. It seeks to identify the optimal action
from a set of possible actions by evaluating the outcomes of each decision. There are two primary
branches of decision theory:

• Normative decision theory focuses on identifying the optimal decision, assuming the decision-
maker is rational and has complete information.

• Descriptive decision theory examines how decisions are actually made in practice, often dealing
with cognitive limitations and psychological biases.

choose courses of action that maximize predicted benefit, it blends probability, or the likelihood of
outcomes, and utility, or the worth of outcomes. In this way, AI systems imitate human decision-making,
even in the face of erroneous or inadequate evidence.

AI systems often use decision theory in two primary ways: supervised learning and reinforcement
learning.

1. Supervised Learning

In supervised learning, AI systems are trained using labeled data to make predictions or decisions. Decision
theory helps optimize the classification or regression tasks by evaluating the trade-offs between false
positives, false negatives, and other outcomes based on the utility of each result.

For instance, in medical diagnosis, the utility of correctly identifying a disease may be far higher than the
cost of a false alarm, leading the AI to favor sensitivity over specificity.

2. Reinforcement Learning

Reinforcement learning (RL) is one of the key areas where decision theory shines in AI. In RL, agents learn
to make decisions through trial and error, receiving feedback from their environment in the form of
rewards or penalties.

• Markov Decision Processes (MDPs) are a common formalism for decision-making in


reinforcement learning, where decision theory principles help in navigating uncertainty and
maximizing long-term rewards.

• In MDPs, the agent needs to choose actions that optimize future rewards, which aligns with the
decision-theoretic concept of maximizing expected utility.

Key Components of Decision Theory

1. Agents and Actions: In decision theory, an agent is an entity that makes decisions. The agent has
a set of possible actions or decisions to choose from.
2. States of the World: These represent the possible conditions or scenarios that may affect the
outcome of the agent’s decision. The agent often has incomplete knowledge about the current or
future states of the world.

3. Outcomes and Consequences: Every decision leads to an outcome. Outcomes can be desirable,
neutral, or undesirable, depending on the goals of the agent.

4. Probabilities: Since outcomes are often uncertain, decision theory involves assigning probabilities
to different states or outcomes based on available data.

5. Utility Function: This is a measure of the desirability of an outcome. A utility function quantifies
how much an agent values a specific result, helping in ranking outcomes to guide decisions.

6. Decision Rules: These are the guidelines the agent follows to choose the best action. Examples
include the Maximization of Expected Utility (MEU), where an agent selects the action that offers
the highest expected utility.

Application of Decision Theory in Waymo's Autonomous Vehicles

Waymo uses decision theory to help its vehicles make safe, rational choices under uncertainty. The AI
system processes data from a variety of sensors (LIDAR, radar, and cameras) to assess the vehicle’s
environment. By assigning probabilities to different events (such as a pedestrian stepping into the road or
a nearby vehicle swerving), the system evaluates multiple actions, such as slowing down, changing lanes,
or stopping, and selects the one that maximizes safety while minimizing disruptions.

• States of the world: Traffic signals, vehicle speeds, weather conditions, and positions of other
road users.

• Actions: Speed up, slow down, change lanes, stop.

• Probabilities: The likelihood of a pedestrian crossing, a vehicle stopping suddenly, or an


emergency braking event.

• Utility: Prioritizing passenger safety and compliance with traffic laws while maintaining a smooth
and efficient ride.

Challenges in Applying Decision Theory to AI

While decision theory provides a robust framework for decision-making in AI, there are several
challenges associated with its implementation:

1. Complexity and Computation: Calculating optimal decisions based on large-scale data with many
variables is computationally expensive. Approximation algorithms and heuristics are often used in
practice to overcome these limitations.

2. Uncertainty and Incomplete Information: In many real-world scenarios, the probabilities of


various outcomes are not known. This requires AI systems to handle uncertainty with limited data
and make reasonable decisions based on estimations.
3. Ethics and Value Alignment: In applications like healthcare, autonomous vehicles, or defense,
decision theory raises ethical concerns. How should AI systems weigh human lives versus property
damage, or long-term environmental impact versus short-term gains?

Information Theory
Is a mathematical framework for quantifying information, data compression, and transmission. In machine
learning, information theory provides powerful tools for analyzing and improving algorithms.
Key Concepts of Information Theory

1. Entropy

Entropy measures the uncertainty or unpredictability of a random variable. In machine learning, entropy
quantifies the amount of information required to describe a dataset.

• Definition: For a discrete random variable X with possible valuesx1,x2,...,xnx1,x2,...,xn and a


probability mass function P(X), the entropy H(X) is defined as:

• Interpretation: Higher entropy indicates greater unpredictability, while lower entropy indicates
more predictability.

2. Mutual Information

Mutual information measures the amount of information obtained about one random variable through
another random variable. It quantifies the dependency between variables.
Applications of Information Theory in Machine Learning

1. Feature Selection

Feature selection aims to identify the most relevant features for building a predictive model. Information-
theoretic measures like mutual information can quantify the relevance of each feature with respect to the
target variable.

• Method: Calculate the mutual information between each feature and the target variable. Select
features with the highest mutual information values.

• Benefit: Helps in reducing dimensionality and improving model performance by removing


irrelevant or redundant features.

2. Decision Trees

Decision trees use entropy and information gain to split nodes and build a tree structure. Information gain,
based on entropy, measures the reduction in uncertainty after splitting a node.

3. Regularization and Model Selection

KL divergence is used in regularization techniques like variational inference in Bayesian neural networks.
By minimizing KL divergence between the approximate and true posterior distributions, we achieve better
model regularization.

• Example: Variational Autoencoders (VAEs) use KL divergence to regularize the latent space
distribution, ensuring it follows a standard normal distribution.
4. Information Bottleneck

The information bottleneck method aims to find a compressed representation of the input data that
retains maximal information about the output.

• Objective: Maximize mutual information between the compressed representation and the output
while minimizing mutual information between the input and the compressed representation.

• Applications: Used in deep learning for learning efficient representations.


Algorithms for Generative Machine Learning

Generative algorithms are designed to simulate the joint probability distribution of the input features and
labels. In order to create fresh samples, their goal is to learn the underlying data distribution.

Learning the Data Distribution

To capture the statistical characteristics of the full dataset, generative models use a variety of strategies,
such as Gaussian Mixture Models (GMMs) or Hidden Markov Models (HMMs). The combined probability
distribution is modeled to give generative algorithms a comprehensive grasp of the data.

Producing New Samples

After learning the distribution, generative models can create artificial samples that mirror the training set.
They are useful for jobs like text generation, where the model picks up the grammar and creates logical
text sequences.

Applications of Generative Machine Learning

• Text Generation and Language Modelling: Recurrent Neural Networks (RNNs) and Transformers
are two examples of generative models that have excelled in text generation tasks. They create
fresh, meaningful sequences by learning the statistical patterns in text data.

• Image and Video Synthesis: The discipline of image synthesis has undergone a revolution thanks
to Generative Adversarial Networks (GANs). GANs create visuals that seem convincing and lifelike
by competing for a generator against a discriminator. They are useful for creating virtual
characters, bogus films, and artwork.

Because they model the entire data distribution, generative models are useful for addressing missing or
incomplete data. They might have trouble with discrimination tasks in complicated datasets, though. They
might not be particularly good at distinguishing between classes or categories because their concentration
is on modeling the total data distribution.

Aiming to simulate the combined probability distribution of the input features (X) and the related class
labels (Y), generative methods are used to create new data. To create fresh samples, they learn the
probability distribution for each class and use it. By selecting a sample from the learned distribution, these
algorithms can produce new data points. Additionally, they employ the Bayes theorem to estimate the
conditional probability of a class given the input features. Gaussian Mixture Models (GMMs), and Hidden
Markov Models (HMMs) are a few examples of generative algorithms.

Mathematical Intuitions of Generative Algorithms

The goal of generative algorithms is to model the P(X, Y) notation, which represents the joint probability
distribution of the input data (X) and the accompanying class labels (Y). Generic algorithms can produce
fresh samples and learn about the underlying data by estimating this joint distribution. Estimating the
prior probability of each class, P(Y), as well as the class-conditional probability distribution, P(X|Y), is
important to the mathematical reasoning behind generative algorithms. Utilizing methods like maximum
likelihood estimation (MLE) or maximum a posteriori (MAP) estimation, these estimations can be derived.
Once these probabilities have been learned, the posterior probability of the class given the input features,
P(Y|X), is computed using Bayes’ theorem. It is possible to categorize new data points using this posterior
probability.

Algorithms for Discriminative Machine Learning:

Discriminative algorithms are primarily concerned with simulating the conditional probability distribution
of the output labels given the input features. Their goal is to understand the line of judgment that
delineates various classes or categories.

Learning the Decision Boundary Discriminative models, such as Logistic Regression, Support Vector
Machines (SVMs), and Neural Networks, train the decision boundary that best distinguishes various
classes in the data. On the basis of the input features and their related labels, they are trained to produce
predictions.

Applications of Discriminative Algorithms

• Image Classification: Discriminative algorithms, particularly Convolutional Neural


Networks (CNNs), have revolutionized it. Applications like object recognition and autonomous
driving are made possible by CNNs’ ability to accurately classify images into many categories and
extract useful characteristics from images.

• Sentiment Analysis: Discriminative models perform exceptionally well in tasks involving sentiment
analysis, where the goal is to ascertain the sentiment of text data. These models make it possible
for applications like sentiment analysis in social media or customer feedback analysis by teaching
the link between text elements and sentiment labels.

Discriminative models excel at tasks that require a clear distinction between classes or categories. They
perform incredibly well in categorization problems by concentrating on the decision border. As they rely
on labeled samples for training, they could struggle with data that is incomplete or missing.

Discriminative algorithms are designed to directly represent the decision boundary rather than implicitly
modeling the underlying probability distribution. In light of the input features, they concentrate on
estimating the conditional probability of the class label. The classes in the input feature space are divided
by a decision boundary learned by these algorithms. Support vector machines (SVMs), neural networks,
and logistic regression are a few examples of discriminative algorithms. Discriminative models are
frequently utilized when the decision boundary is complex or when there is a lot of training data because
they typically perform well in classification tasks.

Mathematical Intuitions

Discriminative algorithms seek to directly represent the line where two classes diverge without explicitly
modeling the probability distribution that underlies that line. They concentrate on estimating the
conditional probability of the class label given the input features, represented as P(Y|X), rather than
calculating the joint distribution. Learning the variables or weights that specify the decision boundary is
essential to understanding the mathematical intuition behind discriminative algorithms. The use of
optimization techniques like gradient descent and maximum likelihood estimation is common in this
learning process. The objective is to identify the parameters that maximize the likelihood of the observed
data given the model while minimizing the classification error. Discriminative algorithms can instantly
categorize fresh data points after learning the parameters by calculating the conditional probability P(Y|X),
then selecting the class label with the highest probability.

Difference Between Generative and Discriminative Machine Learning Algorithms

Params GENERATIVE ALGORITHM DISCRIMINATIVE ALGORITHM

Models the joint probability


Modeling the conditional probability
distribution of the input
distribution of labels given input attributes is
characteristics and labels in a
the main focus of discriminative modeling.
Objective generative manner.

Creates new samples by learning the Acquires the threshold of judgment that
Methodology distribution of the underlying data. distinguishes various classes or categories.

Text generation and image synthesis Used in activities like sentiment analysis and
Application are examples of generative tasks. image categorization.

Effective with inadequate or missing Excellent at distinguishing between classes or


Strength data categories is discrimination.

May have trouble distinguishing Less useful when dealing with incomplete or
Weakness different classes in large datasets. missing data.
Generative Models Explained

Generative models are a cornerstone in the world of artificial intelligence (AI). Their primary function is to
understand and capture the underlying patterns or distributions from a given set of data. Once these
patterns are learned, the model can then generate new data that shares similar characteristics with the
original dataset.

Imagine you're teaching a child to draw animals. After showing them several pictures of different animals,
the child begins to understand the general features of each animal. Given some time, the child might draw
an animal they've never seen before, combining features they've learned. This is analogous to how a
generative model operates: it learns from the data it's exposed to and then creates something new based
on that knowledge.

The distinction between generative and discriminative models is fundamental in machine learning:

Generative models: These models focus on understanding how the data is generated. They aim to learn
the distribution of the data itself. For instance, if we're looking at pictures of cats and dogs, a generative
model would try to understand what makes a cat look like a cat and a dog look like a dog. It would then
be able to generate new images that resemble either cats or dogs.

Discriminative models: These models, on the other hand, focus on distinguishing between different types
of data. They don't necessarily learn or understand how the data is generated; instead, they learn the
boundaries that separate one class of data from another. Using the same example of cats and dogs, a
discriminative model would learn to tell the difference between the two, but it wouldn't necessarily be
able to generate a new image of a cat or dog on its own.

In the realm of AI, generative models play a pivotal role in tasks that require the creation of new content.
This could be in the form of synthesizing realistic human faces, composing music, or even generating
textual content. Their ability to "dream up" new data makes them invaluable in scenarios where original
content is needed, or where the augmentation of existing datasets is beneficial.

In essence, while discriminative models excel at classification tasks, generative models shine in their ability
to create. This creative prowess, combined with their deep understanding of data distributions, positions
generative models as a powerful tool in the AI toolkit.

Types of Generative Models

Generative models come in various forms, each with its unique approach to understanding and generating
data. Here's a more comprehensive list of some of the most prominent types:

• Bayesian networks. These are graphical models that represent the probabilistic relationships
among a set of variables. They're particularly useful in scenarios where understanding causal
relationships is crucial. For example, in medical diagnosis, a Bayesian network might help
determine the likelihood of a disease given a set of symptoms.

• Diffusion models. These models describe how things spread or evolve over time. They're often
used in scenarios like understanding how a rumor spreads in a network or predicting the spread
of a virus in a population.

• Generative Adversarial Networks (GANs). GANs consist of two neural networks, the generator
and the discriminator, that are trained together. The generator tries to produce data, while the
discriminator attempts to distinguish between real and generated data. Over time, the generator
becomes so good that the discriminator can't tell the difference. GANs are popular in image
generation tasks, such as creating realistic human faces or artworks.

• Variational Autoencoders (VAEs). VAEs are a type of autoencoder that produces a compressed
representation of input data, then decodes it to generate new data. They're often used in tasks
like image denoising or generating new images that share characteristics with the input data.

• Restricted Boltzmann Machines (RBMs). RBMs are neural networks with two layers that can learn
a probability distribution over its set of inputs. They've been used in recommendation systems,
like suggesting movies on streaming platforms based on user preferences.

• Pixel Recurrent Neural Networks (PixelRNNs). These models generate images pixel by pixel, using
the context of previous pixels to predict the next one. They're particularly useful in tasks where
the sequential generation of data is crucial, like drawing an image line by line.

• Markov chains. These are models that predict future states based solely on the current state,
without considering the states that preceded it. They're often used in text generation, where the
next word in a sentence is predicted based on the current word.

• Normalizing flows. These are a series of invertible transformations applied to simple probability
distributions to produce more complex distributions. They're useful in tasks where understanding
the transformation of data is crucial, like in financial modeling.
Real-World Use Cases of Generative Models

Generative models have penetrated mainstream consumption, revolutionizing the way we interact with
technology and experience content, for example:

• Art creation. Artists and musicians are using generative models to create new pieces of art or
compositions, based on styles they feed into the model. For example, Midjourney is a very
popular tool that is used to generate artwork.

• Drug discovery. Scientists can use generative models to predict molecular structures for new
potential drugs.

• Content creation. Website owners leverage generative models to speed up the content creation
process. For example, Hubspot's AI content writer helps marketers generate blog posts, landing
page copy and social media posts.

• Video games. Game designers use generative models to create diverse and unpredictable game
environments or characters.

Linear Regression in Machine learning


Linear regression is also a type of machine-learning algorithm more specifically a supervised
machine-learning algorithm that learns from the labelled datasets and maps the data points to
the most optimized linear functions, which can be used for prediction on new datasets.

Regression: It predicts the continuous output variables based on the independent input variable.
like the prediction of house prices based on different parameters like house age, distance from
the main road, location, area, etc.

Linear regression is a type of supervised machine learning algorithm that computes the linear
relationship between the dependent variable and one or more independent features by fitting a
linear equation to observed data.

Types of Linear Regression

There are two main types of linear regression:

Simple Linear Regression

This is the simplest form of linear regression, and it involves only one independent variable and
one dependent variable. The equation for simple linear regression is:
y=β0+β1X
where:

• Y is the dependent variable

• X is the independent variable

• β0 is the intercept

• β1 is the slope
Multiple Linear Regression

This involves more than one independent variable and one dependent variable. The equation for
multiple linear regression is:
y=β0+β1X1+β2X2+………βnXn
where:

• Y is the dependent variable

• X1, X2, …, Xn are the independent variables

• β0 is the intercept

• β1, β2, …, βn are the slopes

The goal of the algorithm is to find the best Fit Line equation that can predict the values based
on the independent variables.

In regression set of records are present with X and Y values and these values are used to learn a
function so if you want to predict Y from an unknown X this learned function can be used. In
regression we have to find the value of Y, So, a function is required that predicts continuous Y in
the case of regression given X as independent features.

What is the best Fit Line?

Our primary objective while using linear regression is to locate the best-fit line, which implies that
the error between the predicted and actual values should be kept to a minimum. There will be the
least error in the best-fit line.

The best Fit Line equation provides a straight line that represents the relationship between the
dependent and independent variables. The slope of the line indicates how much the dependent
variable changes for a unit change in the independent variable(s).
Here Y is called a dependent or target variable and X is called an independent variable also known
as the predictor of Y. There are many types of functions or modules that can be used for regression.
A linear function is the simplest type of function. Here, X may be a single feature or multiple
features representing the problem.

Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x)). Hence, the name is Linear Regression. In the figure above, X (input) is
the work experience and Y (output) is the salary of a person. The regression line is the best-fit line
for our model.

We utilize the cost function to compute the best values in order to get the best fit line since
different values for weights or the coefficient of lines result in different regression lines.

Least Square Method


The least square method is the process of finding the best-fitting curve or line of best fit for a set of data
points by reducing the sum of the squares of the offsets (residual part) of the points from the curve. The
method of least squares actually defines the solution for the minimization of the sum of squares of
deviations or the errors in the result of each equation. Find the formula for sum of squares of errors,
which help to find the variation in observed data.

There are two basic categories of least-squares problems:

• Ordinary or linear least squares

• Nonlinear least squares

These depend upon linearity or nonlinearity of the residuals. The linear problems are often seen in
regression analysis in statistics. On the other hand, the non-linear problems are generally used in the
iterative method of refinement in which the model is approximated to the linear one with each iteration.

Least Square Method Graph

In linear regression, the line of best fit is a straight line as shown in the following diagram:
The given data points are to be minimized by the method of reducing residuals or offsets of each point
from the line. The vertical offsets are generally used in surface, polynomial and hyperplane problems,
while perpendicular offsets are utilized in common practice.

Least Square Method Formula

The least-square method states that the curve that best fits a given set of observations, is said to be a
curve having a minimum sum of the squared residuals (or deviations or errors) from the given data
points. Let us assume that the given points of data are (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) in which all x’s
are independent variables, while all y’s are dependent ones. Also, suppose that f(x) is the fitting curve
and d represents error or deviation from each given point.

Now, we can write:

d1 = y1 − f(x1)

d2 = y2 − f(x2)

d3 = y3 − f(x3)

…..

dn = yn – f(xn)

The least-squares explain that the curve that best fits is represented by the property that the sum of
squares of all the deviations from given values must be minimum, i.e:

Sum = Minimum Quantity


Suppose when we have to determine the equation of line of best fit for the given data, then we first use
the following formula.

The equation of least square line is given by Y = a + bX

Normal equation for ‘a’:

∑Y = na + b∑X

Normal equation for ‘b’:

∑XY = a∑X + b∑X2

Solving these two normal equations we can get the required trend line equation.

Thus, we can get the line of best fit with formula y = ax + b

Solved Example

The Least Squares Model for a set of data (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) passes through the point (xa,
ya) where xa is the average of the xi‘s and ya is the average of the yi‘s. The below example explains how to
find the equation of a straight line or a least square line using the least square method.

Question:

Consider the time series data given below:

xi 8 3 2 10 11 3 6 5 6 8

yi 4 12 1 12 9 4 9 6 1 14

Use the least square method to determine the equation of line of best fit for the data. Then plot the line.

Solution:

Mean of xi values = (8 + 3 + 2 + 10 + 11 + 3 + 6 + 5 + 6 + 8)/10 = 62/10 = 6.2

Mean of yi values = (4 + 12 + 1 + 12 + 9 + 4 + 9 + 6 + 1 + 14)/10 = 72/10 = 7.2

Straight line equation is y = a + bx.

The normal equations are

∑y = an + b∑x

∑xy = a∑x + b∑x2


x y x2 xy

8 4 64 32

3 12 9 36

2 1 4 2

10 12 100 120

11 9 121 99

3 4 9 12

6 9 36 54

5 6 25 30

6 1 36 6

8 14 64 112

∑x = 62 ∑y = 72 ∑x2 = 468 ∑xy = 503

Substituting these values in the normal equations,

10a + 62b = 72….(1)

62a + 468b = 503….(2)

(1) × 62 – (2) × 10,

620a + 3844b – (620a + 4680b) = 4464 – 5030

-836b = -566
b = 566/836

b = 283/418

b = 0.677

Substituting b = 0.677 in equation (1),

10a + 62(0.677) = 72

10a + 41.974 = 72

10a = 72 – 41.974

10a = 30.026

a = 30.026/10

a = 3.0026

Therefore, the equation becomes,

y = a + bx

y = 3.0026 + 0.677x

This is the required trend line equation.

Now, we can find the sum of squares of deviations from the obtained values as:

d1 = [4 – (3.0026 + 0.677*8)] = (-4.4186)


d2 = [12 – (3.0026 + 0.677*3)] = (6.9664)

d3 = [1 – (3.0026 + 0.677*2)] = (-3.3566)

d4 = [12 – (3.0026 + 0.677*10)] = (2.2274)

d5 = [9 – (3.0026 + 0.677*11)] =(-1.4496)

d6 = [4 – (3.0026 + 0.677*3)] = (-1.0336)

d7 = [9 – (3.0026 + 0.677*6)] = (1.9354)

d8 = [6 – (3.0026 + 0.677*5)] = (-0.3876)

d9 = [1 – (3.0026 + 0.677*6)] = (-6.0646)

d10 = [14 – (3.0026 + 0.677*8)] = (5.5814)

∑d2 = (-4.4186)2 + (6.9664)2 + (-3.3566)2 + (2.2274)2 + (-1.4496)2 + (-1.0336)2 + (1.9354)2 + (-0.3876)2 + (-


6.0646)2 + (5.5814)2 = 159.27990

Underfitting in Machine Learning

A statistical model or a machine learning algorithm is said to have underfitting when a model is too simple
to capture data complexities. It represents the inability of the model to learn the training data effectively
result in poor performance both on the training and testing data. In simple terms, an underfit models are
inaccurate, especially when applied to new, unseen examples. It mainly happens when we use very simple
model with overly simplified assumptions. To address underfitting problem of the model, we need to use
more complex models, with enhanced feature representation, and less regularization.

Note: The underfitting model has High bias and low variance.

Reasons for Underfitting

1. The model is too simple, so it may be not capable to represent the complexities in the data.

2. The input features which are used to train the model is not the adequate representations of
underlying factors influencing the target variable.

3. The size of the training dataset used is not enough.

4. Excessive regularization is used to prevent the overfitting, which constraint the model to capture
the data well.

5. Features are not scaled.

Techniques to Reduce Underfitting

1. Increase model complexity.

2. Increase the number of features, performing feature engineering.

3. Remove noise from the data.

4. Increase the number of epochs or increase the duration of training to get better results.
Overfitting in Machine Learning

A statistical model is said to be overfitted when the model does not make accurate predictions on testing
data. When a model gets trained with so much data, it starts learning from the noise and inaccurate data
entries in our data set. And when testing with test data results in High variance. Then the model does not
categorize the data correctly, because of too many details and noise. The causes of overfitting are the non-
parametric and non-linear methods because these types of machine learning algorithms have more
freedom in building the model based on the dataset and therefore, they can really build unrealistic models.
A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like
the maximal depth if we are using decision trees.

In a nutshell, Overfitting is a problem where the evaluation of machine learning algorithms on training
data is different from unseen data.

Reasons for Overfitting:

1. High variance and low bias.

2. The model is too complex.

3. The size of the training data.

Techniques to Reduce Overfitting

1. Improving the quality of training data reduces overfitting by focusing on meaningful patterns,
mitigate the risk of fitting the noise or irrelevant features.

2. Increase the training data can improve the model’s ability to generalize to unseen data and reduce
the likelihood of overfitting.

3. Reduce model complexity.

4. Early stopping during the training phase (have an eye over the loss over the training period as soon
as loss begins to increase stop training).

5. Ridge Regularization and Lasso Regularization.

6. Use dropout for neural networks to tackle overfitting.

What is cross-validation used for?

The main purpose of cross validation is to prevent overfitting, which occurs when a model is trained too
well on the training data and performs poorly on new, unseen data. By evaluating the model on multiple
validation sets, cross validation provides a more realistic estimate of the model’s generalization
performance, i.e., its ability to perform well on new, unseen data.

Types of Cross-Validation

There are several types of cross validation techniques, including k-fold cross validation, leave-one-out
cross validation, and Holdout validation, Stratified Cross-Validation. The choice of technique depends on
the size and nature of the data, as well as the specific requirements of the modeling problem.
1. Holdout Validation

In Holdout Validation, we perform training on the 50% of the given dataset and rest 50% is used for the
testing purpose. It’s a simple and quick way to evaluate a model. The major drawback of this method is
that we perform training on the 50% of the dataset, it may possible that the remaining 50% of the data
contains some important information which we are leaving while training our model i.e. higher bias.

2. LOOCV (Leave One Out Cross Validation)

In this method, we perform training on the whole dataset but leaves only one data-point of the available
dataset and then iterates for each data-point. In LOOCV, the model is trained on n−1 samples and tested
on the one omitted sample, repeating this process for each data point in the dataset. It has some
advantages as well as disadvantages also.

An advantage of using this method is that we make use of all data points and hence it is low bias.

The major drawback of this method is that it leads to higher variation in the testing model as we are testing
against one data point. If the data point is an outlier it can lead to higher variation. Another drawback is
it takes a lot of execution time as it iterates over ‘the number of data points’ times.

3. Stratified Cross-Validation

It is a technique used in machine learning to ensure that each fold of the cross-validation process maintains
the same class distribution as the entire dataset. This is particularly important when dealing with
imbalanced datasets, where certain classes may be underrepresented. In this method,

1. The dataset is divided into k folds while maintaining the proportion of classes in each fold.

2. During each iteration, one-fold is used for testing, and the remaining folds are used for training.

3. The process is repeated k times, with each fold serving as the test set exactly once.

4. Stratified Cross-Validation is essential when dealing with classification problems where


maintaining the balance of class distribution is crucial for the model to generalize well to unseen
data.

5. 4. K-Fold Cross Validation

6. In K-Fold Cross Validation, we split the dataset into k number of subsets (known as folds) then we
perform training on the all the subsets but leave one(k-1) subset for the evaluation of the trained
model. In this method, we iterate k times with a different subset reserved for testing purpose each
time.

Note: It is always suggested that the value of k should be 10 as the lower value of k takes towards validation
and higher value of k leads to LOOCV method.
Example of K Fold Cross Validation

The diagram below shows an example of the training subsets and evaluation subsets generated in k-fold
cross-validation. Here, we have total 25 instances. In first iteration we use the first 20 percent of data for
evaluation, and the remaining 80 percent for training ([1-5] testing and [5-25] training) while in the
second iteration we use the second subset of 20 percent for evaluation, and the remaining three subsets

o f the data for training ([5-10] testing and [1-5 and 10-25] training), and so on.

What is overfitting? In machine learning, overfitting occurs when an algorithm fits too closely or even
exactly to its training data, resulting in a model that can't make accurate predictions or conclusions from
any data other than the training data.

Underfitting is a scenario in data science where a data model is unable to capture the relationship between
the input and output variables accurately, generating a high error rate on both the training set and unseen
data.

Lasso Regression- Classification

Lasso regression is a classification algorithm that uses shrinkage in simple and sparse models (i.e models
with fewer parameters). In Shrinkage, data values are shrunk towards a central point like the mean. Lasso
regression is a regularized regression algorithm that performs L1 regularization which adds a penalty equal
to the absolute value of the magnitude of coefficients.

“LASSO” stands for Least Absolute Shrinkage and Selection Operator. Lasso regression is good for models
showing high levels of multicollinearity or when you want to automate certain parts of model selection i.e
variable selection or parameter elimination. Lasso regression solutions are quadratic programming
problems that can best be solved with software like RStudio, Matlab, etc. It has the ability to select
predictors.

Here:

• N: is the number of observations.

• p: is the number of predictors.


• yi : is the response variable for the i-th observation.

• xij : is the value of the j-th predictor for the i-th observation.

• β0: is the intercept term.

• βj : are the coefficients for the predictors.

• λ : is the regularization parameter controlling the strength of the L1 penalty term.

The algorithm minimizes the sum of squares with constraint. Some Beta are shrunk to zero that results in
a regression model. A tuning parameter lambda controls the strength of the L1 regularization
penalty. lambda is basically the amount of shrinkage:

• When lambda = 0, no parameters are eliminated.

• As lambda increases, more and more coefficients are set to zero and eliminated & bias
increases.

• When lambda = infinity, all coefficients are eliminated.

• As lambda decreases, variance increases.

Logistic regression

Logistic regression is a supervised machine learning algorithm used for classification tasks where
the goal is to predict the probability that an instance belongs to a given class or not. Logistic
regression is a statistical algorithm which analyze the relationship between two data factors.

Logistic regression is used for binary classification where we use sigmoid function, that takes
input as independent variables and produces a probability value between 0 and 1.

For example, we have two classes Class 0 and Class 1 if the value of the logistic function for an
input is greater than 0.5 (threshold value) then it belongs to Class 1 otherwise it belongs to Class
0. It’s referred to as regression because it is the extension of linear regression but is mainly used
for classification problems.

Key Points:

• Logistic regression predicts the output of a categorical dependent variable. Therefore, the
outcome must be a categorical or discrete value.

• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and
1, it gives the probabilistic values which lie between 0 and 1.

• In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic function,
which predicts two maximum values (0 or 1).

Logistic Function – Sigmoid Function

• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1. The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like
the “S” form.

• The S-form curve is called the Sigmoid function or the logistic function.

• In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.

Types of Logistic Regression

On the basis of the categories, Logistic Regression can be classified into three types:

1. Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.

2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered


types of the dependent variable, such as “cat”, “dogs”, or “sheep”

3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as “low”, “Medium”, or “High”.

How does Logistic Regression work?


The logistic regression model transforms the linear regression function continuous value output into
categorical value output using a sigmoid function, which maps any real-valued set of independent
variables input into a value between 0 and 1. This function is known as the logistic function.
Let the independent input features be:

Gradient Descent
Gradient Descent is an iterative optimization algorithm that tries to find the optimum value
(Minimum/Maximum) of an objective function. It is one of the most used optimization techniques in
machine learning projects for updating the parameters of a model in order to minimize a cost function.

The main aim of gradient descent is to find the best parameters of a model which gives the highest
accuracy on training as well as testing datasets. In gradient descent, the gradient is a vector that points in
the direction of the steepest increase of the function at a specific point. Moving in the opposite direction
of the gradient allows the algorithm to gradually descend towards lower values of the function, and
eventually reaching to the minimum of the function.

Steps Required in Gradient Descent Algorithm

• Step 1 we first initialize the parameters of the model randomly

• Step 2 Compute the gradient of the cost function with respect to each parameter. It involves
making partial differentiation of cost function with respect to the parameters.

• Step 3 Update the parameters of the model by taking steps in the opposite direction of the model.
Here we choose a hyperparameter learning rate which is denoted by alpha. It helps in deciding
the step size of the gradient.

• Step 4 Repeat steps 2 and 3 iteratively to get the best parameter for the defined model.

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is
called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:
Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.

o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision boundary that helps to classify the data points. This
best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if there
are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features,
then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.

Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset
that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that
can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there
can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region
is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called
as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is
called the optimal hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we have used
two dimensions x and y, so for non-linear data, we will add a third-dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space
with z=1, then it will become as:

Hence, we get a circumference of radius 1 in case of non-linear data.


Instance-based learning
Instance-based learning are the systems that learn the training examples by heart and then generalizes to
new instances based on some similarity measure. It is called instance-based because it builds the
hypotheses from the training instances. It is also known as memory-based learning or lazy-
learning (because they delay processing until a new instance must be classified).

Advantages:

1. Instead of estimating for the entire instance set, local approximations can be made to the target
function.

2. This algorithm can adapt to new data easily, one which is collected as we go .

Disadvantages:

1. Classification costs are high

2. Large amount of memory required to store the data, and each query involves starting the
identification of a local model from scratch.

Some of the instance-based learning algorithms are:

1. K Nearest Neighbor (KNN)

2. Self-Organizing Map (SOM)

3. Learning Vector Quantization (LVQ)

4. Locally Weighted Learning (LWL)

5. Case-Based Reasoning

K Nearest Neighbor (KNN)


K-Nearest Neighbor (KNN) Algorithm for Machine Learning

o K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.

o K-NN algorithm assumes the similarity between the new case/data and available cases and put
the new case into the category that is most similar to the available categories.

o K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by
using K- NN algorithm.

o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.

o K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying
data.
o It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.

o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.

o
o How does K-NN work?

o The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors

o Step-2: Calculate the Euclidean distance of K number of neighbors

o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

o Step-4: Among these k neighbors, count the number of the data points in each category.

o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.

o Step-6: Our model is ready.

Distance Metrics Used in KNN Algorithm


As we know that the KNN algorithm helps us identify the nearest points or the groups for a query point.
But to determine the closest groups or the nearest points for a query point we need some metric. For this
purpose, we use below distance metrics:
Euclidean Distance
This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the straight line that joins the
two points which are into consideration. This metric helps us calculate the net displacement done between
the two states of an object.
Manhattan Distance

Manhattan Distance metric is generally used when we are interested in the total distance traveled by the
object instead of the displacement. This metric is calculated by summing the absolute difference between
the coordinates of the points in n-dimensions.

Minkowski Distance

We can say that the Euclidean, as well as the Manhattan distance, are special cases of the Minkowski
distance.

From the formula above we can say that when p = 2 then it is the same as the formula for the Euclidean
distance and when p = 1 then we obtain the formula for the Manhattan distance

Advantages of the KNN Algorithm

• Easy to implement as the complexity of the algorithm is not that high.

• Adapts Easily – As per the working of the KNN algorithm it stores all the data in memory storage
and hence whenever a new example or data point is added then the algorithm adjusts itself as per
that new example and has its contribution to the future predictions as well.

• Few Hyperparameters – The only parameters which are required in the training of a KNN algorithm
are the value of k and the choice of the distance metric which we would like to choose from our
evaluation metric.

Disadvantages of the KNN Algorithm

• Does not scale – As we have heard about this that the KNN algorithm is also considered a Lazy
Algorithm. The main significance of this term is that this takes lots of computing power as well as
data storage. This makes this algorithm both time-consuming and resource exhausting.

• Curse of Dimensionality – There is a term known as the peaking phenomenon according to this
the KNN algorithm is affected by the curse of dimensionality which implies the algorithm faces a
hard time classifying the data points properly when the dimensionality is too high.

• Prone to Overfitting – As the algorithm is affected due to the curse of dimensionality it is prone to
the problem of overfitting as well. Hence generally feature selection as well as dimensionality
reduction techniques are applied to deal with this problem.
Tree Based Machine Learning Algorithms

Tree-based algorithms are a class of supervised machine learning models that construct decision trees to
typically partition the feature space into regions, enabling a hierarchical representation of complex
relationships between input variables and output labels.

Decision Tree Classification Algorithm

o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.

o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.

o The decisions or the test are performed on the basis of features of the given dataset.

o It is a graphical representation for getting all the possible solutions to a problem/decision based
on given conditions.

o It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.

o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.

o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree
into subtrees.

o Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the two
reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.

o The logic behind the decision tree can be easily understood because it shows a tree-like structure.

Decision Tree Terminologies


• Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.

• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.

• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.

• Branch/Sub Tree: A tree formed by splitting the tree.

• Pruning: Pruning is the process of removing the unwanted branches from the tree.

• Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of
the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and,
based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be
better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.

o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).

o Step-3: Divide the S into subsets that contains possible values for the best attributes.

o Step-4: Generate the decision tree node, which contains the best attribute.

o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes of
the tree. There are two popular techniques for ASM, which are:

o Information Gain

o Gini Index

1. Information Gain:

o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.

o It calculates how much information a feature provides us about a class.

o According to the value of information gain, we split the node and build the decision tree.

o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy (each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data.
Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples

o P(yes)= probability of yes

o P(no)= probability of no

Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.

o It can be very useful for solving decision-related problems.

o It helps to think about all the possible outcomes for a problem.

o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.

o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

CART (Classification and Regression Tree) in Machine Learning

o CART is a predictive algorithm used in Machine learning and it explains how the target variable’s
values can be predicted based on other matters. It is a decision tree where each fork is split into a
predictor variable and each node has a prediction for the target variable at the end.

o The term CART serves as a generic term for the following categories of decision trees:

o Classification Trees: The tree is used to determine which “class” the target variable is most likely
to fall into when it is continuous.

o Regression trees: These are used to predict a continuous variable’s value.

CART Algorithm

Classification and Regression Trees (CART) is a decision tree algorithm that is used for both classification
and regression tasks. It is a supervised learning algorithm that learns from labelled data to predict unseen
data.

• Tree structure: CART builds a tree-like structure consisting of nodes and branches. The nodes
represent different decision points, and the branches represent the possible outcomes of those
decisions. The leaf nodes in the tree contain a predicted class label or value for the target variable.

• Splitting criteria: CART uses a greedy approach to split the data at each node. It evaluates all
possible splits and selects the one that best reduces the impurity of the resulting subsets. For
classification tasks, CART uses Gini impurity as the splitting criterion. The lower the Gini impurity,
the more pure the subset is. For regression tasks, CART uses residual reduction as the splitting
criterion. The lower the residual reduction, the better the fit of the model to the data.

• Pruning: To prevent overfitting of the data, pruning is a technique used to remove the nodes that
contribute little to the model accuracy. Cost complexity pruning and information gain pruning are
two popular pruning techniques. Cost complexity pruning involves calculating the cost of each
node and removing nodes that have a negative cost. Information gain pruning involves calculating
the information gain of each node and removing nodes that have a low information gain.

How does CART algorithm work?

The CART algorithm works via the following process:

• The best-split point of each input is obtained.

• Based on the best-split points of each input in Step 1, the new “best” split point is identified.

• Split the chosen input according to the “best” split point.

• Continue splitting until a stopping rule is satisfied or no further desirable splitting is available.
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by searching for the
best homogeneity for the sub nodes, with the help of the Gini index criterion.

Gini index/Gini impurity

The Gini index is a metric for the classification tasks in CART. It stores the sum of squared probabilities of
each class. It computes the degree of probability of a specific variable that is wrongly being classified when
chosen randomly and a variation of the Gini coefficient. It works on categorical variables, provides
outcomes either “successful” or “failure” and hence conducts binary splitting only.

The degree of the Gini index varies from 0 to 1,

• Where 0 depicts that all the elements are allied to a certain class, or only one class exists there.

• Gini index close to 1 means a high level of impurity, where each class contains a very small fraction
of elements, and

• A value of 1-1/n occurs when the elements are uniformly distributed into n classes and each class
has an equal probability of 1/n. For example, with two classes, the Gini impurity is 1 – 1/2 = 0.5.

Mathematically, we can write Gini Impurity as follows:

In conclusion, Gini impurity is the probability of misclassification, assuming independent selection of the
element and its class based on the class probabilities.
CART for Classification

A classification tree is an algorithm where the target variable is categorical. The algorithm is then used to
identify the “Class” within which the target variable is most likely to fall. Classification trees are used when
the dataset needs to be split into classes that belong to the response variable (like yes or no)

For classification in decision tree learning algorithm that creates a tree-like structure to predict class labels.
The tree consists of nodes, which represent different decision points, and branches, which represent the
possible result of those decisions. Predicted class labels are present at each leaf node of the tree.

How Does CART for Classification Work?

CART for classification works by recursively splitting the training data into smaller and smaller subsets
based on certain criteria. The goal is to split the data in a way that minimizes the impurity within each
subset. Impurity is a measure of how mixed up the data is in a particular subset. For classification tasks,
CART uses Gini impurity

• Gini Impurity- Gini impurity measures the probability of misclassifying a random instance from a
subset labeled according to the majority class. Lower Gini impurity means more purity of the
subset.

• Splitting Criteria- The CART algorithm evaluates all potential splits at every node and chooses the
one that best decreases the Gini impurity of the resultant subsets. This process continues until a
stopping criterion is reached, like a maximum tree depth or a minimum number of instances in a
leaf node.

CART for Regression

A Regression tree is an algorithm where the target variable is continuous and the tree is used to predict
its value. Regression trees are used when the response variable is continuous. For example, if the response
variable is the temperature of the day.

CART for regression is a decision tree learning method that creates a tree-like structure to predict
continuous target variables. The tree consists of nodes that represent different decision points and
branches that represent the possible outcomes of those decisions. Predicted values for the target variable
are stored in each leaf node of the tree.

How Does CART work for Regression?

Regression CART works by splitting the training data recursively into smaller subsets based on specific
criteria. The objective is to split the data in a way that minimizes the residual reduction in each subset.

• Residual Reduction- Residual reduction is a measure of how much the average squared difference
between the predicted values and the actual values for the target variable is reduced by splitting
the subset. The lower the residual reduction, the better the model fits the data.

• Splitting Criteria- CART evaluates every possible split at each node and selects the one that results
in the greatest reduction of residual error in the resulting subsets. This process is repeated until a
stopping criterion is met, such as reaching the maximum tree depth or having too few instances
in a leaf node.
Ensemble Learning
Ensemble learning is a machine learning technique that combines the predictions from multiple
models to create a more accurate and stable prediction. It is an approach that leverages the
collective intelligence of multiple models to improve the overall performance of the learning
system.

Types of Ensemble Methods

There are various types of ensembles learning methods, including:

1. Bagging (Bootstrap Aggregating): This method involves training multiple models on random
subsets of the training data. The predictions from the individual models are then combined,
typically by averaging.

2. Boosting: This method involves training a sequence of models, where each subsequent model
focuses on the errors made by the previous model. The predictions are combined using a weighted
voting scheme.

3. Stacking: This method involves using the predictions from one set of models as input features for
another model. The final prediction is made by the second-level model.

Bagging

Bootstrap Aggregating, also known as bagging, is a machine learning ensemble meta-algorithm designed
to improve the stability and accuracy of machine learning algorithms used in statistical classification and
regression. It decreases the variance and helps to avoid overfitting. It is usually applied to decision tree
methods. Bagging is a special case of the model averaging approach.

Description of the Technique

Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples is selected via row sampling
with a replacement method (i.e., there can be repetitive elements from different d tuples) from D (i.e.,
bootstrap). Then a classifier model Mi is learned for each training set D < i. Each classifier Mi returns its
class prediction. The bagged classifier M* counts the votes and assigns the class with the most votes to X
(unknown sample).

Implementation Steps of Bagging

• Step 1: Multiple subsets are created from the original data set with equal tuples, selecting
observations with replacement.

• Step 2: A base model is created on each of these subsets.

• Step 3: Each model is learned in parallel with each training set and independent of each other.

• Step 4: The final predictions are determined by combining the predictions from all the models.
An illustration for the concept of bootstrap aggregating (Bagging)

Example of Bagging

The Random Forest model uses Bagging, where decision tree models with higher variance are present. It
makes random feature selection to grow trees. Several random trees make a Random Forest.

To read more refer to this article: Bagging classifier

Boosting

Boosting is an ensemble modeling technique designed to create a strong classifier by combining multiple
weak classifiers. The process involves building models sequentially, where each new model aims to correct
the errors made by the previous ones.

• Initially, a model is built using the training data.

• Subsequent models are then trained to address the mistakes of their predecessors.

• boosting assigns weights to the data points in the original dataset.

• Higher weights: Instances that were misclassified by the previous model receive higher weights.

• Lower weights: Instances that were correctly classified receive lower weights.

• Training on weighted data: The subsequent model learns from the weighted dataset, focusing its
attention on harder-to-learn examples (those with higher weights).

• This iterative process continues until:

o The entire training dataset is accurately predicted, or


o A predefined maximum number of models is reached.

Boosting Algorithms

There are several boosting algorithms. The original ones, proposed by Robert Schapire and Yoav
Freund were not adaptive and could not take full advantage of the weak learners. Schapire and Freund
then developed AdaBoost, an adaptive boosting algorithm that won the prestigious Gödel Prize. AdaBoost
was the first really successful boosting algorithm developed for the purpose of binary classification.
AdaBoost is short for Adaptive Boosting and is a very popular boosting technique that combines multiple
“weak classifiers” into a single “strong classifier”.

Algorithm:

1. Initialize the dataset and assign equal weight to each of the data point.

2. Provide this as input to the model and identify the wrongly classified data points.

3. Increase the weight of the wrongly classified data points and decrease the weights of correctly
classified data points. And then normalize the weights of all data points.

4. if (got required results)


Goto step 5
else
Goto step 2

5. End

An illustration presenting the intuition behind the boosting algorithm, consisting of the parallel learners
and weighted dataset.
Similarities Between Bagging and Boosting

Bagging and boosting, both being the commonly used methods, have a universal similarity of being
classified as ensemble methods. Here we will explain the similarities between them.

1. Both are ensemble methods to get N learners from 1 learner.

2. Both generate several training data sets by random sampling.

3. Both make the final decision by averaging the N learners (or taking the majority of them i.e
Majority Voting).

4. Both are good at reducing variance and provide higher stability.

Differences Between Bagging and Boosting


Random Forest

A random forest is an ensemble learning method that combines the predictions from multiple decision
trees to produce a more accurate and stable prediction. It is a type of supervised learning algorithm that
can be used for both classification and regression tasks.

Every decision tree has high variance, but when we combine all of them in parallel then the resultant
variance is low as each decision tree gets perfectly trained on that particular sample data, and hence the
output doesn’t depend on one decision tree but on multiple decision trees. In the case of a classification
problem, the final output is taken by using the majority voting classifier. In the case of a regression
problem, the final output is the mean of all the outputs. This part is called Aggregation.
Random Forest Regression Model Working

What is Random Forest Regression?

Random Forest Regression in machine learning is an ensemble technique capable of performing


both regression and classification tasks with the use of multiple decision trees and a technique called
Bootstrap and Aggregation, commonly known as bagging. The basic idea behind this is to combine multiple
decision trees in determining the final output rather than relying on individual decision trees.

Random Forest has multiple decision trees as base learning models. We randomly perform row sampling
and feature sampling from the dataset forming sample datasets for every model. This part is called
Bootstrap.

We need to approach the Random Forest regression technique like any other machine learning technique.

• Design a specific question or data and get the source to determine the required data.

• Make sure the data is in an accessible format else convert it to the required format.

• Specify all noticeable anomalies and missing data points that may be required to achieve the
required data.

• Create a machine-learning model.

• Set the baseline model that you want to achieve

• Train the data machine learning model.


• Provide an insight into the model with test data

• Now compare the performance metrics of both the test data and the predicted data from the
model.

• If it doesn’t satisfy your expectations, you can try improving your model accordingly or dating your
data, or using another data modeling technique.

• At this stage, you interpret the data you have gained and report accordingly.

Evaluation of Classification Algorithms


To evaluate a classification algorithm, common metrics like accuracy, precision, recall, F1-score, confusion
matrix, and the area under the Receiver Operating Characteristic (ROC) curve are used, which assess how
well the model correctly categorizes data into different classes, considering factors like true positives, false
positives, true negatives, and false negatives, providing a comprehensive view of its performance across
different.

Classification Accuracy

Classification accuracy is a fundamental metric for evaluating the performance of a classification model,
providing a quick snapshot of how well the model is performing in terms of correct predictions. This is
calculated as the ratio of correct predictions to the total number of input Samples.

Area Under Curve (AUC)

It is one of the widely used metrics and basically used for binary classification. The AUC of a classifier is
defined as the probability of a classifier will rank a randomly chosen positive example higher than a
negative example. Before going into AUC more, let me make you comfortable with a few basic terms.

True positive rate:

Also called or termed sensitivity. True Positive Rate is considered as a portion of positive data points that
are correctly considered as positive, with respect to all data points that are positive.

True Negative Rate

Also called or termed specificity. False Negative Rate is considered as a portion of negative data points that
are correctly considered as negative, with respect to all data points that are negatives.
False-positive Rate

False Negatives rate is actually the proportion of actual positives that are incorrectly identified as negatives

F1 Score

It is a harmonic mean between recall and precision. Its range is [0,1]. This metric usually tells us how
precise (It correctly classifies how many instances) and robust (does not miss any significant number of
instances) our classifier is.

Precision

There is another metric named Precision. Precision is a measure of a model’s performance that tells you
how many of the positive predictions made by the model are actually correct. It is calculated as the number
of true positive predictions divided by the number of true positive and false positive predictions.

Lower recall and higher precision give you great accuracy but then it misses a large number of instances.
The more the F1 score better will be performance. It can be expressed mathematically in this way:

Confusion Matrix

A confusion matrix is a two-dimensional matrix used in classification experiments to evaluate the


performance of a system by showing the number of correctly and wrongly classified data, helping to
identify which classes of data are most often misplaced.

It creates a N X N matrix, where N is the number of classes or categories that are to be predicted. Here we
have N = 2, so we get a 2 X 2 matrix. Suppose there is a problem with our practice which is a binary
classification. Samples of that classification belong to either Yes or No. So, we build our classifier which
will predict the class for the new input sample. After that, we tested our model with 165 samples, and we
get the following result.
There are 4 terms you should keep in mind:

1. True Positives: It is the case where we predicted Yes and the real output was also yes.

2. True Negatives: It is the case where we predicted No and the real output was also No.

3. False Positives: It is the case where we predicted Yes but it was actually No.

4. False Negatives: It is the case where we predicted No but it was actually Yes.
Clustering in Machine Learning
Introduction to Clustering: It is basically a type of unsupervised learning method. An
unsupervised learning method is a method in which we draw references from datasets consisting
of input data without labeled responses. Generally, it is used as a process to find meaningful
structure, explanatory underlying processes, generative features, and groupings inherent in a set
of examples.

Clustering is the task of dividing the population or data points into a number of groups such that
data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis of
similarity and dissimilarity between them.

For example, the data points in the graph below clustered together can be classified into one
single group. We can distinguish the clusters, and we can identify that there are 3 clusters in the
below picture.

It is not necessary for clusters to be spherical as depicted below:


DBSCAN: Density-based Spatial Clustering of Applications with Noise
These data points are clustered by using the basic concept that the data point lies within the given
constraint from the cluster center. Various distance methods and techniques are used for the
calculation of the outliers.
Why Clustering?
Clustering is very much important as it determines the intrinsic grouping among the unlabeled
data present. There are no criteria for good clustering. It depends on the user, and what criteria
they may use which satisfy their need. For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), finding “natural clusters” and
describing their unknown properties (“natural” data types), in finding useful and suitable
groupings (“useful” data classes) or in finding unusual data objects (outlier detection). This
algorithm must make some assumptions that constitute the similarity of points and each
assumption make different and equally valid clusters.
Clustering Methods:
• Density-Based Methods: These methods consider the clusters as the dense region
having some similarities and differences from the lower dense region of the space. These
methods have good accuracy and the ability to merge two clusters. Example DBSCAN
(Density-Based Spatial Clustering of Applications with Noise), OPTICS (Ordering Points to
Identify Clustering Structure), etc.
• Hierarchical Based Methods: The clusters formed in this method form a tree-type
structure based on the hierarchy. New clusters are formed using the previously formed one.
It is divided into two categories
• Agglomerative (bottom-up approach)
• Divisive (top-down approach)
Examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing
Clustering and using Hierarchies), etc.
• Partitioning Methods: These methods partition the objects into k clusters and each
partition forms one cluster. This method is used to optimize an objective criterion similarity
function such as when the distance is a major parameter example K-means, CLARANS
(Clustering Large Applications based upon Randomized Search), etc.
• Grid-based Methods: In this method, the data space is formulated into a finite number of
cells that form a grid-like structure. All the clustering operations done on these grids are fast
and independent of the number of data objects example STING (Statistical Information
Grid), wave cluster, CLIQUE (Clustering In Quest), etc.
Clustering Algorithms: K-means clustering algorithm – It is the simplest unsupervised learning
algorithm that solves clustering problem. K-means algorithm partitions n observations into k
clusters where each observation belongs to the cluster with the nearest mean serving as a
prototype of the cluster .

Applications of Clustering in different fields:


1. Marketing: It can be used to characterize & discover customer segments for marketing
purposes.
2. Biology: It can be used for classification among different species of plants and animals.
3. Libraries: It is used in clustering different books on the basis of topics and information.
4. Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.
5. City Planning: It is used to make groups of houses and to study their values based on their
geographical locations and other factors present.
6. Earthquake studies: By learning the earthquake-affected areas we can determine the
dangerous zones.
7. Image Processing: Clustering can be used to group similar images together, classify
images based on content, and identify patterns in image data.
8. Genetics: Clustering is used to group genes that have similar expression patterns and
identify gene networks that work together in biological processes.
9. Finance: Clustering is used to identify market segments based on customer behavior,
identify patterns in stock market data, and analyze risk in investment portfolios.
10. Customer Service: Clustering is used to group customer inquiries and complaints into
categories, identify common issues, and develop targeted solutions.
11. Manufacturing: Clustering is used to group similar products together, optimize production
processes, and identify defects in manufacturing processes.
12. Medical diagnosis: Clustering is used to group patients with similar symptoms or diseases,
which helps in making accurate diagnoses and identifying effective treatments.
13. Fraud detection: Clustering is used to identify suspicious patterns or anomalies in
financial transactions, which can help in detecting fraud or other financial crimes.
14. Traffic analysis: Clustering is used to group similar patterns of traffic data, such as peak
hours, routes, and speeds, which can help in improving transportation planning and
infrastructure.
15. Social network analysis: Clustering is used to identify communities or groups within
social networks, which can help in understanding social behavior, influence, and trends.
16. Cybersecurity: Clustering is used to group similar patterns of network traffic or system
behavior, which can help in detecting and preventing cyberattacks.
17. Climate analysis: Clustering is used to group similar patterns of climate data, such as
temperature, precipitation, and wind, which can help in understanding climate change and
its impact on the environment.
18. Sports analysis: Clustering is used to group similar patterns of player or team performance
data, which can help in analyzing player or team strengths and weaknesses and making
strategic decisions.
19. Crime analysis: Clustering is used to group similar patterns of crime data, such as
location, time, and type, which can help in identifying crime hotspots, predicting future crime
trends, and improving crime prevention strategies.

Fundamentally, all clustering methods use the same approach i.e. first we calculate similarities
and then we use it to cluster the data points into groups or batches. Here we will focus on
the Density-based spatial clustering of applications with noise (DBSCAN) clustering
method.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Clusters are dense regions in the data space, separated by regions of the lower density of points.
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea
is that for each point of a cluster, the neighborhood of a given radius has to contain at least a
minimum number of points.

Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding
spherical-shaped clusters or convex clusters. In other words, they are suitable only for compact
and well-separated clusters. Moreover, they are also severely affected by the presence of noise
and outliers in the data.
Real-life data may contain irregularities, like:
1. Clusters can be of arbitrary shape such as those shown in the figure below.
2. Data may contain noise.

The figure above shows a data set containing non-convex shape clusters and outliers. Given such
data, the k-means algorithm has difficulties in identifying these clusters with arbitrary shapes.

Parameters Required for DBSCAN Algorithm

1. eps: It defines the neighborhood around a data point i.e. if the distance between two points
is lower or equal to ‘eps’ then they are considered neighbors. If the eps value is chosen too
small then a large part of the data will be considered as an outlier. If it is chosen very large
then the clusters will merge and the majority of the data points will be in the same clusters.
One way to find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. The larger the
dataset, the larger value of MinPts must be chosen. As a general rule, the minimum MinPts
can be derived from the number of dimensions D in the dataset as, MinPts >= D+1. The
minimum value of MinPts must be chosen at least
3. In this algorithm, we have 3 types of data points.
Core Point: A point is a core point if it has more than MinPts points within eps.
4.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood
of a core point.
Noise or outlier: A point which is not a core point or border point.
Steps Used in DBSCAN Algorithm

1. Find all the neighbor points within eps and identify the core points or visited with more than
MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same cluster as the
core point. A point a and b are said to be density connected if there exists a point c which
has a sufficient number of points in its neighbors and both points a and b are within the eps
distance. This is a chaining process. So, if b is a neighbor of c, c is a neighbor of d, and d is
a neighbor of e, which in turn is neighbor of an implying that b is a neighbor of a.

4. Iterate through the remaining unvisited points in the dataset. Those points that do not belong
to any cluster are noise.

Distance based
The dataset is divided into a set of k groups, where K is used to define the number of pre-defined
groups. The cluster center is created in such a way that the between the data points of one cluster
is minimum as compared to another cluster centroid.

What is K-Means Algorithm?


K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters. Here K defines the number of pre-defined clusters that
need to be created in the process, as if K=2, there will be two clusters, and for K=3, there
will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way
that each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

AD

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables
is given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting the
below two points as k points, which are not the part of our dataset. Consider the below
image:

o Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by applying some mathematics that we have studied to calculate the
distance between two points. So, we will draw a median between both the centroids.
Consider the below image:

AD

From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color
them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:

o As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:

o We can see in the above image; there are no dissimilar data points on either side of the
line, which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:

AD
How to choose the value of "K number of clusters"
in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient
clusters that it forms. But choosing the optimal number of clusters is a big task. There are
some different ways to find the optimal number of clusters, but here we are discussing
the most appropriate method to find the number of clusters or value of K. The method is
given below:

Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance (Pi C1)2 +∑Pi in Cluster2distance (Pi C2)2+∑Pi in CLuster3 distance
(Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance (Pi C1)2: It is the sum of the square of the distances between each data
point and its centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such
as Euclidean distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

o It executes the K-means clustering on a given dataset for different K values (ranges from
1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is known as
the elbow method. The graph for the elbow method looks like the below image:

Hierarchical Clustering in Machine Learning


Hierarchical clustering is another unsupervised machine learning algorithm, which is used
to group the unlabeled datasets into a cluster and also known as hierarchical cluster
analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look similar,
but they both differ depending on how they work. As there is no requirement to
predetermine the number of clusters as we did in the K-Means algorithm.
The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts


with taking all data points as single clusters and merging them until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-
down approach.

Why hierarchical clustering?


As we already have other clustering algorithms such as K-Means Clustering, then why we
need hierarchical clustering? So, as we have seen in the K-means clustering that there are
some challenges with this algorithm, which are a predetermined number of clusters, and
it always tries to create the clusters of the same size. To solve these two challenges, we
can opt for the hierarchical clustering algorithm because, in this algorithm, we don't need
to have knowledge about the predefined number of clusters.

In this topic, we will discuss the Agglomerative Hierarchical clustering algorithm.

Agglomerative Hierarchical clustering


The agglomerative hierarchical clustering algorithm is a popular example of HCA. To
group the datasets into clusters, it follows the bottom-up approach. It means, this
algorithm considers each dataset as a single cluster at the beginning, and then start
combining the closest pair of clusters together. It does this until all the clusters are merged
into a single cluster that contains all the datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering


Work?
The working of the AHC algorithm can be explained using the below steps:
o Step-1: Create each data point as a single cluster. Let's say there are N data points, so
the number of clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them to form one cluster. So,
there will now be N-1 clusters.

o Step-3: Again, take the two closest clusters and merge them together to form one
cluster. There will be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram
to divide the clusters as per the problem.

Measure for the distance between two clusters


As we have seen, the closest distance between the two clusters is crucial for the
hierarchical clustering. There are various ways to calculate the distance between two
clusters, and these ways decide the rule for clustering. These measures are called Linkage
methods. Some of the popular linkage methods are given below:

1. Single Linkage: It is the Shortest Distance between the closest points of the clusters.
Consider the below image:

2. Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than single-
linkage.
3. Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid
of the clusters is calculated. Consider the below image:

From the above-given approaches, we can apply any of them according to the type of
problem or business requirement.

Working of Dendrogram in Hierarchical clustering


The dendrogram is a tree-like structure that is mainly used to store each step as a memory
that the HC algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean
distances between the data points, and the x-axis shows all the data points of the given
dataset.

The working of the dendrogram can be explained using the below diagram:

In the above diagram, the left part is showing how clusters are created in agglomerative
clustering, and the right part is showing the corresponding dendrogram.
o As we have discussed above, firstly, the datapoints P2 and P3 combine together and form
a cluster, correspondingly a dendrogram is created, which connects P2 and P3 with a
rectangular shape. The height is decided according to the Euclidean distance between the
data points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created.
It is higher than of previous, as the Euclidean distance between P5 and P6 is a little bit
greater than the P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram,
and P4, P5, and P6, in another dendrogram.
o At last, the final dendrogram is created that combines all the data points together.

Cluster Validity
Why do we need cluster validity indices?

o To compare clustering algorithms.


o To compare two sets of clusters.
o To compare two clusters i.e which one is better in terms of compactness and
connectedness.
o To determine whether random structure exists in the data due to noise.
o Generally, cluster validity measures are categorized into 3 classes, they are –

o Internal cluster validation: The clustering result is evaluated based on the data clustered
itself (internal information) without reference to external information.
o External cluster validation: Clustering results are evaluated based on some externally
known result, such as externally provided class labels.
o Relative cluster validation: The clustering results are evaluated by varying different
parameters for the same algorithm (e.g. changing the number of clusters).
o Besides the term cluster validity index, we need to know about inter-cluster distance d (a,
b) between two cluster a, b and intra-cluster index D(a) of cluster a.
Inter-cluster distance d(a, b) between two clusters a and b can be –
o Single linkage distance: Closest distance between two objects belonging
to a and b respectively.
o Complete linkage distance: Distance between two most remote objects belonging
to a and b respectively.
o Average linkage distance: Average distance between all the objects belonging
to a and b respectively.

o Centroid linkage distance: Distance between the centroid of the two


clusters a and b respectively.

o Intra-cluster distance D(a) of a cluster a can be –

o Complete diameter linkage distance: Distance between two farthest objects belonging to
cluster a.

o Average diameter linkage distance: Average distance between all the objects belonging
to cluster a.

o Centroid diameter linkage distance: Twice the average distance between all the objects
and the centroid of the cluster a.

Dimensionality Reduction

o Dimensionality reduction is a technique used to reduce the number of features in a dataset


while retaining as much of the important information as possible. In other words, it is a
process of transforming high-dimensional data into a lower-dimensional space that still
preserves the essence of the original data.
o In machine learning, high-dimensional data refers to data with a large number of features
or variables. The curse of dimensionality is a common problem in machine learning, where
the performance of the model deteriorates as the number of features increases. This is
because the complexity of the model increases with the number of features, and it
becomes more difficult to find a good solution. In addition, high-dimensional data can
also lead to overfitting, where the model fits the training data too closely and does not
generalize well to new data.
o Dimensionality reduction can help to mitigate these problems by reducing the complexity
of the model and improving its generalization performance. There are two main
approaches to dimensionality reduction: feature selection and feature extraction.
o Feature Selection

Feature selection involves selecting a subset of the original features that are most relevant
to the problem at hand. The goal is to reduce the dimensionality of the dataset while
retaining the most important features. There are several methods for feature selection,
including filter methods, wrapper methods, and embedded methods. Filter methods rank
the features based on their relevance to the target variable, wrapper methods use the
model performance as the criteria for selecting features, and embedded methods combine
feature selection with the model training process.
o Feature Extraction:

Feature extraction involves creating new features by combining or transforming the


original features. The goal is to create a set of features that captures the essence of the
original data in a lower-dimensional space. There are several methods for feature
extraction, including principal component analysis (PCA), linear discriminant analysis (LDA),
and t-distributed stochastic neighbor embedding (t-SNE). PCA is a popular technique that
projects the original features onto a lower-dimensional space while preserving as much of
the variance as possible.

Why is Dimensionality Reduction important in Machine Learning and Predictive


Modeling?

An intuitive example of dimensionality reduction can be discussed through a simple e-


mail classification problem, where we need to classify whether the e-mail is spam or not.
This can involve a large number of features, such as whether or not the e-mail has a generic
title, the content of the e-mail, whether the e-mail uses a template, etc. However, some of
these features may overlap. In another condition, a classification problem that relies on
both humidity and rainfall can be collapsed into just one underlying feature, since both of
the aforementioned are correlated to a high degree. Hence, we can reduce the number of
features in such problems. A 3-D classification problem can be hard to visualize, whereas
a 2-D one can be mapped to a simple 2-dimensional space, and a 1-D problem to a simple
line. The below figure illustrates this concept, where a 3-D feature space is split into two
2-D feature spaces, and later, if found to be correlated, the number of features can be
reduced even further.
Components of Dimensionality Reduction

There are two components of dimensionality reduction:

• Feature selection: In this, we try to find a subset of the original set of variables, or features,
to get a smaller subset which can be used to model the problem. It usually involves three
ways:
1. Filter
2. Wrapper
3. Embedded
• Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.

Methods of Dimensionality Reduction

The various methods used for dimensionality reduction include:

• Principal Component Analysis (PCA)


• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)

Dimensionality reduction may be both linear and non-linear, depending upon the method
used. The prime linear method, called Principal Component Analysis, or PCA, is discussed
below.

Principal Component Analysis

This method was introduced by Karl Pearson. It works on the condition that while the data
in a higher dimensional space is mapped to data in a lower dimension space, the variance
of the data in the lower dimensional space should be maximum.
It involves the following steps:

• Construct the covariance matrix of the data.


• Compute the eigenvectors of this matrix.
• Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large
fraction of variance of the original data.

Hence, we are left with a lesser number of eigenvectors, and there might have been some
data loss in the process. But, the most important variances should be retained by the
remaining eigenvectors.

Advantages of Dimensionality Reduction

• It helps in data compression, and hence reduced storage space.


• It reduces computation time.
• It also helps remove redundant features, if any.
• Improved Visualization: High dimensional data is difficult to visualize, and dimensionality
reduction techniques can help in visualizing the data in 2D or 3D, which can help in better
understanding and analysis.
• Overfitting Prevention: High dimensional data may lead to overfitting in machine
learning models, which can lead to poor generalization performance. Dimensionality
reduction can help in reducing the complexity of the data, and hence prevent overfitting.
• Feature Extraction: Dimensionality reduction can help in extracting important features
from high dimensional data, which can be useful in feature selection for machine learning
models.
• Data Preprocessing: Dimensionality reduction can be used as a preprocessing step before
applying machine learning algorithms to reduce the dimensionality of the data and hence
improve the performance of the model.
• Improved Performance: Dimensionality reduction can help in improving the performance
of machine learning models by reducing the complexity of the data, and hence reducing
the noise and irrelevant information in the data.

Disadvantages of Dimensionality Reduction

• It may lead to some amount of data loss.


• PCA tends to find linear correlations between variables, which is sometimes undesirable.
• PCA fails in cases where mean and covariance are not enough to define datasets.
• We may not know how many principal components to keep- in practice, some thumb rules
are applied.
• Interpretability: The reduced dimensions may not be easily interpretable, and it may be
difficult to understand the relationship between the original features and the reduced
dimensions.
• Overfitting: In some cases, dimensionality reduction may lead to overfitting, especially
when the number of components is chosen based on the training data.
• Sensitivity to outliers: Some dimensionality reduction techniques are sensitive to outliers,
which can result in a biased representation of the data.
• Computational complexity: Some dimensionality reduction techniques, such as manifold
learning, can be computationally intensive, especially when dealing with large datasets.

Recommendation Systems
Recommender systems, also known as recommendation systems, are machine learning
algorithms that use data to recommend items or content to users based on their
preferences, past behavior, or their combination. These systems can recommend various
items, such as movies, books, music, products, etc.,

The two main kinds are content-based filtering (which takes into account the
characteristics of products and user profiles) and collaborative filtering (which generates
recommendations based on user behaviour and preferences). Hybrid strategies that
integrate the two approaches are also popular. These kinds of systems improve user
experiences, boost user involvement, and propel corporate expansion.

Recommender System is of different types:

• Content-Based Recommendation: It is supervised machine learning used to induce a


classifier to discriminate between interesting and uninteresting items for the user.

• Collaborative Filtering: Collaborative Filtering recommends items based on similarity


measures between users and/or items. The basic assumption behind the algorithm is that
users with similar interests have common preferences.

Content-Based Recommendation System

Content-based systems recommend items to the customer similar to previously high-rated


items by the customer. It uses the features and properties of the item. From these
properties, it can calculate the similarity between the items.

In a content-based recommendation system, first, we need to create a profile for each


item, which represents the properties of those items. The user profiles are inferred for a
particular user. We use these user profiles to recommend the items to the users from the
catalog.

Item profile

In a content-based recommendation system, we need to build a profile for each item,


which contains the important properties of each item. For Example, If the movie is an item,
then its actors, director, release year, and genre are its important properties, and for the
document, the important property is the type of content and set of important words in it.
Let’s have a look at how to create an item profile. First, we need to perform the TF-IDF
vectorizer, here TF (term frequency) of a word is the number of times it appears in a
document and The IDF (inverse document frequency) of a word is the measure of how
significant that term is in the whole corpus.

TF-IDF Vectorizer

• Term Frequency (TF): Term frequency, or TF for short, is a key idea in information
retrieval and natural language processing. It displays the regularity with which a certain
term or word occurs in a text corpus or document. TF is used to rank terms in a
document according to their relative value or significance.
The term-frequency can be calculated by:

• where fij is the frequency of term(feature) i in document(item) j.


For a variety of text analysis tasks, such as information retrieval, document classification,
and sentiment analysis, the yielded TF value can be used to identify important terms in a
document. It offers a framework for figuring out how relevant a word is in a particular
situation.

• Inverse-document Frequency(IDF): The measure known as Inverse Document


Frequency (IDF) is employed in text analysis and information retrieval to evaluate the
significance of phrases within a set of documents. IDF measures how uncommon or unique
a term is in the corpus. To compute it, take the reciprocal of the fraction of documents
that include the term and logarithmize it. Common terms have lower IDF values, while rare
terms have higher values. IDF is an essential part of the TF-IDF (Term Frequency-Inverse
Document Frequency) method, which uses it to assess the relative importance of terms in
different documents. To improve information representation and retrieval from massive
text datasets, IDF is used in tasks including document ranking, categorization, and text
mining.
The inverse-document frequency can be calculated with:

• where, ni number of documents that mention term i. N is the total number of docs.

A numerical statistic called Term Frequency-Inverse Document Frequency (TF-IDF) is


employed in information retrieval and natural language processing. The term’s significance
within a document is assessed in relation to a group of documents (the corpus). TF
emphasizes terms with greater frequencies by measuring a term’s frequency of occurrence
in a document. IDF evaluates a term’s rarity within the corpus, emphasizing terms that are
distinct. A weighted score is produced for each term in a document by multiplying TF and
IDF together to compute TF-IDF.

Therefore, the total formula is:

User profile

The user profile is a vector that describes the user preference. During the creation of the user’s
profile, we use a utility matrix that describes the relationship between user and item. From this
information, the best estimate we can decide which item the user likes, is some aggregation of
the profiles of those items.

Advantages and Disadvantages

• Advantages:
o No need for data on other users when applying to similar users.
o Able to recommend to users with unique tastes.
o Able to recommend new & popular items
o Explanations for recommended items.

• Disadvantages:
o Finding the appropriate feature is hard.
o Doesn’t recommend items outside the user profile.

Collaborative Filtering

Collaborative filtering is based on the idea that similar people (based on the data) generally tend
to like similar things. It predicts which item a user will like based on the item preferences of other
similar users.

Collaborative filtering uses a user-item matrix to generate recommendations. This matrix contains
the values that indicate a user’s preference towards a given item. These values can represent either
explicit feedback (direct user ratings) or implicit feedback (indirect user behavior such as listening,
purchasing, watching).

• Explicit Feedback: The amount of data that is collected from the users when they choose
to do so. Many of the times, users choose not to provide data for the user. So, this data is
scarce and sometimes costs money. For example, ratings from the user.

• Implicit Feedback: In implicit feedback, we track user behavior to predict their preference

Advantages and Disadvantages

• Advantages:
o No need for the domain knowledge because embedding are learned automatically.
o Capture inherent subtle characteristics.

• Disadvantages:
o Cannot handle fresh items due to cold start problem.
o Hard to add any new features that may improve quality of model

EM algorithm.
o The Expectation-Maximization (EM) algorithm is an iterative optimization method
that combines different unsupervised machine learning algorithms to find
maximum likelihood or maximum posterior estimates of parameters in statistical
models that involve unobserved latent variables. The EM algorithm is commonly
used for latent variable models and can handle missing data. It consists of an
estimation step (E-step) and a maximization step (M-step), forming an iterative
process to improve model fit.
o In the E step, the algorithm computes the latent variables i.e. expectation of the
log-likelihood using the current parameter estimates.
o In the M step, the algorithm determines the parameters that maximize the
expected log-likelihood obtained in the E step, and corresponding model
parameters are updated based on the estimated latent variables.
o
o Expectation-Maximization in EM Algorithm
o By iteratively repeating these steps, the EM algorithm seeks to maximize the
likelihood of the observed data. It is commonly used for unsupervised learning
tasks, such as clustering, where latent variables are inferred and has applications in
various fields, including machine learning, computer vision, and natural language
processing.

Key Terms in Expectation-Maximization (EM) Algorithm

o Some of the most commonly used key terms in the Expectation-Maximization (EM)
Algorithm are as follows:
o Latent Variables: Latent variables are unobserved variables in statistical models
that can only be inferred indirectly through their effects on observable variables.
They cannot be directly measured but can be detected by their impact on the
observable variables.
o Likelihood: It is the probability of observing the given data given the parameters
of the model. In the EM algorithm, the goal is to find the parameters that maximize
the likelihood.
o Log-Likelihood: It is the logarithm of the likelihood function, which measures the
goodness of fit between the observed data and the model. EM algorithm seeks to
maximize the log-likelihood.
o Maximum Likelihood Estimation (MLE): MLE is a method to estimate the
parameters of a statistical model by finding the parameter values that maximize
the likelihood function, which measures how well the model explains the observed
data.
o Posterior Probability: In the context of Bayesian inference, the EM algorithm can
be extended to estimate the maximum a posteriori (MAP) estimates, where the
posterior probability of the parameters is calculated based on the prior distribution
and the likelihood function.
o Expectation (E) Step: The E-step of the EM algorithm computes the expected
value or posterior probability of the latent variables given the observed data and
current parameter estimates. It involves calculating the probabilities of each latent
variable for each data point.
o Maximization (M) Step: The M-step of the EM algorithm updates the parameter
estimates by maximizing the expected log-likelihood obtained from the E-step. It
involves finding the parameter values that optimize the likelihood function,
typically through numerical optimization methods.
o Convergence: Convergence refers to the condition when the EM algorithm has
reached a stable solution. It is typically determined by checking if the change in
the log-likelihood or the parameter estimates falls below a predefined threshold.

How Expectation-Maximization (EM) Algorithm Works:

o The essence of the Expectation-Maximization algorithm is to use the available


observed data of the dataset to estimate the missing data and then use that data
to update the values of the parameters. Let us understand the EM algorithm in
detail.
o Initialization:
o Initially, a set of initial values of the parameters are considered. A set of incomplete
observed data is given to the system with the assumption that the observed data
comes from a specific model.
o E-Step (Expectation Step): In this step, we use the observed data in order to
estimate or guess the values of the missing or incomplete data. It is basically used
to update the variables.
o Compute the posterior probability or responsibility of each latent variable given
the observed data and current parameter estimates.
o Estimate the missing or incomplete data values using the current parameter
estimates.
o Compute the log-likelihood of the observed data based on the current parameter
estimates and estimated missing data.
o M-step (Maximization Step): In this step, we use the complete data generated in
the preceding “Expectation” – step in order to update the values of the parameters.
It is basically used to update the hypothesis.
o Update the parameters of the model by maximizing the expected complete data
log-likelihood obtained from the E-step.
o This typically involves solving optimization problems to find the parameter values
that maximize the log-likelihood.
o The specific optimization technique used depends on the nature of the problem
and the model being used.
o Convergence: In this step, it is checked whether the values are converging or not,
if yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” – step and
“Maximization” – step until the convergence occurs.
o Check for convergence by comparing the change in log-likelihood or the
parameter values between iterations.
o If the change is below a predefined threshold, stop and consider the algorithm
converged.
o Otherwise, go back to the E-step and repeat the process until convergence is
achieved.

Reinforcement Learning
Reinforcement Learning (RL) is a branch of machine learning focused on making decisions to
maximize cumulative rewards in a given situation. Unlike supervised learning, which relies on a
training dataset with predefined answers, RL involves learning through experience. In RL, an agent
learns to achieve a goal in an uncertain, potentially complex environment by performing actions
and receiving feedback through rewards or penalties.

Key Concepts of Reinforcement Learning

• Agent: The learner or decision-maker.

• Environment: Everything the agent interacts with.

• State: A specific situation in which the agent finds itself.

• Action: All possible moves the agent can make.

• Reward: Feedback from the environment based on the action taken.


How Reinforcement Learning Works

RL operates on the principle of learning optimal behavior through trial and error. The agent takes
actions within the environment, receives rewards or penalties, and adjusts its behavior to maximize
the cumulative reward. This learning process is characterized by the following elements:

• Policy: A strategy used by the agent to determine the next action based on the current
state.

• Reward Function: A function that provides a scalar feedback signal based on the state
and action.

• Value Function: A function that estimates the expected cumulative reward from a given
state.

• Model of the Environment: A representation of the environment that helps in planning


by predicting future states and rewards.

Example: Navigating a Maze

The problem is as follows: We have an agent and a reward, with many hurdles in between. The
agent is supposed to find the best possible path to reach the reward. The following problem
explains the problem more easily.
The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward
that is the diamond and avoid the hurdles that are fired. The robot learns by trying all the possible
paths and then choosing the path which gives him the reward with the least hurdles. Each right
step will give the robot a reward and each wrong step will subtract the reward of the robot. The
total reward will be calculated when it reaches the final reward that is the diamond.

Main points in Reinforcement learning –

• Input: The input should be an initial state from which the model will start

• Output: There are many possible outputs as there are a variety of solutions to a particular
problem

• Training: The training is based upon the input; The model will return a state and the user
will decide to reward or punish the model based on its output.

• The model keeps continues to learn.

• The best solution is decided based on the maximum reward.

Types of Reinforcement:

1. Positive: Positive Reinforcement is defined as when an event, occurs due to a particular


behavior, increases the strength and the frequency of the behavior. In other words, it has
a positive effect on behavior.
Advantages of reinforcement learning are:

• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states which can diminish the
results

2. Negative: Negative Reinforcement is defined as strengthening of behavior because a


negative condition is stopped or avoided.
Advantages of reinforcement learning:

• Increases Behavior
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior

Elements of Reinforcement Learning

i) Policy: Defines the agent’s behavior at a given time.

ii) Reward Function: Defines the goal of the RL problem by providing feedback.

iii) Value Function: Estimates long-term rewards from a state.

iv) Model of the Environment: Helps in predicting future states and rewards for planning.

Difference between Reinforcement learning and Supervised learning:

Reinforcement learning Supervised learning

Reinforcement learning is all about making decisions In Supervised learning, the


sequentially. In simple words, we can say that the output decision is made on the initial
depends on the state of the current input and the next input or the input given at the
input depends on the output of the previous input start

In supervised learning the


In Reinforcement learning decision is dependent, so we decisions are independent of
give labels to sequences of dependent decisions each other so labels are given to
each decision.

Example: Object recognition,


Example: Chess game, text summarization
spam detection
Model based Learning
Model-based learning involves creating a mathematical model that can predict outcomes based
on input data. The model is trained on a large dataset and then used to make predictions on new
data. The model can be thought of as a set of rules that the machine uses to make predictions.

In model-based learning, the training data is used to create a model that can be generalized to
new data. The model is typically created using statistical algorithms such as linear regression,
logistic regression, decision trees, and neural networks. These algorithms use the training data to
create a mathematical model that can be used to predict outcomes.

Advantages of Model-Based Learning

1. Faster predictions: Model-based learning is typically faster than instance-based learning


because the model is already created and can be used to make predictions quickly.
2. More accurate predictions: Model-based learning can often make more accurate
predictions than instance-based learning because the model is trained on a large dataset
and can generalize to new data.
3. Better understanding of data Model-based learning allows you to gain a better
understanding of the relationships between input and output variables. This can help
identify which variables are most important in making predictions.
Disadvantages of Model-Based Learning

1. Requires a large dataset: model-based learning requires a large dataset to train the model.
This can be a disadvantage if you have a small dataset.
2. Requires expert knowledge: Model-based learning requires expert knowledge of statistical
algorithms and mathematical modeling. This can be a disadvantage if you don’t have the
expertise to create the model.
3. Requires expert knowledge: Model-based learning requires expert knowledge of statistical
algorithms and mathematical modeling. This can be a disadvantage if you don’t have the
expertise to create the model.

Example of Model-Based Learning

An example of model-based learning is predicting the price of a house based on its size, number
of rooms, location, and other features. In this case, a model could be created using linear
regression to predict the price of the house based on these features. The model would be trained
on a dataset of house prices and features and then used to make predictions on new data.

Temporal Difference Learning.

Temporal Difference (TD) learning a model-free reinforcement learning technique that aims to align
the expected prediction with the latest prediction, thus matching expectations with actual
outcomes and progressively enhancing the accuracy of the overall prediction chain. It also seeks
to predict a combination of the immediate reward and its own reward prediction at the same
moment.

In temporal difference learning, the signal used for training a prediction comes from a future
prediction. This approach is a combination of the Monte Carlo (MC) technique and the Dynamic
Programming (DP) technique. Monte Carlo methods modify their estimates only after the final
result is known, whereas temporal difference techniques adjust predictions to match later, more
precise predictions for the future, well before knowing the final outcome. This is essentially a type
of bootstrapping.
Parameters used in Temporal Difference Learning

The most common parameters used in temporal difference learning are −

• Alpha (αα) − This indicates the learning rate which varies between 0 to 1. It determines
how much our estimates should be adjusted based on the error.
• Gamma (γγ) − This implies the discount rate which varies between 0 to 1. A large discount
rate signifies that future rewards are valued to a greater extent.
• ϵϵ − This means examining new possibilities with a likelihood of ϵϵ and remaining at the
existing maximum with a likelihood of 1−ϵ1−ϵ. A greater ϵϵ indicates that more
explorations take place during training.

Temporal Difference Learning Algorithms

The main goal of Temporal Difference (TD) learning is to estimate the value
function V(s)V(s), which represents the expected future reward started from the state ss.
Following is the list of algorithms used in TD learning −

1. TD(λ) Algorithm

TD(λ) is a reinforcement learning algorithm that combines concepts from both Monte
Carlo methods and TD (0). It calculates the value function by taking weighted average of
n-steps return from the agent's trajectory, with the weight determined by λ.

• When λ=0λ=0 it corresponds to TD (0), where the latest reward and the value of the next
state are considered in updating the estimate.
• When λ=1λ=1, it indicates the use of Monte Carlo methods, which involve updating the
value based on the total return from a state until the episode ends.
• If the λ lies between 0 to 1, TD(λ) combines short-term TD (0) and Monte Carlo methods,
emphasizing latest rewards.

2. TD (0) Algorithm

The simplest form of TD learning is TD (0) TD (0) algorithm (One-Step TD learning), where
the value of a state is updated based on the successive reward and the estimated value of
the next state. The update rule –
Where,

• V(st) represents the current estimate of the value of state st


• Rt+1 represents the rewards received after transitioning from state st.
• γ is the discount factor
• V(st+1) represents the estimated value of next state.
• α is the learning rate.

The rule adjusts the current estimate based on the difference between the predicted return
(using V(st+1) and the actual return (using Rt+1).

3. TD (1) Algorithm

Temporal Difference Learning with a trace length of 1, is known as TD (1) which is a


combination of Monte Carlo techniques and Dynamic Programming in a reinforcement
learning. This is the generalized version of TD (0). The main concept behind TD (1) is to
adjust the value function using the last reward and the prediction of upcoming rewards.
Unit-IV
Neural Networks
Neural networks are machine learning models that mimic the complex functions of the human
brain. These models consist of interconnected nodes or neurons that process data, learn patterns,
and enable tasks such as pattern recognition and decision-making.

Understanding Neural Networks in Deep Learning


Neural networks are capable of learning and identifying patterns directly from data without pre-
defined rules. These networks are built from several key components:
1. Neurons: The basic units that receive inputs, each neuron is governed by a threshold and
an activation function.
2. Connections: Links between neurons that carry information, regulated by weights and
biases.
3. Weights and Biases: These parameters determine the strength and influence of
connections.
4. Propagation Functions: Mechanisms that help process and transfer data across layers of
neurons.
5. Learning Rule: The method that adjusts weights and biases over time to improve
accuracy.
Learning in neural networks follows a structured, three-stage process:
1. Input Computation: Data is fed into the network.
2. Output Generation: Based on the current parameters, the network generates an output.
3. Iterative Refinement: The network refines its output by adjusting weights and biases,
gradually improving its performance on diverse tasks.

In an adaptive learning environment:


• The neural network is exposed to a simulated scenario or dataset.
• Parameters such as weights and biases are updated in response to new data or conditions.
• With each adjustment, the network’s response evolves, allowing it to adapt effectively to
different tasks or environments.
The image illustrates the analogy between a biological neuron and an artificial neuron, showing
how inputs are received and processed to produce outputs in both systems.
Importance of Neural Networks
Neural networks are pivotal in identifying complex patterns, solving intricate challenges, and
adapting to dynamic environments. Their ability to learn from vast amounts of data is
transformative, impacting technologies like natural language processing, self-driving vehicles,
and automated decision-making.
Neural networks streamline processes, increase efficiency, and support decision-making across
various industries. As a backbone of artificial intelligence, they continue to drive innovation,
shaping the future of technology.
Layers in Neural Network Architecture
1. Input Layer: This is where the network receives its input data. Each input neuron in the
layer corresponds to a feature in the input data.
2. Hidden Layers: These layers perform most of the computational heavy lifting. A neural
network can have one or multiple hidden layers. Each layer consists of units (neurons)
that transform the inputs into something that the output layer can use.
3. Output Layer: The final layer produces the output of the model. The format of these
outputs varies depending on the specific task (e.g., classification, regression).
Working of Neural Networks
Forward Propagation
When data is input into the network, it passes through the network in the forward direction, from
the input layer through the hidden layers to the output layer. This process is known as forward
propagation. Here’s what happens during this phase:
2. Linear Transformation: Each neuron in a layer receives inputs, which are multiplied by
the weights associated with the connections. These products are summed together, and a
bias is added to the sum. This can be represented mathematically as:

3. Activation: The result of the linear transformation (denoted as zz) is then passed through
an activation function. The activation function is crucial because it introduces non-
linearity into the system, enabling the network to learn more complex patterns. Popular
activation functions include ReLU, sigmoid, and tanh.
Backpropagation
After forward propagation, the network evaluates its performance using a loss function, which
measures the difference between the actual output and the predicted output. The goal of training
is to minimize this loss. This is where backpropagation comes into play:
1. Loss Calculation: The network calculates the loss, which provides a measure of error in
the predictions. The loss function could vary; common choices are mean squared error for
regression tasks or cross-entropy loss for classification.
2. Gradient Calculation: The network computes the gradients of the loss function with
respect to each weight and bias in the network. This involves applying the chain rule of
calculus to find out how much each part of the output error can be attributed to each
weight and bias.
3. Weight Update: Once the gradients are calculated, the weights and biases are updated
using an optimization algorithm like stochastic gradient descent (SGD). The weights are
adjusted in the opposite direction of the gradient to minimize the loss. The size of the step
taken in each update is determined by the learning rate.
Iteration
This process of forward propagation, loss calculation, backpropagation, and weight update is
repeated for many iterations over the dataset. Over time, this iterative process reduces the loss,
and the network’s predictions become more accurate.
Through these steps, neural networks can adapt their parameters to better approximate the
relationships in the data, thereby improving their performance on tasks such as classification,
regression, or any other predictive modeling.
Learning of a Neural Network
1. Learning with Supervised Learning
In supervised learning, a neural network learns from labeled input-output pairs provided by a
teacher. The network generates outputs based on inputs, and by comparing these outputs to the
known desired outputs, an error signal is created. The network iteratively adjusts its parameters
to minimize errors until it reaches an acceptable performance level.
2. Learning with Unsupervised Learning
Unsupervised learning involves data without labeled output variables. The primary goal is to
understand the underlying structure of the input data (X). Unlike supervised learning, there is no
instructor to guide the process. Instead, the focus is on modeling data patterns and relationships,
with techniques like clustering and association commonly used.
3. Learning with Reinforcement Learning
Reinforcement learning enables a neural network to learn through interaction with its
environment. The network receives feedback in the form of rewards or penalties, guiding it to
find an optimal policy or strategy that maximizes cumulative rewards over time. This approach is
widely used in applications like gaming and decision-making.
Types of Neural Networks
There are seven types of neural networks that can be used.
• Feedforward Networks: A feedforward neural network is a simple artificial neural
network architecture in which data moves from input to output in a single direction.
• Multilayer Perceptron (MLP): MLP is a type of feedforward neural network with three
or more layers, including an input layer, one or more hidden layers, and an output layer. It
uses nonlinear activation functions.
• Convolutional Neural Network (CNN): A Convolutional Neural Network (CNN) is a
specialized artificial neural network designed for image processing. It employs
convolutional layers to automatically learn hierarchical features from input images,
enabling effective image recognition and classification.
• Recurrent Neural Network (RNN): An artificial neural network type intended for
sequential data processing is called a Recurrent Neural Network (RNN). It is appropriate
for applications where contextual dependencies are critical, such as time series prediction
and natural language processing, since it makes use of feedback loops, which enable
information to survive within the network.
• Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed to
overcome the vanishing gradient problem in training RNNs. It uses memory cells and
gates to selectively read, write, and erase information.
Advantages of Neural Networks
Neural networks are widely used in many different applications because of their many benefits:
• Adaptability: Neural networks are useful for activities where the link between inputs and
outputs is complex or not well defined because they can adapt to new situations and learn
from data.
• Pattern Recognition: Their proficiency in pattern recognition renders them efficacious
in tasks like as audio and image identification, natural language processing, and other
intricate data patterns.
• Parallel Processing: Because neural networks are capable of parallel processing by
nature, they can process numerous jobs at once, which speeds up and improves the
efficiency of computations.
• Non-Linearity: Neural networks are able to model and comprehend complicated
relationships in data by virtue of the non-linear activation functions found in neurons,
which overcome the drawbacks of linear models.
Disadvantages of Neural Networks
Neural networks, while powerful, are not without drawbacks and difficulties:
• Computational Intensity: Large neural network training can be a laborious and
computationally demanding process that demands a lot of computing power.
• Black box Nature: As “black box” models, neural networks pose a problem in important
applications since it is difficult to understand how they make decisions.
• Overfitting: Overfitting is a phenomenon in which neural networks commit training
material to memory rather than identifying patterns in the data. Although regularization
approaches help to alleviate this, the problem still exists.
• Need for Large datasets: For efficient training, neural networks frequently need sizable,
labeled datasets; otherwise, their performance may suffer from incomplete or skewed
data.
Applications of Neural Networks
Neural networks have numerous applications across various fields:
1. Image and Video Recognition: CNNs are extensively used in applications such as facial
recognition, autonomous driving, and medical image analysis.
2. Natural Language Processing (NLP): RNNs and transformers power language
translation, chatbots, and sentiment analysis.
3. Finance: Predicting stock prices, fraud detection, and risk management.
4. Healthcare: Neural networks assist in diagnosing diseases, analyzing medical images,
and personalizing treatment plans.
5. Gaming and Autonomous Systems: Neural networks enable real-time decision-making,
enhancing user experience in video games and enabling autonomous systems like self-
driving cars.

What is Perceptron?
Perceptron is a type of neural network that performs binary classification that maps input features
to an output decision, usually classifying data into one of two categories, such as 0 or 1.
Perceptron consists of a single layer of input nodes that are fully connected to a layer of output
nodes. It is particularly good at learning linearly separable patterns. It utilizes a variation of
artificial neurons called Threshold Logic Units (TLU), which were first introduced by
McCulloch and Walter Pitts in the 1940s. This foundational model has played a crucial role in the
development of more advanced neural networks and machine learning algorithms.
Types of Perceptron
1. Single-Layer Perceptron is a type of perceptron is limited to learning linearly separable
patterns. It is effective for tasks where the data can be divided into distinct categories
through a straight line. While powerful in its simplicity, it struggles with more complex
problems where the relationship between inputs and outputs is non-linear.
2. Multi-Layer Perceptron possess enhanced processing capabilities as they consist of two
or more layers, adept at handling more complex patterns and relationships within the
data.
Basic Components of Perceptron
A Perceptron is composed of key components that work together to process information and
make predictions.
• Input Features: The perceptron takes multiple input features, each representing a
characteristic of the input data.
• Weights: Each input feature is assigned a weight that determines its influence on the
output. These weights are adjusted during training to find the optimal values.
• Summation Function: The perceptron calculates the weighted sum of its inputs,
combining them with their respective weights.
• Activation Function: The weighted sum is passed through the Heaviside step function,
comparing it to a threshold to produce a binary output (0 or 1).
• Output: The final output is determined by the activation function, often used for binary
classification tasks.
• Bias: The bias term helps the perceptron make adjustments independent of the input,
improving its flexibility in learning.
• Learning Algorithm: The perceptron adjusts its weights and bias using a learning
algorithm, such as the Perceptron Learning Rule, to minimize prediction errors.
These components enable the perceptron to learn from data and make predictions. While a single
perceptron can handle simple binary classification, complex tasks require multiple perceptrons
organized into layers, forming a neural network.
How does Perceptron work?
A weight is assigned to each input node of a perceptron, indicating the importance of that input
in determining the output. The Perceptron’s output is calculated as a weighted sum of the inputs,
which is then passed through an activation function to decide whether the Perceptron will fire.
The weighted sum is computed as:

The step function compares this weighted sum to a threshold. If the input is larger than the
threshold value, the output is 1; otherwise, it’s 0. This is the most common activation function used
in Perceptron’s are represented by the Heaviside step function:
A perceptron consists of a single layer of Threshold Logic Units (TLU), with each TLU fully
connected to all input nodes.

In a fully connected layer, also known as a dense layer, all neurons in one layer are connected to
every neuron in the previous layer.
The output of the fully connected layer is computed as:

where X is the input W is the weight for each inputs neurons and b is the bias and h is the step
function.
During training, the Perceptron’s weights are adjusted to minimize the difference between the
predicted output and the actual output. This is achieved using supervised learning algorithms like
the delta rule or the Perceptron learning rule.
The weight update formula is:
What is a Multilayer Perceptron?
A Multi-Layer Perceptron (MLP) consists of fully connected dense layers that transform input
data from one dimension to another. It is called “multi-layer” because it contains an input layer,
one or more hidden layers, and an output layer. The purpose of an MLP is to model complex
relationships between inputs and outputs, making it a powerful tool for various machine learning
tasks.
The key components of Multi-Layer Perceptron include:
• Input Layer: Each neuron (or node) in this layer corresponds to an input feature. For
instance, if you have three input features, the input layer will have three neurons.
• Hidden Layers: An MLP can have any number of hidden layers, with each layer
containing any number of nodes. These layers process the information received from the
input layer.
• Output Layer: The output layer generates the final prediction or result. If there are
multiple outputs, the output layer will have a corresponding number of neurons.

Every connection in the diagram is a representation of the fully connected nature of an MLP. This
means that every node in one layer connects to every node in the next layer. As the data moves
through the network, each layer transforms it until the final output is generated in the output layer.
Working of Multi-Layer Perceptron
Let’s delve in to the working of the multi-layer perceptron. The key mechanisms such as forward
propagation, loss function, backpropagation, and optimization.
Step 1: Forward Propagation
In forward propagation, the data flows from the input layer to the output layer, passing through
any hidden layers. Each neuron in the hidden layers processes the input as follows:
1. Weighted Sum: The neuron computes the weighted sum of the inputs:
Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the network’s weights
and biases. This is achieved through backpropagation:
1. Gradient Calculation: The gradients of the loss function with respect to each weight and
bias are calculated using the chain rule of calculus.
2. Error Propagation: The error is propagated back through the network, layer by layer.
• Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss:

• Where:
o w is the weight.
o η is the learning rate.
o ∂L/∂w is the gradient of the loss function with respect to the weight.

Feed Forward Network


What is a Feedforward Neural Network?
A Feedforward Neural Network (FNN) is a type of artificial neural network where connections
between the nodes do not form cycles. This characteristic differentiates it from recurrent neural
networks (RNNs). The network consists of an input layer, one or more hidden layers, and an output
layer. Information flows in one direction—from input to output—hence the name "feedforward."
Structure of a Feedforward Neural Network
1. Input Layer: The input layer consists of neurons that receive the input data. Each neuron
in the input layer represents a feature of the input data.
2. Hidden Layers: One or more hidden layers are placed between the input and output layers.
These layers are responsible for learning the complex patterns in the data. Each neuron in
a hidden layer applies a weighted sum of inputs followed by a non-linear activation
function.
3. Output Layer: The output layer provides the final output of the network. The number of
neurons in this layer corresponds to the number of classes in a classification problem or
the number of outputs in a regression problem.
Each connection between neurons in these layers has an associated weight that is adjusted during
the training process to minimize the error in predictions.

Feed Forward Neural Network


Activation Functions
Activation functions introduce non-linearity into the network, enabling it to learn and model
complex data patterns. Common activation functions include:
Training a Feedforward Neural Network
Training a Feedforward Neural Network involves adjusting the weights of the neurons to minimize
the error between the predicted output and the actual output. This process is typically performed
using backpropagation and gradient descent.
1. Forward Propagation: During forward propagation, the input data passes through the
network, and the output is calculated.
2. Loss Calculation: The loss (or error) is calculated using a loss function such as Mean
Squared Error (MSE) for regression tasks or Cross-Entropy Loss for classification tasks.
3. Backpropagation: In backpropagation, the error is propagated back through the network
to update the weights. The gradient of the loss function with respect to each weight is
calculated, and the weights are adjusted using gradient descent.

Forward Propagation

Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the loss function by iteratively
updating the weights in the direction of the negative gradient. Common variants of gradient descent
include:
• Batch Gradient Descent: Updates weights after computing the gradient over the entire
dataset.
• Stochastic Gradient Descent (SGD): Updates weights for each training example
individually.
• Mini-batch Gradient Descent: Updates weights after computing the gradient over a small
batch of training examples.
Evaluation of Feedforward neural network
Evaluating the performance of the trained model involves several metrics:
• Accuracy: The proportion of correctly classified instances out of the total instances.
• Precision: The ratio of true positive predictions to the total predicted positives.
• Recall: The ratio of true positive predictions to the actual positives.
• F1 Score: The harmonic means of precision and recall, providing a balance between the
two.
• Confusion Matrix: A table used to describe the performance of a classification model,
showing the true positives, true negatives, false positives, and false negatives.

What is Backpropagation?
Backpropagation is a powerful algorithm in deep learning, primarily used to train artificial neural
networks, particularly feed-forward networks. It works iteratively, minimizing the cost function
by adjusting weights and biases.
In each epoch, the model adapts these parameters, reducing loss by following the error gradient.
Backpropagation often utilizes optimization algorithms like gradient descent or stochastic
gradient descent. The algorithm computes the gradient using the chain rule from calculus,
allowing it to effectively navigate complex layers in the neural network to minimize the cost
function.
fig(a) A simple illustration of how the backpropagation works by adjustments of weights
Why is Backpropagation Important?
Backpropagation plays a critical role in how neural networks improve over time. Here's why:
1. Efficient Weight Update: It computes the gradient of the loss function with respect to each
weight using the chain rule, making it possible to update weights efficiently.
2. Scalability: The backpropagation algorithm scales well to networks with multiple layers
and complex architectures, making deep learning feasible.
3. Automated Learning: With backpropagation, the learning process becomes automated,
and the model can adjust itself to optimize its performance.
Working of Backpropagation Algorithm
The Backpropagation algorithm involves two main steps: the Forward Pass and the Backward
Pass.
How Does the Forward Pass Work?
In the forward pass, the input data is fed into the input layer. These inputs, combined with their
respective weights, are passed to hidden layers.
For example, in a network with two hidden layers (h1 and h2 as shown in Fig. (a)), the output from
h1 serves as the input to h2. Before applying an activation function, a bias is added to the weighted
inputs.
Each hidden layer applies an activation function like ReLU (Rectified Linear Unit), which
returns the input if it’s positive and zero otherwise. This adds non-linearity, allowing the model to
learn complex relationships in the data. Finally, the outputs from the last hidden layer are passed
to the output layer, where an activation function, such as softmax, converts the weighted outputs
into probabilities for classification.

The forward pass using weights and biases


How Does the Backward Pass Work?
In the backward pass, the error (the difference between the predicted and actual output) is
propagated back through the network to adjust the weights and biases. One common method for
error calculation is the Mean Squared Error (MSE), given by:
MSE=(Predicted Output−Actual Output)2
Once the error is calculated, the network adjusts weights using gradients, which are computed
with the chain rule. These gradients indicate how much each weight and bias should be adjusted
to minimize the error in the next iteration. The backward pass continues layer by layer, ensuring
that the network learns and improves its performance. The activation function, through its
derivative, plays a crucial role in computing these gradients during backpropagation.
What is an Activation Function?
An activation function is a mathematical function applied to the output of a neuron. The activation
function decides whether a neuron should be activated by calculating the weighted sum of inputs
and adding a bias term. This helps the model make complex decisions and predictions by
introducing non-linearities to the output of each neuron.
Types of Activation Functions
1. Linear Activation Function
Linear Activation Function resembles straight line define by y=x. No matter how many layers
the neural network contains, if they all use linear activation functions, the output is a linear
combination of the input.
• The range of the output spans from (−∞ to +∞)
• Linear activation function is used at just one place i.e. output layer.
• Using linear activation across all layers makes the network’s ability to learn complex
patterns limited.
Linear activation functions are useful for specific tasks but must be combined with non-linear
functions to enhance the neural network’s learning and predictive capabilities.

Linear Activation Function or Identity Function returns the input as the output
2. Non-Linear Activation Functions
1. Sigmoid Function
Sigmoid Activation Function is characterized by ‘S’ shape. It is mathematically defined as
This formula ensures a smooth and continuous output that is essential for gradient-based
optimization methods.
• It allows neural networks to handle and model complex patterns that linear equations
cannot.
• The output ranges between 0 and 1, hence useful for binary classification.
• The function exhibits a steep gradient when x values are between -2 and 2. This sensitivity
means that small changes in input x can cause significant changes in output y, which is
critical during the training process.

Sigmoid or Logistic Activation Function Graph


2. Tanh Activation Function
Tanh function or hyperbolic tangent function, is a shifted version of the sigmoid, allowing it to
stretch across the y-axis. It is defined as:

Alternatively, it can be expressed using the sigmoid function:


• Value Range: Outputs values from -1 to +1.
• Non-linear: Enables modeling of complex data patterns.
• Use in Hidden Layers: Commonly used in hidden layers due to its zero-centered output,
facilitating easier learning for subsequent layers.

Tanh Activation Function


3. ReLU (Rectified Linear Unit) Function
ReLU activation is defined by A(x)=max (0, x), this means that if the input x is positive, ReLU
returns x, if the input is negative, it returns 0.
• Value Range: [0, ∞), meaning the function only outputs non-negative values.
• Nature: It is a non-linear activation function, allowing neural networks to learn complex
patterns and making backpropagation more efficient.
• Advantage over other Activation: ReLU is less computationally expensive than tanh and
sigmoid because it involves simpler mathematical operations. At a time only a few neurons
are activated making the network sparse making it efficient and easy for computation.
ReLU Activation Function
3. Exponential Linear Units
1. Softmax Function
Softmax function is designed to handle multi-class classification problems. It transforms raw
output scores from a neural network into probabilities. It works by squashing the output values of
each class into the range of 0 to 1, while ensuring that the sum of all probabilities equals 1.
• Softmax is a non-linear activation function.
• The Softmax function ensures that each class is assigned a probability, helping to identify
which class the input belongs to.
Softmax Activation Function
2. SoftPlus Function
Softplus function is defined mathematically as: A(x)=log(1+ex). This equation ensures that the
output is always positive and differentiable at all points, which is an advantage over the traditional
ReLU function.
• Nature: The Softplus function is non-linear.
• Range: The function outputs values in the range (0,∞) similar to ReLU, but without the
hard zero threshold that ReLU has.
• Smoothness: Softplus is a smooth, continuous function, meaning it avoids the sharp
discontinuities of ReLU, which can sometimes lead to problems during optimization.
Limitations of Machine Learning
Machine learning, a method that enables computers to learn from data and make predictions or
judgments without being explicitly programmed, has grown in popularity in artificial intelligence
(AI). Machine learning has its limitations, just like any other technology, and these must be
considered before using it in practical situations. The main machine learning restrictions that every
data scientist, researcher, and engineer should be aware of are covered in this article.
1. Lack of Transparency and Interpretability
One of its main drawbacks is more transparency and interpretability in machine learning. As they
don't reveal how a judgment was made or how it came to be, machine learning algorithms are
frequently called "black boxes." This makes it challenging to comprehend how a certain model
concluded and might be problematic when explanations are required. For instance, understanding
the reasoning behind a particular diagnosis in healthcare might be easier with transparency and
interpretability.
A critical drawback of machine learning algorithms that might have substantial ramifications in
practical applications is their need for more transparency and interpretability. As they don't reveal
how a judgment was made or how it came to be, machine learning algorithms are sometimes called
"black boxes." This might make it challenging to comprehend how a certain model concluded and
can pose problems when explanations are required.
Increase transparency and interpretability by providing a more thorough description of the
decision-making process through explanations. Natural language explanations or decision trees are
just two examples of the available explanation formats. Natural language explanations can offer a
description of the decision-making process that is readable by humans, making it simpler for non-
experts to comprehend. A visual representation of the decision-making process, such as a decision
tree, can increase transparency and interpretability.
2. Bias and Discrimination
The possibility for bias and discrimination is a significant flaw in machine learning. Large datasets,
which may have data biases, are used to train machine learning systems. If these biases are not
addressed, the machine learning system may reinforce them, producing biased results.
The algorithms used in facial recognition are one instance of bias in machine learning. According
to research, facial recognition software performs worse on those with darker skin tones, which
causes false positive and false negative rates to be higher for people of races. This bias may have
significant consequences, particularly in law enforcement and security applications, where false
positives may result in unjustified arrests or other undesirable results.
Finally, it is critical to understand that biases and discrimination in machine learning algorithms
frequently emerge from larger social and cultural biases. To address these biases, there has to be a
larger push for inclusion and diversity in the design and use of machine learning algorithms.
3. Overfitting and Underfitting
Machine learning algorithms frequently have two limitations: overfitting and underfitting.
Overfitting is a condition where a machine learning model performs poorly on new, unknown data
because it needs to be simplified and has been trained too successfully on the training data. On the
other side, underfitting happens when a machine learning model is overly simplistic and unable to
recognize the underlying patterns in the data, resulting in subpar performance on both the training
data and fresh data.
Regularization, cross-validation, and ensemble approaches are examples of techniques that can be
used to alleviate overfitting and underfitting. When a model is regularised, a penalty term is added
to the loss function to prevent the model from growing too complex. Cross-validation includes
splitting the data into training and validation sets so that the model's performance can be assessed
and its hyperparameters can be adjusted. To enhance performance, ensemble approaches combine
several models.
While developing predictive models using machine learning, overfitting, and underfitting are
frequent problems. When a model is overtrained and excessively sophisticated on a small dataset,
overfitting occurs, which results in a good performance on training data but poor generalization to
new data. Conversely, underfitting occurs when a model needs to be more complex and adequately
represent the underlying relationships in the data, resulting in subpar performance on training and
test data. Using regularization methods like L1 and L2 regularization is one way to prevent
overfitting. The objective function receives a penalty term during regularization that restricts the
magnitude of the model's parameters. Another method is early stopping, in which training is halted
when a model's performance on a validation set stops advancing.
A common method for assessing a machine learning model's performance and fine-tuning its
hyperparameters is cross-validation. The dataset is divided into folds, and the model is trained and
tested on each fold. Overfitting can be prevented, and a more precise estimate of the model's
performance can be obtained.
4. Limited Data Availability
A major challenge for machine learning is the need for more available data. Machine learning
algorithms need a lot of data to learn and produce precise predictions. However, there might need
to be more data available or only restricted access to it in many fields. Due to privacy
considerations, it might be difficult to get medical data, while data from sporadic events, such as
natural catastrophes, may be of restricted scope.
Researchers are looking into novel techniques for creating synthetic data that may be used to
supplement small datasets to address this constraint. To expand the amount of data accessible for
training machine learning algorithms, efforts are also being made to enhance data sharing and
collaboration across enterprises.
A major obstacle to machine learning is the need for more available data. Addressing this
restriction will need for a concerted effort across industries and disciplines to improve data
collection, sharing, and reinforcement in order to ensure that machine learning algorithms can
continue to be helpful in a variety of applications.
5. Computational Resources
Machine learning algorithms can be computationally expensive, and they may require a lot of
resources to be successfully trained. This may be a major barrier, particularly for people or smaller
companies who want access to high-performance computing resources. Distributed and cloud
computing can be used to get around this restriction, however the project's cost might go up.
For huge datasets and complex models, machine learning approaches can be computationally
expensive. The scalability and feasibility of machine learning algorithms may be hampered by the
need for significant processing resources. The availability of computational resources like
processor speed, memory, and storage is another limitation on machine learning.
Using cloud computing is one way to overcome the computational resource barrier. Users can scale
up or decrease their use of computer resources according to their demands using cloud computing
platforms like Amazon Web Services (AWS) and Microsoft Azure, which offer on-demand access
to computing resources. The cost and difficulty of maintaining computational resources can be
greatly decreased.
To lower the computing demands, optimizing the data preprocessing pipelines and machine
learning algorithms is crucial. This may entail the use of more effective algorithms, a decrease in
the data's dimensionality, and the removal of pointless or redundant information.
6. Lack of Causality
Predictions based on correlations in the data are frequently made using machine learning
algorithms. Machine learning algorithms may not shed light on the underlying causal links in the
data because correlation does not always imply causation. This may reduce our capacity for precise
prediction when causality is crucial.
The absence of causation is one of machine learning's main drawbacks. The main purpose of
machine learning algorithms is to find patterns and correlations in data; however, they cannot
establish causal links between different variables. In other words, machine learning models can
forecast future events based on seen data, but they cannot explain why such events occur.
A major drawback of using machine learning models to judge is the absence of causality. For
instance, if a machine learning model is used to forecast the likelihood that a consumer would buy
a product, it may find factors like age, income, and gender that are connected with buying behavior.
The model, however, is unable to determine if these variables are the source of the buying behavior
or whether there are further underlying causes.
To get over this restriction, machine learning may need to be integrated with other methodologies
like experimental design. Researchers can identify causal relationships by manipulating variables
and observing how those changes impact a result using an experimental design. However,
compared to traditional machine learning techniques, this approach may require more time and
resources.
Machine learning can be a useful tool for predicting outcomes from observable data, but it's crucial
to be aware of its limitations when making decisions based on these predictions. The lack of
causation is a basic flaw in machine learning systems. To establish causation, it could be necessary
to use methods other than machine learning.
7. Ethical Considerations
Machine learning models can have major social, ethical, and legal repercussions when used to
make judgments that affect people's lives. Machine learning models, for instance, may have a
differential effect on groups of individuals when used to make employment or lending choices.
Privacy, security, and data ownership must also be addressed when adopting machine learning
models.
The ethical issue of bias and discrimination is a major one. If the training data is biased or the
algorithms are not created in a fair and inclusive manner, biases and discrimination in society may
be perpetuated and even amplified by machine learning algorithms.
Another important ethical factor is privacy. Machine learning algorithms can collect and process
large amounts of personal data, which raises questions about how that data is utilized and
safeguarded.
Accountability and transparency are also crucial ethical factors. It is essential to ensure that
machine learning algorithms are visible and understandable and that systems are in place to hold
the creators and users of these algorithms responsible for their actions.
Finally, there are ethical issues around how machine learning will affect society. More
sophisticated machine learning algorithms may have far-reaching social, economic, and political
repercussions that require careful analysis and regulation.

Deep learning
Deep learning is a type of machine learning that uses artificial neural networks to learn from data.
Artificial neural networks are inspired by the human brain, and they can be used to solve a wide
variety of problems, including image recognition, natural language processing, and speech
recognition.
What is Deep Learning?
The definition of Deep learning is that it is the branch of machine learning that is based on artificial
neural network architecture. An artificial neural network or ANN uses layers of interconnected
nodes called neurons that work together to process and learn from the input data.
In a fully connected Deep neural network, there is an input layer and one or more hidden layers
connected one after the other. Each neuron receives input from the previous layer neurons or the
input layer. The output of one neuron becomes the input to other neurons in the next layer of the
network, and this process continues until the final layer produces the output of the network. The
layers of the neural network transform the input data through a series of nonlinear transformations,
allowing the network to learn complex representations of the input data.
Convolution Neural Network
A Convolutional Neural Network (CNN) is a type of Deep Learning neural network architecture
commonly used in Computer Vision. Computer vision is a field of Artificial Intelligence that
enables a computer to understand and interpret the image or visual data.
When it comes to Machine Learning, Artificial Neural Networks perform really well. Neural
Networks are used in various datasets like images, audio, and text. Different types of Neural
Networks are used for different purposes, for example for predicting the sequence of words we
use Recurrent Neural Networks more precisely an LSTM, similarly for image classification we use
Convolution Neural networks. In this blog, we are going to build a basic building block for CNN.
Convolution Neural Network
Convolutional Neural Network (CNN) is the extended version of artificial neural networks (ANN)
which is predominantly used to extract the feature from the grid-like matrix dataset. For example
visual datasets like images or videos where data patterns play an extensive role.
CNN Architecture
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer,
Pooling layer, and fully connected layers.

Simple CNN architecture


The Convolutional layer applies filters to the input image to extract features, the Pooling layer
down samples the image to reduce computation, and the fully connected layer makes the final
prediction. The network learns the optimal filters through backpropagation and gradient descent.
How Convolutional Layers Works?
Convolution Neural Networks or covnets are neural networks that share their parameters. Imagine
you have an image. It can be represented as a cuboid having its length, width (dimension of the
image), and height (i.e the channel as images generally have red, green, and blue channels).
Now imagine taking a small patch of this image and running a small neural network, called a filter
or kernel on it, with say, K outputs and representing them vertically. Now slide that neural network
across the whole image, as a result, we will get another image with different widths, heights, and
depths. Instead of just R, G, and B channels now we have more channels but lesser width and
height. This operation is called Convolution. If the patch size is the same as that of the image it
will be a regular neural network. Because of this small patch, we have fewer weights.

Image source: Deep Learning Udacity


Mathematical Overview of Convolution
Now let’s talk about a bit of mathematics that is involved in the whole convolution process.
• Convolution layers consist of a set of learnable filters (or kernels) having small widths and
heights and the same depth as that of input volume (3 if the input layer is image input).
• For example, if we have to run convolution on an image with dimensions 34x34x3. The
possible size of filters can be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller
as compared to the image dimension.
• During the forward pass, we slide each filter across the whole input volume step by step
where each step is called stride (which can have a value of 2, 3, or even 4 for high-
dimensional images) and compute the dot product between the kernel weights and patch
from input volume.
• As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together
as a result, we’ll get output volume having a depth equal to the number of filters. The
network will learn all the filters.
Layers Used to Build ConvNets
A complete Convolution Neural Networks architecture is also known as covnets. A covnets is a
sequence of layers, and every layer transforms one volume to another through a differentiable
function.
Types of layers: datasets
Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.
• Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the
input will be an image or a sequence of images. This layer holds the raw input of the image
with width 32, height 32, and depth 3.
• Convolutional Layers: This is the layer, which is used to extract the feature from the input
dataset. It applies a set of learnable filters known as the kernels to the input images. The
filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input
image data and computes the dot product between kernel weight and the corresponding
input image patch. The output of this layer is referred as feature maps. Suppose we use a
total of 12 filters for this layer we’ll get an output volume of dimension 32 x 32 x 12.
• Activation Layer: By adding an activation function to the output of the preceding layer,
activation layers add nonlinearity to the network. it will apply an element-wise activation
function to the output of the convolution layer. Some common activation functions
are RELU: max (0, x), Tanh, Leaky RELU, etc. The volume remains unchanged hence
output volume will have dimensions 32 x 32 x 12.
• Pooling layer: This layer is periodically inserted in the covnets and its main function is to
reduce the size of volume which makes the computation fast reduces memory and also
prevents overfitting. Two common types of pooling layers are max pooling and average
pooling. If we use a max pool with 2 x 2 filters and stride 2, the resultant volume will be
of dimension 16x16x12.
Image source: [Link]
• Flattening: The resulting feature maps are flattened into a one-dimensional vector after
the convolution and pooling layers so they can be passed into a completely linked layer for
categorization or regression.
• Fully Connected Layers: It takes the input from the previous layer and computes the final
classification or regression task.

Image source: [Link]


• Output Layer: The output from the fully connected layers is then fed into a logistic
function for classification tasks like sigmoid or softmax which converts the output of each
class into the probability score of each class.
Recurrent Neural Networks (RNNs)
In traditional neural networks, inputs and outputs are treated independently. However, tasks like
predicting the next word in a sentence require information from previous words to make accurate
predictions. To address this limitation, Recurrent Neural Networks (RNNs) were developed.
Recurrent Neural Networks introduce a mechanism where the output from one step is fed back as
input to the next, allowing them to retain information from previous inputs. This design makes
RNNs well-suited for tasks where context from earlier steps is essential, such as predicting the
next word in a sentence.
The defining feature of RNNs is their hidden state—also called the memory state—which
preserves essential information from previous inputs in the sequence. By using the same
parameters across all steps, RNNs perform consistently across inputs, reducing parameter
complexity compared to traditional neural networks. This capability makes RNNs highly effective
for sequential tasks.

Recurrent Neural Network


In simple terms, RNNs apply the same network to each element in a sequence, RNNs preserve and
pass on relevant information, enabling them to learn temporal dependencies that conventional
neural networks cannot.
How RNN Differs from Feedforward Neural Networks
Feedforward Neural Networks (FNNs) process data in one direction, from input to output, without
retaining information from previous inputs. This makes them suitable for tasks with independent
inputs, like image classification. However, FNNs struggle with sequential data since they lack
memory.
Recurrent Neural Networks (RNNs) solve this by incorporating loops that allow information from
previous steps to be fed back into the network. This feedback enables RNNs to remember prior
inputs, making them ideal for tasks where context is important.

1. Recurrent Neurons
The fundamental processing unit in a Recurrent Neural Network (RNN) is a Recurrent Unit, which
is not explicitly called a “Recurrent Neuron.” Recurrent units hold a hidden state that maintains
information about previous inputs in a sequence. Recurrent units can “remember” information
from prior steps by feeding back their hidden state, allowing them to capture dependencies across
time.

Recurrent Neuron
2. RNN Unfolding
RNN unfolding, or “unrolling,” is the process of expanding the recurrent structure over time steps.
During unfolding, each step of the sequence is represented as a separate layer in a series,
illustrating how information flows across each time step. This unrolling enables backpropagation
through time (BPTT), a learning process where errors are propagated across time steps to adjust
the network’s weights, enhancing the RNN’s ability to learn dependencies within sequential data.

RNN Unfolding
Types Of Recurrent Neural Networks
There are four types of RNNs based on the number of inputs and outputs in the network:
1. One-to-One RNN
One-to-One RNN behaves as the Vanilla Neural Network, is the simplest type of neural network
architecture. In this setup, there is a single input and a single output. Commonly used for
straightforward classification tasks where input data points do not depend on previous elements.

One to One RNN


2. One-to-Many RNN
In a One-to-Many RNN, the network processes a single input to produce multiple outputs over
time. This setup is beneficial when a single input element should generate a sequence of
predictions.
For example, for image captioning task, a single image as input, the model predicts a sequence of
words as a caption.

One to Many RNN


3. Many-to-One RNN
The Many-to-One RNN receives a sequence of inputs and generates a single output. This type is
useful when the overall context of the input sequence is needed to make one prediction.
In sentiment analysis, the model receives a sequence of words (like a sentence) and produces a
single output, which is the sentiment of the sentence (positive, negative, or neutral).

Many to One RNN


4. Many-to-Many RNN
The Many-to-Many RNN type processes a sequence of inputs and generates a sequence of outputs.
This configuration is ideal for tasks where the input and output sequences need to align over time,
often in a one-to-one or many-to-many mapping.
In language translation task, a sequence of words in one language is given as input, and a
corresponding sequence in another language is generated as output.

Many to Many RNN


Variants of Recurrent Neural Networks (RNNs)
There are several variations of RNNs, each designed to address specific challenges or optimize for
certain tasks:
1. Vanilla RNN
This simplest form of RNN consists of a single hidden layer, where weights are shared across time
steps. Vanilla RNNs are suitable for learning short-term dependencies but are limited by the
vanishing gradient problem, which hampers long-sequence learning.
2. Bidirectional RNNs
Bidirectional RNNs process inputs in both forward and backward directions, capturing both past
and future context for each time step. This architecture is ideal for tasks where the entire sequence
is available, such as named entity recognition and question answering.
3. Long Short-Term Memory Networks (LSTMs)
Long Short-Term Memory Networks (LSTMs) introduce a memory mechanism to overcome the
vanishing gradient problem. Each LSTM cell has three gates:
• Input Gate: Controls how much new information should be added to the cell state.
• Forget Gate: Decides what past information should be discarded.
• Output Gate: Regulates what information should be output at the current step. This
selective memory enables LSTMs to handle long-term dependencies, making them ideal
for tasks where earlier context is critical.
4. Gated Recurrent Units (GRUs)
Gated Recurrent Units (GRUs) simplify LSTMs by combining the input and forget gates into a
single update gate and streamlining the output mechanism. This design is computationally
efficient, often performing similarly to LSTMs, and is useful in tasks where simplicity and faster
training are beneficial.
Unit -V
Introduction to IBM Watson Studio
IBM Watson® Studio empowers data scientists, developers and analysts to build, run and manage
AI models, and optimize decisions anywhere on IBM Cloud Pak® for Data.

Benefits
Optimize AI and cloud economics
Put multicloud AI to work for business. Use flexible consumption models. Build and deploy AI
anywhere.
Predict outcomes and prescribe actions
Optimize schedules, plans and resource allocations using predictions. Simplify optimization
modeling with a natural language interface.
Synchronize apps and AI
Unite and cross-train developers and data scientists. Push models through REST API across any
cloud. Save time and cost managing disparate tools.
Unify tools and increase productivity for ModelOps
Operationalize enterprise AI across clouds. Govern and secure data science projects at scale.
Deliver explainable AI
Reduce model monitoring efforts by 35% to 50%.¹ Increase model accuracy by 15% to
30%.² Increase net profits on a data and AI platform.
Manage risks and regulatory compliance
Protect against exposure and regulatory penalties. Simplify AI model risk management through
automated validation.

Features
AutoAI for faster experimentation
Automatically build model pipelines. Prepare data and select model types. Generate and rank
model pipelines.
Advanced data refinery
Cleanse and shape data with a graphical flow editor. Apply interactive templates to code
operations, functions and logical operators.
Open-source notebook support
Create a notebook file, use a sample notebook or bring your own notebook. Code and run a
notebook.
Integrated visual tooling
Prepare data quickly and develop models visually with IBM SPSS Modeler in Watson Studio.
Model training and development
Build experiments quickly and enhance training by optimizing pipelines and identifying the right
combination of data.
Extensive open-source frameworks
Bring your model of choice to production. Track and retrain models using production feedback.
Embedded decision optimization
Combine predictive and prescriptive models. Use predictions to optimize decisions. Create and
edit models in Python, in OPL or with natural language.
Model management and monitoring
Monitor quality, fairness and drift metrics. Select and configure deployment for model insights.
Customize model monitors and metrics.
Model risk management
Compare and evaluate models. Evaluate and select models with new data. Examine the key model
metrics side-by-side.
Data science methodology
IBM has defined a lightweight IBM Cloud Garage Method that includes a process model to map
individual technology components to the reference architecture. This method does not include any
requirement engineering or design thinking tasks. Because it can be hard to initially define the
architecture of a project, this method supports architectural changes during the process model.
Each stage plays a vital role in the context of the overall methodology. At a certain level of
abstraction, it can be seen as a refinement of the workflow outlined by the CRISP-DM method for
data mining.
According to both methodologies, every project starts with Business understanding, where the
problem and objectives are defined. This is followed in the IBM Data Science Method by
the Analytical approach phase, where the data scientist can define the approach to solving the
problem. The IBM Data Science Method then continues with three phases called Data
requirements, Data collection, and Data understanding, which in CRISP-DM are presented by a
single Data understanding phase.
After the data scientist has an understanding of the data and has sufficient data to get started, they
move to the Data preparation phase. This phase is usually very time-consuming. A data scientist
spends about 80% of their time in this phase, performing tasks such as data cleansing and feature
engineering. The term "data wrangling" is often used in this context. During and after cleansing
the data, the data scientist generally performs exploration, such as descriptive statistics to get an
overall feel for the data, and clustering to look at the relationships and latent structure of the data.
This process is often iterated several times until the data scientist is satisfied with their data set.
The model training stage is where machine learning is used in building a predictive model. The
model is trained and then evaluated by statistical measures such as prediction accuracy, sensitivity,
and specificity. After the model is deemed sufficient, it is deployed and used for scoring on unseen
data. The IBM Data Science Methodology adds an additional Feedback stage for obtaining
feedback from using the model, which is then used to improve the model. Both methods are highly
iterative by nature.
IBM Watson Studio
IBM Watson Studio gives you the environment and tools to solve business problems by
collaboratively working with data. You can choose the tools needed to analyze and visualize data;
to cleanse and shape the data; to ingest streaming data; or to create, train, and deploy machine
learning models.

With IBM Watson Studio, you can:


• Create projects to organize the resources (such as data connections, data assets,
collaborators, and notebooks) to achieve an analytics goal.
• Access data from connections to your cloud or on-premises data sources.
• Upload files to the project’s object storage.
• Create and maintain data catalogs to discover, index, and share data.
• Refine data by cleansing and shaping the data to prepare it for analysis.
• Perform data science tasks by creating Jupyter Notebooks for Python or Scala to run code
that processes data and then view the results inline. Alternavitely, you can use RStudio
for R.
• Create, test, and deploy machine learning and deep learning models.
• Visualize your data.
Technically, IBM Watson Studio is based on a variety of open source technologies and IBM
products, as shown in the following figure.

In the context of data science, IBM Watson Studio can be viewed as an integrated, multirole
collaboration platform that supports the developer, data engineer, business analyst, and the data
scientist in the process of solving a data science problem. For the developer role, other components
of the IBM Cloud platform might be relevant as well in building applications that use machine
learning services. However, the data scientist can build machine learning models using a variety
of tools, ranging from:
• AutoAI Model Builder: A graphical tool requiring no programming skills

• SPSS Modeler Flows: Adopts a diagrammatic style

• RStudio and Jupyter Notebooks: Using a programmatic style


IBM Watson Machine Learning
Using IBM Watson Machine Learning, you can build analytic models and neural networks, which
are trained with your own data, that you can deploy for use in applications. Watson Machine
Learning provides a full range of tools and services, so you can build, train, and deploy Machine
Learning models. Choose from tools that fully automate the training process for rapid prototyping
to tools that give you complete control to create a model that matches your needs.

IBM Watson Machine Learning service


A key component of IBM Watson Studio is the IBM Watson Machine Learning service and its set
of REST APIs that can be called from any programming language to interact with a machine
learning model. The focus of the IBM Watson Machine Learning service is deployment, but you
can use IBM SPSS Modeler or IBM Watson Studio to author and work with models and pipelines.
Both SPSS Modeler and IBM Watson Studio use Spark MLlib and Python scikit-learn and offer
various modeling methods that are taken from machine learning, artificial intelligence, and
statistics.

Using IBM Watson Machine Learning, you can deploy machine learning models, scripts, and
functions, and prompt templates for generative AI models. After you create deployments, you can
test and manage them, and prepare your assets to deploy into pre-production and production
environments to generate predictions and insights.

Service The administrator must provision the Watson Machine Learning service on Cloud Pak for
Data as a Service platform to use its capabilities.

Analyzing data and building models with Watson Studio, Watson Machine Learning, and
other supplemental services

You can analyze data and build models with the Watson Studio service. Supplemental services to
Watson Studio, such as Watson Machine Learning, add tools and compute resources to projects.

Service The Watson Studio, Watson Machine Learning, and other supplemental services are not
available by default. An administrator must install these services on the IBM Cloud Pak for Data
platform. To determine whether a service is installed, open the Services catalog and check whether
the service is enabled.
To start analyzing data with Watson Studio:

1. Create or open a project:


o To create a project, choose Projects > All projects from the main menu and then
click New project on the Projects page. See Creating a project.
o To open an existing project, choose Projects > All projects from the main menu
and then click the name of the project.
2. Add data to the project. Alternatively, you can add data from within a tool.
3. Analyze data or build models. Find out how to choose the right tool in Watson Studio.

You can use the data analytics and model building methods that are listed in the following table
with Watson Studio plus the other listed services.

Supplementary services to
Method
Watson Studio
Analyze data by writing code in Jupyter notebooks or scripts
Code notebooks and Python scripts in the JupyterLab IDE
with Git integration
Visualize and prepare data in Data Refinery
[Link]
Prompt Lab
Watson Machine Learning
[Link]
Tuning Studio
Watson Machine Learning
Supplementary services to
Method
Watson Studio
Develop Shiny applications in the RStudio IDE RStudio Server with R 3.6
Visualize your data without coding with Cognos dashboards Cognos Dashboards
Run analytic workloads with Spark environments or SparkAnalytics Engine powered by
APIs Apache Spark
Execution Engine for Apache
Analyze data on Apache Hadoop clusters
Hadoop
Analyze data with SQL queries on Hadoop clusters or cloud
Db2 Big SQL
object stores
Build models with AutoAI Watson Machine Learning
Train models with federated learning Watson Machine Learning
Build models in notebooks Watson Machine Learning
Watson Machine Learning
Run deep learning experiments Watson Machine Learning
Accelerator
Watson Machine Learning
Solve Decision Optimization models
Decision Optimization
Build models with SPSS Modeler SPSS Modeler
Data analysis and model building methods with Watson Studio and supplementary services

Analyzing data with other services

If you don't have the Watson Studio service that is installed, you can use the data analytics methods
with the services that are listed in the following table.

Method Service
Analytics Engine powered by
Run analytic workloads with Spark APIs
Apache Spark
Analyze data with SQL queries on Hadoop clusters or cloud
Db2 Big SQL
object stores
Table 2. Data analysis methods and the required services

Deploying and managing assets in deployment spaces

Create a deployment space to collaborate with stakeholders and deploy and manage your AI assets.
To manage your assets within a deployment space, you must promote your assets from a project
to your deployment space. You can also import or export assets from your deployment space. For
more information, see Deployment spaces.
The above graphic shows the typical activities to deploy AI assets:

Deployment spaces
Deployment spaces contain deployable assets, deployments, deployment jobs, associated input and
output data, and the associated environments. You can use spaces to deploy various assets and manage
your deployments.

Promoting assets to a deployment space

Deployment spaces are not associated with projects. You can promote assets from multiple
projects to a space, and you can deploy assets to more than one space. For example, you might
have a test space for evaluating deployments, and a production space for deployments that you
want to deploy in business applications.

Components of a deployment space

When you open a deployment space from Cloud Pak for Data, you see these components:

• Overview: Use the Overview tab to view all space activity, such as machine learning and
generative AI assets within the deployment space and the status of deployments and jobs.
• Assets: The Assets tab provides details about assets that are created or imported in the
deployment space. You can organize assets by types. For example, the Data access view
provides information about connections to DataStax Enterprise, Google BigQuery, Oracle,
and more.
• Deployments: Use the Deployments tab to monitor the status of your deployments in the
deployment space. You must promote assets from your project to the deployment space
before they can be deployed.
• Jobs: Use the Jobs tab to monitor jobs that are associated with your batch deployments.
• Manage: Use the Manage tab to access and edit details about your deployment space. You
can share a space with collaborators. When you add collaborators to a deployment space,
you can specify which actions they can do by assigning them access levels. You can also
create new environments and managing resource usage.

The following graphic shows the elements of a deployment space:

Creating deployment spaces


Create a deployment space to store your assets, deploy assets, and manage your deployments.

Required permissions:
All users in your IBM Cloud account with the Editor IAM platform access role for all IAM
enabled services or for Cloud Pak for Data can manage to create deployment spaces. For more
information, see IAM Platform access roles.

A deployment space is not associated with a project. You can publish assets from multiple projects
to a space. For example, you might have a test space for evaluating deployments, and a production
space for deployments you want to deploy in business applications.

Follow these steps to create a deployment space:

1. From the navigation menu, select Deployments > New deployment space. Enter a name
for your deployment space.
2. Optional: Add a description and tags.
3. Select a storage service to store your space assets.
o If you have a Cloud Object Storage repository that is associated with your IBM Cloud
account, choose a repository from the list to store your space assets.
o If you do not have a Cloud Object Storage repository that is associated with your IBM
Cloud account, you are prompted to create one.
4. Optional: If you want to deploy assets from your space, select a machine learning service
instance to associate with your deployment space.
To associate a machine learning instance to a space, you must:
o Be a space administrator.
o Have view access to the machine learning service instance that you want to
associate with the space.

Tip: If you want to evaluate assets in the space, switch to the **Manage** tab and associate a
Watson OpenScale instance.

5. Optional: Assign the space to a deployment stage. Deployment stages are used for MLOps,
to manage access for assets in various stages of the AI lifecycle. They are also used in
governance, for tracking assets. Choose from:
o Development for assets under development. Assets that are tracked for governance
are displayed in the Develop stage of their associated use case.
o Testing for assets that are being validated. Assets that are tracked for governance
are displayed in the Validate stage of their associated use case.
o Production for assets in production. Assets that are tracked for governance are
displayed in the Operate stage of their associated use case.
6. Optional: Upload space assets, such as exported project or exported space. If the imported
space is encrypted, you must enter the password.

Tip: If you get an import error, clear your browser cookies and then try again.

7. Click Create.

Viewing and managing deployment spaces

• To view all deployment spaces that you can access, click Deployments on the navigation
menu.
• To view any of the details about the space after you create it, such as the associated service
instance or storage ID, open your deployment space and then click the Manage tab.
• Your space assets are stored in a Cloud Object Storage repository. You can access this
repository from IBM Cloud. To find the bucket ID, open your deployment space, and click
the Manage tab.

Automatic archiving of spaces

Spaces that are not used for 90 days are automatically archived to preserve system resources. If
you access an archived space, either through the user interface or programmatically, you must wait
for the space to be restored before you can use it.
Note: You cannot promote or add assets to an archived space. Restore the space first, then promote
or add assets.

You cannot promote or add assets to an archived space. Restore the space first, then promote or
add assets.

Ways to deploy assets


You can deploy and manage your assets in the following ways:

• Use a no-code approach: You can use a no-code approach to deploy and manage assets in
a deployment space. For more information, see Deploying and managing assets in deployment
spaces
• Use a custom-code approach You can use a custom-code approach to deploy and manage
assets programmatically by using:
o Python client
o Watson Machine Learning API

For additional Cloud Pak for Data as a Service APIs, see Cloud Pak for Data APIs.

Types of deployments
Depending on your organization's needs, you can create an online or a batch deployment:

• Online deployment: Create an online deployment to process input data in real-time. To


test the online deployment in real-time, you can submit new customer data to the
deployment endpoint to get a prediction in real-time. You can test your online deployment
by entering test data in a form or through JSON code. For more information, see Creating
online deployments in Watson Machine Learning.
• Batch deployment: Create a batch deployment to process a large batch of input data from
a data source and write the output to a selected destination. To test your batch deployment,
you must create a batch deployment job. You can configure the batch deployment job by
providing details about the input data, output file, and information about running the job
on a schedule or on demand. For more information, see Creating batch deployments in
Watson Machine Learning.
• Application deployment: Create an application deployment to deploy your application
assets, such as R Shiny applications. For more information, see Deploying Shiny apps in
Watson Machine Learning.

Retrieving deployment endpoints

To use you deployed asset in applications for making predictions, retrieve the endpoint URL for
your online or batch deployment. The model endpoint provides access to an interface to invoke
and manage model deployments.
For more information, see Retrieving the endpoint for an online deployment or Retrieving the endpoint
for a batch deployment.

Types of deployable assets


You can use certain assets only to create online or batch deployments. For example, both online
and batch deployments support the deployment of assets such as Python functions, scripts, and
models, such as AutoAI or Decision Optimization models. However, you can create online
deployments only for models that are imported from a file. The different types of deployable assets
are as follows:

• Foundation model assets: You can deploy foundation model assets such as tuned model
or prompt template assets with [Link]. For more information, see Deploying foundation
model assets.
• Machine Learning assets: You can deploy machine learning Machine Learning assets
such as Python functions, R Shiny applications, NLP models, scripts, and more with
Watson Machine Learning. For more information, see Deploying Machine Learning assets.
• Decision Optimization models: You can deploy Decision Optimization model with
Watson Machine Learning.

Managing deployments
You can access, update, scale, delete, and monitor the performance for your deployment in your
deployment space:

• Accessing a deployment: You can access details that are related to your deployment, such
as stage type, which describes whether the deployment space is for preproduction or
production purposes.
• Updating a deployment: You can update your deployment details such as deployment
name, software specification, and more. For more information, see Updating a deployment.
• Scaling a deployment: You can create multiple copies of your deployment to increase
scalability and availability for a larger volume of scoring requests. For more information,
see Scaling a deployment.
• Deleting a deployment: Delete your deployment when you no longer need it to free up
resources. For more information, see Deleting a deployment.
• Monitor deployment performance: You can evaluate your deployments to measure
performance and understand model predictions by provisioning a Watson OpenScale
instance and configuring monitors for fairness, quality, drift, and explainability.

Monitoring deployment activity


Use the deployments dashboard to get an aggregate view of your deployments and monitor
deployment activity. You can use the dashboard to monitor the status of your batch deployment
jobs, such as active runs and finished runs based on job schedule that you defined when you created
the job. You can also get information about the number of successful and failed online
deployments. For more information, see Deployments dashboard.

Managing runtime environments for deployments

Runtime environments provide the necessary functions that are required to run your deployment.

Important: You must use the same runtime environment to build and deploy your model.

You can use predefined runtime environments or create custom runtime environments to include
more components, depending on your use case. To create a custom runtime environment for your
deployment, you must create a Dockerfile and add a base image. Further, you can add
the docker commands to build the runtime environment for your deployment. For more
information, see Customizing Watson Machine Learning deployment runtimes.

Deploying and managing AI assets


Use IBM [Link] Runtime to deploy and manage AI assets, and put them into pre-production
and production environments. Manage and update deployed assets. You can also automate part of
the AI lifecycle using IBM Orchestration Pipelines.

Deploying AI assets and orchestrating pipelines

Deploying an asset makes it available for testing or for productive use via an endpoint.

The following graphic describes the process of deploying your model, automating path to
production, and monitoring and managing AI lifecycle after you build your model:
Deploy assets

You can deploy assets from your deployment space by using [Link] Runtime. To deploy your
assets, you must promote these assets from a project to your deployment space or import these
assets directly to your deployment space.

For more information, see Deploying AI assets.

Automate pipelines

You can automate the path to production by building a pipeline to automate parts of the AI lifecycle
from building the model to deployment by using Orchestration Pipelines.
Short Answers
1. What is Conditional Probability?
2. How Machine Learning is differed from Traditional Learning?
3. How classification is differed from Clustering? Which situation can use clustering?
4. What are the different types of deployments?
5. How can avoid overfitting in a Machine Learning?
6. What are the different types of decision theory?
7. What are the various distance metrics used in KNN?
8. What is the role of filters (kernels) in CNNs?
9. How Generative model is differed from Discriminative models.
10. How to monitor deployment activity?
11. Why dimension reduction is required? In machine learning techniques.
12. What do you mean by pruning?
13. What is dimensionality reduction in ML? List is advantages.
14. What is model based learning?
15. Define Deep Learning. How it is differed from ML.
16. Draw the diagrammatic representation of multi-layer perceptron, and write mathematical notation
of output calculation.
17. Define Perceptron.
18. How can you choose K value in the clustering concept?
19. How to monitor deployment activity?
20. What are the ways to deploy assets?

Big Answers
1. Explain in detail vector calculation and optimization techniques.
2. Explain in detail about Decision theory and Information with its applications.
3. Explain in detail principal component analysis with suitable example and list its
applications
4. Explain in detail various Ensemble learning methods with suitable examples, list
differentiate among them.
5. Explain in detail about different types of Machine Learning types with advantages,
disadvantages and applications.
6. How Machine learning is differed from Artificial intelligence?
7. Explain in detail K Means clustering with suitable example.
8. Explain in detail Hierarchical clustering with its types with suitable examples.
9. Compare supervised learning, unsupervised learning, and reinforcement learning.
10. Explain in detail Decision Tree used in supervised learning for classification.
11. Explain in detail Convolution Neural Network with its applications.
12. Explain in detail Feed Forward Network and Backpropagation algorithm with suitable
example.
13. Explain in detail DBSCAN clustering with suitable example
14. Explain in detail Recommendations systems its types, and its advantages and
disadvantages.
15. Explain in detail Support vector Machine used in supervised learning for classification.
16. Explain different types of deployable assets in IBM Watson studio.
17. How to Managing runtime environments for deployments in IBM Watson studio.
18. How to evaluate classification algorithms in Machine Learning.
19. Explain in detail about different types of Machine Learning types with advantages, disadvantages
and applications.
20. Describe about RNN and its various types.
21. Explain in detail Multilayer perceptron.
22. Explain various IBM Watson Machine Learning service.
23. How to Managing runtime environments for deployments in IBM Watson studio.
24. Explain in detail K Means clustering with suitable example.
25. Explain various cross validation methods with suitable examples.
26. How to monitor deployment activity in IBM Watson studio.
Dimension Reduction-

In pattern recognition, Dimension Reduction is defined as-

• It is a process of converting a data set having vast dimensions into a data set with lesser
dimensions.
• It ensures that the converted data set conveys similar information concisely.

Example-

Consider the following example-

• The following graph shows two dimensions x1 and x2.


• x1 represents the measurement of several objects in cm.
• x2 represents the measurement of several objects in inches.

In machine learning,

• Using both these dimensions convey similar information.


• Also, they introduce a lot of noise in the system.
• So, it is better to use just one dimension.
Using dimension reduction techniques-

• We convert the dimensions of data from 2 dimensions (x1 and x2) to 1 dimension (z1).
• It makes the data relatively easier to explain.
Benefits-

Dimension reduction offers several benefits such as-

• It compresses the data and thus reduces the storage space requirements.
• It reduces the time required for computation since less dimensions require less computation.
• It eliminates the redundant features.
• It improves the model performance.

Dimension Reduction Techniques-

The two popular and well-known dimension reduction techniques are-

1. Principal Component Analysis (PCA)


2. Fisher Linear Discriminant Analysis (LDA)

In this article, we will discuss about Principal Component Analysis.

Principal Component Analysis-

• Principal Component Analysis is a well-known dimension reduction technique.


• It transforms the variables into a new set of variables called as principal components.
• These principal components are linear combination of original variables and are orthogonal.
• The first principal component accounts for most of the possible variation of original data.
• The second principal component does its best to capture the variance in the data.
• There can be only two principal components for a two-dimensional data set.
PRACTICE PROBLEMS BASED ON PRINCIPAL
COMPONENT ANALYSIS- PCA Algorithm-

The steps involved in PCA Algorithm are as follows-

Step-01: Get data.


Step-02: Compute the mean vector (µ).
Step-03: Subtract mean from the given data.
Step-04: Calculate the covariance matrix.
Step-05: Calculate the eigen vectors and eigen values of the covariance matrix.
Step-06: Choosing components and forming a feature vector.
Step-07: Deriving the new data set.

Problem-01:

Given data = { 2, 3, 4, 5, 6, 7 ; 1, 5, 3, 6, 7, 8 }.

Compute the principal component using PCA Algorithm.

OR

Consider the two-dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8).

Compute the principal component using PCA Algorithm.

OR

Compute the principal component of following data-

CLASS 1

X=2,3,4
Y=1,5,3

CLASS 2

X=5,6,7

Y=6,7,8

Solution-

We use the above discussed PCA Algorithm-

Step-01:

Get data.

The given feature vectors are-

• x1 = (2, 1)
• x2 = (3, 5)
• x3 = (4, 3)
• x4 = (5, 6)
• x5 = (6, 7)
• x6 = (7, 8)

Step-02:

Calculate the mean vector (µ).

Mean vector (µ)

= ((2 + 3 + 4 + 5 + 6 + 7) / 6, (1 + 5 + 3 + 6 + 7 + 8) / 6)

= (4.5, 5)
Thus,

Step-03:

Subtract mean vector (µ) from the given feature vectors.

• x1 – µ = (2 – 4.5, 1 – 5) = (-2.5, -4)


• x2 – µ = (3 – 4.5, 5 – 5) = (-1.5, 0)
• x3 – µ = (4 – 4.5, 3 – 5) = (-0.5, -2)
• x4 – µ = (5 – 4.5, 6 – 5) = (0.5, 1)
• x5 – µ = (6 – 4.5, 7 – 5) = (1.5, 2)
• x6 – µ = (7 – 4.5, 8 – 5) = (2.5, 3)

Feature vectors (xi) after subtracting mean vector (µ) are-

Step-04:

Calculate the covariance matrix.

Covariance matrix is given by-


Now,

Now,

Covariance matrix

= (m1 + m2 + m3 + m4 + m5 + m6) / 6
On adding the above matrices and dividing by 6, we get-

Step-05:

Calculate the eigen values and eigen vectors of the covariance matrix.

λ is an eigen value for a matrix M if it is a solution of the characteristic equation |M – λI| = 0.

So, we have-

From here,

(2.92 – λ)(5.67 – λ) – (3.67 x 3.67) = 0

16.56 – 2.92λ – 5.67λ + λ2 – 13.47 = 0

λ2 – 8.59λ + 3.09 = 0
Solving this quadratic equation, we get λ = 8.22, 0.38

Thus, two eigen values are λ1 = 8.22 and λ2 = 0.38.

Clearly, the second eigen value is very small compared to the first eigen value.

So, the second eigen vector can be left out.

Eigen vector corresponding to the greatest eigen value is the principal component for the given data
set. So. we find the eigen vector corresponding to eigen value λ1.

We use the following equation to find the eigen vector-

MX = λX

where-

• M = Covariance Matrix
• X = Eigen vector
• λ = Eigen value

Substituting the values in the above equation, we get-

Solving these, we get-

2.92X1 + 3.67X2 = 8.22X1

3.67X1 + 5.67X2 = 8.22X2

On simplification, we get-

5.3X1 = 3.67X2 ………(1)

3.67X1 = 2.55X2 ………(2)


From (1) and (2), X1 = 0.69X2

From (2), the eigen vector is-

Thus, principal component for the given data set is-

Lastly, we project the data points onto the new subspace as-
Problem-02:

Use PCA Algorithm to transform the pattern (2, 1) onto the eigen vector in the previous question.

Solution-

The given feature vector is (2, 1).

The feature vector gets transformed to

= Transpose of Eigen vector x (Feature Vector – Mean Vector)


CP4252 MACHINE LEARNING L T PC
3 0 2 4
COURSE OBJECTIVES:
 To understand the concepts and mathematical foundations of machine learning and types of
problems tackled by machine learning
 To explore the different supervised learning techniques including ensemble methods
 To learn different aspects of unsupervised learning and reinforcement learning
 To learn the role of probabilistic methods for machine learning
 To understand the basic concepts of neural networks and deep learning

UNIT I INTRODUCTION AND MATHEMATICAL FOUNDATIONS 9


What is Machine Learning? Need –History – Definitions – Applications - Advantages, Disadvantages
& Challenges -Types of Machine Learning Problems – Mathematical Foundations - Linear Algebra &
Analytical Geometry -Probability and Statistics- Bayesian Conditional Probability -Vector Calculus &
Optimization - Decision Theory - Information theory

UNIT II SUPERVISED LEARNING 9


Introduction-Discriminative and Generative Models -Linear Regression - Least Squares -Under-fitting
/ Overfitting -Cross-Validation – Lasso Regression- Classification - Logistic Regression- Gradient
Linear Models -Support Vector Machines –Kernel Methods -Instance based Methods - K-Nearest
Neighbours - Tree based Methods –Decision Trees –ID3 – CART - Ensemble Methods –Random
Forest - Evaluation of Classification Algorithms

UNIT III UNSUPERVISED LEARNING AND REINFORCEMENT LEARNING 9


Introduction - Clustering Algorithms -K – Means – Hierarchical Clustering - Cluster Validity -
Dimensionality Reduction –Principal Component Analysis – Recommendation Systems - EM
algorithm. Reinforcement Learning – Elements -Model based Learning – Temporal Difference
Learning

UNIT IV PROBABILISTIC METHODS FOR LEARNING- 9


Introduction -Naïve Bayes Algorithm -Maximum Likelihood -Maximum Apriori -Bayesian Belief
Networks -Probabilistic Modelling of Problems -Inference in Bayesian Belief Networks – Probability
Density Estimation - Sequence Models – Markov Models – Hidden Markov Models

UNIT V NEURAL NETWORKS AND DEEP LEARNING 9


Neural Networks – Biological Motivation- Perceptron – Multi-layer Perceptron – Feed Forward
Network – Back Propagation-Activation and Loss Functions- Limitations of Machine Learning – Deep
Learning– Convolution Neural Networks – Recurrent Neural Networks – Use cases
45 PERIODS
SUGGESTED ACTIVITIES:
1. Give an example from our daily life for each type of machine learning problem
2. Study at least 3 Tools available for Machine Learning and discuss pros & cons of each
3. Take an example of a classification problem. Draw different decision trees for the example
and explain the pros and cons of each decision variable at each level of the tree
4. Outline 10 machine learning applications in healthcare
5. Give 5 examples where sequential models are suitable.
6. Give at least 5 recent applications of CNN

23
Assignment -1
Build a decision tree using ID3 algorithm for the given training data in the table (Buy Computer data),
and predict the class of the following new example: age<=30, income=medium, student=yes, credit-
rating=fair

age income student Credit rating Buys computer

<=30 high no fair no

<=30 high no excellent no

31…40 high no fair yes

>40 medium no fair yes

>40 low yes fair yes

>40 low yes excellent no

31…40 low yes excellent yes

<=30 medium no fair no

<=30 low yes fair yes

>40 medium yes fair yes

<=30 medium yes excellent yes

31…40 medium no excellent yes

31…40 high yes fair yes

>40 medium no excellent no


Assignment-2 Build a decision tree using ID3 algorithm for the given training data in the table COVID-19
infection.

You might also like