UNIT – II
Bayesian Learning: Probability Theory and Bayes Rule, Naives Bayes Learning Algorithm, Bayes Nets
Bayes Theorem:
Let E1, E2,…, En be a set of events associated with a sample space S, where all the events E1, E2,…, En
have nonzero probability of occurrence and they form a partition of S. Let A be any event associated with S,
then according to Bayes theorem,
Naive Bayes Classifiers:
A Naive Bayes classifiers, a family of algorithms based on Bayes’ Theorem. Despite the “naive” assumption
of feature independence, these classifiers are widely utilized for their simplicity and efficiency in machine
learning. It is not a single algorithm but a family of algorithms where all of them share a common principle,
i.e. every pair of features being classified is independent of each other. To start with, let us consider a
dataset.
One of the most simple and effective classification algorithms, the Naïve Bayes classifier aids in the rapid
development of machine learning models with rapid prediction capabilities.
Naïve Bayes algorithm is used for classification problems. It is highly used in text classification. In text
classification tasks, data contains high dimension (as each word represent one feature in the data). It is
used in spam filtering, sentiment detection, rating classification etc. The advantage of using naïve Bayes is
its speed. It is fast and making prediction is easy with high dimension of data.
This model predicts the probability of an instance belongs to a class with a given set of feature value. It is a
probabilistic classifier. It is because it assumes that one feature in the model is independent of existence of
another feature. In other words, each feature contributes to the predictions with no relation between each
other. In real world, this condition satisfies rarely. It uses Bayes theorem in the algorithm for training and
prediction.
Types of Naive Bayes Model
There are three main types of Naive Bayes classifiers, each designed to handle different types of data
distributions and features:
1. Gaussian Naive Bayes (GaussianNB)
● Purpose: Used when the features are continuous and are assumed to follow a Gaussian (normal)
distribution.
● Assumption: It assumes that the likelihood of the features given the class follows a normal
distribution. For each feature, the model estimates the mean and variance from the training data for
each class.
● Formula: The probability density function of a Gaussian distribution for a feature xi given class c is:
whereare the mean and variance of the feature for class c.
● Example: Gaussian Naive Bayes is often used in applications where continuous features such as
height, weight, or age are common and can be modeled with a normal distribution.
2. Multinomial Naive Bayes
● Purpose: Primarily used for discrete data, especially when dealing with document classification or
word count features.
● Assumption: The features are counts (positive integers) and are distributed according to a
multinomial distribution. It works best for problems where the features represent frequencies, such
as word counts in a text document.
● Formula: The likelihood of a feature given a class ccc is:
● Example: Multinomial Naive Bayes is commonly used in Natural Language Processing (NLP) tasks
like spam detection, text classification, and sentiment analysis, where the feature set is often the
frequency of words or n-grams in documents.
3. Bernoulli Naive Bayes
● Purpose: Used for binary/Boolean feature vectors, where features are represented as 0s or 1s
(true/false or presence/absence of a feature).
● Assumption: Each feature is binary and follows a Bernoulli distribution, meaning that it can take only
two possible outcomes: 0 or 1.
● Formula: The likelihood of a feature given a class ccc is:
● Example: Bernoulli Naive Bayes is commonly used in text classification tasks where the feature set
consists of binary indicators for the presence or absence of words in a document (e.g., whether the
word “spam” is present in an email).
Summary of Differences:
Classifier Type Feature Type Assumed Example Applications
Distribution
GaussianNB Continuous Gaussian Continuous data, e.g., medical diagnosis based on
(Normal) measurements like height, weight, etc.
MultinomialNB Discrete Multinomial Text classification, word frequency analysis,
(counts) sentiment analysis
BernoulliNB Binary (0/1 Bernoulli Text classification with binary features (word
values) presence/absence)
Bayesian Belief Network:
A Bayesian Belief Network (BBN), also known as a Bayesian Network (BN), is a probabilistic graphical
model that represents a set of variables and their conditional dependencies via a directed acyclic graph
(DAG). It provides a compact and efficient way of representing the joint probability distribution of random
variables.
"A Bayesian network is a probabilistic graphical model which represents a set of variables and their
conditional dependencies using a directed acyclic graph."
Bayesian networks are probabilistic, because these networks are built from a probability distribution, and
also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between multiple
events, we need a Bayesian network. It can also be used in various tasks including prediction, anomaly
detection, diagnostics, automated insight, reasoning, time series prediction, and decision making under
uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it consists of two
parts:
o Directed Acyclic Graph
o Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities between random
variables. These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no directed link
that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the nodes of the
network graph.
o If we are considering node B, which is connected with node A by a directed arrow, then node
A is called the parent of Node B.
o Node C is independent of node A.
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as a directed
acyclic graph or DAG.
The Bayesian network has mainly two components:
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ), which
determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional probability. So let's first
understand the joint probability distribution:
Joint probability distribution:
If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1, x2, x3.. xn, are
known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability distribution.
= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]
= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].
Explanation of Bayesian network:
Let's understand the Bayesian network through an example by creating a directed acyclic graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably responds at
detecting a burglary but also responds for minor earthquakes. Harry has two neighbors David and Sophia,
who have taken a responsibility to inform Harry at work when they hear the alarm. David always calls Harry
when he hears the alarm, but sometimes he got confused with the phone ringing and calls at that time too.
On the other hand, Sophia likes to listen to high music, so sometimes she misses to hear the alarm. Here
we would like to compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an earthquake
occurred, and David and Sophia both called the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network structure is showing that
burglary and earthquake is the parent node of the alarm and directly affecting the probability of
alarm's going off, but David and Sophia's calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive the burglary and also do
not notice the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an exhaustive
set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if there are two
parents, then CPT will contain 4 probability values
List of all events occurring in this network:
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can rewrite the
above probability statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]
Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
Conditional probability table for Alarm A:
The Conditional probability of Alarm A depends on Burglar and earthquake:
B E P(A= True) P(A= False)
True True 0.94 0.06
True False 0.95 0.04
False True 0.31 0.69
False False 0.001 0.999
Conditional probability table for David Calls:
The Conditional probability of David that he will call depends on the probability of Alarm.
A P(D= True) P(D= False)
True 0.91 0.09
False 0.05 0.95
Conditional probability table for Sophia Calls:
The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."
A P(S= True) P(S= False)
True 0.75 0.25
False 0.02 0.98
From the formula of joint distribution, we can write the problem statement in the form of probability
distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint distribution.
2. State different probabilities used in Baysian Belief network
In Bayesian Belief Networks (BBNs), several types of probabilities are used:
1. Prior Probabilities: These are the probabilities of each node (variable) in the network occurring
independently, without considering any relationships with other nodes. They represent the initial
beliefs or assumptions about the variables.
2. Conditional Probabilities: These represent the probability of a node (variable) occurring given the
values of its parent nodes in the network. They define the relationships between the variables.
3. Joint Probabilities: These are the probabilities of combinations of variables occurring together. They
can be calculated using the chain rule of probability and the conditional probabilities defined in the
BBN.
4. Posterior Probabilities: These are the probabilities of nodes (variables) occurring given the evidence
(observed values) in the network. They are calculated using inference algorithms, such as belief
propagation or sampling methods, to update the prior probabilities based on the evidence.