Bayesian Classification
1
Bayesian Classification
A statistical classifier
Probabilistic prediction
Predict class membership probabilities
Based on Bayes’ Theorem
Naive Bayesian classifier
comparable performance with decision tree and selected neural
network classifiers
Accuracy and Speed is good when applied to large databases
Incremental
2
Bayesian Classification
Naïve Bayesian Classifier
Class Conditional Independence
Effect of an attribute value on a given class is
independent of the values of other attributes
Simplifies Computations
Bayesian Belief Networks
Graphical models
Represent dependencies among subsets of
attributes
3
Bayesian Theorem: Basics
Let X be a data sample class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
Posterior Probability
P(H) (prior probability), the initial probability
P(X): probability that sample data is observed
P(X|H) (posteriori probability), the probability of observing the
sample X, given that the hypothesis holds
X – Round and Red Fruit H - Apple
4
Bayesian Theorem
Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes theorem
P(H | X) = P(X | H )P(H )
P(X)
Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
Practical difficulty: require initial knowledge of many probabilities,
significant computational cost
5
Naïve Bayesian Classifier
Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
This can be derived from Bayes’ theorem
P(X | C ) P(C )
P(C | X) = i i
i P(X)
6
Naïve Bayesian Classifier
Since P(X) is constant for all classes, only
P(C | X) = P(X | C )P(C )
i i i
needs to be maximized
Can assume that all classes are equally likely and maximize P(X|
Ci)
A simplified assumption: attributes are conditionally independent
(i.e., no dependence relation between attributes):
n
P(X | C i) = ∏ P( x | C i) = P( x | C i) × P( x | C i) × ...× P( x | C i)
k 1 2 n
k=1
7
Derivation of Naïve Bayes Classifier
This greatly reduces the computation cost: Only counts the
class distribution
If Ak is categorical, P(xk|Ci) = sik /si where sik is the # of tuples in Ci
having value xk for Ak and si is the number of training samples
belonging to Ci
If Ak is continuous-valued, P(xk|Ci) is usually computed based
on Gaussian distribution with a mean μ and standard deviation
σ
P(xk|Ci) is g(xk, µCi, σCi)
( x −µ ) 2
1 −
g ( x, µ, σ ) = e 2σ 2
2πσ
8
Example age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
Class:
C1:buys_computer = ‘yes’
>40 low yes fair yes
C2:buys_computer = ‘no’ >40 low yes excellent no
31…40 low yes excellent yes
Data sample <=30 medium no fair no
X = (age <=30, <=30 low yes fair yes
Income = medium, >40 medium yes fair yes
Student = yes <=30 medium yes excellent yes
Credit_rating = Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
9
Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
10
Avoiding the 0-Probability Problem
Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i) = ∏P ( x k | C i )
k =1
Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10),
Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
The “corrected” prob. estimates are close to their
“uncorrected” counterparts
11
Naïve Bayesian Classifier
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
12
Bayesian Belief Networks
Models dependencies between variables
Defined by Two components
Directed Acyclic Graph
Conditional Probability Table (CPT) for each variable
Bayesian belief network allows a subset of the
variables to be conditionally independent
13
Bayesian Belief Networks
A graphical model of causal relationships
Represents dependency among the variables
Gives a specification of joint probability distribution
Nodes: random variables
Links: dependency
X Y
X and Y are the parents of Z, and Y is
the parent of P
Z
P No dependency between Z and P
Has no loops or cycles
14
Bayesian Belief Network: An Example
Family The conditional probability table
History
Smoker (CPT) for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
LC 0.8 0.5 0.7 0.1
LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9
CPT shows the conditional probability for
each possible combination of its parents
Derivation of the probability of a
PositiveXRay Dyspnea
particular combination of values of X,
from CPT:
n
Bayesian Belief Networks P ( x1 ,..., xn ) = ∏ P ( x i | Parents (Y i ))
i =1
15
Training Bayesian Networks
Several scenarios:
Given both the network structure and all variables observable:
learn only the CPTs
Network structure known, some hidden variables: gradient
descent (greedy hill-climbing) method, analogous to neural
network learning
Network structure unknown, all variables observable: search
through the model space to reconstruct network topology
Unknown structure, all hidden variables: No good algorithms
known for this purpose
16