Classification-Bayesian
Classification
Dr. Manish Kumar
Associate Professor
Chair: Data Analytics Lab & M.Tech (Data Engg.)
Department of Information Technology
Indian Institute of Information Technology-Allahabad, Prayagraj
Bayesian Classification
Bayesian Classification
What are Bayesian Classifiers?
▪ Statistical Classifiers
▪ Predict class membership probabilities
▪ Based on Bayes Theorem
▪ Naïve Bayesian Classifier
▪ Computationally Simple
▪ Comparable performance with DT and NN
classifiers
Bayesian Classification
▪ Probabilistic learning: Calculate explicit
probabilities for hypothesis, among the most
practical approaches to certain types of learning
problems
▪ Incremental: Each training example can
incrementally increase/decrease the probability that
a hypothesis is correct. Prior knowledge can be
combined with observed data.
Bayes Theorem
▪ Let X be a data sample whose class label is
unknown
▪ Let H be some hypothesis that X belongs to a
class C
▪ For classification determine P(H/X)
▪ P(H/X) is the probability that H holds given the
observed data sample X
▪ P(H/X) is posterior probability of H conditioned on
X
Bayes Theorem
Example: Sample space: All Fruits described by their
color and shape
X is “round” and “red”
H= hypothesis that X is an Apple
P(H/X) is our confidence that X is an apple given
that X is “round” and “red”
▪ P(H) is Prior Probability of H, ie, the probability
that any given data sample is an apple regardless
of how it looks
▪ P(H/X) is based on more information
▪ Note that P(H) is independent of X
Bayes Theorem
Example: Sample space: All Fruits
▪ P(X/H) ?
▪ It is the probability that X is round and
red given that we know that it is true
that X is an apple
▪ Here P(X) is prior probability =
P(data sample from our set of fruits is
red and round)
Estimating Probabilities
▪ P(X), P(H), and P(X/H) may be estimated
from given data
▪ Bayes Theorem
▪ Use of Bayes Theorem in Naïve Bayesian
Classifier!!
Naïve Bayesian Classification
▪ Also called Simple BC
▪ Class Conditional Independence
Effect of an attribute values on a given class is
independent of the values of other attributes
▪ This assumption simplifies computations
Naïve Bayesian Classification
Steps Involved
1. Each data sample is of the type
X=(xi) i =1(1)n, where xi is the values of X for
attribute Ai
2. Suppose there are m classes C i, i=1(1)m.
X ∈ Ci iff
P(Ci|X) > P(Cj|X) for 1≤ j ≤ m, j≠i
i.e BC assigns X to class Ci having highest
posterior probability conditioned on X
Naïve Bayesian Classification
The class for which P(Ci|X) is maximized is called the
maximum posteriori hypothesis.maximum posterior hypothesis.
From Bayes Theorem
P ( Ci | X ) = P ( X | Ci ) P ( Ci )
P( X )
3. P(X) is constant. Only need be maximized.
▪ If class prior probabilities not known, then assume all
classes to be equally likely i.e. P(C1)=P(C2)=…=P(Cm)
therefore maximize P(X/Ci )
▪ Otherwise maximize
P(Ci) = Si/S
Problem: computing P(X|Ci) is computationally expensive
(may be infeasible)
Naïve Bayesian Classification
4. Naïve assumption: attribute independence
(class conditional independence)
= P(x1,…,xn|C) = Π P(xk|C) over k=1 to n
5. In order to classify an unknown sample X,
evaluate for each class Ci. Sample X
is assigned to the class Ci iff
P(X|Ci)P(Ci) > P(X|Cj) P(Cj) for 1≤ j ≤ m, j≠i
Naïve Bayesian Classification
Example
Age Income Student Credit_rating Class:Buys_comp
<=30 HIGH N FAIR N
<=30 HIGH N EXCELLENT N
31…..40 HIGH N FAIR Y
>40 MEDIUM N FAIR Y
>40 LOW Y FAIR Y
>40 LOW Y EXCELLENT N
31…..40 LOW Y EXCELLENT Y
<=30 MEDIUM N FAIR N
<=30 LOW Y FAIR Y
>40 MEDIUM Y FAIR Y
<=30 MEDIUM Y EXCELLENT Y
31….40 MEDIUM N EXCELLENT Y
31….40 HIGH Y FAIR Y
>40 MEDIUM N EXCELLENT N
Naïve Bayesian Classification
Example
X= (<=30,MEDIUM, Y,FAIR, ???)
We need to max.
P(X|Ci)P(Ci) for i =1,2.
P(Ci) is computed from training sample
P(buys_comp=Y) = 9/14 = 0.643
P(buys_comp=N) = 5/14 = 0.357
How to calculate P(X|Ci)P(Ci) for i=1,2?
P(X|Ci) = P(x1, x2, x3, x4|C) = ΠP(xk|C)
Naïve Bayesian Classification
Example
P(age<=30 | buys_comp=Y)=2/9=0.222
P(age<=30 | buys_comp=N)=3/5=0.600
P(income=medium | buys_comp=Y)=4/9=0.444
P(income=medium | buys_comp=N)=2/5=0.400
P(student=Y | buys_comp=Y)=6/9=0.667
P(student=Y | buys_comp=N)=1/5=0.200
P(credit_rating=FAIR | buys_comp=Y)=6/9=0.667
P(credit_rating=FAIR | buys_comp=N)=2/5=0.400
Naïve Bayesian Classification
Example
P(X | buys_comp=Y)=0.222*0.444*0.667*0.667=0.044
P(X | buys_comp=N)=0.600*0.400*0.200*0.400=0.019
P(X | buys_comp=Y)P(buys_comp=Y) = 0.044*0.643=0.028
P(X | buys_comp=N)P(buys_comp=N) = 0.019*0.357=0.007
CONCLUSION: Bayesian classifier predicts buys_comp=Y for
sample X. X buys computer
Bayesian Belief Networks
▪ Naïve BC assumes Class Conditional Independence
▪ This assumption simplifies computations
▪ When this assumption holds true, Naïve BC is most
accurate compared to all other classifiers
▪ In real problems, dependencies do exist between variables
▪ 2 methods to overcome this limitation of NBC
▪ Bayesian networks, that combine Bayesian reasoning
with causal relationships between attributes
▪ Decision trees, that reason on one attribute at the
time, considering most important attributes first
Bayesian Belief Networks
Known as
▪ Belief Networks
▪ Bayesian Networks
▪ Probabilistic Networks
has 2 components
▪ Directed Acyclic Graph (DAG)
▪ Conditional Probability Table (CPT)
The k-Nearest Neighbor Algorithm
▪ All instances correspond to points in the n-D space.
▪ The nearest neighbor are defined in terms of Euclidean
distance.
▪ Euclidean distance between two points, X = (x1,x2,…,xn) and
Y = (y1,y2,…,yn) is d(X,Y)= √(Σi=1n (xi-yi)2)
▪ The target function could be discrete- or real- valued.
▪ For discrete-valued, the k-NN returns the most common
value among the k training examples nearest to xq .
▪ The k-NN algorithm for continuous-valued target functions
▪ Calculate the mean values of the k nearest neighbors
The k-Nearest Neighbor Algorithm
▪ Distance-weighted nearest neighbor algorithm
▪ Weight the contribution of each of the k neighbors
according to their distance to the query point xq. (giving
greater weight to closer neighbors)
▪ Nearest neighbor classifiers are lazy learners: they store all
of the training samples and do not build a classifier until a
new sample needs to be classified
▪ Robust to noisy data by averaging k-nearest neighbors.
▪ Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes
▪ To overcome it, elimination of the least relevant
attributes.
Case-Based Reasoning
▪ Also uses: lazy evaluation + analyze similar instances
▪ Difference: Instances are not “points in a Euclidean space”
▪ Methodology
▪ Instances represented by rich symbolic descriptions
(e.g., function graphs)
▪ Multiple retrieved cases may be combined
▪ Tight coupling between case retrieval, knowledge-based
reasoning, and problem solving
▪ Research issues
Indexing based on syntactic similarity measure, and when
failure, backtracking, and adapting to additional cases
Classifier Accuracy
▪ How it can be measured?
▪ Holdout Method (Random Sub sampling)
▪ K-fold Cross Validation
▪ Bootstrapping
▪ How we can improve classifier Accuracy?
▪ Bagging
▪ Boosting
▪ Is accuracy enough to judge a classifier?
Classification of Class-Imbalanced Data Sets
• Class-imbalance problem: Rare positive example but numerous negative
ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
• Traditional methods assume a balanced distribution of classes and
equal error costs: not suitable for class-imbalanced data
• Typical methods for imbalance data in 2-class classification:
– Oversampling: re-sampling of data from positive class
– Under-sampling: randomly eliminate tuples from negative class
– Threshold-moving: moves the decision threshold, t, so that the rare
class tuples are easier to classify, and hence, less chance of false
negative errors
– Ensemble techniques: Ensemble multiple classifiers introduced
above
23