CLASSIFICATION
3 . 1 C L A S S I F I C A T I O N :
nominal).
class labels
(discrete or
Predicts categorical the training
based on
a model)
the data (constructs
in classifying the attribute
Classifies
labels)
set and the values (class
the new data.
and uses it in classifying
two-step process:
Data Classification is
a
1. Model Construction: Describing a set of predetermined classes.
assumed to belong to a predefined
Each tuple / sample is
label attribute.
class, as determined by the class
The set of tuples used for model construction is
a training set.
The model is represented as classification rules, decision
trees, or mathematical formulae.
2. Model Usage: For classifying future or unknown objects.
Estimate accuracy of the model.
The known label of test sample is compared with the
classified result from the model.
Accuracy rate is the percentage of test set samples that are
correctly classified by the model.
Test set is
independent of a training set.
3 . 2C l a s s i f i c a t i o n
TE the accuracy is acceptable, use the model to classify ata
whose class labels
are unknown
tuples
broadly
classified into two types:
earning is
Supervised learning
The training .data(observations, measurements, etc.) are
the labels indicating the class of the observations.
ccompanied by
classified based on the training set.
New data is
Unsupervised learning (clustering)
set of
The class label of training data is unknown. Given a
with the aim of establishing the
measurements, observations, etc.
existence of classes or clusters in the data.
(a) Learning
Classalkcation ugontwn
Kame rcone Lan.d6es
Yung
Safe
Classicauon nes
aky
S s n uke Sanor Lw Sate
Cuze Pns Sanor Fage
risky
outh THEN koan_dacsin
-
Safe 5ale
Kan decscolow
-
Frcmt -
ngh THEN
mdidit aqed ANO
agTHEN CN dc:son c inkyicme
b) Classification
Ciassihcaton rues
Now Cs
Trauuny data
low)
(John Heny, middle aJ0d.
Loan dec540n?
lacom lRMdsiAiR
Nama
Juas Belle Seaior Low Safo
Low Risky
Syhria Crast lidda aad
Saie
Anae Yee uddla asi High
Riuly
(b)
Classification
Process (a) Learning,
igure.3.1. Data
Classification
3.2. STATISTICAL-BASED ALGORITHMS
statistical-based algorithms which are as
There are two types of
follows
with the evaluation of
Regression Regression 1ssues deal
values. When utilized for
an output value located on input
the database
classification, the input values are values from
be
and the output values define the classes(Regression can
used to clarify classification issues, but it is used for different
applications including forecasting. The elementary form of
regression is simple linear regression that
-
includes only one
predictor and a prediction.
Regression can be used toimplementclassification using two
various methods which are as follows -
Division - The data are divided into regions located on
class.
>Prediction Formulas are created to predict the output
class's value.
Bayesian Classification - Statistical classifiers are used for
the classification. Bayesian classification is based on the
Bayes theorem. Bayesian classifiers view high efficiency and
speed when used to high databases
Bayes Theorem -
Let X be a data tuple. In the Bayesian method, X is treated as
"evidence." Let H be some hypothesis, including that the data tuple
X belongs to a particularized class C. The probability P (HX) is
decided to define the data. This probability P (H[X) isthe probability
that hypothesis H's influence has given the "evidene" or noticed
data tuple X.
is the of H conditioned on X.
P (HX) posterior probability
limited to users
For instance, consider the nature of data tuples is
3.4 Classification
and that X is 30
defined by the attribute age and income, commonly,
the
Assume that H is
years old users with Rs. 20,000income.
will purchase a computer. Thus P (HX
hypothesis that the user
given
reverses the probability that user X will purchase a computer
acknowledged.
that the user's age and income
are
this is the
H. For instance,
probability of
PE) is the prior a computer,
regardless
will purchase
probability that any given
user
The posterior probability P (HIX)
s o m e other
data.
of age, income, or is free
probability P (H), which
located on m o r e data
than the prior
is
of X. X
X
posterior probability of
P (XH) is the old
Likewise, is 30 years X
that a user
conditioned on H. It is the probability
and gains Rs. 20,000.
c a n be
measured from the given
P (H), P (XH), and
P (X) the
method of computing
theorem supports a
information. Bayes It is
from P (H), P (XH), and P(X).
P (H{X),
posterior probability
given by
P H I X ) = P O X H ) P ( E D P ( C X ) P ( H { X ) = P ( X H ) P ( H ) P ( X )
m e a s u r e is also a distance with
In Data Mining, the similarity
That means if the distance
dimensions describing object features.
of
among two data points is small then there is a high degree
versa. The similarity is
similarity among the objects and vice
For
subjective and depends heavily on the context and application.
example, similarity among vegetables can be determined from their
taste, size, colour etc.
3.3. THE DISTANCE-BASED ALGORITHMS IN DATA
MINING
The algorithms are used to the
measure distance between
each text, and _to _calculate the score. Distance measures play an
important role in machine learning.
Data Mining 3.5
the
foundations
for pular and effective
many popular: ective
They provide like KNN Nearest Neio
(K-Nearest Neighb ours) for
for
machine learning
algorithms
K-Means_clustering for s
for unsupervised
"learning_and
supervised
learning.
as
measures must be chosen
must and
and
used
Differcnt distance
is important to know ho
depending on the types of data, it to
a range
of ditIerent popular ance
implement and calculate
measures and the
intuitions for the resulting scores.)
play an important role in machine
Distance measures
used distance measures in machine
learning, the most commonly
learning are as follows
Hamming Distance
Euclidean Distance
Manhattan Distance
Minkowski Distance
Mahalanobis Distance
.Cosine Distance
The most
important is to
calculate each of these distance
measures when
implementing the algorithms from scratch and the
intuition for what is
make use of these being calculated when using
distance
measures. algorithns tna
3.3.1. HAMMING DISTANCE
Hamming
binary vectors, alsodistance
referred tocalculates the
The most
as
binary distance between
strings or bitstrings.
two
likely
performs One-Hot Encodeencountered binary
For
example, A set as categorical columnsstrings is when
of data.
when the
tne user
follows
3.6 Classification
COLUMIN
RED
GREEN
BLUE
Example Set, After Encoding,
Column One hot encode
Red I1,0,0
Green [0,1,0]
Blue (0,0,1]
The distance between red and green could be calculated as
the sum or the average number of bit differences between the two
bit-strings. This is Hamming distance.
1 1 011 1 0 0
11 1 1 0 1 1 0
00 10 10 10 Hamming distance =3
For a One-hot encoded string, it might make more sense to
summarize the sum of the bit difference between the strings, which
will always be a 0 or 1.
Hamming Distance = sum for i to N abs(vl[i]-v2[i])
For bit-strings that may have many 1 bits, it is more common
to calculate the average number of bit differences to give a hamming
distance score between 0(identical) and 1 (all diferent).
Hamming Distance = [(sum for i to N abs(vl[i]-v2[i]))/N]
Data Mining 3.7
3.3.2. Euclidean Distance:
is considered the traditional metric for
Euglidean distance
problems with geometry. It can be simnply explained as the ordinary
in
distance between two points. It is one of the most used algorithms
use this formula
the cluster analysis. One of the algorithms that
the root of squarcd
would be K-mcan. Mathematically it computes
differences between the coordinates between two objects.
(2: P2)
Y2-91
T2-1
(T11)
Figure.3.2. Euclidean Distance
3.3.3. M a n h a t t a n Distance:
the absolute difference among the pair of the
This determines
coordinates.
we points P and Q to determine the
have two
Suppose
the
these points we simply have to calculate
distance between
of the points from X-Axis and Y-Axis.
nerpendicular distance
plane with P at coordinate (x1, y1) and Q at (x2, y2).
In a
Manhattan distance between P and Q= |x1 - x2 + lyl - y2
3.8 Classification
Figure.3.3. Manhattan Distance
Here the total distance of the Red line gives the Manhattan
distance between both the points.
3.3.4. Jaccard Index:
The Jaccard distance measures the
similarity of the two data
set items as the intersection of those items divided by the union of
the data items.
Jaccard coefficient
rersecion Union
AnB AUE
(A, B)= AnB
AUB
Figure.3.4. Jaccard Index
3.3.5. Minkowski distance
It generalized form of the Euclidean and n.
is, the
point is represented
Distance Measure. In an N-dimensional space, a point is rer
as,
(x1, x2, ., xN)
Consider two points P1 and P2:
P1:(X1,X2,., XN)
P2: (Y1, Y2,..., YN)
Then, the Minkowski distance between Pl and P2 is given as:
When p =
2, Minkowski distance is same as the Euclidean
distance.
When p =
1, Minkowski distance is same as the Manhattan
distance.
3.3.6. Cosine Index:
Cosine distance measure for clustering determines the cosine
of the angle between two vectors
given by the following formula.
Here (theta) gives the angle between two vectors and A, B
are n-dimensional vectors.
A(x1,y1)
d
B(x2.y2)
Figure.3.5. Cosine Distance