0% found this document useful (0 votes)

81 views9 pages

Data Mining Assignment 3

1. Data classification involves building a model from a training dataset that assigns class labels to new data based on patterns in the training data. 2. It is a two-step process - first a classification model is constructed by analyzing a training dataset that includes data points with known class labels. Then the model is used to predict the class labels of new, unlabeled data. 3. There are two main types of classification algorithms - statistical-based algorithms like regression and Bayesian classification that use statistical techniques to analyze the training data, and machine learning-based algorithms like decision trees that learn classification rules from the data to make predictions.

Uploaded by

Christy Evangeline

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views9 pages

Data Mining Assignment 3

Uploaded by

Christy Evangeline

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

CLASSIFICATION

3 . 1 C L A S S I F I C A T I O N :

nominal).
class labels
(discrete or
Predicts categorical the training
based on
a model)
the data (constructs
in classifying the attribute
Classifies
labels)
set and the values (class
the new data.
and uses it in classifying
two-step process:
Data Classification is
a

1. Model Construction: Describing a set of predetermined classes.

assumed to belong to a predefined
Each tuple / sample is
label attribute.
class, as determined by the class
The set of tuples used for model construction is
a training set.
The model is represented as classification rules, decision
trees, or mathematical formulae.
2. Model Usage: For classifying future or unknown objects.
Estimate accuracy of the model.
The known label of test sample is compared with the
classified result from the model.
Accuracy rate is the percentage of test set samples that are
correctly classified by the model.
Test set is
independent of a training set.
3 . 2C l a s s i f i c a t i o n

TE the accuracy is acceptable, use the model to classify ata

whose class labels

are unknown
tuples
broadly
classified into two types:
earning is
Supervised learning

The training .data(observations, measurements, etc.) are

the labels indicating the class of the observations.
ccompanied by
classified based on the training set.
New data is
Unsupervised learning (clustering)
set of
The class label of training data is unknown. Given a
with the aim of establishing the
measurements, observations, etc.
existence of classes or clusters in the data.

(a) Learning
Classalkcation ugontwn

Kame rcone Lan.d6es

Yung
Safe
Classicauon nes
aky
S s n uke Sanor Lw Sate

Cuze Pns Sanor Fage

risky
outh THEN koan_dacsin
-

Safe 5ale
Kan decscolow
-

Frcmt -
ngh THEN

mdidit aqed ANO

agTHEN CN dc:son c inkyicme

b) Classification
Ciassihcaton rues

Now Cs
Trauuny data
low)
(John Heny, middle aJ0d.
Loan dec540n?

lacom lRMdsiAiR
Nama
Juas Belle Seaior Low Safo
Low Risky
Syhria Crast lidda aad
Saie
Anae Yee uddla asi High

Riuly

(b)
Classification
Process (a) Learning,
igure.3.1. Data
Classification
3.2. STATISTICAL-BASED ALGORITHMS

statistical-based algorithms which are as

There are two types of
follows
with the evaluation of
Regression Regression 1ssues deal
values. When utilized for
an output value located on input
the database
classification, the input values are values from
be
and the output values define the classes(Regression can
used to clarify classification issues, but it is used for different
applications including forecasting. The elementary form of
regression is simple linear regression that
-
includes only one

predictor and a prediction.

Regression can be used toimplementclassification using two

various methods which are as follows -

Division - The data are divided into regions located on

class.
>Prediction Formulas are created to predict the output
class's value.
Bayesian Classification - Statistical classifiers are used for

the classification. Bayesian classification is based on the

Bayes theorem. Bayesian classifiers view high efficiency and
speed when used to high databases
Bayes Theorem -

Let X be a data tuple. In the Bayesian method, X is treated as

"evidence." Let H be some hypothesis, including that the data tuple
X belongs to a particularized class C. The probability P (HX) is
decided to define the data. This probability P (H[X) isthe probability
that hypothesis H's influence has given the "evidene" or noticed
data tuple X.

is the of H conditioned on X.
P (HX) posterior probability
limited to users
For instance, consider the nature of data tuples is
3.4 Classification
and that X is 30
defined by the attribute age and income, commonly,
the
Assume that H is
years old users with Rs. 20,000income.
will purchase a computer. Thus P (HX
hypothesis that the user
given
reverses the probability that user X will purchase a computer
acknowledged.
that the user's age and income
are

this is the
H. For instance,
probability of
PE) is the prior a computer,
regardless
will purchase
probability that any given
user

The posterior probability P (HIX)

s o m e other
data.
of age, income, or is free
probability P (H), which
located on m o r e data
than the prior
is
of X. X
X
posterior probability of
P (XH) is the old
Likewise, is 30 years X
that a user

conditioned on H. It is the probability

and gains Rs. 20,000.

c a n be
measured from the given
P (H), P (XH), and
P (X) the
method of computing
theorem supports a
information. Bayes It is
from P (H), P (XH), and P(X).
P (H{X),
posterior probability
given by
P H I X ) = P O X H ) P ( E D P ( C X ) P ( H { X ) = P ( X H ) P ( H ) P ( X )

m e a s u r e is also a distance with

In Data Mining, the similarity
That means if the distance
dimensions describing object features.
of
among two data points is small then there is a high degree
versa. The similarity is
similarity among the objects and vice
For
subjective and depends heavily on the context and application.
example, similarity among vegetables can be determined from their
taste, size, colour etc.

3.3. THE DISTANCE-BASED ALGORITHMS IN DATA

MINING
The algorithms are used to the
measure distance between
each text, and _to _calculate the score. Distance measures play an
important role in machine learning.
Data Mining 3.5

the
foundations
for pular and effective
many popular: ective
They provide like KNN Nearest Neio
(K-Nearest Neighb ours) for
for
machine learning
algorithms
K-Means_clustering for s
for unsupervised
"learning_and
supervised
learning.
as
measures must be chosen
must and
and
used
Differcnt distance
is important to know ho
depending on the types of data, it to
a range
of ditIerent popular ance
implement and calculate
measures and the
intuitions for the resulting scores.)

play an important role in machine

Distance measures

used distance measures in machine

learning, the most commonly
learning are as follows
Hamming Distance
Euclidean Distance

Manhattan Distance
Minkowski Distance
Mahalanobis Distance
.Cosine Distance
The most
important is to
calculate each of these distance
measures when
implementing the algorithms from scratch and the
intuition for what is
make use of these being calculated when using
distance
measures. algorithns tna
3.3.1. HAMMING DISTANCE

Hamming
binary vectors, alsodistance
referred tocalculates the
The most
as
binary distance between
strings or bitstrings.
two

likely
performs One-Hot Encodeencountered binary
For
example, A set as categorical columnsstrings is when
of data.
when the
tne user
follows
3.6 Classification

COLUMIN
RED

GREEN
BLUE
Example Set, After Encoding,

Column One hot encode

Red I1,0,0
Green [0,1,0]
Blue (0,0,1]

The distance between red and green could be calculated as

the sum or the average number of bit differences between the two
bit-strings. This is Hamming distance.

1 1 011 1 0 0
11 1 1 0 1 1 0

00 10 10 10 Hamming distance =3

For a One-hot encoded string, it might make more sense to

summarize the sum of the bit difference between the strings, which
will always be a 0 or 1.
Hamming Distance = sum for i to N abs(vl[i]-v2[i])

For bit-strings that may have many 1 bits, it is more common

to calculate the average number of bit differences to give a hamming
distance score between 0(identical) and 1 (all diferent).
Hamming Distance = [(sum for i to N abs(vl[i]-v2[i]))/N]
Data Mining 3.7

3.3.2. Euclidean Distance:

is considered the traditional metric for

Euglidean distance
problems with geometry. It can be simnply explained as the ordinary
in
distance between two points. It is one of the most used algorithms
use this formula
the cluster analysis. One of the algorithms that
the root of squarcd
would be K-mcan. Mathematically it computes
differences between the coordinates between two objects.

(2: P2)

Y2-91

T2-1
(T11)

Figure.3.2. Euclidean Distance

3.3.3. M a n h a t t a n Distance:

the absolute difference among the pair of the

This determines
coordinates.

we points P and Q to determine the

have two
Suppose
the
these points we simply have to calculate
distance between
of the points from X-Axis and Y-Axis.
nerpendicular distance

plane with P at coordinate (x1, y1) and Q at (x2, y2).

In a
Manhattan distance between P and Q= |x1 - x2 + lyl - y2
3.8 Classification

Figure.3.3. Manhattan Distance

Here the total distance of the Red line gives the Manhattan
distance between both the points.

3.3.4. Jaccard Index:

The Jaccard distance measures the

similarity of the two data
set items as the intersection of those items divided by the union of
the data items.

Jaccard coefficient
rersecion Union

AnB AUE

(A, B)= AnB

AUB
Figure.3.4. Jaccard Index
3.3.5. Minkowski distance

It generalized form of the Euclidean and n.

is, the
point is represented
Distance Measure. In an N-dimensional space, a point is rer

as,
(x1, x2, ., xN)
Consider two points P1 and P2:

P1:(X1,X2,., XN)
P2: (Y1, Y2,..., YN)
Then, the Minkowski distance between Pl and P2 is given as:

When p =
2, Minkowski distance is same as the Euclidean
distance.
When p =
1, Minkowski distance is same as the Manhattan
distance.

3.3.6. Cosine Index:

Cosine distance measure for clustering determines the cosine

of the angle between two vectors
given by the following formula.
Here (theta) gives the angle between two vectors and A, B
are n-dimensional vectors.

A(x1,y1)
d

B(x2.y2)

Figure.3.5. Cosine Distance

Unit 2
No ratings yet
Unit 2
55 pages
Chapter 4
No ratings yet
Chapter 4
40 pages
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
No ratings yet
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
61 pages
Machine Learning Clustering Guide
No ratings yet
Machine Learning Clustering Guide
80 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
ML Unit2
No ratings yet
ML Unit2
38 pages
Classification (NaiveBayes KNN SVM DecisionTrees)
No ratings yet
Classification (NaiveBayes KNN SVM DecisionTrees)
105 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
Datamining Lect7knearst
No ratings yet
Datamining Lect7knearst
62 pages
ML Module5
No ratings yet
ML Module5
37 pages
Unit 5 - DA - Classification & Clustering
No ratings yet
Unit 5 - DA - Classification & Clustering
105 pages
Unit 2 ML
No ratings yet
Unit 2 ML
89 pages
Classification
No ratings yet
Classification
50 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
Supervised Learning Techniques Overview
No ratings yet
Supervised Learning Techniques Overview
71 pages
DW&M Unit 3 Part I
No ratings yet
DW&M Unit 3 Part I
101 pages
Datamining Lect12
No ratings yet
Datamining Lect12
75 pages
Machine Learning Module-03
No ratings yet
Machine Learning Module-03
24 pages
K - Nearest Neighbours Classifier / Regressor
No ratings yet
K - Nearest Neighbours Classifier / Regressor
35 pages
Predict Classify Cluster
No ratings yet
Predict Classify Cluster
12 pages
Chapter
100% (1)
Chapter
101 pages
2 - Classification Models
No ratings yet
2 - Classification Models
52 pages
Clustering For Clasification
No ratings yet
Clustering For Clasification
13 pages
ML Unit-5
No ratings yet
ML Unit-5
31 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
ML Unit-2
No ratings yet
ML Unit-2
33 pages
Supervised Learning Techniques
No ratings yet
Supervised Learning Techniques
33 pages
Distance-Based Methods - KNN
0% (1)
Distance-Based Methods - KNN
8 pages
ML Unit 3
No ratings yet
ML Unit 3
12 pages
Data Mining Unit-2
No ratings yet
Data Mining Unit-2
37 pages
Michael Melese (PH.D.) Michael - Melese@aau - Edu.et
No ratings yet
Michael Melese (PH.D.) Michael - Melese@aau - Edu.et
22 pages
Example 1: Riding Mowers
No ratings yet
Example 1: Riding Mowers
6 pages
03 - Classification PDF
No ratings yet
03 - Classification PDF
92 pages
Classification and Clustering Algorithm Notes
No ratings yet
Classification and Clustering Algorithm Notes
19 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
Supervised Learning vs. Unsupervised Learning
No ratings yet
Supervised Learning vs. Unsupervised Learning
7 pages
Instance Based Classifiers: Dr. Faisal Kamiran
No ratings yet
Instance Based Classifiers: Dr. Faisal Kamiran
20 pages
Module 5 - Clustering - Afterclassb
No ratings yet
Module 5 - Clustering - Afterclassb
49 pages
K-Nearest Neighbor
No ratings yet
K-Nearest Neighbor
24 pages
Unit 3
No ratings yet
Unit 3
100 pages
ML Chapter 3
No ratings yet
ML Chapter 3
45 pages
ML Unit-2 (CEC)
No ratings yet
ML Unit-2 (CEC)
96 pages
Unit-4 AML (1. Basics and K-NN)
No ratings yet
Unit-4 AML (1. Basics and K-NN)
25 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
79 pages
Data Mining Classifiers Guide
No ratings yet
Data Mining Classifiers Guide
23 pages
Week 9 - Clustering
No ratings yet
Week 9 - Clustering
63 pages
Unit 4-Unsupervised Learning-K Means and Hierarchical Clustering
No ratings yet
Unit 4-Unsupervised Learning-K Means and Hierarchical Clustering
48 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Aiml 5th Module Part2
No ratings yet
Aiml 5th Module Part2
28 pages
AIML Chapter 13
No ratings yet
AIML Chapter 13
26 pages
Unit 2
No ratings yet
Unit 2
89 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
ML-LECTURE9 KNN Classification
No ratings yet
ML-LECTURE9 KNN Classification
23 pages
CS402 Mod 3
No ratings yet
CS402 Mod 3
2 pages
ML Unit-2
No ratings yet
ML Unit-2
55 pages
DM - MP
No ratings yet
DM - MP
15 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Bayesian
No ratings yet
Bayesian
23 pages
Technical Specification - Bathymetry 120511
No ratings yet
Technical Specification - Bathymetry 120511
3 pages
Cheracebus Aquinoi N SP (MS 2022)
No ratings yet
Cheracebus Aquinoi N SP (MS 2022)
16 pages
321B Excavator Hydraulic System: Kga1-Up AKG501-UP 9CZ1001-UP
No ratings yet
321B Excavator Hydraulic System: Kga1-Up AKG501-UP 9CZ1001-UP
2 pages
Application Proforma
No ratings yet
Application Proforma
14 pages
Ringkasan Materi Optimasi Tugas Mata Kul
No ratings yet
Ringkasan Materi Optimasi Tugas Mata Kul
15 pages
Rroland Berger Future of Steelmaking
No ratings yet
Rroland Berger Future of Steelmaking
16 pages
G7 Math Q3 - Week 8 - Classification of Polygons
No ratings yet
G7 Math Q3 - Week 8 - Classification of Polygons
24 pages
Pasta Machines & Cutters Guide
No ratings yet
Pasta Machines & Cutters Guide
6 pages
SPM PHYSICS FORM 4 Forces and Motion
0% (1)
SPM PHYSICS FORM 4 Forces and Motion
16 pages
Monopolistic Competition Insights
No ratings yet
Monopolistic Competition Insights
2 pages
Call For Young Professionals Application G
No ratings yet
Call For Young Professionals Application G
2 pages
NDE Level III Service Contract
No ratings yet
NDE Level III Service Contract
2 pages
Panduan Teknik Mekanikal AC
100% (3)
Panduan Teknik Mekanikal AC
77 pages
Test 49 - Thermodynamics - Top of Pyramid
No ratings yet
Test 49 - Thermodynamics - Top of Pyramid
6 pages
S3 - MLP - 20-21 - 2nd - Term - Exam - RP (3rd Draft)
No ratings yet
S3 - MLP - 20-21 - 2nd - Term - Exam - RP (3rd Draft)
7 pages
Instrumentation Mls 342 Handout (Mrs June)
No ratings yet
Instrumentation Mls 342 Handout (Mrs June)
11 pages
Groups in Model United Nations Groups Leader and Members Country Assigned Group 1: AFRICA
No ratings yet
Groups in Model United Nations Groups Leader and Members Country Assigned Group 1: AFRICA
16 pages
Sadness of Saying Enough
No ratings yet
Sadness of Saying Enough
3 pages
Quiz - Week 09 Quiz
No ratings yet
Quiz - Week 09 Quiz
2 pages
Sa1Nt'S Coaching Prices
No ratings yet
Sa1Nt'S Coaching Prices
3 pages
Semantic Scale and Likert Scale
100% (1)
Semantic Scale and Likert Scale
16 pages
Cardiovascular MCQs for Med Students
100% (2)
Cardiovascular MCQs for Med Students
9 pages
Indraprastha Gas Limited: (A) Mandatory Checks
No ratings yet
Indraprastha Gas Limited: (A) Mandatory Checks
3 pages
Definity: Enterprise Communications Server
No ratings yet
Definity: Enterprise Communications Server
2,350 pages
2009 First Semester Load Distribution Mechanical
No ratings yet
2009 First Semester Load Distribution Mechanical
4 pages
Emas
50% (2)
Emas
46 pages
Why Not Me Lyrics
No ratings yet
Why Not Me Lyrics
3 pages
MOD-5 Notes
No ratings yet
MOD-5 Notes
58 pages
SunshineTT 6
No ratings yet
SunshineTT 6
7 pages
Past Paper Computer Studies
100% (1)
Past Paper Computer Studies
12 pages

Data Mining Assignment 3

Uploaded by

Data Mining Assignment 3

Uploaded by

CLASSIFICATION

1. Model Construction: Describing a set of predetermined classes.

TE the accuracy is acceptable, use the model to classify ata

whose class labels

The training .data(observations, measurements, etc.) are

Kame rcone Lan.d6es

Cuze Pns Sanor Fage

mdidit aqed ANO

statistical-based algorithms which are as

predictor and a prediction.

Regression can be used toimplementclassification using two

Division - The data are divided into regions located on

the classification. Bayesian classification is based on the

Let X be a data tuple. In the Bayesian method, X is treated as

The posterior probability P (HIX)

conditioned on H. It is the probability

and gains Rs. 20,000.

m e a s u r e is also a distance with

3.3. THE DISTANCE-BASED ALGORITHMS IN DATA

play an important role in machine

used distance measures in machine

Column One hot encode

The distance between red and green could be calculated as

For a One-hot encoded string, it might make more sense to

For bit-strings that may have many 1 bits, it is more common

3.3.2. Euclidean Distance:

is considered the traditional metric for

Figure.3.2. Euclidean Distance

the absolute difference among the pair of the

we points P and Q to determine the

plane with P at coordinate (x1, y1) and Q at (x2, y2).

Figure.3.3. Manhattan Distance

3.3.4. Jaccard Index:

The Jaccard distance measures the

(A, B)= AnB

It generalized form of the Euclidean and n.

3.3.6. Cosine Index:

Cosine distance measure for clustering determines the cosine

Figure.3.5. Cosine Distance

You might also like