Unit 3 Data Mining

The document discusses two main forms of data analysis: classification, which predicts categorical class labels, and prediction, which forecasts continuous values. It outlines the processes involved in building classifiers, issues related to data preparation, and compares classification and prediction methods. Additionally, it covers decision tree induction, including its structure, algorithms, overfitting, and techniques for pruning to improve model accuracy.

Uploaded by

Jothi B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views21 pages

Unit 3 Data Mining

Uploaded by

Jothi B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Classification and

Prediction
Introduction
• There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −

• Classification
Classification models predict categorical class labels
• Prediction
prediction models predict continuous valued functions
• Example
we can build a classification model to categorize bank loan applications as either safe
or risky, or a prediction model to predict the expenditures in dollars of potential
customers on computer equipment given their income and occupation.
What is classification?

• Following are the examples of cases where the data analysis task is
Classification −
• A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with
a given profile, who will buy a new computer.
• In both of the above examples, a model or classifier is constructed to
predict the categorical labels. These labels are risky or safe for loan
application data and yes or no for marketing data.
What is prediction?

• Following are the examples of cases where the data analysis task is
Prediction −
• Suppose the marketing manager needs to predict how much a given
customer will spend during a sale at his company. In this example we
are bothered to predict a numeric value. Therefore the data analysis
task is an example of numeric prediction. In this case, a model or a
predictor will be constructed that predicts a continuous-valued-
function or ordered value.
How Does Classification Works?

• With the help of the bank loan application that we have discussed
above, let us understand the working of classification. The Data
Classification process includes two steps −
• Building the Classifier or Model
• Using Classifier for Classification
Building the Classifier or Model

•This step is the learning step or the learning phase.

•In this step the classification algorithms build the classifier.
•The classifier is built from the training set made up of database tuples and their associated
class labels.
•Each tuple that constitutes the training set is referred to as a category or class. These tuples
can also be referred to as sample, object or data points.
Using Classifier for Classification

• In this step, the classifier is used for classification. Here the test data is
used to estimate the accuracy of classification rules. The classification
rules can be applied to the new data tuples if the accuracy is
considered acceptable.
Classification and Prediction Issues

• The major issue is preparing the data for Classification and Prediction. Preparing the da
• Data Cleaning − Data cleaning involves removing the noise and treatment of missing values. The
noise is removed by applying smoothing techniques and the problem of missing values is solved by
replacing a missing value with most commonly occurring value for that attribute.
• Relevance Analysis − Database may also have the irrelevant attributes. Correlation analysis is used
to know whether any two given attributes are related.
• Data Transformation and reduction − The data can be transformed by any of the following methods.
• Normalization − The data is transformed using normalization. Normalization involves scaling all values for given
attribute in order to make them fall within a small specified range. Normalization is used when in the learning
step, the neural networks or the methods involving measurements are used.
• Generalization − The data can also be transformed by generalizing it to the higher concept. For this purpose we
can use the concept hierarchies.
• Note − Data can also be reduced by some other methods such as wavelet transformation, binning,
histogram analysis, and clustering.
• ta involves the following activities −
Comparison of Classification and Prediction Methods

• Here is the criteria for comparing the methods of Classification and Prediction −
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the
class label correctly and the accuracy of the predictor refers to how well a given
predictor can guess the value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and using the
classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor understands.
Decision Tree Induction

• A decision tree is a structure that includes a root node, branches, and

leaf nodes. Each internal node denotes a test on an attribute, each
branch denotes the outcome of a test, and each leaf node holds a
class label. The topmost node in the tree is the root node.
A decision tree is a graph that uses a branching method to illustrate every
possible outcome of a decision. Decision trees can be drawn by hand or created
with a graphics program or specialized software. Informally, decision trees are
useful for focusing discussion when a group must make a decision.
Programmatically, they can be used to assign monetary/time or other values to
possible outcomes so that decisions can be automated. Decision tree software is
used in data mining to simplify complex strategic challenges and evaluate the
cost-effectiveness of research and business decisions. Variables in a decision tree
are usually represented by circles.
• The following decision tree is for the concept buy_computer that
indicates whether a customer at a company is likely to buy a
computer or not. Each internal node represents a test on an attribute.
Each leaf node represents a class.
• The benefits of having a decision tree are as follows −
• It does not require any domain knowledge.
• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and
fast.
Decision Tree Induction Algorithm

• A machine researcher named J. Ross Quinlan in 1980 developed a

decision tree algorithm known as ID3 (Iterative Dichotomiser). Later,
he presented C4.5, which was the successor of ID3. ID3 and C4.5
adopt a greedy approach. In this algorithm, there is no backtracking;
the trees are constructed in a top-down recursive divide-and-conquer
manner.
Decision tree example
Overfitting
• Overfitting is a practical problem while building a decision tree model.
The model is having an issue of overfitting is considered when the
algorithm continues to go deeper and deeper in the to reduce the
training set error but results with an increased test set error i.e,
Accuracy of prediction for our model goes down. It generally happens
when it builds many branches due to outliers and irregularities in
data.
How to overcome overfitting?
• Pruning •The shortening of branches of the tree. Pruning is the
process of reducing the size of the tree by turning some branch nodes
into leaf nodes, and removing the leaf nodes under the original
branch. Pruning is useful because classification trees may fit the
training data well, but may do a poor job of classifying new values. A
simpler tree often avoids over-fitting.
Tree Pruning

• Tree pruning is performed in order to remove anomalies in the

training data due to noise or outliers. The pruned trees are smaller
and less complex.
• Tree Pruning Approaches
• There are two approaches to prune a tree −
• Pre-pruning − The tree is pruned by halting its construction early.
• Post-pruning - This approach removes a sub-tree from a fully grown
tree.
Entropy

• A decision tree is built top-down from a root node and involves

partitioning the data into subsets that contain instances with similar
values (homogenous). ID3 algorithm uses entropy to calculate the
homogeneity of a sample. If the sample is completely homogeneous
the entropy is zero and if the sample is an equally divided it has
entropy of one.
Information Gain
• The information gain is based on the decrease in entropy after a
dataset is split on an attribute. Constructing a decision tree is all
about finding attribute that returns the highest information gain (i.e.,
the most homogeneous branches).
Cost Complexity

• The cost complexity is measured by the following two parameters −

• Number of leaves in the tree, and
• Error rate of the tree.

Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Updated DM Unit 3
No ratings yet
Updated DM Unit 3
28 pages
Classification and Prediction Overview
No ratings yet
Classification and Prediction Overview
75 pages
Classification & Prediction Guide
100% (1)
Classification & Prediction Guide
67 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
Classification and Prediction Techniques
No ratings yet
Classification and Prediction Techniques
41 pages
DWM - Module 3
No ratings yet
DWM - Module 3
22 pages
R20 DMT Unit-Iii
No ratings yet
R20 DMT Unit-Iii
21 pages
Unit 3
No ratings yet
Unit 3
16 pages
Unit 3 DM
No ratings yet
Unit 3 DM
34 pages
DWDM Unit IV Note
No ratings yet
DWDM Unit IV Note
21 pages
DM - 06 Mar 2025
No ratings yet
DM - 06 Mar 2025
13 pages
Classification and Prediction Techniques
No ratings yet
Classification and Prediction Techniques
50 pages
Classification Unit3
No ratings yet
Classification Unit3
15 pages
Classification and Prediction Guide
No ratings yet
Classification and Prediction Guide
93 pages
CH 5
No ratings yet
CH 5
84 pages
Classification & Prediction Guide
No ratings yet
Classification & Prediction Guide
83 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Data Mining and Warehousing Mod3
No ratings yet
Data Mining and Warehousing Mod3
69 pages
4 Classification
No ratings yet
4 Classification
20 pages
Module 3
No ratings yet
Module 3
64 pages
Classifiction
No ratings yet
Classifiction
42 pages
Unit 3
No ratings yet
Unit 3
28 pages
Classification
No ratings yet
Classification
23 pages
Chapter 4
No ratings yet
Chapter 4
31 pages
Data Classification & Prediction Guide
No ratings yet
Data Classification & Prediction Guide
38 pages
Chapter 8
No ratings yet
Chapter 8
15 pages
DM Unit 4
No ratings yet
DM Unit 4
22 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
Module 3 Notes
No ratings yet
Module 3 Notes
31 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
3 Module DWM
No ratings yet
3 Module DWM
16 pages
Classification and Prediction
No ratings yet
Classification and Prediction
14 pages
DWDM Module IV
No ratings yet
DWDM Module IV
57 pages
Data Mining Unit-Iii
No ratings yet
Data Mining Unit-Iii
36 pages
Classification and Prediction Techniques
No ratings yet
Classification and Prediction Techniques
92 pages
Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
Unit-4 Data Mining
No ratings yet
Unit-4 Data Mining
19 pages
Classification and Prediction
No ratings yet
Classification and Prediction
69 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
Chapter 3
No ratings yet
Chapter 3
67 pages
Classification
No ratings yet
Classification
81 pages
Classification Notes
No ratings yet
Classification Notes
14 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
Dmi Unit 4
No ratings yet
Dmi Unit 4
34 pages
CH-5 DM Classification
No ratings yet
CH-5 DM Classification
31 pages
Classification Unit-4
No ratings yet
Classification Unit-4
19 pages
TTDS Lecture 4
No ratings yet
TTDS Lecture 4
31 pages
Classification Techniques Overview
No ratings yet
Classification Techniques Overview
141 pages
Classification and Prediction
No ratings yet
Classification and Prediction
21 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
Classification vs. Prediction in ML
No ratings yet
Classification vs. Prediction in ML
25 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
70 pages
Wine Quality Prediction with Decision Trees
No ratings yet
Wine Quality Prediction with Decision Trees
39 pages
Machine Learning Concepts and Techniques
No ratings yet
Machine Learning Concepts and Techniques
13 pages
Decision Tree Classifiers in ML
No ratings yet
Decision Tree Classifiers in ML
10 pages
Enhancing Indian Agriculture Using Information Theory
No ratings yet
Enhancing Indian Agriculture Using Information Theory
5 pages
整合機器學習方法於決策樹為基智慧型排程系統之研究
No ratings yet
整合機器學習方法於決策樹為基智慧型排程系統之研究
76 pages
MCA AI Machine Learning Notes
No ratings yet
MCA AI Machine Learning Notes
42 pages
C2 W4 Lab 01 Decision Trees
No ratings yet
C2 W4 Lab 01 Decision Trees
6 pages
Advanced Diploma in Data Science Syllabus
No ratings yet
Advanced Diploma in Data Science Syllabus
113 pages
Towards Pentesting Automation Using The Metasploit Framework
No ratings yet
Towards Pentesting Automation Using The Metasploit Framework
8 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
Feature Selection Techniques in Machine Learning - Javatpoint
No ratings yet
Feature Selection Techniques in Machine Learning - Javatpoint
9 pages
Machine Learning and Data Analytics Using Python Lab
No ratings yet
Machine Learning and Data Analytics Using Python Lab
36 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
Unit 2
100% (1)
Unit 2
42 pages
Decision Tree With Cross Validation
No ratings yet
Decision Tree With Cross Validation
19 pages
05 Classification II 2024
No ratings yet
05 Classification II 2024
54 pages
Comps DWM May 2025 Pyq Solutions
No ratings yet
Comps DWM May 2025 Pyq Solutions
34 pages
3 - Decision Trees
No ratings yet
3 - Decision Trees
16 pages
Detecting Malicious Websites
No ratings yet
Detecting Malicious Websites
58 pages
Unit 5
No ratings yet
Unit 5
21 pages
Basic Notes
No ratings yet
Basic Notes
26 pages
Machine Learning PPT Part II
No ratings yet
Machine Learning PPT Part II
56 pages
College Predictor - Thesis
No ratings yet
College Predictor - Thesis
37 pages
ML - Questions & Answer
No ratings yet
ML - Questions & Answer
45 pages
EEG Calculations and Data Analysis
No ratings yet
EEG Calculations and Data Analysis
15 pages
Predictive Models and Techniques
No ratings yet
Predictive Models and Techniques
9 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Machine Learning With Random Forests and Decision Trees PDF
No ratings yet
Machine Learning With Random Forests and Decision Trees PDF
171 pages

Unit 3 Data Mining

Uploaded by

Unit 3 Data Mining

Uploaded by

Classification and

•This step is the learning step or the learning phase.

• A decision tree is a structure that includes a root node, branches, and

• A machine researcher named J. Ross Quinlan in 1980 developed a

• Tree pruning is performed in order to remove anomalies in the

• A decision tree is built top-down from a root node and involves

• The cost complexity is measured by the following two parameters −

You might also like