0% found this document useful (0 votes)
46 views21 pages

Unit 3 Data Mining

The document discusses two main forms of data analysis: classification, which predicts categorical class labels, and prediction, which forecasts continuous values. It outlines the processes involved in building classifiers, issues related to data preparation, and compares classification and prediction methods. Additionally, it covers decision tree induction, including its structure, algorithms, overfitting, and techniques for pruning to improve model accuracy.

Uploaded by

Jothi B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views21 pages

Unit 3 Data Mining

The document discusses two main forms of data analysis: classification, which predicts categorical class labels, and prediction, which forecasts continuous values. It outlines the processes involved in building classifiers, issues related to data preparation, and compares classification and prediction methods. Additionally, it covers decision tree induction, including its structure, algorithms, overfitting, and techniques for pruning to improve model accuracy.

Uploaded by

Jothi B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Classification and

Prediction
Introduction
• There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −

• Classification
Classification models predict categorical class labels
• Prediction
prediction models predict continuous valued functions
• Example
we can build a classification model to categorize bank loan applications as either safe
or risky, or a prediction model to predict the expenditures in dollars of potential
customers on computer equipment given their income and occupation.
What is classification?

• Following are the examples of cases where the data analysis task is
Classification −
• A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with
a given profile, who will buy a new computer.
• In both of the above examples, a model or classifier is constructed to
predict the categorical labels. These labels are risky or safe for loan
application data and yes or no for marketing data.
What is prediction?

• Following are the examples of cases where the data analysis task is
Prediction −
• Suppose the marketing manager needs to predict how much a given
customer will spend during a sale at his company. In this example we
are bothered to predict a numeric value. Therefore the data analysis
task is an example of numeric prediction. In this case, a model or a
predictor will be constructed that predicts a continuous-valued-
function or ordered value.
How Does Classification Works?

• With the help of the bank loan application that we have discussed
above, let us understand the working of classification. The Data
Classification process includes two steps −
• Building the Classifier or Model
• Using Classifier for Classification
Building the Classifier or Model

•This step is the learning step or the learning phase.


•In this step the classification algorithms build the classifier.
•The classifier is built from the training set made up of database tuples and their associated
class labels.
•Each tuple that constitutes the training set is referred to as a category or class. These tuples
can also be referred to as sample, object or data points.
Using Classifier for Classification

• In this step, the classifier is used for classification. Here the test data is
used to estimate the accuracy of classification rules. The classification
rules can be applied to the new data tuples if the accuracy is
considered acceptable.
Classification and Prediction Issues

• The major issue is preparing the data for Classification and Prediction. Preparing the da
• Data Cleaning − Data cleaning involves removing the noise and treatment of missing values. The
noise is removed by applying smoothing techniques and the problem of missing values is solved by
replacing a missing value with most commonly occurring value for that attribute.
• Relevance Analysis − Database may also have the irrelevant attributes. Correlation analysis is used
to know whether any two given attributes are related.
• Data Transformation and reduction − The data can be transformed by any of the following methods.
• Normalization − The data is transformed using normalization. Normalization involves scaling all values for given
attribute in order to make them fall within a small specified range. Normalization is used when in the learning
step, the neural networks or the methods involving measurements are used.
• Generalization − The data can also be transformed by generalizing it to the higher concept. For this purpose we
can use the concept hierarchies.
• Note − Data can also be reduced by some other methods such as wavelet transformation, binning,
histogram analysis, and clustering.
• ta involves the following activities −
Comparison of Classification and Prediction Methods

• Here is the criteria for comparing the methods of Classification and Prediction −
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the
class label correctly and the accuracy of the predictor refers to how well a given
predictor can guess the value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and using the
classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor understands.
Decision Tree Induction

• A decision tree is a structure that includes a root node, branches, and


leaf nodes. Each internal node denotes a test on an attribute, each
branch denotes the outcome of a test, and each leaf node holds a
class label. The topmost node in the tree is the root node.
A decision tree is a graph that uses a branching method to illustrate every
possible outcome of a decision. Decision trees can be drawn by hand or created
with a graphics program or specialized software. Informally, decision trees are
useful for focusing discussion when a group must make a decision.
Programmatically, they can be used to assign monetary/time or other values to
possible outcomes so that decisions can be automated. Decision tree software is
used in data mining to simplify complex strategic challenges and evaluate the
cost-effectiveness of research and business decisions. Variables in a decision tree
are usually represented by circles.
• The following decision tree is for the concept buy_computer that
indicates whether a customer at a company is likely to buy a
computer or not. Each internal node represents a test on an attribute.
Each leaf node represents a class.
• The benefits of having a decision tree are as follows −
• It does not require any domain knowledge.
• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and
fast.
Decision Tree Induction Algorithm

• A machine researcher named J. Ross Quinlan in 1980 developed a


decision tree algorithm known as ID3 (Iterative Dichotomiser). Later,
he presented C4.5, which was the successor of ID3. ID3 and C4.5
adopt a greedy approach. In this algorithm, there is no backtracking;
the trees are constructed in a top-down recursive divide-and-conquer
manner.
Decision tree example
Overfitting
• Overfitting is a practical problem while building a decision tree model.
The model is having an issue of overfitting is considered when the
algorithm continues to go deeper and deeper in the to reduce the
training set error but results with an increased test set error i.e,
Accuracy of prediction for our model goes down. It generally happens
when it builds many branches due to outliers and irregularities in
data.
How to overcome overfitting?
• Pruning •The shortening of branches of the tree. Pruning is the
process of reducing the size of the tree by turning some branch nodes
into leaf nodes, and removing the leaf nodes under the original
branch. Pruning is useful because classification trees may fit the
training data well, but may do a poor job of classifying new values. A
simpler tree often avoids over-fitting.
Tree Pruning

• Tree pruning is performed in order to remove anomalies in the


training data due to noise or outliers. The pruned trees are smaller
and less complex.
• Tree Pruning Approaches
• There are two approaches to prune a tree −
• Pre-pruning − The tree is pruned by halting its construction early.
• Post-pruning - This approach removes a sub-tree from a fully grown
tree.
Entropy

• A decision tree is built top-down from a root node and involves


partitioning the data into subsets that contain instances with similar
values (homogenous). ID3 algorithm uses entropy to calculate the
homogeneity of a sample. If the sample is completely homogeneous
the entropy is zero and if the sample is an equally divided it has
entropy of one.
Information Gain
• The information gain is based on the decrease in entropy after a
dataset is split on an attribute. Constructing a decision tree is all
about finding attribute that returns the highest information gain (i.e.,
the most homogeneous branches).
Cost Complexity

• The cost complexity is measured by the following two parameters −


• Number of leaves in the tree, and
• Error rate of the tree.

You might also like