0% found this document useful (0 votes)

0 views129 pages

Data Mining Digital Notes

The document provides a comprehensive overview of data mining, covering its definition, tasks, and methodologies across five units. It discusses key concepts such as data preprocessing, association rules, classification, clustering, and web/text mining, along with challenges and applications in various fields. The document also includes references and textbooks for further reading on data mining techniques and principles.

Uploaded by

Suneel Ramireddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views129 pages

Data Mining Digital Notes

Uploaded by

Suneel Ramireddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 129

lOMoARcPSD|18595359

DATA Mining
lOMoARcPSD|18595359

UNIT - I
Introduction to Data Mining: Introduction, What is Data Mining, Definition,
KDD, Challenges, Data Mining Tasks, Data Preprocessing, Data Cleaning,
Missing data, Dimensionality Reduction, Feature Subset Selection, Discretization and
Binaryzation, Data Transformation; Measures of Similarity and Dissimilarity- Basics.

UNIT - II
Association Rules: Problem Definition, Frequent Item Set Generation, The
APRIORI Principle, Support and Confidence Measures, Association Rule Generation;
APRIOIRI Algorithm, The Partition Algorithms, FP-Growth Algorithms, Compact
Representation of Frequent Item Set- Maximal Frequent Item Set, Closed Frequent Item
Set.

UNIT - III
Classification: Problem Definition, General Approaches to solving a classification
problem, Evaluation of Classifiers , Classification techniques, Decision Trees-
Decision tree Construction , Methods for Expressing attribute test conditions, Measures
for Selecting the Best Split, Algorithm for Decision tree Induction ; Naive- Bayes
Classifier, Bayesian Belief Networks; K- Nearest neighbor classification-Algorithm and
Characteristics.

UNIT - IV
Clustering: Problem Definition, Clustering Overview, Evaluation of Clustering Algorithms,
Partitioning Clustering-K-Means Algorithm, K-Means Additional issues, PAM Algorithm;
Hierarchical Clustering-Agglomerative Methods and divisive methods, Basic Agglomerative
Hierarchical Clustering Algorithm, Specific techniques, Key Issues in Hierarchical
Clustering, Strengths and Weakness; Outlier Detection.

UNIT - V
Web and Text Mining: Introduction, web mining, web content mining, web
structure mining, we usage mining, Text mining –unstructured text, episode rule
discovery for texts, hierarchy of categories, text clustering.

TEXT BOOKS:
1. Data Mining- Concepts and Techniques- Jiawei Han, Micheline
Kamber, Morgan Kaufmann Publishers, Elsevier, 2 Edition, 2006.
2. Introduction to Data Mining, Pang-Ning Tan, Vipin Kumar,
Michael Steinbanch, Pearson Education.
3. Data mining Techniques and Applications, Hongbo Du Cengage India Publishing
REFERENCE BOOKS:
1. Data Mining Techniques, Arun K Pujari, 3rd Edition, Universities Press.
2. Data Mining Principles & Applications – T.V Sveresh Kumar,
B.Esware Reddy, Jagadish S Kalimani, Elsevier.
3. Data Mining, Vikaram Pudi, P Radha Krishna, Oxford University Press
lOMoARcPSD|18595359

INDEX

PAGE
SNO TOPIC
N0.
UNIT – I
1.1 Introduction to Data Mining, Definition 1
1.2 Knowledge Data Discovery 3
1.3 Challenges 4
1.4 Data Mining Tasks 6
1.5 Data Preprocessing 8
1.6 Data Cleaning 9
1.7 Dimensionality Reduction 11
1.8 Feature Subset Selection 14
1.9 Discretization and Binaryzation 16
1.10 Data Transformation 18
1.11 Measures of Similarity and Dissimilarity 21
UNIT - 2
2.1 Problem Definition 24
2.2 Frequent Item Set Generation 26
2.3 The APRIORI Principle, Association Rule
27
Generation APRIOIRI Algorithm
2.4 The Partition Algorithms, FP-Growth Algorithms 30
2.5 Compact Representation of Frequent Item Set- 32
Maximal Frequent Item Set
2.6 Closed Frequent Item Set 33
UNIT - 3
3.1 Classification Problem Definition, General 35
Approaches to solving a classification problem
3.2 Evaluation of Classifiers, 37
Classification techniques
3.3 Decision Trees-Decision tree Construction 38
3.4 Methods for Expressing attribute test conditions, 39
Measures for Selecting the Best Split
3.5 Algorithm for Decision tree Induction Naive- Bayes 47
Classifier
3.6 Bayesian Belief Networks 50
3.7 K- Nearest neighbor classification-Algorithm and 51
Characteristics.
UNIT - 4
4.1 Clustering: Problem Definition, Clustering Overview, 53
4.2 Evaluation of Clustering Algorithms 54
4.3 Partitioning Clustering-K-Means Algorithm, K-Means 55
lOMoARcPSD|18595359

Additional issues
4.4 PAM Algorithm; 57
4.5 Hierarchical Clustering-Agglomerative Methods and 59
divisive methods
4.6 Key Issues in Hierarchical Clustering, Strengths and 61
Weakness
4.7 Outlier Detection 61
UNIT - 5
5.1 Web and Text Mining: Introduction 65
5.2 web mining, web content mining, web 66
structure mining, web usage mining
5.3 Text Mining 68
5.4 Unstructured text 71
5.5 Episode rule discovery for texts, hierarchy of categories, 72
Text clustering.
Tutorial Questions 74

Assignment Questions 76

Unit Wise Important Questions 78

Objective Questions 83

Internal Question Papers 99

External questions papers 103

References 122
lOMoARcPSD|18595359

DATA MINING

Unit-I
Introducing to Data Mining
Introduction to Data Mining
Data Mining is defined as the procedure of extracting information from huge sets of data.
There is a huge amount of data available in the Information Industry. This data is of no
use until it is converted into useful information. It is necessary to analyze this huge
amount of data and extract useful information from it.
Extraction of information is not the only process we need to perform; data mining also
involves other processes such as Data Cleaning, Data Integration, Data Transformation,
Data Mining, Pattern Evaluation and Data Presentation. Once all these processes are over,
we would be able to use this information in many applications such as Fraud Detection,
Market Analysis, Production Control, Science Exploration, etc.

Need for Data Mining

Growth of OLAP data: The first database systems were implemented in the 1960’s and
1970’s. Many enterprises therefore have more than 30 years of experience in using
database systems and they have accumulated large amounts of data during that time.

Growth of data due to cards: The growing use of credit cards and loyalty cards is an
important area of dat growth. In USA, there has been a tremendous growth in the use of
loyalty cards. Even in Australia, the use of cards like, FlyBuys has grown considerably.

Growth in data due to the web: E-commerce developments have resulted in information
about visitors to Web sites being captures, once again resulting in mountains of data for
some companies.
Growth in data due to other sources: There are many other sources of data.
Some of them are:
 Telephone Transactions
 Frequent flyer transactions
 Medical transactions
 Immigration and customs transactions
 Banking transactions
 Motor vehicle transactions
 Utilities (eg electricity and gas) transactions
 Shopping transactions

Growth in data storage capacity: Another way of illustrating data growth is to consider
annual disk storage sales over the last few years.
Decline in the cost of processing: The cost of computing hardware has declined rapidly
the last 30 years coupled with the increase in hardware performance. Not only do the
prices for processors continue to decline, but also the prices for computer peripherals
have also been declining.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 1
lOMoARcPSD|18595359

DATA MINING

Competitive environment: Owning to increased globalization of trade, the business

environment in most countries has become very competitive. For example, in many
countries the telecommunications industry used to be a state monopoly but it has mostly
been privatized now, leading to intense competition in this industry. Businesses have to
work harder to find new customers and to retain old ones.
Availability of software: A number of companies have developed useful data mining
software in the last few years. Companies that were already operating in the statistics
software market and were familiar with statistical algorithms, some of which are now
used in data mining, have developed some of the software.

What is Data Mining?

Data Mining or knowledge discovery in databases is a collection of exploration
techniques based on advanced analytical methods and tools for handling a large amount
of information. The techniques can find novel patterns that may assist an enterprise in
understanding the business better and in forecasting.
Data Mining is defined as extracting information from huge sets of data. In other words,
we can say that data mining is the procedure of mining knowledge from data. The
information or knowledge extracted so can be used for any of the following applications:
 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Science Exploration
Apart from these, data mining can also be used in the areas of production control,
customer retention, science exploration, sports, astrology, and Internet Web Surf-Aid.

Market Analysis and Management

Listed below are the various fields of market where data mining is used −
 Customer Profiling− Data mining helps determine what kind of people buy what
kind of products.
 Identifying Customer Requirements− Data mining helps in identifying the best
products for different customers. It uses prediction to find the factors that may attract new
customers.
 Cross Market Analysis− Data mining performs Association/correlations between
product sales.
 Target Marketing− Data mining helps to find clusters of model customers who
share the same characteristics such as interests, spending habits, income, etc.
 Determining Customer purchasing pattern− Data mining helps in determining
customer purchasing pattern.
lOMoARcPSD|18595359

DATA MINING

 Providing Summary Information− Data mining provides us various

multidimensional summary reports.

Corporate Analysis and Risk Management

Data mining is used in the following fields of the Corporate Sector −
 Finance Planning and Asset Evaluation− It involves cash flow analysis
and prediction, contingent claim analysis to evaluate assets.
 Resource Planning− It involves summarizing and comparing the resources and
spending.
 Competition− It involves monitoring competitors and market directions.

Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to
detect frauds. In fraud telephone calls, it helps to find the destination of the call, duration
of the call, time of the day or week, etc. It also analyzes the patterns that deviate from
expected norms.

Knowledge Discovery in Databases (KDD)

What is Knowledge Discovery?

Some people don’t differentiate data mining from knowledge discovery while others view
data mining as an essential step in the process of knowledge discovery. Here is the list of
steps involved in the knowledge discovery process −
 Data Cleaning− In this step, the noise and inconsistent data is removed.
 Data Integration− In this step, multiple data sources are combined.
 Data Selection− In this step, data relevant to the analysis task are retrieved from
the database.
 Data Transformation− In this step, data is transformed or consolidated
into forms appropriate for mining by performing summary or aggregation
operations.
 Data Mining− In this step, intelligent methods are applied in order to extract
data patterns.
 Pattern Evaluation− In this step, data patterns are evaluated.
 Knowledge Presentation− In this step, knowledge is
represented. The following diagram shows the process of knowledge
discovery −
lOMoARcPSD|18595359

DATA MINING

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 3
lOMoARcPSD|18595359

DATA MINING

Fig. 1.1 : Knowledge Discovery in Databases Process

Challenges
Data mining is not an easy task, as the algorithms used can get very complex and data is
not always available at one place. It needs to be integrated from various heterogeneous
data sources. These factors also create some issues. Here, we will discuss the major issues
regarding –
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 4
lOMoARcPSD|18595359

DATA MINING

The following diagram describes the major issues.

Fig. 1.2 : Data Mining Issues

Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases− Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data mining to
cover a broad range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction− The data
mining process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.
 Incorporation of background knowledge− To guide discovery process and to
express the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms but
at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining− Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated with a
data warehouse query language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results− Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 5
lOMoARcPSD|18595359

DATA MINING

 Handling noisy or incomplete data− The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be poor.
 Pattern evaluation− The patterns discovered should be interesting because either
they represent common knowledge or lack novelty.

Performance Issues
There can be performance-related issues such as follows −

 Efficiency and scalability of data mining algorithms− In order to effectively

extract the information from huge amount of data in databases, data mining algorithm
must be efficient and scalable.

 Parallel, distributed, and incremental mining algorithms− The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.

Diverse Data Types Issues

 Handling of relational and complex types of data− The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is not
possible for one system to mine all these kind of data.
 Mining information from heterogeneous databases and global information systems
- The data is available at different data sources on LAN or WAN. These data source may
be structured, semi structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.

Data Mining Tasks

Data mining deals with the kind of patterns that can be mined. On the basis of the kind of
data to be mined, there are two categories of functions involved in Data Mining −
 Descriptive
 Classification and Prediction
Descriptive Function
The descriptive function deals with the general properties of data in the database. Here is
the list of descriptive functions −
 Class/Concept Description
 Mining of Frequent Patterns
 Mining of Associations
 Mining of Correlations
 Mining of Clusters

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 6
lOMoARcPSD|18595359

DATA MINING

Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers, and
concepts of customers include big spenders and budget spenders. Such descriptions of a
class or a concept are called class/concept descriptions. These descriptions can be derived
by the following two ways −
 Data Characterization− This refers to summarizing data of class under study.
This class under study is called as Target Class.
 Data Discrimination− It refers to the mapping or classification of a class with
some predefined group or class.

Mining of Frequent Patterns

Frequent patterns are those patterns that occur frequently in transactional data. Here is the
list of kind of frequent patterns −
 Frequent Item Set− It refers to a set of items that frequently appear together,
for example, milk and bread.
 Frequent Subsequence− A sequence of patterns that occur frequently such as
purchasing a camera is followed by memory card.
 Frequent Sub Structure− Substructure refers to different structural forms, such
as graphs, trees, or lattices, which may be combined with item-sets or subsequences.

Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data and
determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is
sold with bread and only 30% of times biscuits are sold with bread.

Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyze that if they
have positive, negative or no effect on each other.

Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the
objects in other clusters.

Classification and Prediction

Classification is the process of finding a model that describes the data classes or concepts.
The purpose is to be able to use this model to predict the class of objects whose class

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 7
lOMoARcPSD|18595359

DATA MINING

label is unknown. This derived model is based on the analysis of sets of training data. The
derived model can be presented in the following forms −
 Classification (IF-THEN) Rules
 Decision Trees
 Mathematical Formulae
 Neural Networks
The list of functions involved in these processes are as follows −
 Classification− It predicts the class of objects whose class label is unknown. Its
objective is to find a derived model that describes and distinguishes data classes or
concepts. The Derived Model is based on the analysis set of training data i.e. the data
object whose class label is well known.
 Prediction− It is used to predict missing or unavailable numerical data values
rather than class labels. Regression Analysis is generally used for prediction. Prediction
can also be used for identification of distribution trends based on available data.
 Outlier Analysis− Outliers may be defined as the data objects that do not comply
with the general behavior or model of the data available.
 Evolution Analysis− Evolution analysis refers to the description and model
regularities or trends for objects whose behavior changes over time.

Data Preprocessing
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking
in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is
a proven method of resolving such issues. Data preprocessing prepares raw data for
further processing.

Data preprocessing is used database-driven applications such as customer relationship

management and rule-based applications (like neural networks).
 Data goes through a series of steps during preprocessing:
 Data Cleaning: Data is cleansed through processes such as filling in
missing values, smoothing the noisy data, or resolving the inconsistencies in the
data.
 Data Integration: Data with different representations are put together
and conflicts within the data are resolved.
 Data Transformation: Data is normalized, aggregated and generalized.
 Data Reduction: This step aims to present a reduced representation of the data
in a data warehouse.
 Data Discretization: Involves the reduction of a number of values of a
continuous attribute by dividing the range of attribute intervals.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 8
lOMoARcPSD|18595359

DATA MINING

Fig. 1.3 : Forms of Data preprocessing

Data Cleaning
Quality of your data is critical in getting to final analysis.Any data which tend to be
incomplete, noisy and inconsistent can effect your result.
Data cleaning in data mining is the process of detecting and removing corrupt or
inaccurate records from a record set, table or database.
Some data cleaning methods

Missing data

1. You can ignore the tuple.This is done when class label is missing. This method is
not very effective , unless the tuple contains several attributes with missing values.
2. You can fill in the missing value manually. This approach is effective on small data
set with some missing values.
3. You can replace all missing attribute values with global constant, such as a label like
“Unknown” or minus infinity.
4. You can use the attribute mean to fill in the missing value. For example customer
average income is 25000 then you can use this value to replace missing value for
income.
5. Use the most probable value to fill in the missing value.

Noisy Data
Noise is a random error or variance in a measured variable. Noisy Data may be due to
faulty data collection instruments, data entry problems and technology limitation.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 9
lOMoARcPSD|18595359

DATA MINING

How to Handle Noisy Data?

Binning
Binning methods sorted data value by consulting its “neighborhood,” that is, the values
around it.The sorted values are distributed into a number of “buckets,” or bins.

For example

Price = 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into (equal-frequency) bins

Bin a: 4, 8, 15

Bin b: 21, 21, 24

Bin c: 25, 28, 34

In this example, the data for price are first sorted and then partitioned into equal-
frequency bins of size 3.

Smoothing by bin means

Bin a: 9, 9, 9

Bin b: 22, 22, 22

Bin c: 29, 29, 29

In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.

Smoothing by bin boundaries

Bin a: 4, 4, 15

Bin b: 21, 21, 24

Bin c: 25, 25, 34

In smoothing by bin boundaries, each bin value is replaced by the closest boundary value.

Regression
Data can be smoothed by fitting the data into a regression functions.

Clustering
Outliers may be detected by clustering,where similar values are organized into groups, or
“clusters.Values that fall outside of the set of clusters may be considered outliers.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 10
lOMoARcPSD|18595359

DATA MINING

Fig.1.4 result of clustering

Dimensionality Reduction
Dimensionality reduction strategies include dimensionality reduction and numerosity
reduction.

In dimensionality reduction, data encoding schemes are applied so as to obtain a reduced

or “compressed” representation of the original data. Examples include data compression
techniques (e.g., wavelet transforms and principal components analysis), attribute subset
selection (e.g., removing irrelevant attributes), and attribute construction (e.g., where a
small set of more useful attributes is derived from the original set).

In numerosity reduction, the data are replaced by alternative, smaller representations

using parametric models (e.g., regression or log-linear models) or nonparametric models
(e.g., histograms, clusters, sampling, or data aggregation).

Dimensionality Reduction
Dimensionality reduction is the process of reducing the number of random variables or
attributes under consideration. Dimensionality reduction methods include wavelet
transforms and principal components analysis,which transform or project the original data
onto a smaller space. Attribute subset selection is a method of dimensionality reduction in
which irrelevant, weakly relevant, or redundant attributes or dimensions are detected and
removed.

Wavelet Transforms
The discrete wavelet transform (DWT) is a linear signal processing technique that, when
applied to a data vector X, transforms it to a numerically different vector, X 0 , of wavelet
coefficients. The two vectors are of the same length. When applying this technique to data
reduction, we consider each tuple as an n-dimensional data vector, that is, X =
(x1,x2,...,xn), depicting n measurements made on the tuple from n database attributes.3
“How can this technique be useful for data reduction if the wavelet transformed data are
of the same length as the original data?” The usefulness lies in the fact that the wavelet
transformed data can be truncated. A compressed approximation of the data can be
retained by storing only a small fraction of the strongest of the wavelet coefficients. For
example, all wavelet coefficients larger than some user-specified threshold can be
retained. All other coefficients are set to 0. The resulting data representation is therefore
very sparse, so that operations that can take advantage of data sparsity are

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 11
lOMoARcPSD|18595359

DATA MINING

computationally very fast if performed in wavelet space. The technique also works to
remove noise without smoothing out the main features of the data, making it effective for
data cleaning as well. Given a set of coefficients, an approximation of the original data
can be constructed by applying the inverse of the DWT used. The DWT is closely related
to the discrete Fourier transform (DFT), a signal processing technique involving sines and
cosines. In general, however, the DWT achieves better lossy compression. That is, if the
same number of coefficients is retained for a DWT and a DFT of a given data vector, the
DWT version will provide a more accurate approximation of the original data. Hence, for
an equivalent approximation, the DWT requires less space than the DFT. Unlike the DFT,
wavelets are quite localized in space, contributing to the conservation of local detail.
There is only one DFT, yet there are several families of DWTs. Figure shows some
wavelet families. Popular wavelet transforms include the Haar-2, Daubechies-4, and
Daubechies-6. The general procedure for applying a discrete wavelet transform uses a
hierarchical pyramid algorithm that halves the data at each iteration, resulting in fast
computational speed. The method is as follows: 1. The length, L, of the input data vector
must be an integer power of 2. This condition can be met by padding the data vector with
zeros as necessary (L ≥ n). 2. Each transform involves applying two functions. The first
applies some data smoothing, such as a sum or weighted average. The second performs a
weighted difference, which acts to bring out the detailed features of the data. 3. The two
functions are applied to pairs of data points in X, that is, to all pairs of measurements (x2i
,x2i+1). This results in two data sets of length L/2. In general, these represent a smoothed
or low-frequency version of the input data and the highfrequency content of it,
respectively. 4. The two functions are recursively applied to the data sets obtained in the
previous loop, until the resulting data sets obtained are of length 2. 5. Selected values
from the data sets obtained in the previous iterations are designated the wavelet
coefficients of the transformed data.

Fig. 1.5:

Equivalently, a matrix multiplication can be applied to the input data in order to obtain
the wavelet coefficients, where the matrix used depends on the given DWT. The matrix
must be orthonormal, meaning that the columns are unit vectors and are mutually
orthogonal, so that the matrix inverse is just its transpose. Although we do not have room
to discuss it here, this property allows the reconstruction of the data from the smooth and
smooth-difference data sets. By factoring the matrix used into a product of a few sparse

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 12
lOMoARcPSD|18595359

DATA MINING

matrices, the resulting “fast DWT” algorithm has a complexity of O(n) for an input vector
of length n. Wavelet transforms can be applied to multidimensional data such as a data
cube. This is done by first applying the transform to the first dimension, then to the
second, and so on. The computational complexity involved is linear with respect to the
number of cells in the cube. Wavelet transforms give good results on sparse or skewed
data and on data with ordered attributes. Lossy compression by wavelets is reportedly
better than JPEG compression, the current commercial standard. Wavelet transforms have
many realworld applications, including the compression of fingerprint images, computer
vision, analysis of time-series data, and data cleaning.

Principal Components Analysis

In this subsection we provide an intuitive introduction to principal components analysis as
a method of dimesionality reduction. Suppose that the data to be reduced consist of tuples
or data vectors described by n attributes or dimensions. Principal components analysis
(PCA; also called the Karhunen-Loeve, or K-L, method) searches for k n-dimensional
orthogonal vectors that can best be used to represent the data, where k ≤ n. The original
data are thus projected onto a much smaller space, resulting in dimensionality reduction.
Unlike attribute subset selection, which reduces the attribute set size by retaining a subset
of the initial set of attributes, PCA “combines” the essence of attributes by creating an
alternative, smaller set of variables. The initial data can then be projected onto this
smaller set. PCA often reveals relationships that were not previously suspected and
thereby allows interpretations that would not ordinarily result.

The basic procedure is as follows:

1. The input data are normalized, so that each attribute falls within the same range. This
step helps ensure that attributes with large domains will not dominate attributes with
smaller domains.

2. PCA computes k orthonormal vectors that provide a basis for the normalized input
data. These are unit vectors that each point in a direction perpendicular to the others.
These vectors are referred to as the principal components. The input data are a linear
combination of the principal components.

3. The principal components are sorted in order of decreasing “significance” or strength.

The principal components essentially serve as a new set of axes for the data, providing
important information about variance. That is, the sorted axes are such that the first axis
shows the most variance among the data, the second axis shows the next highest variance,
and so on. For example, Figure 1.6 shows the first two principal components, Y1 and Y2,
for the given set of data originally mapped to the axes X1 and X2. This information helps
identify groups or patterns within the data.

4. Because the components are sorted in decreasing order of “significance,” the data size
can be reduced by eliminating the weaker components, that is, those with low variance.
Using the strongest principal components, it should be possible to reconstruct a good

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 13
lOMoARcPSD|18595359

DATA MINING

approximation of the original data. PCA can be applied to ordered and unordered
attributes, and can handle sparse data and skewed data. Multidimensional data of more
than two dimensions can be handled by reducing the problem to two dimensions.
Principal components may be used as inputs to multiple regression and cluster analysis. In
comparison with wavelet transforms, PCA tends to be better at handling sparse data,
whereas wavelet transforms are more suitable for data of high dimensionality

Fig. 1.6:

Attribute Subset Selection Data sets for analysis may contain hundreds of attributes,
many of which may be irrelevant to the mining task or redundant. For example, if the task
is to classify customers based on whether or not they are likely to purchase a popular new
CD at AllElectronics when notified of a sale, attributes such as the customer’s telephone
number are likely to be irrelevant, unlike attributes such as age or music taste. Although it
may be possible for a domain expert to pick out some of the useful attributes, this can be
a difficult and timeconsuming task, especially when the data’s behavior is not well
known. (Hence, a reason behind its analysis!) Leaving out relevant attributes or keeping
irrelevant attributes may be detrimental, causing confusion for the mining algorithm
employed. This can result in discovered patterns of poor quality. In addition, the added
volume of irrelevant or redundant attributes can slow down the mining process.

Feature Subset Selection

Attribute subset selection or Feature Subset selection reduces the data set size by
removing irrelevant or redundant attributes (or dimensions). The goal of attribute subset
selection is to find a minimum set of attributes such that the resulting probability
distribution of the data classes is as close as possible to the original distribution obtained
using all attributes. Mining on a reduced set of attributes has an additional benefit: It
reduces the number of attributes appearing in the discovered patterns, helping to make the
patterns easier to understand. “How can we find a ‘good’ subset of the original
attributes?” For n attributes, there are 2 n possible subsets. An exhaustive search for the
optimal subset of attributes can be prohibitively expensive, especially as n and the
number of data classes increase. Therefore, heuristic methods that explore a reduced
search space are commonly used for attribute subset selection. These methods are
typically greedy in that, while searching through attribute space, they always make what
looks to be the best choice at the time. Their strategy is to make a locally optimal choice
in the hope that this will lead to a globally optimal solution. Such greedy methods are

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 14
lOMoARcPSD|18595359

DATA MINING

effective in practice and may come close to estimating an optimal solution. The “best”
(and “worst”) attributes are typically determined using tests of statistical significance,
which assume that the attributes are independent of one another. Many other attribute
evaluation measures can be used such as the information gain measure used in building
decision trees for classification.5 Basic heuristic methods of attribute subset selection
include the techniques that follow, some of which are illustrated in Figure

1. Stepwise forward selection: The procedure starts with an empty set of attributes as the
reduced set. The best of the original attributes is determined and added to the reduced set.
At each subsequent iteration or step, the best of the remaining original attributes is added
to the set.

2.Stepwise backward elimination: The procedure starts with the full set of attributes.
At each step, it removes the worst attribute remaining in the set.

3. Combination of forward selection and backward elimination: The stepwise forward

selection and backward elimination methods can be combined so that, at each step, the
procedure selects the best attribute and removes the worst from among the remaining
attributes.

4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were
originally intended for classification. Decision tree induction constructs a flowchart like
structure where each internal (nonleaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node denotes a class
prediction. At each node, the algorithm chooses the “best” attribute to partition the data
into individual classes. When decision tree induction is used for attribute subset selection,
a tree is constructed from the given data. All attributes that do not appear in the tree are
assumed to be irrelevant. The set of attributes appearing in the tree form the reduced
subset of attributes. The stopping criteria for the methods may vary. The procedure may
employ a threshold on the measure used to determine when to stop the attribute selection
process. In some cases, we may want to create new attributes based on others. Such
attribute construction6 can help improve accuracy and understanding of structure in high
dimensional data. For example, we may wish to add the attribute area based on the
attributes height and width. By combining attributes, attribute construction can discover
missing information about the relationships between data attributes that can be useful for
knowledge discovery.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 15
lOMoARcPSD|18595359

DATA MINING

Fig. 1.7:

Data Discretization & Binaryzation

Data discretization transforms numeric data by mapping values to interval or concept
labels. Such methods can be used to automatically generate concept hierarchies for the
data, which allows for mining at multiple levels of granularity. Discretization techniques
include binning, histogram analysis, cluster analysis, decision tree analysis, and
correlation analysis. For nominal data, concept hierarchies may be generated based on
schema definitions as well as the number of distinct values per attribute.

Discretization by Binning
Binning is a top-down splitting technique based on a specified number of bins. Binning
methods are also used as discretization methods for data reduction and concept hierarchy
generation. For example, attribute values can be discretized by applying equal-width or
equal-frequency binning, and then replacing each bin value by the bin mean or median, as
in smoothing by bin means or smoothing by bin medians, respectively. These techniques
can be applied recursively to the resulting partitions to generate concept hierarchies.
Binning does not use class information and is therefore an unsupervised discretization
technique. It is sensitive to the user-specified number of bins, as well as the presence of
outliers.

Discretization by Histogram Analysis

Like binning, histogram analysis is an unsupervised discretization technique because it
does not use class information. A histogram partitions the values of an attribute, A, into
disjoint ranges called buckets or bins.

Various partitioning rules can be used to define histograms.In an equal-width histogram,

for example, the values are partitioned into equal-size partitions or ranges (e.g., earlier in
Figure for price, where each bucket has a width of $10). With an equal-frequency
histogram, the values are partitioned so that, ideally, each partition contains the same

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 16
lOMoARcPSD|18595359

DATA MINING

number of data tuples. The histogram analysis algorithm can be applied recursively to
each partition in order to automatically generate a multilevel concept hierarchy, with the
procedure terminating once a prespecified number of concept levels has been reached. A
minimum interval size can also be used per level to control the recursive procedure. This
specifies the minimum width of a partition, or the minimum number of values for each
partition at each level. Histograms can also be partitioned based on cluster analysis of the
data distribution, as described next.

Discretization by Cluster
Decision Tree, and Correlation Analyses Clustering, decision tree analysis, and
correlation analysis can be used for data discretization. We briefly study each of these
approaches. Cluster analysis is a popular data discretization method. A clustering
algorithm can be applied to discretize a numeric attribute, A, by partitioning the values of
A into clusters or groups. Clustering takes the distribution of A into consideration, as well
as the closeness of data points, and therefore is able to produce high-quality discretization
results. Clustering can be used to generate a concept hierarchy for A by following either a
top-down splitting strategy or a bottom-up merging strategy, where each cluster forms a
node of the concept hierarchy. In the former, each initial cluster or partition may be
further decomposed into several subclusters, forming a lower level of the hierarchy. In the
latter, clusters are formed by repeatedly grouping neighboring clusters in order to form
higher-level concepts. Techniques to generate decision trees for classification can be
applied to discretization. Such techniques employ a top-down splitting approach. Unlike
the other methods mentioned so far, decision tree approaches to discretization are
supervised, that is, they make use of class label information. For example, we may have a
data set of patient symptoms (the attributes) where each patient has an associated
diagnosis class label. Class distribution information is used in the calculation and
determination of split-points (data values for partitioning an attribute range). Intuitively,
the main idea is to select split-points so that a given resulting partition contains as many
tuples of the same class as possible. Entropy is the most commonly used measure for this
purpose. To discretize a numeric attribute, A, the method selects the value of A that has
the minimum entropy as a split-point, and recursively partitions the resulting intervals to
arrive at a hierarchical discretization. Such discretization forms a concept hierarchy for A.
Because decision tree–based discretization uses class information, it is more likely that
the interval boundaries (split-points) are defined to occur in places that may help improve
classification accuracy.

Measures of correlation can be used for discretization. ChiMerge is a χ 2 -based

discretization method. The discretization methods that we have studied up to this point
have all employed a top-down, splitting strategy. This contrasts with ChiMerge, which
employs a bottom-up approach by finding the best neighboring intervals and then
merging them to form larger intervals, recursively. As with decision tree analysis,
ChiMerge is supervised in that it uses class information. The basic notion is that for
accurate discretization, the relative class frequencies should be fairly consistent within an
interval. Therefore, if two adjacent intervals have a very similar distribution of classes,
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 17
lOMoARcPSD|18595359

DATA MINING

then the intervals can be merged. Otherwise, they should remain separate. ChiMerge
proceeds as follows. Initially, each distinct value of a numeric attribute A is considered to
be one interval. χ 2 tests are performed for every pair of adjacent intervals. Adjacent
intervals with the least χ 2 values are merged together, because low χ 2 values for a pair
indicate similar class distributions. This merging process proceeds recursively until a
predefined stopping criterion is met.

Data Transformation
Overview In data transformation, the data are transformed or consolidated into forms
appropriate for mining. Strategies for data transformation include the following:

1. Smoothing, which works to remove noise from the data. Techniques include binning,
regression, and clustering.

2. Attribute construction (or feature construction), where new attributes are constructed
and added from the given set of attributes to help the mining process.

3. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for data analysis at
multiple abstraction levels.

4. Normalization, where the attribute data are scaled so as to fall within a smaller range,
such as −1.0 to 1.0, or 0.0 to 1.0. 5. Discretization, where the raw values of a numeric
attribute (e.g., age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual
labels (e.g., youth, adult, senior).

5.The labels, in turn, can be recursively organized into higher-level concepts, resulting in
a concept hierarchy for the numeric attribute.

6. Concept hierarchy generation for nominal data, where attributes such as street can be
generalized to higher-level concepts, like city or country. Many hierarchies for nominal
attributes are implicit within the database schema and can be automatically defined at the
schema definition level. Recall that there is much overlap between the major data
preprocessing tasks. Smoothing is a form of data cleaning on the data cleaning process
also discussed ETL tools, where users specify transformations to correct data
inconsistencies. Discretization techniques can be categorized based on how the
discretization is performed, such as whether it uses class information or which direction it
proceeds (i.e., top-down vs. bottom-up). If the discretization process uses class
information, then we say it is supervised discretization. Otherwise, it is unsupervised. If
the process starts by first finding one or a few points (called split points or cut points) to
split the entire attribute range, and then repeats this recursively on the resulting intervals,
it is called top-down discretization or splitting. This contrasts with bottom-up
discretization or merging, which starts by considering all of the continuous values as
potential split-points, removes some by merging neighborhood values to form intervals,

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 18
lOMoARcPSD|18595359

DATA MINING

and then recursively applies this process to the resulting intervals. Data discretization and
concept hierarchy generation are also forms of data reduction. The raw data are replaced
by a smaller number of interval or concept labels. This simplifies the original data and
makes the mining more efficient. The resulting patterns mined are typically easier to
understand. Concept hierarchies are also useful for mining at multiple abstraction levels.

Discretization techniques can be categorized based on how the discretization is

performed, such as whether it uses class information or which direction it proceeds (i.e.,
top-down vs. bottom-up). If the discretization process uses class information, then we say
it is supervised discretization. Otherwise, it is unsupervised. If the process starts by first
finding one or a few points (called split points or cut points) to split the entire attribute
range, and then repeats this recursively on the resulting intervals, it is called top-down
discretization or splitting. This contrasts with bottom-up discretization or merging, which
starts by considering all of the continuous values as potential split-points, removes some
by merging neighborhood values to form intervals, and then recursively applies this
process to the resulting intervals. Data discretization and concept hierarchy generation are
also forms of data reduction. The raw data are replaced by a smaller number of interval or
concept labels. This simplifies the original data and makes the mining more efficient. The
resulting patterns mined are typically easier to understand. Concept hierarchies are also
useful for mining at multiple abstraction levels.

Data Transformation by Normalization

The measurement unit used can affect the data analysis. For example, changing
measurement units from meters to inches for height, or from kilograms to pounds for
weight, may lead to very different results. In general, expressing an attribute in smaller
units will lead to a larger range for that attribute, and thus tend to give such an attribute
greater effect or “weight.” To help avoid dependence on the choice of measurement units,
the data should be normalized or standardized. This involves transforming the data to fall
within a smaller or common range such as [−1,1] or [0.0, 1.0]. (The terms standardize and
normalize are used interchangeably in data preprocessing, although in statistics, the latter
term also has other connotations.) Normalizing the data attempts to give all attributes an
equal weight. Normalization is particularly useful for classification algorithms involving
neural networks or distance measurements such as nearest-neighbor classification and
clustering. If using the neural network backpropagation algorithm for classification
mining (Chapter 9), normalizing the input values for each attribute measured in the
training tuples will help speed up the learning phase. For distance-based methods,
normalization helps prevent attributes with initially large ranges (e.g., income) from
outweighing attributes with initially smaller ranges (e.g., binary attributes). It is also
useful when given no prior knowledge of the data. There are many methods for data
normalization. We study min-max normalization, z-score normalization, and
normalization by decimal scaling. For our discussion, let A be a numeric attribute with n
observed values, v1, v2,..., vn.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 19
lOMoARcPSD|18595359

DATA MINING

Min-max normalization performs a linear transformation on the original data. Suppose

that minA and maxA are the minimum and maximum values of an attribute, A. Min-max
normalization maps a value, , of A to v 0 i in the range [new minA,new maxA] by
computing

Min-max normalization preserves the relationships among the original data values. It will
encounter an “out-of-bounds” error if a future input case for normalization falls outside of
the original data range for A.

Example : Min-max normalization. Suppose that the minimum and maximum values for
the attribute income are $12,000 and $98,000, respectively. We would like to map income
to the range [0.0,1.0]. By min-max normalization, a value of $73,600 for income is
transformed to

In z-score normalization (or zero-mean normalization), the values for an attribute, A, are
normalized based on the mean (i.e., average) and standard deviation of A. A value, vi , of
A is normalized to v 0 i by computing

where A¯ and σA are the mean and standard deviation, respectively, of attribute A. The
mean and standard deviation, where vn ) and σA is computed as the
square root of the variance of A. This method of normalization is useful when the actual
minimum and maximum of attribute A are unknown, or when there are outliers that
dominate the min-max normalization.

z-score normalization. Suppose that the mean and standard deviation of the values for
the attribute income are $54,000 and $16,000, respectively. With z-score normalization, a

value of $73,600 for income is transformed to A variation of this z-

score normalization replaces the standard deviation of Eq. (3.9) by the mean absolute
deviation of A. The mean absolute deviation of A, denoted sA, is

Thus, z-score normalization using the mean absolute deviation is

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 20
lOMoARcPSD|18595359

DATA MINING

The mean absolute deviation, sA, is more robust to outliers than the standard deviation,
σA. When computing the mean absolute deviation, the deviations from the mean (i.e.,
are not squared; hence, the effect of outliers is somewhat reduced. Normalization
by decimal scaling normalizes by moving the decimal point of values of attribute A. The
number of decimal points moved depends on the maximum absolute value of A. A value,
vi , of A is normalized to v 0 i by computing.

where j is the smallest integer such that max(|v 0 i |) < 1.

Decimal scaling
Suppose that the recorded values of A range from −986 to 917. The maximum absolute
value of A is 986. To normalize by decimal scaling, we therefore divide each value by
1000 (i.e., j = 3) so that −986 normalizes to −0.986 and 917 normalizes to 0.917. Note
that normalization can change the original data quite a bit, especially when using z-score
normalization or decimal scaling. It is also necessary to save the normalization
parameters (e.g., the mean and standard deviation if using z-score normalization) so that
future data can be normalized in a uniform manner.

Measures of Similarity and Dissimilarity

Similarity and Dissimilarity: Distance or similarity measures are essential to solve many
pattern recognition problems such as classification and clustering. Various
distance/similarity measures are available in literature to compare two data distributions.
As the names suggest, a similarity measures how close two distributions are. For
multivariate data complex summary methods are developed to answer this question.

Similarity Measure:
Numerical measure of how alike two data objects are. Often falls between 0
(no similarity) and 1 (complete similarity).

Dissimilarity Measure

 Numerical measure of how different two data objects are.

 Range from 0 (objects are alike) to ∞ (objects are different).
Proximity refers to a similarity or dissimilarity.

Similarity/Dissimilarity for Simple Attributes

Here, p and q are the attribute values for two data objects.

Attribute
Similarity Dissimilarity
Type

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 21
lOMoARcPSD|18595359

DATA MINING

Nominal S={1 if p=q0 if p≠qs={1 if p=q0 if p≠q d={0if p=q1 if p≠qd={0 if p=q1 if p≠q

s=1−∥p−q∥n−1s=1−‖p−q‖n−1
Ordinal (values mapped to integer 0 to n-1, where n d=∥p−q∥n−1d=‖p−q‖n−1
is the number of values)

Interval or s=1−∥p−q∥,s=11+∥p−q∥s=1−‖p−q‖,s=11+‖p−
d=∥p−q∥d=‖p−q‖
Ratio q‖

Common Properties of Dissimilarity Measures

Distance, such as the Euclidean distance, is a dissimilarity measure and has some
well known properties:
 d(p,q) ≥ 0 for all pandq, and d(p,q) = 0 if and only if p=q,
 d(p,q) =d(q,p)for all p andq,
 d(p,r) ≤d(p,q) +d(q,r) for all p, q, and r, where d(p, q) is the
distance (dissimilarity) between points (data objects), p and q.
A distance that satisfies these properties is called a metric.Following is a list of several
common distance measures to compare multivariate data. We will assume that the
attributes are all continuous.
Euclidean Distance
Assume that we have measurements xik,i= 1, … ,N, on variablesk= 1, … ,p(also called
attributes).
The Euclidean distance between the ith and jth objects is
dE(i,j)=(p∑k=1(xik−xjk)2)12dE(i,j)=(∑k=1p(xik−xjk)2)12
for every pair (i, j) of observations.

The weighted Euclidean distance is

dWE(i,j)=(p∑k=1Wk(xik−xjk)2)12dWE(i,j)=(∑k=1pWk(xik−xjk)2)12
If scales of the attributes differ substantially, standardization is necessary.

Minkowski Distance
The Minkowski distance is a generalization of the Euclidean distance.
With the measurement, xik,i= 1, … ,N, k= 1, … ,p, the Minkowski distance
is dM(i,j)=(p∑k=1∣∣xik−xjk∣∣λ)1λ,dM(i,j)=(∑k=1p|xik−xjk|λ)1λ,
where λ ≥ 1.It is also called the Lλmetric.
 λ = 1 : L1metric, Manhattan or City-block distance.
 λ = 2 : L2metric, Euclidean distance.
 λ → ∞ : L∞metric, Supremum distance.
limλ→∞=(p∑k=1∣∣xik−xjk∣∣λ)1λ=max(∣∣xi1−xj1∣∣,...,∣∣xip−xjp∣∣)limλ→∞=(∑k=1p|xik−x
jk|λ)1λ=max(|xi1−xj1|,...,|xip−xjp|)
Note that λ and p are two different parameters. Dimension of the data matrix remains
finite.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 22
lOMoARcPSD|18595359

DATA MINING

Mahalanobis Distance
Let X be a N × p matrix. Then the ith row of X is
xTi=(xi1,...,xip)xiT=(xi1,...,xip)
The Mahalanobis distance is

dMH(i,j)=((xi−xj)TΣ−1(xi−xj))12dMH(i,j)=((xi−xj)TΣ−1(xi−xj))12
where ∑ is the p×p sample covariance matrix.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 23
lOMoARcPSD|18595359

DATA MINING

Unit-II
Association Rules
Problem Definition

What is Association Rule Mining?

Association rule mining is a popular and well researched method for discovering
interesting relations between variables in large data bases. It is intended to identify strong
rules discovered in databases using different measures of interestingness.

Problem Definition
The problem of association rule mining is defined as:

Let be a set of binary attributes called items.

Let be a set of transactions called the database.

Each transaction in has a unique transaction ID and contains a subset of the items in .
A rule is defined as an implication of the form

where and .

The sets of items (for short itemsets) and are called antecedent (left-hand-side or
LHS) and consequent (right-hand-side or RHS) of the rule respectively.

Example:

To illustrate the concepts, we use a small example from the supermarket domain. The set
of items is and a small database containing the
items (1 codes presence and 0 absence of an item in a transaction) is shown in the table.

An example rule for the supermarket could be meaning

that if butter and bread are bought, customers also buy milk.
Example database with 4 items and 5 transactions

Transaction ID milk bread butter beer

1 1 1 0 0

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 24
lOMoARcPSD|18595359

DATA MINING

0 0 1 0

3 0 0 0 1

4 1 1 1 0

5 0 1 0 0

Important concepts of Association Rule Mining

The support supp(X) of an X itemset is defined as the proportion of transactions in the

data set which contain the itemset. In the example database, the itemset {
milk,bread,butter} has a support of 1/5 =0.2 since it occurs in 20% of all transactions (1
out of 5 transactions).

has a support of since it occurs in 20% of

all transactions (1 out of 5 transactions).

The confidenceof a rule is defined

.
For example, the rule has a confidence of

in the database, which means that for 100% of the transactions

containing butter and bread the rule is correct (100% of the times a customer buys butter
and bread, milk is bought as well). Confidence can be interpreted as an estimate of the
probability , the probability of finding the RHS of the rule in transactions
.
under the condition that these transactions also contain the LHS
The lift of a rule is defined as
or the ratio of the observed support to that expected if X and Y were independent.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 25
lOMoARcPSD|18595359

DATA MINING

The rule has a lift of

. The conviction of a rule is

defined as

The rule has a conviction of ,

and can be interpreted as the ratio of the expected frequency that X occurs without Y (that
is to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions.

Frequent Item set Generation- Market Basket Analysis

This process analyzes customer buying habits by finding associations between the
different items that customers place in their shopping baskets. The discovery of such
association scan help retailers develop marketing strategies by gaining insight into which
items are frequently purchased together by customers.

Fig. 2.1 : Market Basket Analysis

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 26
lOMoARcPSD|18595359

DATA MINING

The APRIORI Principle

Finding Frequent Item sets Using Candidate Generation: The Apriori Algorithm
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining
frequent itemsets for Boolean association rules. The name of the algorithm is based on the
fact that the algorithm uses prior knowledge of frequent itemset properties.

Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets.

First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support. The
resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets,
which is used to find L3, and so on, until no more frequent k-itemsets can be found.
The finding of each Lkrequires one full scan of the database.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 27
lOMoARcPSD|18595359

DATA MINING

Example

TID List of item IDs

T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5

T900 I1, I2, I3

There are nine transactions in this database, that is, |D| = 9.

Steps
1. In the first iteration of the algorithm, each item is a member of the set of candidate1-
itemsets, C1. The algorithm simply scans all of the transactions in order to count the
number of occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set
of frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets
satisfying minimum support. In our example, all of the candidates in C1 satisfy minimum
support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1
to generate a candidate set of 2-itemsets, C2.No candidates are removed fromC2 during
the prune step because each subset of the candidates is also frequent.
4. Next, the transactions in Dare scanned and the support count of each candidate
itemsetInC2 is accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those
candidate2-itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, From the join step, we first
getC3 =L2x L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4,
I5}. Based on the Apriori property that all subsets of a frequent itemset must also be
frequent, we can determine that the four latter candidates cannot possibly be frequent.
7. The transactions in D are scanned in order to determine L3, consisting of those
candidate 3-itemsets in C3 having minimum support.
8. The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 28
lOMoARcPSD|18595359

DATA MINING

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 29
lOMoARcPSD|18595359

DATA MINING

Association Rule Generation

Generating Association Rules from Frequent Itemsets:

Once the frequent itemsets from transactions in a database D have been found, it is

straightforward to generate strong association rules from them.

Example :

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 30
lOMoARcPSD|18595359

DATA MINING

Partition Algorithms
First scan:

Subdivide the transactions of database D into n non overlapping partitions

If the minimum support in D is min_sup, then the minimum support for a partition is
min_sup * number of transactions in that partition

Local frequent items are determined

A local frequent item my not by a frequent item in D

Second scan: Frequent items are determined from the local frequent items

First scan: Subdivide the transactions of database D into n non overlapping partitions

If the minimum support in D is min_sup, then the minimum support for a partition is
min_sup * number of transactions in D / number of transactions in that partition

Local frequent items are determined

A local frequent item my not by a frequent item in D

Second scan: Frequent items are determined from the local frequent items

FP- Growth Algorithms

An FP-tree is then constructed as follows. First, create the root of the tree, labeled with
“null.” Scan database D a second time. The items in each transaction are processed in L
order(i.e., sorted according to descending support count), and a branch is created for each
transaction. For example, the scan of the first transaction, “T100: I1, I2, I5,” which
contains three items (I2, I1, I5 in L order), leads to the construction of the first branch of
the tree with three nodes, hI2: 1i, hI1: 1i, and hI5: 1i, where I2 is linked as a child to the
root, I1 is linked to I2, and I5 is linked to I1. The second transaction, T200, contains the
items I2 and I4 in L order, which would result in a branch where I2 is linked to the root
and I4 is linked to I2. However, this branch would share a common prefix, I2, with the
existing path for T100. Therefore, we instead increment the count of the I2 node by 1, and
create a new node, hI4: 1i, which is linked as a child to hI2: 2i. In general, when
considering the branch to be added for a transaction, the count of each node along a
common prefix is incremented by 1, and nodes for the items following the prefix are
created and linked accordingly.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 31
lOMoARcPSD|18595359

DATA MINING

Fig. 2.2 : FP-Tree

Compact Representation of Frequent Itemsets

Maximal and Closed Itemsets A compact representation of frequent itemsets is extremely
important when we look at association rule mining. The notion of maximal frequent
itemsets comes handy when suppose we are considering huge amounts of data. If the
length of a frequent itemset is ‘k’ we know all of it 2k subsets are also frequent because
of the downward closure property. But sometimes when the computation is very
expensive and we not interested in associations alone the process of generating these
additional subsets can be avoided and we can just look at the frequent itemset with
maximum length. Maximal frequent itemset: The definition says that an itemset is
maximal frequent if none of its immediate supersets is frequent.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 32
lOMoARcPSD|18595359

DATA MINING

Like for example look at the lattice below to identify the maximal itemsets .

Fig. 2.3: Lattice to identify the maximal itemsets

itemsets in blue which are maximal frequent itemsets. This can be easily seen as none of
the supersets of the blue color itemsets are above the border. However, one problem
associated with maximal frequent itemsets is that even though we know that all its subsets
are frequent we do not know their supports. Actual support information is very important
for these itemsets when we are deriving the association rules. So now we shift our goal
and try to find out all such frequent itemsets that have the same support as their subsets.
This forms the notion of a closed frequent itemset.

2.6 Closed Frequent Itemset

An itemset is closed if none of its immediate supersets has the same support as that of the
itemset. Now to understand the notion of closed frequent itemsets lets look at the example
below.

Fig.2.4: Closed frequent itemset example.

In the above process we have derived the set of frequent itemsets from the transactions.
Now let’s represent the itemset lattice and identify the closed and frequent itemsets. The

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 33
lOMoARcPSD|18595359

DATA MINING

minimum support value here in the current example is set to 2. The labels in red for each
node in the lattice represent the set of transactions where the element occurs. (support)

We have provided labeling for few of the nodes which can be verified by checking once.
C,D,E are closed because one can observe that none of the supersets have the same
support but there does exist a frequent superset hence it’s not maximal. But for CE and
DE one can observe that both the properties are satisfied and hence it comes under the
category of both Closed and Maximal.

Fig.2.5: Lattice for closed frequent itemset example.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 34
lOMoARcPSD|18595359

DATA MINING

Unit-III
Classification
Problem Definition
There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −
 Classification
 Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to
categorize bank loan applications as either safe or risky, or a prediction model to predict
the expenditures in dollars of potential customers on computer equipment given their
income and occupation.

What is classification?
Following are the examples of cases where the data analysis task is Classification −
 A bank loan officer wants to analyze the data in order to know which customer
(loan applicant) are risky or which are safe.
 A marketing manager at a company needs to analyze a customer with a given
profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes or no
for marketing data.

What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model
or a predictor will be constructed that predicts a continuous-valued-function or ordered
value.

How Does Classification Works?

With the help of the bank loan application that we have discussed above, let us
understand the working of classification. The Data Classification process includes two
steps −
 Building the Classifier or Model
 Using Classifier for Classification

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 35
lOMoARcPSD|18595359

DATA MINING

Building the Classifier or Model

 This step is the learning step or the learning phase.
 In this step the classification algorithms build the classifier.
 The classifier is built from the training set made up of database tuples and
their associated class labels.
 Each tuple that constitutes the training set is referred to as a category or
class. These tuples can also be referred to as sample, object or data points.

Using Classifier for Classification

In this step, the classifier is used for classification. Here the test data is used to estimate
the accuracy of classification rules. The classification rules can be applied to the new data
tuples if the accuracy is considered acceptable.

Classification and Prediction Issues

The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities −
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 36
lOMoARcPSD|18595359

DATA MINING

Data Cleaning − Data cleaning involves removing the noise and treatment of missing
values. The noise is removed by applying smoothing techniques and the problem of
missing values is solved by replacing a missing value with most commonly occurring
value for that attribute.
Relevance Analysis − Database may also have the irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
Data Transformation and reduction − The data can be transformed by any of the
following methods.
Normalization − The data is transformed using normalization. Normalization involves
scaling all values for given attribute in order to make them fall within a small specified
range. Normalization is used when in the learning step, the neural networks or the
methods involving measurements are used.
Generalization − The data can also be transformed by generalizing it to the higher
concept. For this purpose we can use the concept hierarchies.

Evaluation of Classifier- classification techniques

A classification technique (or classifier) is a systematic approach to building
classification models from an input data set. Examples include decision tree classifiers,
rule-based classifiers, neural networks, support vector machines and naive Bayes
classifiers. Each technique employs a learning algorithm to identify a model that best fits
the relationship between the attribute set and class label of the input data. The model
generated by a learning algorithm should both fit the input data well and correctly predict
the class labels of records it has never seen before.

Fig.3.1 : Classification model generation

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 37
lOMoARcPSD|18595359

DATA MINING

Decision Tree – Decision tree Construction

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and
each leaf node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a
test on an attribute. Each leaf node represents a class.

Fig.3.2 : Decision tree construction

The benefits of having a decision tree are as follows –
It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.

Methods for attribute selection

Metrics for Evaluating Classifier Performance This section presents measures for
assessing how good or how “accurate” your classifier is at predicting the class label of
tuples. We will consider the case of where the class tuples are more or less evenly
distributed, as well as the case where classes are unbalanced (e.g., where an important
class of interest is rare such as in medical tests). The classifier evaluation measures
presented in this section are summarized in Figure . They include accuracy (also known
as recognition rate), sensitivity (or recall), specificity, precision, F1, and Fβ. Note that

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 38
lOMoARcPSD|18595359

DATA MINING

although accuracy is a specific measure, the word “accuracy” is also used as a general
term to refer to a classifier’s predictive abilities. Using training data to derive a classifier
and then estimate the accuracy of the resulting learned model can result in misleading
overoptimistic estimates due to overspecialization of the learning algorithm to the data.
(We will say more on this in a moment!) Instead, it is better to measure the classifier’s
accuracy on a test set consisting of class-labeled tuples that were not used to train the
model. Before we discuss the various measures, we need to become comfortable with
some terminology. Recall that we can talk in terms of positive tuples (tuples of the main
class of interest) and negative tuples (all other tuples).

Measures for Selecting the Best Split

Attribute Selection Measures An attribute selection measure is a heuristic for selecting the
splitting criterion that “best” separates a given data partition, D, of class-labeled training
tuples into individual classes. If we were to split D into smaller partitions according to the
outcomes of the splitting criterion, ideally each partition would be pure (i.e., all the tuples
that fall into a given partition would belong to the same class). Conceptually, the “best”
splitting criterion is the one that most closely results in such a scenario.

Attribute selection measures are also known as splitting rules because they determine how
the tuples at a given node are to be split. The attribute selection measure provides a
ranking for each attribute describing the given training tuples. The attribute having the
best score for the measure4 is chosen as the splitting attribute for the given tuples. If the
splitting attribute is continuous-valued or if we are restricted to binary trees, then,
respectively, either a split point or a splitting subset must also be determined as part of the
splitting criterion. The tree node created for partition D is labeled with the splitting
criterion, branches are grown for each outcome of the criterion, and the tuples are
partitioned accordingly. This section describes three popular attribute selection measures
—information gain, gain ratio, and Gini index.

The notation used herein is as follows. Let D, the data partition, be a training set of class-
labeled tuples. Suppose the class label attribute has m distinct values defining m distinct
classes, Ci (for i = 1,..., m). Let Ci,D be the set of tuples of class Ci in D. Let |D| and
|Ci,D| denote the number of tuples in D and Ci,D, respectively.

Information Gain ID3

This uses information gain as its attribute selection measure. This measure is based on
pioneering work by Claude Shannon on information theory, which studied the value or
“information content” of messages. Let node N represent or hold the tuples of partition D.
The attribute with the highest information gain is chosen as the splitting attribute for node
N. This attribute minimizes the information needed to classify the tuples in the resulting
partitions and reflects the least randomness or “impurity” in these partitions. Such an
approach minimizes the expected number of tests needed to classify a given tuple and

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 39
lOMoARcPSD|18595359

DATA MINING

guarantees that a simple (but not necessarily the simplest) tree is found. The expected
information needed to classify a tuple in D is given by

where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is
estimated by |Ci,D|/|D|. A log function to the base 2 is used, because the information is
encoded in bits. Info(D) is just the average amount of information needed to identify the
class label of a tuple in D. Note that, at this point, the information we have is based solely
on the proportions of tuples of each class.

Info(D): This is also known as the entropy of D. Now, suppose we were to partition the
tuples in D on some attribute A having v distinct values, {a1, a2,..., av }, as observed
from the training data. If A is discrete-valued, these values correspond directly to the v
outcomes of a test on A. Attribute A can be used to split D into v partitions or subsets,
{D1, D2,..., Dv }, where Dj contains those tuples in D that have outcome aj of A. These
partitions would correspond to the branches grown from node N. Ideally, we would like
this partitioning to produce an exact classification of the tuples. That is, we would like for
each partition to be pure. However, it is quite likely that the partitions will be impure
(e.g., where a partition may contain a collection of tuples from different classes rather
than from a single class). How much more information would we still need (after the
partitioning) to arrive at an exact classification? This amount is measured by

The term |Dj | |D| acts as the weight of the jth partition. InfoA (D) is the expected
information required to classify a tuple from D based on the partitioning by A. The
smaller the expected information (still) required, the greater the purity of the partitions.
Information gain is defined as the difference between the original information
requirement (i.e., based on just the proportion of classes) and the new requirement (i.e.,
obtained after partitioning on A). That is,

In other words, Gain(A) tells us how much would be gained by branching on A. It is the
expected reduction in the information requirement caused by knowing the value of A. The
attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute
at node N. This is equivalent to saying that we want to partition on the attribute A that
would do the “best classification,” so that the amount of information still required to
finish classifying the tuples is minimal (i.e., minimum InfoA (D)).

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 40
lOMoARcPSD|18595359

DATA MINING

Fig.3.3 : Class-labeled training tuples from the AllElectronics Customer Database

Induction of a decision tree using information gain. The above table presents a training
set, D, of class-labeled tuples randomly selected from the AllElectronics customer
database. (The data are adapted from Quinlan [Qui86]. In this example, each attribute is
discrete valued. Continuous-valued attributes have been generalized.) The class label
attribute, buys computer, has two distinct values (namely, {yes, no}); therefore, there are
two distinct classes (i.e., m = 2). Let class C1 correspond to yes and class C2 correspond
to no. There are nine tuples of class yes and five tuples of class no. A (root) node N is
created for the tuples in D. To find the splitting criterion for these tuples, we must
compute the information gain of each attribute. We first use Eq. (8.1) to compute the
expected information needed to classify a tuple in D:

Next, we need to compute the expected information requirement for each attribute. Let’s
start with the attribute age. We need to look at the distribution of yes and no tuples for
each category of age. For the age category “youth,” there are two yes tuples and three no
tuples. For the category “middle aged,” there are four yes tuples and zero no tuples. For
the category “senior,” there are three yes tuples and two no tuples. Using Eq. (8.2), the
expected information needed to classify a tuple in D if the tuples are partitioned according
to age is

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 41
lOMoARcPSD|18595359

DATA MINING

Hence, the gain in information from such a partitioning would be

Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and
Gain(credit rating) = 0.048 bits. Because age has the highest information gain among the
attributes, it is selected as the splitting attribute. Node N is labeled with age, and branches
are grown for each of the attribute’s values. The tuples are then partitioned accordingly,
as shown in Figure 8.5. Notice that the tuples falling into the partition for age = middle
aged all belong to the same class. Because they all belong to class “yes,” a leaf should
therefore be created at the end of this branch and labeled “yes.” The final decision tree
returned by the algorithm was shown earlier in Figure.

Fig.3.4 : Decision tree obtained

“But how can we compute the information gain of an attribute that is continuous valued,
unlike in the example?” Suppose, instead, that we have an attribute A that is continuous-
valued, rather than discrete-valued. (For example, suppose that instead of the discretized
version of age from the example, we have the raw values for this attribute.) For such a

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 42
lOMoARcPSD|18595359

DATA MINING

scenario, we must determine the “best” split-point for A, where the split-point is a
threshold on A. We first sort the values of A in increasing order. Typically, the midpoint
between each pair of adjacent values is considered as a possible split-point. Therefore,
given v values of A, then v − 1 possible splits are evaluated. For example, the midpoint
between the values ai and ai+1 of A is

If the values of A are sorted in advance, then determining the best split for A requires
only one pass through the values. For each possible split-point for A, we evaluate InfoA
(D), where the number of partitions is two, that is, v = 2 (or j = 1,2) in Eq. (8.2). The
point with the minimum expected information requirement for A is selected as the split
point for A. D1 is the set of tuples in D satisfying A ≤ split point, and D2 is the set of
tuples in D satisfying A > split point.

Gain Ratio The information gain measure is biased toward tests with many outcomes.
That is, it prefers to select attributes having a large number of values. For example,
consider an attribute that acts as a unique identifier such as product ID. A split on product
ID would result in a large number of partitions (as many as there are values), each one
containing just one tuple. Because each partition is pure, the information required to
classify data set D based on this partitioning would be Infoproduct ID(D) = 0. Therefore,
the information gained by partitioning on this attribute is maximal. Clearly, such a
partitioning is useless for classification. C4.5, a successor of ID3, uses an extension to
information gain known as gain ratio, which attempts to overcome this bias. It applies a
kind of normalization to information gain using a “split information” value defined
analogously with Info(D) as

This value represents the potential information generated by splitting the training data set,
D, into v partitions, corresponding to the v outcomes of a test on attribute A. Note that,
for each outcome, it considers the number of tuples having that outcome with respect to
the total number of tuples in D. It differs from information gain, which measures the
information with respect to classification that is acquired based on the same partitioning.
The gain ratio is defined as

The attribute with the maximum gain ratio is selected as the splitting attribute. Note,
however, that as the split information approaches 0, the ratio becomes unstable. A
constraint is added to avoid this, whereby the information gain of the test selected must
be large—at least as great as the average gain over all tests examined.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 43
lOMoARcPSD|18595359

DATA MINING

Example:

Computation of gain ratio for the attribute income. A test on income splits the data of
Table 8.1 into three partitions, namely low, medium, and high, containing four, six, and
four tuples, respectively. To compute the gain ratio of income, we first use the above Eq.
to obtain

Gini Index

The Gini index is used in CART. Using the notation previously described, the Gini index
measures the impurity of D, a data partition or set of training tuples, as

where pi is the probability that a tuple in D belongs to class Ci and is estimated by

|Ci,D|/|D|. The sum is computed over m classes. The Gini index considers a binary split
for each attribute. Let’s first consider the case where A is a discrete-valued attribute
having v distinct values, {a1, a2,..., av }, occurring in D. To determine the best binary

A. Each subset, SA, can be considered as a binary test for attribute A of the form “A ∈
split on A, we examine all the possible subsets that can be formed using known values of

SA?” Given a tuple, this test is satisfied if the value of A for the tuple is among the values
listed in SA. If A has v possible values, then there are 2v possible subsets. For example, if
income has three possible values, namely {low, medium, high}, then the possible subsets
are {low, medium, high}, {low, medium}, {low, high}, {medium, high}, {low},
{medium}, {high}, and {}. We exclude the power set, {low, medium, high}, and the
empty set from consideration since, conceptually, they do not represent a split. Therefore,
there are 2v − 2 possible ways to form two partitions of the data, D, based on a binary
split on A. When considering a binary split, we compute a weighted sum of the impurity
of each resulting partition. For example, if a binary split on A partitions D into D1 and
D2, the Gini index of D given that partitioning is

For each attribute, each of the possible binary splits is considered. For a discrete-valued
attribute, the subset that gives the minimum Gini index for that attribute is selected as its
splitting subset. For continuous-valued attributes, each possible split-point must be
considered. The strategy is similar to that described earlier for information gain, where
the midpoint between each pair of (sorted) adjacent values is taken as a possible split-
point. The point giving the minimum Gini index for a given (continuous-valued) attribute

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 44
lOMoARcPSD|18595359

DATA MINING

is taken as the split-point of that attribute. Recall that for a possible split-point of A, D1 is
the set of tuples in D satisfying A ≤ split point, and D2 is the set of tuples in D satisfying
A > split point. The reduction in impurity that would be incurred by a binary split on a
discrete- or continuous-valued attribute A is

The attribute that maximizes the reduction in impurity (or, equivalently, has the minimum
Gini index) is selected as the splitting attribute. This attribute and either its splitting
subset (for a discrete-valued splitting attribute) or split-point (for a continuous-valued
splitting attribute) together form the splitting criterion.

Example

Induction of a decision tree using the Gini index. Let D be the training data shown earlier
in Table 8.1, where there are nine tuples belonging to the class buys computer = yes and
the remaining five tuples belong to the class buys computer = no. A (root) node N is
created for the tuples in D. We first use Eq. (8.7) for the Gini index to compute the
impurity of D:

To find the splitting criterion for the tuples in D, we need to compute the Gini index for
each attribute. Let’s start with the attribute income and consider each of the possible

partition D1 satisfying the condition “income ∈ {low, medium}.” The remaining four
splitting subsets. Consider the subset {low, medium}. This would result in 10 tuples in

tuples of D would be assigned to partition D2. The Gini index value computed based on
this partitioning is

Similarly, the Gini index values for splits on the remaining subsets are 0.458 (for the
subsets {low, high} and {medium}) and 0.450 (for the subsets {medium, high} and
{low}). Therefore, the best binary split for attribute income is on {low, medium} (or
{high}) because it minimizes the Gini index. Evaluating age, we obtain {youth, senior}
(or {middle aged}) as the best split for age with a Gini index of 0.375; the attributes
student and credit rating are both binary, with Gini index values of 0.367 and 0.429,
respectively. The attribute age and splitting subset {youth, senior} therefore give the

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 45
lOMoARcPSD|18595359

DATA MINING

binary split “age ∈ {youth, senior?}” results in the maximum reduction in impurity of the
minimum Gini index overall, with a reduction in impurity of 0.459 − 0.357 = 0.102. The

tuples in D and is returned as the splitting criterion. Node N is labeled with the criterion,
two branches are grown from it, and the tuples are partitioned accordingly.

Other Attribute Selection Measures

This section on attribute selection measures was not intended to be exhaustive. We have
shown three measures that are commonly used for building decision trees. These
measures are not without their biases. Information gain, as we saw, is biased toward
multivalued attributes. Although the gain ratio adjusts for this bias, it tends to prefer
unbalanced splits in which one partition is much smaller than the others. The Gini index
is biased toward multivalued attributes and has difficulty when the number of classes is
large. It also tends to favor tests that result in equal-size partitions and purity in both
partitions. Although biased, these measures give reasonably good results in practice.
Many other attribute selection measures have been proposed. CHAID, a decision tree
algorithm that is popular in marketing, uses an attribute selection measure that is based on
the statistical χ 2 test for independence. Other measures include C-SEP (which performs
better than information gain and the Gini index in certain cases) and G-statistic (an
information theoretic measure that is a close approximation to χ 2 distribution). Attribute
selection measures based on the Minimum Description Length (MDL) principle have the
least bias toward multivalued attributes. MDL-based measures use encoding techniques to
define the “best” decision tree as the one that requires the fewest number of bits to both
(1) encode the tree and (2) encode the exceptions to the tree(i.e., cases that are not
correctly classified by the tree). Its main idea is that the simplest of solutions is preferred.
Other attribute selection measures consider multivariate splits (i.e., where the partitioning
of tuples is based on a combination of attributes, rather than on a single attribute). The
CART system, for example, can find multivariate splits based on a linear combination of
attributes. Multivariate splits are a form of attribute (or feature) construction, where new
attributes are created based on the existing ones. (Attribute construction was also
discussed in Chapter 3, as a form of data transformation.) These other measures
mentioned here are beyond the scope of this book. Additional references are given in the
bibliographic notes at the end of this chapter (Section 8.9). “Which attribute selection
measure is the best?” All measures have some bias. It has been shown that the time
complexity of decision tree induction generally increases exponentially with tree height.
Hence, measures that tend to produce shallower trees (e.g., with multiway rather than
binary splits, and that favor more balanced splits) may be preferred. However, some
studies have found that shallow trees tend to have a large number of leaves and higher
error rates. Despite several comparative studies, no one attribute selection measure has
been found to be significantly superior to others. Most measures give quite good results.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 46
lOMoARcPSD|18595359

DATA MINING

Decision Tree Induction Algorithm

A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm
known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the
successor of ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no
backtracking; the trees are constructed in a top-down recursive divide-and-conquer
manner.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 47
lOMoARcPSD|18595359

DATA MINING

Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise
or outliers. The pruned trees are smaller and less complex.

Tree Pruning Approaches

There are two approaches to prune a tree −
 Pre-pruning − The tree is pruned by halting its construction early.
 Post-pruning - This approach removes a sub-tree from a fully grown tree.

Naive Bayes Classifier

Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the
probability that a given tuple belongs to a particular class.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −

 Posterior Probability [P(H/X)]

 Prior Probability [P(H)]

where X is data tuple and H is some

hypothesis. According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)

The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

1. LetD be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, …,xn),
depicting n measurements made on the tuple from n attributes, respectively, A1, A2, …,
An.

2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the
classifier willpredict that X belongs to the class having the highest posterior probability,
conditioned on X.

That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and
only if

Thus we maximize P(CijX). The classCifor which P(CijX) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 48
lOMoARcPSD|18595359

DATA MINING

3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class

prior probabilities are not known, then it is commonly assumed that the classes are
equally likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize
P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci).

4. Given data sets with many attributes, it would be extremely computationally

expensiveto compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the
naive assumption of class conditional independence is made. This presumes that the
values of the attributes areconditionally independent of one another, given the class
label of the tuple. Thus,

We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) from

the training tuples. For each attribute, we look at whether the attribute is
categorical or continuous-valued. Forinstance, to compute P(X|Ci), we consider the
following:

If Akis categorical, then P(xk|Ci) is the number of tuples of class Ciin D having the value

Xk for Ak, divided by |Ci,D| the number of tuples of class Ci in D. If Ak is continuous-

valued, then we need to do a bit more work, but the calculation is pretty straightforward.

A continuous-valued attribute is typically assumed to have a Gaussian distribution

with a mean μ and standard deviation , defined by

5. Inorder to predict the class label of X, P(XjCi)P(Ci) is evaluated for each class Ci. The
classifier predicts that the class label of tuple X is the class Ciif and only if

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 49
lOMoARcPSD|18595359

DATA MINING

Bayesian Belief Network

Bayesian Belief Networks specify joint conditional probability distributions. They are
also known as Belief Networks, Bayesian Networks, or Probabilistic Networks.
 A Belief Network allows class conditional independencies to be
defined between subsets of variables.
 It provides a graphical model of causal relationship on which learning can
be performed.
 We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −

 Directed acyclic graph

 A set of conditional probability tables

Directed Acyclic Graph

 Each node in a directed acyclic graph represents a random variable.
 These variable may be discrete or continuous valued.
 These variables may correspond to the actual attribute given in the data.

Directed Acyclic Graph Representation

The following diagram shows a directed acyclic graph for six Boolean variables.

The arc in the diagram allows representation of causal knowledge. For example, lung
cancer is influenced by a person's family history of lung cancer, as well as whether or not
the person is a smoker. It is worth noting that the variable PositiveXray is independent of
whether the patient has a family history of lung cancer or that the patient is a smoker,
given that we know the patient has lung cancer.

Conditional Probability Table

The conditional probability table for the values of the variable LungCancer (LC) showing
each possible combination of the values of its parent nodes, FamilyHistory (FH), and
Smoker (S) is as follows −

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 50
lOMoARcPSD|18595359

DATA MINING

Fig.3.5 : Conditional probability table

K – Nearnest Neighbor classification – Algorithm

K-Nearest Neighbors Algorithm (aka kNN) can be used for both classification (data with
discrete variables) and regression (data with continuous labels). The algorithm functions
by calculating the distance (Sci-Kit Learn uses the formula for Euclidean distance but
other formulas are available) between instances to create local "neighborhoods".

k-Nearest-Neighbor Classifier
Nearest-neighbor classifiers are based on learning by analogy, that is, by
comparing a given test tuplewith training tuples that are similar to it.
The training tuples are describedby n attributes. Each tuple represents a
point in an n- dimensional space. In this way,all of the training tuples are
stored in an n-dimensional pattern space. When given anunknown tuple, a k-
nearest-neighbor classifier searches the pattern space for the k trainingtuples that
are closest to the unknown tuple. These k training tuples are the k nearest
neighbors of the unknown tuple.
Closeness is defined in terms of a distance metric, such as Euclidean distance.
The Euclidean distance between two points or tuples, say, X1 = (x11, x12, … ,
x1n) and X2 = (x21, x22, … ,x2n), is

In other words, for each numeric attribute, we take the difference between the
corresponding values of that attribute in tuple X1and in tuple X2, square this
difference,and accumulate it. The square root is taken of the total accumulated distance
count.

Min-Max normalization can be used to transforma value v of a numeric attribute A to v0

in therange [0, 1] by computing

Where minAand maxAare the minimum and maximum values of attribute A

For k-nearest-neighbor classification, the unknown tuple is assigned the

most common class among its k nearest neighbors.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 51
lOMoARcPSD|18595359

DATA MINING

When k = 1, the unknown tuple is assigned the class of the training tuple that is
closest to it in pattern space.

Nearest neighbor classifiers can also be used for prediction, that is, to return
a real-valued prediction for a given unknown tuple.

In this case, the classifier returns the average value of the real-valued labels
associated with the k nearest neighbors of the unknown tuple.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 52
lOMoARcPSD|18595359

DATA MINING

UNIT-IV
Clustering
Problem Definition
Cluster is a group of objects that belongs to the same class. In other words, similar objects
are grouped in one cluster and dissimilar objects are grouped in another cluster.

What is Cluster Analysis?

We like to organize observations or objects or things (e.g. plants, animals, chemicals) into
meaningful groups so that we are able to make comments about the groups rather than
individual objects. Such groupings are often rather convenient since we can talk about a
small number of groups rather than a large number of objects although certain details are
necessarily lost because objects in each group are not identical. A classical example of a
grouping is the chemical periodic table where chemical elements are grouped into rows
and columns such that elements adjacent to each other within a group have similar
physical properties.

Clustering Overview
Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the same
group than those in other groups. In simple words, the aim is to segregate groups with
similar traits and assign them into clusters.

Clustering is the process of making a group of abstract objects into classes of similar
objects.

Points to Remember
 A cluster of data objects can be treated as one group.
 While doing cluster analysis, we first partition the set of data into groups based
on data similarity and then assign the labels to the groups.
 The main advantage of clustering over classification is that, it is adaptable
to changes and helps single out useful features that distinguish different groups.

Applications of Cluster Analysis

 Clustering analysis is broadly used in many applications such as market research,
pattern recognition, data analysis, and image processing.
 Clustering can also help marketers discover distinct groups in their customer base.
And they can characterize their customer groups based on the purchasing patterns.
 In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures inherent to
populations.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 53
lOMoARcPSD|18595359

DATA MINING

 Clustering also helps in identification of areas of similar land use in an earth

observation database. It also helps in the identification of groups of houses in a city
according to house type, value, and geographic location.
 Clustering also helps in classifying documents on the web for information
discovery.
 Clustering is also used in outlier detection applications such as detection of credit
card fraud.
 As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data mining −
 Scalability − We need highly scalable clustering algorithms to deal with large
databases.
 Ability to deal with different kinds of attributes − Algorithms should be
capable to be applied on any kind of data such as interval-based (numerical) data,
categorical, and binary data.
 Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
 High dimensionality − The clustering algorithm should not only be able to
handle low-dimensional data but also the high dimensional space.
 Ability to deal with noisy data − Databases contain noisy, missing or erroneous
data. Some algorithms are sensitive to such data and may lead to poor quality clusters.
 Interpretability − The clustering results should be interpretable, comprehensible,
and usable.

Evaluation of Clustering Algorithms

Clustering Methods : Clustering methods can be classified into the following categories −
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 54
lOMoARcPSD|18595359

DATA MINING

Fig. 4.1 : Clustering methods

Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will
classify the data into k groups, which satisfy the following requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group.

Points to remember −
 For a given number of partitions (say k), the partitioning method will create
an initial partitioning.
 Then it uses the iterative relocation technique to improve the partitioning
by moving objects from one group to other.

K Means Clustering

K means is an iterative clustering algorithm that aims to find local maxima in each
iteration. This algorithm works in these 5 steps :

1. Specify the desired number of clusters K : Let us choose k=2 for these 5
data points in 2-D space.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 55
lOMoARcPSD|18595359

DATA MINING

2. Randomly assign each data point to a cluster : Let’s assign three points in cluster 1
shown using red color and two points in cluster 2 shown using grey color.

3. Compute cluster centroids : The centroid of data points in the red cluster is shown
using red cross and those in grey cluster using grey cross.

4. Re-assign each point to the closest cluster centroid : Note that only the data point
at the bottom is assigned to the red cluster even though its closer to the centroid of grey
cluster. Thus, we assign that data point into grey cluster

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 56
lOMoARcPSD|18595359

DATA MINING

5. Re-compute cluster centroids : Now, re-computing the centroids for both the
clusters.

6. Repeat steps 4 and 5 until no improvements are possible : Similarly, we’ll repeat
the 4 and 5thsteps until we’ll reach global optima. When there will be no further
th

switching of data points between two clusters for two successive repeats. It will mark the
termination of the algorithm if not explicitly mentioned.

PAM Algorithm
The PAM algorithm was developed by Leonard Kaufman and Peter J. Rousseeuw, and
this algorithm is very similar to K-means, mostly because both are partitional algorithms,
in other words, both break the dataset into groups (clusters), and both work by trying to
minimize the error, but PAM works with Medoids, that are an entity of the dataset that
represent the group in which it is inserted, and K-means works with Centroids, that are
artificially created entity that represent its cluster.
The PAM algorithm partitions the dataset of n objects into k clusters, where both the
dataset and the number k is an input of the algorithm. This algorithm works with a matrix
of dissimilarity, whose goal is to minimize the overall dissimilarity between the

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 57
lOMoARcPSD|18595359

DATA MINING

representants of each cluster and its members. The algorithm uses the following model to
solve the problem:
Subject to:

1. Σi=1nzij= 1 , j = 1,2,...,n
2. zij≤ yi, i, j = 1,2,...,n

3. Σi=1nyi= k , k = number of clusters

4. yi, zij€ {0,1} , i, j = 1,2,...,n

Where F(x) is the main function to minimize, d(i,j) is the dissimilarity measurement
between the entities i and j, and z ij is a variable that ensures that only the dissimilarity
between entities from the same cluster will be computed in the main function. The others
expressions are constraints that have the following functions: (1.) ensures that every
single entity is assigned to one cluster and only one cluster, (2.) ensures that the entity is
assigned to its medoid that represents the cluster, (3.) ensures that there are exactly k
clusters and (4.) lets the decision variables assume just the values of 0 or 1.
The PAM algorithm can work over two kinds of input, the first is the matrix representing
every entity and the values of its variables, and the second is the dissimilarity matrix
directly, in the latter the user can provide the dissimilarity directly as an input to the
algorithm, instead of the data matrix representing the entities. Either way the algorithm
reaches a solution to the problem, in a general analysis the algorithm proceeds this way:

Build phase
1. Choose k entities to become the medoids, or in case these entities were provided
use them as the medoids;
2. Calculate the dissimilarity matrix if it was not informed;
3. Assign every entity to its closest medoid;

Swap phase
4. For each cluster search if any of the entities of the cluster lower the average
dissimilarity coefficient, if it does select the entity that lowers this coefficient the most as
the medoid for this cluster;
5. If at least one medoid has changed go to (3), else end the algorithm.
As was said the PAM algorithm works with a matrix of dissimilarity, and to calculate this
matrix the algorithm can use two metrics. The first one is the euclidean, that are the root
sum-of-squares of differences, while the second one is the manhattan distance that are the
sum of absolute distances.

Hierarchical clustering
This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is
formed. There are two approaches here −
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 58
lOMoARcPSD|18595359

DATA MINING

 Agglomerative Approach
 Divisive Approach

Agglomerative Methods
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the
termination condition holds.

Divisive Methods
This approach is also known as the top-down approach. In this, we start with all of the
objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds.
This method is rigid, i.e., once a merging or splitting is done, it can never be undone.

Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of hierarchical clustering
−
 Perform careful analysis of object linkages at each hierarchical partitioning.
 Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-clustering on
the micro-clusters.

Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of

clusters. This algorithm starts with all the data points assigned to a cluster of their own.
Then two nearest clusters are merged into the same cluster. In the end, this algorithm
terminates when there is only a single cluster left.
The results of hierarchical clustering can be shown using dendrogram. The dendrogram
can be interpreted as:

Fig.4.2 : Dendrogram representation of hierarchical clustering

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 59
lOMoARcPSD|18595359

DATA MINING

At the bottom, we start with 25 data points, each assigned to separate clusters. Two
closest clusters are then merged till we have just one cluster at the top. The height in the
dendrogram at which two clusters are merged represents the distance between two
clusters in the data space.The decision of the no. of clusters that can best depict different
groups can be chosen by observing the dendrogram. The best choice of the no. of clusters
is the no. of vertical lines in the dendrogram cut by a horizontal line that can transverse
the maximum distance vertically without intersecting a cluster.

In the above example, the best choice of no. of clusters will be 4 as the red horizontal line
in the dendrogram below covers maximum vertical distance AB.

Fig.4.3 : Maximum vertical distance in dendrogram representation of

hierarchical clustering

Key issues in hierarchical clustering are:

 This algorithm has been implemented above using bottom up approach. It is also
possible to follow top-down approach starting with all data points assigned in the same
cluster and recursively performing splits till each data point is assigned a separate cluster.
 The decision of merging two clusters is taken on the basis of closeness of these
clusters. There are multiple metrics for deciding the closeness of two clusters :
o Euclidean distance: ||a-b||2= √(Σ(ai-bi))
o Squared Euclidean distance: ||a-b||22= Σ((ai-bi)2)
o Manhattan distance: ||a-b||1= Σ|ai-bi|
o Maximum distance:||a-b||INFINITY= maxi|ai-bi|
o
Mahalanobis distance: √((a-b)TS-1(-b)) {where, s : covariance matrix}

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 60
lOMoARcPSD|18595359

DATA MINING

Difference between K Means and Hierarchical clustering

 Hierarchical clustering can’t handle big data well but K Means clustering can.
This is because the time complexity of K Means is linear i.e. O(n) while that of
hierarchical clustering is quadratic i.e. O(n2).
 In K Means clustering, since we start with random choice of clusters, the results
produced by running the algorithm multiple times might differ. While results are
reproducible in Hierarchical clustering.

 K Means is found to work well when the shape of the clusters is hyper spherical
(like circle in 2D, sphere in 3D).

 K Means clustering requires prior knowledge of K i.e. no. of clusters you want to
divide your data into. But, you can stop at whatever number of clusters you find
appropriate in hierarchical clustering by interpreting the dendrogram.

Outlier Detection
Outliers: Assume that a given statistical process is used to generate a set of data objects.
An outlier is a data object that deviates significantly from the rest of the objects, as if it
were generated by a different mechanism. For ease of presentation within this chapter, we
may refer to data objects that are not outliers as “normal” or expected data. Similarly, we
may refer to outliers as “abnormal” data.

Outliers are different from noisy data. Noise is a random error or variance in a measured
variable. In general, noise is not interesting in data analysis, including outlier detection.
For example, in credit card fraud detection, a customer’s purchase behavior can be
modeled as a random variable. A customer may generate some “noise transactions” that
may seem like “random errors” or “variance,” such as by buying a bigger lunch one day,
or having one more cup of coffee than usual. Such transactions should not be treated as
outliers; otherwise, the credit card company would incur heavy costs from verifying that
many transactions. The company may also lose customers by bothering them with
multiple false alarms. As in many other data analysis and data mining tasks, noise should
be removed before outlier detection. Outliers are interesting because they are suspected of
not being generated by the same mechanisms as the rest of the data. Therefore, in outlier
detection, it is important to justify why the outliers detected are generated by some other
mechanisms. This is often achieved by making various assumptions on the rest of the data
and showing that the outliers detected violate those assumptions significantly. Outlier
detection is also related to novelty detection in evolving data sets. For example, by
monitoring a social media web site where new content is incoming, novelty detection may
identify new topics and trends in a timely manner. Novel topics may initially appear as
outliers. To this extent, outlier detection and novelty detection share some similarity in
modeling and detection methods. However, a critical difference between the two is that in
novelty detection, once new topics are confirmed, they are usually incorporated into the
model of normal behavior so that follow-up instances are not treated as outliers anymore.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 61
lOMoARcPSD|18595359

DATA MINING

Fig.4.4 : Outliers

The objects in region R are outliers.

Types of Outliers In general, outliers can be classified into three categories, namely
global outliers, contextual (or conditional) outliers, and collective outliers. Let’s examine
each of these categories.

Global Outliers In a given data set, a data object is a global outlier if it deviates
significantly from the rest of the data set. Global outliers are sometimes called point
anomalies, and are the simplest type of outliers. Most outlier detection methods are aimed
at finding global outliers.

To detect global outliers, a critical issue is to find an appropriate measurement of

deviation with respect to the application in question. Various measurements are proposed,
and, based on these, outlier detection methods are partitioned into different categories.
We will come to this issue in detail later. Global outlier detection is important in many
applications. Consider intrusion detection in computer networks, for example. If the
communication behavior of a computer is very different from the normal patterns (e.g., a
large number of packages is broadcast in a short time), this behavior may be considered
as a global outlier and the corresponding computer is a suspected victim of hacking. As
another example, in trading transaction auditing systems, transactions that do not follow
the regulations are considered as global outliers and should be held for further
examination.

Contextual Outliers “The temperature today is 28◦C. Is it exceptional (i.e., an outlier)?” It

depends, for example, on the time and location! If it is in winter in Toronto, yes, it is an
outlier. If it is a summer day in Toronto, then it is normal. Unlike global outlier detection,
in this case whether or not today’s temperature value is an outlier depends on the context
—the date, the location, and possibly some other factors. In a given data set, a data object
is a contextual outlier if it deviates significantly with respect to a specific context of the
object. Contextual outliers are also known as conditional outliers because they are
conditional on the selected context. Therefore, in contextual outlier detection, the context
has to be specified as part of the problem definition. Generally, in contextual outlier
detection, the attributes of the data objects in question are divided into two groups:

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 62
lOMoARcPSD|18595359

DATA MINING

Contextual attributes: The contextual attributes of a data object define the object’s
context. In the temperature example, the contextual attributes may be date and location.

Behavioral attributes: These define the object’s characteristics, and are used to evaluate
whether the object is an outlier in the context to which it belongs. In the temperature
example, the behavioral attributes may be the temperature, humidity, and pressure

Unlike global outlier detection, in contextual outlier detection, whether a data object is an
outlier depends on not only the behavioral attributes but also the contextual attributes. A
configuration of behavioral attribute values may be considered an outlier in one context
(e.g., 28◦C is an outlier for a Toronto winter), but not an outlier in another context (e.g.,
28◦C is not an outlier for a Toronto summer). Contextual outliers are a generalization of
local outliers, a notion introduced in density-based outlier analysis approaches. An object
in a data set is a local outlier if its density significantly deviates from the local area in
which it occurs. collectively to understand the shipment problem. Given a data set, a
subset of data objects forms a collective outlier if the objects as a whole deviate
significantly from the entire data set. Importantly, the individual data objects may not be
outliers.

Challenges of Outlier Detection Outlier detection is useful in many applications yet faces
many challenges such as the following:

Modeling normal objects and outliers effectively. Outlier detection quality highly
depends on the modeling of normal (nonoutlier) objects and outliers. Often, building a
comprehensive model for data normality is very challenging, if not impossible. This is
partly because it is hard to enumerate all possible normal behaviors in an application. The
border between data normality and abnormality (outliers) is often not clear cut. Instead,
there can be a wide range of gray area. Consequently, while some outlier detection
methods assign to each object in the input data set a label of either “normal” or “outlier,”
other methods assign to each object a score measuring the “outlier-ness” of the object.

Application-specific outlier detection. Technically, choosing the similarity/distance

measure and the relationship model to describe data objects is critical in outlier detection.
Unfortunately, such choices are often application-dependent. Different applications may
have very different requirements. For example, in clinic data analysis, a small deviation
may be important enough to justify an outlier. In contrast, in marketing analysis, objects
are often subject to larger fluctuations, and consequently a substantially larger deviation
is needed to justify an outlier. Outlier detection’s high dependency on the application type
makes it impossible to develop a universally applicable outlier detection method. Instead,
individual outlier detection methods that are dedicated to specific applications must be
developed.

Handling noise in outlier detection. As mentioned earlier, outliers are different from
noise. It is also well known that the quality of real data sets tends to be poor. Noise often
unavoidably exists in data collected in many applications. Noise may be present as
deviations in attribute values or even as missing values. Low data quality and the

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 63
lOMoARcPSD|18595359

DATA MINING

presence of noise bring a huge challenge to outlier detection. They can distort the data,
blurring the distinction between normal objects and outliers. Moreover, noise and missing
data may “hide” outliers and reduce the effectiveness of outlier detection—an outlier may
appear “disguised” as a noise point, and an outlier detection method may mistakenly
identify a noise point as an outlier.

Understandability. In some application scenarios, a user may want to not only detect
outliers, but also understand why the detected objects are outliers. To meet the
understandability requirement, an outlier detection method has to provide some
justification of the detection. For example, a statistical method can be used to justify the
degree to which an object may be an outlier based on the likelihood that the object was
generated by the same mechanism that generated the majority of the data. The smaller the
likelihood, the more unlikely the object was generated by the same mechanism, and the
more likely the object is an outlier.

Outlier Detection Methods

There are many outlier detection methods in the literature and in practice. Here, we
present two orthogonal ways to categorize outlier detection methods. First, we categorize
outlier detection methods according to whether the sample of data for analysis is given
with domain expert–provided labels that can be used to build an outlier detection model.
Second, we divide methods into groups according to their assumptions regarding normal
objects versus outliers.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 64
lOMoARcPSD|18595359

DATA MINING

UNIT V
WEB AND TEXT MINING
Introduction
Text Mining is the discovery by computer of new, previously unknown information, by
automatically extracting information from different written resources. A key element is
the linking together of the extracted information together to form new facts or new
hypotheses to be explored further by more conventional means of experimentation. Text
mining is different from what we’re familiar with in web search. In search, the user is
typically looking for something that is already known and has been written by someone
else. The problem is pushing aside all the material that currently isn’t relevant to your
needs in order to find the relevant information. In text mining, the goal is to discover
heretofore unknown information, something that no one yet knows and so could not have
yet written down.

Data mining versus Text mining

The difference between regular data mining and text mining is that in text mining:
The patterns are extracted from natural language text rather than from structured
databases of facts. Databases are designed for programs to process automatically; text is
written for people to read. We do not have programs that can “read” text and will not
have such for the foreseeable future. Many researchers think it will require a full
simulation of how the mind works before we can write programs that read the way people
do.

However, there is a field called computational linguistics (also known as natural

language processing) which is making a lot of progress in doing small subtasks in text
analysis. For example, it is relatively easy to write a program to extract phrases from an
article or book that, when shown to a human reader, seem to summarize its contents. (The
most frequent words and phrases in this article, minus the really common words like
“the” are: text mining, information, programs, and example, which is not a bad five-word
summary of its contents.)

Typical applications of text Mining could include Analyzing open-ended survey

responses. For example, you may discover a certain set of words or terms that are
commonly used by respondents to describe the pro’s and con’s of a product or service
(under investigation), suggesting common misconceptions or confusion regarding the
items in the study.

Another application include to aid in the automatic classification of texts. For example, it
is possible to “filter” out automatically most undesirable “junk email” based on certain
terms or words that are not likely to appear in legitimate messages, but instead identify
undesirable electronic mail. In this manner, such messages can automatically be
discarded. Such automatic systems for classifying electronic messages can also be useful
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 65
lOMoARcPSD|18595359

DATA MINING

in applications where messages need to be routed (automatically) to the most appropriate

department or agency; e.g., email messages with complaints or petitions to a municipal
authority are automatically routed to the appropriate departments; at the same time, the
emails are screened for inappropriate or obscene messages, which are automatically
returned to the sender with a request to remove the offending words or content.

Text Mining Algorithm consist of 3 steps.

1. Train.create attribute dictioary where the attribute represents words from articles
related to a particular topic. Choose only words that occur a minimum number of
times.

2. Filter. Remove the common words known to be useless in the differentiating

articles. Eg. The,As, We etc.

3. Classify. Check each document to be classified for the presence and frequency of the
chosen attributes. Classify the document under a particular topic if it contains a
predetermined minimum number of references to the chosen attributes for the topic.
Web Mining
Web mining is the process which includes various data mining techniques to extract
knowledge from web data categorized as web content, web structure and data usage. It
includes a process of discovering the useful and unknown information from the web data.

Web mining can be classified based on the following categories:

1. Web Content
2. Web Structure
3. Web Usage

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 66
lOMoARcPSD|18595359

DATA MINING

Web Content Mining

Web content mining is defined as the process of converting raw data to useful information
using the content of web page of a specified web site. The process starts with the
extraction of structured data or information from web pages and then identifying similar
data with integration. Various types of web content include text, audio, video etc. This
process is called as text mining. Text Mining uses Natural Language processing and
retrieving information techniques for a specific mining process.
Web Structure Mining
Web graphs include a typical structure which consists of web pages such as nodes and
hyperlinks which will be treated as edges connected between web pages. It includes a
process of discovering a specified structure with information from the web.
This category of mining can be performed either at document level or hyperlink level.
The researsch activity which involves hyperlink level is called hyperlink analysis.
Terminologies associated with Web structure
1.Webgraph: It is a directed graph which represents the web.
2.Node: Each web page includes a node of the web graph.
3. Link: Hyperlink is a type of directed edge of the web graph.
4. In-degree: In-degree specifies the number of distinct links that point to a
specified node.
5. Out-degree: Out-degree specifies the number of distinct lakes originating at a
node that points to othernodes.
6. Directed path: Directed path includes a sequence of links starting from a specified
node that can befollowedtoreachanothernode.
7. Shortest Path: The shortest path will be the shortest length out of all the paths
between p and q.
8. Diameter: The maximum of the shortest path between a pair of nodes p and q for all
pairs of nodes p and q in the web graph.

Application of Web Content and Web Structure Mining

Structure mining can aid to this goal, by identifying popular sites (so-called ‘authorities’),
for example, by analysing the number of links that refer to a particular site. Web content
and structure mining are not only used to improve the quality of public search engines.
Special search services can also be offered. Content and structure mining tools can for
instance track down online misuse of brands , or analyse the content and structure of
competitive web sites in detail to gain some strategic advantage . With content and
structure mining tools, things like online curriculum vitae or personal home pages can be
collected. After interpreting the personal data found on personal pages this information
could be used for marketing purposes. Profiles on potential customers can be produced
and more detailed information is added to profiles of current customers. So mining the
web not only contributes to acquiring new customers, it can also aid in retaining existing
ones.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 67
lOMoARcPSD|18595359

DATA MINING

Web Usage Mining

Web includes a collection of interrelated files with one or more web servers. It includes a
pattern of discovery of meaningful patterns of data generated by the client-server
transaction.

The typical sources of data are mentioned below:

1. Data which is generated automatically is stored in server access logs, referrer
logs, agent logs and client-sidecookies.
2.Informationofuserprofiles.
3. Metadata which includes page attributes and content attributes.

Application of web usage mining

Using web usage mining, it can extract useful information from the clickstream analysis
of web server log containing details of webpage visits, transactions. Web server log
analyzer may include software such as NetTracker, AwStats to view how often is the
website visited, which kind of product is the best and worst sellers in a e-commerce
website. The ability to track web users’ browsing behaviour down to individual mouse
clicks makes it possible to personalise services for individual customers on a massive
scale. This ‘mass customisation’ of services not only helps customers by satisfying their
needs, but also results in customer loyalty. Due to a more personalised and customer-
centred approach, the content and structure of a web site can be evaluated and adapted to
the customer’s preferences and the right offers can be made to the right customer.
Web server log: Server logs created by the server record all activities. The page
forwarded to the web server includes every single piece of basic information about URL.

Text Mining
Text and data mining are considered as complementary techniques required for efficient
business management. Data mining and text mining tools have gathered its primary
location in the marketplace. Natural Language processing is a subset of text mining tools
which is used to define accurate and complete domain specific taxonomies. This helps in
effective metadata association. Text mining is more mature and efficient in comparison
with data mining process. 80 percent of the information is made of text.
The objective of text mining is to exploit information which is included in textual
documents in various patterns and trends in association with entities and predictive rules.
The results are manipulated and used for:
1. The analysis of a collection
2. Providing information about intelligent navigation and browsing method.

The five fundamental steps involved in text mining are

 Gathering unstructured data from multiple data sources like plain text, web pages,
pdf files, emails, and blogs, to name a few.
 Detect and remove anomalies from data by conducting pre-processing and cleansing
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 68
lOMoARcPSD|18595359

DATA MINING

operations. Data cleansing allows you to extract and retain the valuable information
hidden within the data and to help identify the roots of specific words.
 Convert all the relevant information extracted from unstructured data into
structured formats.
 Analyse the patterns within the data via Management Information System (MIS).
 Store all the valuable information into a secure database to drive trend analysis
and enhance the decision-making process of the organisation.

Fig 5.2 : Fundamental Steps in data mining

Text Mining Techniques

various text mining techniques:
Information Extraction: Information Extraction (IE) refers to the process of extracting
meaningful information from vast chunks of textual data. This method focuses on
identifying the extraction of entities, attributes, and their relationships from semi-
structured or unstructured texts. Whatever information is extracted is then stored in a
database for future access and retrieval. The efficacy and relevancy of the outcomes are
checked and evaluated using precision and recall processes.
Information Retrieval:Information Retrieval (IR) refers to the process of extracting
relevant and associated patterns based on a specific set of words or phrases. IR systems
make use of different algorithms to track and monitor user behaviours and discover
relevant data accordingly. Google and Yahoo search engines are the two most renowned
IR systems.
Categorization:Text categorization is a form of “supervised” learning wherein normal
language texts are assigned to a predefined set of topics depending upon their content.
Thus, categorization or rather Natural Language Processing (NLP) is a process of
gathering text documents and processing and analysing them to uncover the right topics
or indexes for each document. The co-referencing method is commonly used as a part of
NLP to extract relevant synonyms and abbreviations from textual data. Today, NLP has
become an automated process used in a host of contexts ranging from personalized
commercials delivery to spam filtering and categorizing web pages under hierarchical
definitions, and much more.
Clustering:Clustering is one of the most crucial techniques of text mining. It seeks to
identify intrinsic structures in textual information and organize them into relevant

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 69
lOMoARcPSD|18595359

DATA MINING

subgroups or ‘clusters’ for further analysis. A significant challenge in the clustering

process is to form meaningful clusters from the unlabeled textual data without having any
prior information on them. Cluster analysis is a standard text mining tool that assists in
data distribution or acts as a per-processing step for other text mining algorithms running
on detected clusters.
Summarisation: Text summarisation refers to the process of automatically generating a
compressed version of a specific text that holds valuable information for the end user.
The aim here is to browse through multiple text sources to craft summaries of texts
containing a considerable proportion of information in a concise format, keeping the
overall meaning and intent of the original documents essentially the same. Text
summarisation integrates and combines the various methods that employ text
categorization like decision trees, neural networks, regression models, and swarm
intelligence.
Applications of Text Mining
Text mining techniques are rapidly penetrating the industry, right from academia and
healthcare to businesses and social media platforms. Here are a few applications of text
mining being used across the globe today.
1. Risk Management: One of the primary causes of failure in the business sector is the
lack of proper or insufficient risk analysis. Adopting and integrating risk management
software powered by text mining technologies such as SAS Text Miner can help
businesses to stay updated with all the current trends in the business market and boost
their abilities to mitigate potential risks. Since text mining technologies can gather
relevant information from across thousands of text data sources and create links between
the extracted insights, it allows companies to access the right information at the right
moment, thereby enhancing the entire risk management process.
2. Customer care service :Text mining techniques, particularly NLP, are finding
increasing importance in the field of customer care. Companies are investing in text
analytics software to enhance their overall customer experience by accessing the textual
data from varied sources such as surveys, customer feedback, and customer calls, etc.
Text analysis aims to reduce the response time of the company and help address the
grievances of the customers speedily and efficiently.
3. Fraud Detection: Text analytics backed by text mining technologies provides a
tremendous opportunity for domains that gather a majority of data in the text format.
Insurance and finance companies are harnessing this opportunity. By combining the
outcomes of text analyses with relevant structured data these companies are now able to
process claims swiftly as well as detect and prevent frauds.
4. Business Intelligence: Organizations and business firms have started to leverage text
mining techniques as a part of their business intelligence. Apart from providing profound
insights into customer behaviour and trends, text mining techniques also help companies
to analyse the strengths and weaknesses of their rivals, thus, giving them a competitive
advantage in the market. Text mining tools such as Cogito Intelligence Platform and IBM
text analytics provide insights on the performance of marketing strategies, latest customer
and market trends, and so on.
5. Social Media Analysis: There are many text mining software packages designed
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 70
lOMoARcPSD|18595359

DATA MINING

exclusively for analysing the performance of social media platforms. These help to track
and interpret the texts generated online from the news, blogs, emails, etc. Furthermore,
text mining tools can efficiently analyse the number of posts, likes, and followers of your
brand on social media, thereby allowing you to understand the reaction of people who are
interacting with your brand and online content. The analysis will enable you to
understand ‘what’s hot and what’s not’ for your target audience.
Unstructured text
Unstructured data is information, in many different forms, that doesn't hew to
conventional data models and thus typically isn't a good fit for a mainstream
relational database. Thanks to the emergence of alternative platforms for storing and
managing such data, it is increasingly prevalent in IT system and is used by organizations
in a variety of business intelligence and analytics applications.
Traditional structured data, such as the transaction data in financial systems and other
business applications, conforms to a rigid format to ensure consistency in processing and
analyzing it. Sets of unstructured data, on the other hand, can be maintained in formats
that aren't uniform, freeing analytics teams to work with all of the available data without
necessarily having to consolidate and standardize it first. That enables more
comprehensive analyses than would otherwise be possible.
Types of unstructured data
One of the most common types of unstructured data is text. Unstructured text is generated
and collected in a wide range of forms, including Word documents, email messages,
PowerPoint presentations, survey responses, transcripts of call center interactions, and
posts from blogs and social media sites.
Other types of unstructured data include images, audio and video files. Machine data is
another category, one that's growing quickly in many organizations. For example, log
files from websites, servers, networks and applications -- particularly mobile ones -- yield
a trove of activity and performance data. In addition, companies increasingly capture and
analyze data from sensors on manufacturing equipment and other internet of things (IoT)
connected devices.
In some cases, such data may be considered to be semi-structured -- for example,
if metadatatags are added to provide information and context about the content of the
data. The line between unstructured and semi-structured data isn't absolute, though; some
data management consultants contend that all data, even the unstructured kind, has some
level of structure.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 71
lOMoARcPSD|18595359

DATA MINING

Fig 5.3 : Unstructured Data types

Episode rule discovery for texts

Episode rules are event patterns mined from a single event sequence. They are mainly
used to predict the occurrence of events (the consequent of the rule), once the antecedent
has occurred. The occurrence of the consequent of a rule may however be disturbed by
the occurrence of another event in the sequence (that does not belong to the antecedent).
We refer such an event to as an influencer event. To the best of our knowledge, the
identification of such events in the context of episode rules has never been studied.
However, identifying influencer events is of the highest importance as these events can be
viewed as a way to act to impact the occurrence of events, here the consequent of rules.
We propose to identify three types of influencer events: distance influencer events,
confidence influencer events and disappearance events.

Hierarchy of categories
Concept hierarchies may also be defined by discretizing or grouping values for a given
dimension or attribute, resulting in a set-grouping hierarchy. A total or partial order can
be defined among groups of values. An example of a set-grouping hierarchy is shown
in figure for the dimension Root, where an interval (vacation…Work) denotes the range
from vacation (exclusive) to Work (inclusive).

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 72
lOMoARcPSD|18595359

DATA MINING

Fig 5.4 : Concept hierarchies

Text clustering
Text clustering is the application of cluster analysis to text-based documents. It uses
machine learning and natural language processing (NLP) to understand and categorize
unstructured, textual data.

 Document Retrieval: To improve recall, start by adding other documents from the same
cluster.

 Taxonomy Generation: Automatically generate hierarchical taxonomies for browsing

content.

 Fake News Identification: Detect if a news is genuine or fake.

 Language Translation: Translation of a sentence from one language to another.

 Spam Mail Filtering: Detect unsolicited and unwanted email/messages.

 Customer Support Issue Analysis: Identify commonly reported support

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 73
lOMoARcPSD|18595359

DATA MINING

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN

(Autonomous Institution –UGC,Govt. of India)
Permanently Affiliated to JNTUH, Approved by AICTE, ISO 9001:2015 Certified Institution
Accredited by NBA & NAAC with ‘A’ Grade
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DATA MINING
TUTORIAL
QUESTIONS

UNIT-I

1. Define Data mining

2. Explain the definition of Data warehouse
3. Identify any three functionality of data mining
4. Interpret major issues in data mining
5. Name the steps in the process of knowledge discovery
6. List the types of data that can be mined?
7. Express what is a decision tree?
8. Name the steps involved in data preprocessing?

UNIT-II
1. List the techniques of efficiency of Apriori algorithm?
2. Name the pruning strategies in mining closed frequent itemsets?
3. State maximal frequent itemset?
4. Name the steps in association rule mining?
5. State how can we mine closed frequent itemsets?
6. List the applications of pattern mining?
7. Classify the confidence rule for itemsets A and B?
8. Interpret the rule of support for itemsets A and B?

UNIT-III
1. Name the steps in data classification?
2. Differentiate supervised learning and unsupervised learning?
3. State gain ratio?
4. State Gini index?
5. List the Attribute Selection Measures?
6. Define Naïve Bayesian Classification?
7. Define Bayes’ Theorem?
8. Explain the IF-THEN rules for classification?

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 74
lOMoARcPSD|18595359

DATA MINING

UNIT-IV

1. Define Clustering?
2. Define nominal, ordinal and ratio scaled variables?
3. Define CLARA and CLARANS?
4. State hierarchical method?
5. State K-Means method?
6. Define Outlier Detection?
7. Define Chameleon method?
8. List the fields in which clustering techniques are used?

UNIT-V

1. Define about Web Mining.

2. Discuss about Text Mning.
3. List the Web Mining Classification.
4. Define about Text Clustering.
5. List the Application of Text Mining.
6. State the Web Structure Mining.
7. Define Web Usage Mining .
8. Define Episode rule discovery for texts.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 75
lOMoARcPSD|18595359

DATA MINING

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN

DATA MINING
ASSIGNMENT QUESTIONS

UNIT-I
1. Describe data mining? In your answer, address the following:
a) Is it another hype?
b) Is it a simple transformation of Technology developed from databases, statistics,
and machine learning?
c) Explain how the evolutions of database technology lead to data mining?
d) Describe the steps involved in data mining when viewed as a process of knowledge
discovery.
2. Examine the following consider the following data for analysis includes the attribute
age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20,
21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
a) Use min-max normalization to transform the value 35 for age on to the range [0.0,
1.0].Use z-score normalization to transform the value 35 for age, where the standard
deviation of age is 12.94 years.
b) Use normalization by decimal scaling to transform the value 35 for age.
c) Comment on which method you would prefer to use for the given data, giving reasons
as to why.
UNIT-II
1. Discuss which algorithm is an influential algorithm for mining frequent itemsets
for boolean association rules? Explain with an example?
2. Apply the following rules on a database has five transactions. Let min sup =60%
and min con f = 80%

TID Items Bought

T100 {M, O, N, K, E, Y}
T200
{D, O, N, K, E, Y }
T300
T400 {M, A, K, E}
T500 {M, U, C, K, Y}

{C, O, O, K, I ,E}

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 76
lOMoARcPSD|18595359

DATA MINING

UNIT-III
1. Explain for a given a decision tree, you have the option of (a) converting the
decision tree to rules and then pruning the resulting rules, or (b) pruning the decision
tree and then converting the pruned tree to rules. What advantage does (a) have over
(b)?
2. Write an algorithm for k-nearest-neighbor classification given k and n, the number
of attributes describing each tuple.
UNIT-IV
1. Explain for a given a decision tree, you have the option of (a) converting the
decision tree to rules and then pruning the resulting rules, or (b) pruning the decision
tree and then converting the pruned tree to rules. What advantage does (a) have over
(b)?
2. Write an algorithm for k-nearest-neighbor classification given k and n, the number
of attributes describing each tuple.
UNIT-V
1. Discusss about the terminologies associated with Web Structure.

2. What is Unstructured text classify types of unstructured text.

lOMoARcPSD|18595359

DATA MINING

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 77
lOMoARcPSD|18595359

DATA MINING

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN

DATA MINING
UNIT WISE IMPORTANT QUESTION QUESTIONS

UNIT-I

PART-A (2 or 3 marks)
1. Define Data mining.
2. Write the differences of data base, data warehouse and data mining with real
time example?
3. Give some alternative terms for Data Mining.
4. What is KDD?
5. What are the steps involved in KDD process?
6. What is the use of knowledge base?
7. Mention some of the data mining techniques?
8. What is the purpose of Data mining Technique?
9. What is data mining and why it is required?
10. What are the applications of data mining?
11. Define Predictive model.
12. Define Descriptive model?
13. Define Data characterization.
14. Define Data discrimination.
15. What is meant by concept description?
16. Define data preprocessing.
17. Define data cleaning.
18. Define data reduction.
19. Define data transformation.
20. Define data integration.
21. What is binning?
22. What is missing data?
23. Define data aggregation.
24. What is meant by dimensionality reduction?
25. What is Feature Subset Selection?
26. What are the primitives that define a task?
27. What is DMQL?
28. List the different coupling schemes used in a data mining system.
29. What is entropy of a attribute?
30. Give the formula for mean, median and mode.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 78
lOMoARcPSD|18595359

DATA MINING

31. Differentiate between following:

a) OlAP Vs. Data Mining
b) Data Warehouse Vs. Data marts
c) Operational Vs. Informational System
PART-B (5 -10 marks)
1. Explain the architecture of data mining system in detail?
2. Explain various tasks in data mining?
3. Explain the taxonomy of data mining tasks?
4. Explain various techniques in data mining?
5. Explain the major issues in data mining.
6. Explain the classification of data mining systems.
7. Explain the data mining functionalities or data mining task in detail.
8. Explain the need and steps involved in data preprocessing.
9. Explain data integration in detail.
10. Explain about any two data reduction techniques in detail
11. Explain data transformation methods in detail.
12. Explain data reduction in detail.
13. Explain data discretization in detail.
14. Explain data mining query language in detail.
15. Explain Measures of similarity and dissimilarity.
16. Explain in detail Feature Subset Selection.
17. What is data preprocessing? Explain about data cleaning using BINNING
method for the following data sequence
443267855379 9
UNIT-II

PART-A (2 or 3 marks)
1. Define Association Rule Mining.
2. Define frequent itemset.
3. When we can say the association rules are interesting?
4. Explain Association rule in mathematical notations.
5. Define support and confidence in Association rule mining.
6. How is association rules mined from large databases?
7. Describe the different classifications of Association rule mining.
8. What is the purpose of Apriori Algorithm?
9. What is boolean association rule?
10. How to generate association rules from frequent Item sets?
11. Give few techniques to improve the efficiency of Apriori algorithm.
12. What are the things suffering the performance of Apriori candidate
generation technique.
13. Describe the method of generating frequent item sets without
candidate generation.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 79
lOMoARcPSD|18595359

DATA MINING

14. What is Maximal Frequent item set?

15. What is Closed Frequent item set?
PART-B (5 -10 marks)
1. Design an Algorithm for without using candidate generation, Consider
min_support count=2.
2. Design an Algorithm for with candidate generation, Consider min_support count=2.
3. Implement the Association Rules in Data Mining.
4. Explain how the efficiency of apriori is improved?
5. Explain Market Basket Analysis?
6. Explain about mining frequent patterns without using candidate generation for
the below transactional database. Consider min_support count=2

Transaction_ID Item_ID
T1 I2,I4,I8,I6
T2 I1,I3,I5
T3 I6,I7,I1
T4 I1,I3,I4
T5 I2,I4,I7

UNIT-III

PART-A (2 or 3 marks)
1. Define the concept of classification?
2. Define the concept of Prediction?
3. What is Decision Tree?
4. What is Attribute Selection Measure?
5. Describe Tree pruning methods.
6. Define Prepruning.
7. Define Post pruning.
8. Why naïve Bayesian classification is is called ‘naïve’?
9. What is K-nearest neighbor?
10. What is Best-split?
11. What is a Belief networks?
PART-B (5 -10 marks)

1. Explain classification by Decision tree induction?

2. Explain Bayesian classification?
3. Explain Bayesian Belief networks classification?
4. Explain Measures for selecting best split?

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 80
lOMoARcPSD|18595359

DATA MINING

5. Explain in detail General approaches to solving a classification problem?

6. Estimate the Accuracy of Classifier?
7. Design an Algorithm for Classification by Back Propagation?
UNIT-IV
PART-A (2 or 3 marks)

1. Define Clustering.
2. What do you mean by Cluster Analysis?
3. What are the fields in which clustering techniques are used?
4. What are the requirements of cluster analysis?
5. What are the different types of data used for cluster analysis?
6. What is interval scaled variables?
7. Define Binary variables. And what are the two types of binary variables?
8. Define nominal, ordinal and ratio scaled variables.
9. What is meant by partitioning method?
10. What is K-mean Algorithm?
11. What is PAM?
12. Define CLARA and CLARANS.
13. What is Hierarchical method?
14. Differentiate Agglomerative and Divisive Hierarchical Clustering.
15. Define Density based method.
16. What is a DBSCAN?
17. What is meant by Grid Based Method?
18. What is a STING?
19. Define Wave Cluster.
PART-B (5 -10 marks)

1. Discuss the requirements of clustering in data mining?

2. Explain the various types of variables used in clustering?
3. Explain the partitioning methods of clustering?
4. Explain the hierarchical method of clustering?
5. Design steps for Partitioning clustering-k-means algorithm.
6. Design Outlier Detection.
7. Design a PAM Algorithm.
8. Given two objects represented by the tuples (22,1,42,10) and (20,0,36,8):
i. Compute the Euclidean distance between the two objects.
ii. Compute the Manhanttan distance between the two objects.
iii. Compute the Minkowski distance between the two objects, using q=3.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 81
lOMoARcPSD|18595359

DATA MINING

UNIT-V

PART-A (2 or 3 marks)
1. Explain about Web Mining.
2. Discuss about Text Mining.
3. Specify the Web Mining Classification.
4. Discuss about Web Content Mining.
5. Explain about the Web Structure Mining.
6. Discuss about the terminologies associated with Web Structure.
7. Explain about the Web Usage Mining and its Applications.
8. Discuss about the Application of Text Mining.
9. Discuss about Episode rule discovery for texts.
10. Illustrate about the Hierarchy of categories.
11. Discuss about Text Clustering.

PART-B (5 -10 marks)

1. Illustrate the Steps for Text mining

2. Illustrate about the Application of Web Content and Web structure mining.
3. What is Unstructured text classify types of unstructured text.
4. Specify The Text Mining Techniques.
5. Illustrate about the Hierarchy of categories

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 82
lOMoARcPSD|18595359

DATA MINING

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN

DATA MINING
OBJECTIVE
QUESTIONS

UNIT-I
1. is a subject-oriented, integrated, time-variant, nonvolatile collection
of data in support of
management decisions.
A. Data Mining.
B. Data Warehousing.
C. Web Mining.
D. Text Mining.

2. The data Warehouse is .

A. read only.
B. write only.
C. read write only.
D. none.

3. Expansion for DSS in DW is .

A. Decision Support system.
B. Decision Single System.
C. Data Storable System.
D. Data Support System.

4. The important aspect of the data warehouse environment is that data found within
the data warehouse
is .
A. subject-oriented.
B. time-variant.
C. integrated.
D. All of the above.

5. The time horizon in Data warehouse is usually _ .

A. 1-2 years.
B. 3-4years.
C. 5-6 years.
D. 5-10 years.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 83
lOMoARcPSD|18595359

DATA MINING

6. The data is stored, retrieved & updated in .

A. OLAP.
B. OLTP.
C. SMTP.
D. FTP.

7. describes the data contained in the data warehouse.

A. Relational data.
B. Operational data.
C. Metadata.
D. Informational data.

8. predicts future trends & behaviors, allowing business managers to

make proactive,
knowledge-driven decisions.
A. Data warehouse.
B. Data mining.
C. Datamarts.
D. Metadata.

9. is the heart of the warehouse.

A. Data mining database servers.
B. Data warehouse database servers.
C. Data mart database servers.
D. Relational data base servers.

10. 10. is the specialized data warehouse database.

A. Oracle.
B. DBZ.
C. Informix.
D. Redbrick.

11. defines the structure of the data held in operational databases

and used by
operational applications.
A. User-level metadata.
B. Data warehouse metadata.
C. Operational metadata.
D. Data mining metadata.

12. is held in the catalog of the warehouse database system.

A. Application level metadata.
B. Algorithmic level metadata.
C. Departmental level metadata.
D. Core warehouse metadata
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 84
lOMoARcPSD|18595359

DATA MINING

12. maps the core warehouse metadata to business concepts, familiar

and useful to end users.
A. Application level metadata.
B. User level metadata.
C. Enduser level metadata.
D. Core level metadata.

14. consists of formal definitions, such as a COBOL layout or adatabase

schema.
A. Classical metadata.
B. Transformation metadata.
C. Historical metadata.
D. Structural metadata.

15. consists of information in the enterprise that is not in

classical form.
A. Mushy metadata.
B. Differential metadata.
C. Data warehouse.
D. Data mining.

16. . databases are owned by particular departments or

business groups.
A. Informational.
B. Operational.
C. Both informational and operational.
D. Flat.

17. The star schema is composed of fact table.

A. one.
B. two.
C. three.
D. four.
18. The time horizon in operational environment is .
A. 30-60 days.
B. 60-90 days.
C. 90-120 days.
D. 120-150 days.

19. The key used in operational environment may not have an element of .
A. time.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 85
lOMoARcPSD|18595359

DATA MINING

B. cost.
C. frequency.
D. quality.

20. Data can be updated in environment.

A. data warehouse.
B. data mining.
C. operational.
D. informational.

UNIT-II
21. Record cannot be updated in .
A. OLTP
B. files
C. RDBMS
D. data warehouse

22. The source of all data warehouse data is the .

A. operational environment.
B. informal environment.
C. formal environment.
D. technology environment.

23. Data warehouse contains data that is never found in the

operational environment.
A. normalized.
B. informational.
C. summary.
D. denormalized.

24. The modern CASE tools belong to category.

A. a. analysis.
B. b.Development
C. c.Coding
D. d.Delivery

25. Bill Inmon has estimated of the time required to build a data
warehouse, is consumed in
the conversion process.
A. 10 percent.
B. 20 percent.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 86
lOMoARcPSD|18595359

DATA MINING

C. 40 percent
D. 80 percent.

26. Detail data in single fact table is otherwise known as .

A. monoatomic data.
B. diatomic data.
C. atomic data.
D. multiatomic data.

27. test is used in an online transactional processing environment.

A. MEGA.
B. MICRO.
C. MACRO.
D. ACID.

28. is a good alternative to the star schema.

A. Star schema.
B. Snowflake schema.
C. Fact constellation.
D. Star-snowflake schema.

29. The biggest drawback of the level indicator in the classic star-schema is that
it limits .
A. quantify.
B. qualify.
C. flexibility.
D. ability.

30. A data warehouse is .

A. updated by end users.
B. contains numerous naming conventions and formats
C. organized around important subject areas.
D. contains only current data.

31. An operational system is .

A. used to run the business in real time and is based on historical data.
B. used to run the business in real time and is based on current data.
C. used to support decision making and is based on current data

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 87
lOMoARcPSD|18595359

DATA MINING

D. used to support decision making and is based on historical data.

32. The generic two-level data warehouse architecture includes .

A. at least one data mart.
B. data that can extracted from numerous internal and external sources.
C. near real-time updates.
D. far real-time updates.

33. The active data warehouse architecture includes

A. at least one data mart.
B. data that can extracted from numerous internal and external sources.
C. near real-time updates.
D. all of the above.

34. Reconciled data is .

A. data stored in the various operational systems throughout the organization.
B. current data intended to be the single source for all decision support systems.
C. data stored in one operational system in the organization.
D. data that has been selected and formatted for end-user support applications.

35. Transient data is .

A. data in which changes to existing records cause the previous version of the records
to be eliminated.
B. data in which changes to existing records do not cause the previous version of the
records to be
eliminated.
C. data that are never altered or deleted once they have been added.
D. data that are never deleted once they have been added.

36. The extract process is .

A. capturing all of the data contained in various operational systems.
B. capturing a subset of the data contained in various operational systems.
C. capturing all of the data contained in various decision support systems.
D. capturing a subset of the data contained in various decision support systems.

37. Data scrubbing is .

A. a process to reject data from the data warehouse and to create the necessary indexes.
B. a process to load the data in the data warehouse and to create the necessary indexes.
C. a process to upgrade the quality of data after it is moved into a data warehouse.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 88
lOMoARcPSD|18595359

DATA MINING

D. a process to upgrade the quality of data before it is moved into a data warehouse

38. The load and index is .

39. Data transformation includes .

A. a process to change data from a detailed level to a summary level.
B. a process to change data from a summary level to a detailed level.
C. joining data from one source into various sources of data.
D. separating data from one source into various sources of data.

40. is called a multifield transformation.

A. Converting data from one field into multiple fields.
B. Converting data from fields into field.
C. Converting data from double fields into multiple fields.
D. Converting data from one field to one field.

UNIT-III
41. The type of relationship in star schema is .
A. many-to-many.
B. one-to-one.
C. one-to-many.
D. many-to-one.

42. Fact tables are .

A. completely demoralized.
B. partially demoralized.
C. completely normalized.
D. partially normalized.

43. is the goal of data mining.

A. To explain some observed event or condition.
B. To confirm that data exists.
C. To analyze data for expected relationships.
D. To create a new data warehouse.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 89
lOMoARcPSD|18595359

DATA MINING

44. Business Intelligence and data warehousing is used for .

A. Forecasting.
B. Data Mining.
C. Analysis of large volumes of product sales data.
D. All of the above.

45. The data administration subsystem helps you perform all of the
following, except .
A. backups and recovery.
B. query optimization.
C. security management.
D. create, change, and delete information.

46. The most common source of change data in refreshing a data warehouse is .
A. queryable change data.
B. cooperative change data.
C. logged change data.
D. snapshot change data.

47. are responsible for running queries and reports against data
warehouse tables.
A. Hardware.
B. Software.
C. End users.
D. Middle ware.

48. Query tool is meant for .

A. data acquisition.
B. information delivery.
C. information exchange.
D. communication.

49. Classification rules are extracted from .

A. root node.
B. decision tree.
C. siblings.
D. branches.

50. Dimensionality reduction reduces the data set size by removing .

A. relevant attributes.
B. irrelevant attributes.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 90
lOMoARcPSD|18595359

DATA MINING

C. derived attributes.
D. composite attributes.
51. is a method of incremental conceptual clustering.
A. CORBA.
B. OLAP.
C. COBWEB.
D. STING.

52. Effect of one attribute value on a given class is independent of values of

other attribute is called
.
A. value independence.
B. class conditional independence.
C. conditional independence.
D. unconditional independence.

53. The main organizational justification for implementing a data warehouse is to provide
.
A. cheaper ways of handling transportation.
B. decision support.
C. storing large volume of data.
D. access to data.

54. Multidimensional database is otherwise known as .

A. RDBMS
B. DBMS
C. EXTENDED RDBMS
D. EXTENDED DBMS

55. Data warehouse architecture is based on .

A. DBMS.
B. RDBMS.
C. Sybase.
D. SQL Server.

56. Source data from the warehouse comes from .

A. ODS.
B. TDS.
C. MDDB.
D. ORDBMS.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 91
lOMoARcPSD|18595359

DATA MINING

57. is a data transformation process.

A. Comparison.
B. Projection.
C. Selection.
D. Filtering.

58. The technology area associated with CRM is .

A. specialization.
B. generalization.
C. personalization.
D. summarization.

59. SMP stands for .

A. Symmetric Multiprocessor.
B. Symmetric Multiprogramming.
C. Symmetric Metaprogramming.
D. Symmetric Microprogramming.

60. are designed to overcome any limitations placed on the warehouse by the
nature of the
relational data model.
A. Operational database.
B. Relational database.
C. Multidimensional database.
D. Data repository.

UNIT-IV
61. are designed to overcome any limitations placed on the warehouse by
the nature of the
relational data model.
A. Operational database.
B. Relational database.
C. Multidimensional database.

62. MDDB stands for .

A. multiple data doubling.
B. multidimensional databases.
C. multiple double dimension.
D. multi-dimension doubling.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 92
lOMoARcPSD|18595359

DATA MINING

63. is data about data.

A. Metadata.
B. Microdata.
C. Minidata.
D. Multidata.

64. is an important functional component of the metadata.

A. Digital directory.
B. Repository.
C. Information directory.
D. Data dictionary.

65. EIS stands for .

A. Extended interface system.
B. Executive interface system.
C. Executive information system.
D. Extendable information system.

66. is data collected from natural systems.

A. MRI scan.
B. ODS data.
C. Statistical data.
D. Historical data.

67. is an example of application development environments.

A. Visual Basic.
B. Oracle.
C. Sybase.
D. SQL Server.

68. The term that is not associated with data cleaning process is .
A. domain consistency.
B. deduplication.
C. disambiguation.
D. segmentation.
69. are some popular OLAP tools.
A. Metacube, Informix.
B. Oracle Express, Essbase.
C. HOLAP.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 93
lOMoARcPSD|18595359

DATA MINING

D. MOLAP.

70. Capability of data mining is to build models.

A. retrospective.
B. interrogative.
C. predictive.
D. imperative.

71. is a process of determining the preference of customer's majority.

A. Association.
B. Preferencing.
C. Segmentation.
D. Classification.

72. Strategic value of data mining is .

A. cost-sensitive.
B. work-sensitive.
C. time-sensitive.
D. technical-sensitive.

73. proposed the approach for data integration issues.

A. Ralph Campbell.
B. Ralph Kimball.
C. John Raphlin.
D. James Gosling.

74. The terms equality and roll up are associated with .

A. OLAP.
B. visualization.
C. data mart.
D. decision tree.

75. Exceptional reporting in data warehousing is otherwise called as .

A. exception.
B. alerts.
C. errors.
D. bugs.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 94
lOMoARcPSD|18595359

DATA MINING

76. is a metadata repository.

A. Prism solution directory manager.
B. CORBA.
C. STUNT.
D. COBWEB.

77. is an expensive process in building an expert system.

A. Analysis.
B. Study.
C. Design.
D. Information collection.

78. The full form of KDD is .

A. Knowledge database.
B. Knowledge discovery in database.
C. Knowledge data house.
D. Knowledge data definition.

79. The first International conference on KDD was held in the year .
A. 1996.
B. 1997.
C. 1995.
D. 1994.

80. Removing duplicate records is a process called .

A. recovery.
B. data cleaning.
C. data cleansing.
D. data pruning.

UNIT-V
81. contains information that gives users an easy-to-
understand perspective of the
information stored in the data warehouse.
A. Business metadata.
B. Technical metadata.
C. Operational metadata.
D. Financial metadata.

82. helps to integrate, maintain and view the contents of the data
warehousing system.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 95
lOMoARcPSD|18595359

DATA MINING

A. Business directory.
B. Information directory.
C. Data dictionary.
D. Database.

83. Discovery of cross-sales opportunities is called .

A. segmentation.
B. visualization.
C. correction.
D. association.

84. Data marts that incorporate data mining tools to extract sets of data are called .
A. independent data mart.
B. dependent data marts.
C. intra-entry data mart.
D. inter-entry data mart.

85. can generate programs itself, enabling it to carry out new tasks.
A. Automated system.
B. Decision making system.
C. Self-learning system.
D. Productivity system.

86. The power of self-learning system lies in .

A. cost.
B. speed.
C. accuracy.
D. simplicity.

87. Building the informational database is done with the help of .

A. transformation or propagation tools.
B. transformation tools only.
C. propagation tools only.
D. extraction tools.

88. How many components are there in a data warehouse?

A. two.
B. three.
C. four.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 96
lOMoARcPSD|18595359

DATA MINING

D. five.

89. Which of the following is not a component of a data warehouse?

A. Metadata.
B. Current detail data.
C. Lightly summarized data.
D. Component Key.

90. is data that is distilled from the low level of detail found at the
current detailed leve.
A. Highly summarized data.
B. Lightly summarized data.
C. Metadata.
D. Older detail data.

91. Highly summarized data is .

A. compact and easily accessible.
B. compact and expensive.
C. compact and hardly accessible.
D. compact.

92. A directory to help the DSS analyst locate the contents of the data warehouse is
seen in .
A. Current detail data.
B. Lightly summarized data.
C. Metadata.
D. Older detail data.

93. Metadata contains atleast .

A. the structure of the data.
B. the algorithms used for summarization.
C. the mapping from the operational environment to the data warehouse.
D. all of the above.

94. Which of the following is not a old detail storage medium?

A. Phot Optical Storage.
B. RAID.
C. Microfinche.
D. Pen drive.
95. The data from the operational environment enter of data warehouse.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 97
lOMoARcPSD|18595359

DATA MINING

A. Current detail data.

B. Older detail data.
C. Lightly summarized data.
D. Highly summarized data.

96. The dimension tables describe the .

A. entities.
B. facts.
C. keys.
D. units of measures.

97. The granularity of the fact is the of detail at which it isrecorded.

A. transformation.
B. summarization.
C. level.
D. transformation and summarization.

98. Which of the following is not a primary grain in analytical modeling?

A. Transaction.
B. Periodic snapshot.
C. Accumulating snapshot.
D. All of the above.

100. Granularity is determined by .

A. number of parts to a key.
B. granularity of those parts.
C. both A and B.
D. none of the above.

99. of data means that the attributes within a given entity are
fully dependent on the entire primary key of the entity.
A. Additivity.
B. Granularity.
C. Functional dependency.
D. Dimensionality.
100. A fact is said to be fully additive if .
A. it is additive over every dimension of its dimensionality.
B. additive over atleast one but not all of the dimensions.
C. not additive over any dimension. D. None of the above.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 98
lOMoARcPSD|18595359

DATA MINING

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN

DATA MINING
INTERNAL QUESTION PAPERS

MID-I (2017-2018)
DATA MINING (IV– CSE)

Part – A
Answer all questions
1. Design a Data Warehouse in words of Inmon.
2. List out the applications of data mining?
3. Write the differences of data base, data warehouse and data mining with real
time example?
4. Design OLAP Cube.
5. List out the OLAP operations in multidimensional data model?
Part – B
1. (a) Write the logic for binning method for using Smoothing Technique?
4 1 3 2 6 11 8 5 10 9 7 12 13
(Or)
(b) Design various star schema and snowflake schemas in Multi Dimensional Data
Model?

2. (a) Design architecture of 3-tier.

(Or)
(b) Implement any two data reduction techniques and KDD Process?

3. (a) Design data transformation methods in detail.

(Or)
(b) Design major difference between OLTP and OLAP.

MID-II (2017-2018)
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 99
lOMoARcPSD|18595359

DATA MINING

DATA MINING
Part – A
Answer all questions
1. Define a classification process?

2. Why naïve Bayesian classification is is called ‘naïve’??

3. Define Cluster?
4. Give few techniques to improve the efficiency of Apriori algorithm?

5. Define ordinal and ratio scaled variables?

Part – B

1.(a) Design classification by Decision tree induction Algorithm with example?

(Or)
(b) Design Measures for selecting best split and Estimate the Accuracy of
Classifier?
2. (a) Construct Various partitioning methods of
clustering? (Or)
(b) Design and Implement for Naïve Bayesian classification?

3. (a) Given two objects represented by the tuples (22,1,42,10)

and (20,0,36,8):
• Compute the Euclidean distance between the two objects.
• Compute the Manhanttan distance between the two objects.
• Compute the Minkowski distance between the two objects,
using q=3.
(Or)
(b) Design a General approaches to solving a classification problem?

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 100
lOMoARcPSD|18595359

DATA MINING

MID-I (2016-2017)
DATA MINING
Part – A
Answer all questions
1. Design a Operational Databases.
2. Design OLAP Cube.
3. Design a Data Mining.
4. Implement Market Basket Analysis.
5. Design Facts in Data Warehousing.

Part – B
2. (a) Write the logic for binning method for using Smoothing
Technique? 4 4 3 2 6 7 8 5 5 3 7 9 9
(Or)
(b) Design various schemas in Multi Dimensional Data Model?

2. (a) Design architecture of 3-tier.

(Or)
(b) Design architecture of KDD?
3. (a) Find mining frequent patterns with using candidate generation for the below
transactional database. Consider min_support count=2

Transaction_ID Item_ID
T1 I8,I5,I8,I9
T2 I2,I3,I6
T3 I6,I7,I5
T4 I5,I3,I8
T5 I2,I4,I8

(Or)
(b) Implement any two data reduction techniques?

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 101
lOMoARcPSD|18595359

DATA MINING

MID-II (2016-2017)
DATA MINING
Part – A
Answer all questions
1. Design a concept of classification?
2. Why naïve Bayesian classification is is called ‘naïve’??
3. Design Attribute Selection Measure?
4. Design Clustering and requirements of cluster analysis?
5. Implement Maximal Frequent item set and Closed Frequent item set?
Part – B
1. (a) Design classification by Decision tree induction Algorithm?
(Or)
(b) Design Measures for selecting best split and Estimate the Accuracy of
Classifier?

2. (a) Construct Various PAM?

(Or)
(b) Design and Implement for Bayesian classification?

3. (a) Given two objects represented by the tuples (11,2,21,5) and

(10,0,18,4):
iv. Compute the Euclidean distance between the two objects.
v. Compute the Manhanttan distance between the two objects.
vi. Compute the Minkowski distance between the two objects, using q=2.

(Or)

(b) Explain in detail outlier detection?

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 102
lOMoARcPSD|18595359

DATA MINING

DATA MINING
EXTERNAL QUESTION PAPERS

Code No: 126EW R13

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY

HYDERABAD

B. Tech III Year II Semester Examinations, May - 2016

DATA MINING
(Common to CSE & IT)

Time: 3 hours Max. Marks: 75

PART – A (25 Marks)

1.a) Define concept hierarchy. [2]

b) What is meta data repository? [3]

c)What are the challenges of KDD? [2]

d) What is data mining? [3]

e) Define association rule. [2]

f) Define maximal frequent itemset. [3]

g) Why is tree pruning useful in decision tree induction? [2]

h) Define Bayesian belief network. [3]

I) What is clustering? [2]

j) How does Chameleon work? [3]

PART – B (50 Marks)

2.a) Explain any two of the schemas for multidimensional databases. [5+5]

b) Describe Fully Addictive, Semi-Addictive, Non Addictive Measures.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 103
lOMoARcPSD|18595359

DATA MINING

3.a) What are OLAP operations in the multidimensional data model? Explain.[5+5]

b) Describe efficient computation of data cubes.

4.a) Discuss about dimensionality reduction. [5+5]

b) Explain in detail about data cleaning.

5.a) List and describe the five primitives for specifying a data mining task. [5+5]

b) In real-world data, tuples with missing values for some attributes are a common
occurrence. Describe various methods for handling this problem.

6. Write the the Apriori algorithm for discovering frequent item sets for mining
Boolean association rules. [10]

7.a) How can we mine closed frequent item sets? Explain. [5+5]

b) Write the FP-growth algorithm.

8. Compare the advantages and disadvantages of eager classification (e.g.,

decision tree,Bayesian, neural network) versus lazy classification (e.g., k- earest
neighbor, sebased reasoning). [10]

9.a.What are the measures for selecting the Best Split? Explain. [5+5]

b.What are the general approaches for classification problems? Explain.

10.a) Write and explain about the k-medoids algorithm. [5+5]

b) Describe distance based outlier detection.

11. Briefly describe the following approaches to clustering: partitioning methods,

hierarchical methods, density-based methods, grid-based methods, model-based
methods, methods for high-dimensional data, and constraint-based methods. Give
examples in each case.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 104
lOMoARcPSD|18595359

DATA MINING

Code No: 126EW R13

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY

HYDERABAD

B. Tech III Year II Semester Examinations, May - 2016

DATA MINING

(Common to CSE & IT)

Time: 3 hours Max. Marks: 75

PART - A

(25 Marks)

1.a) What is the significance of Data warehousing? [2]

b) How Data mining is different from KDD? [3]

c)What is Star Schema? [2]

d) Describe the features of OLTP systems. [3]

e) What is Data Transformation? [2]

f) What are the principles of APRIORI algorithms? [3]

g) Describe Decision Tree construction. [2]

h) How to measure best split of any classification? [3]

I) What are the key issues in Hierarchical clustering? [2]

j) Give an overview of Outlier Detection. [3]

PART – B (50 Marks)

2. Explain in detail about 3 tier Data Warehousing architecture with a neat sketch.

[10]

3.a) Describe the architecture of OLTP with its operation. [10]

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 105
lOMoARcPSD|18595359

DATA MINING

4.a) What are the challenges of KDD? [5+5]

b) Discuss about Dimensionality Reduction.

5. Explain the basic Data mining tasks with example. [10]

6. Briefly discuss about different partition algorithms with an example. [10]

7.a) What is Frequent Item Set Generation? Explain. [5+5]

b) Explain the compact representation of Frequent Item Data Set.

8.a)What are the general approaches to consider for solving classification problem?

b) Describe about different classification techniques. [5+5]

9.a.) How to evaluate any classifier model, which was build for classification. [5+5]

b.) Discuss about KNN classification.

10. Explain with example how clustering can be with large databases. [10]

11.Explain about Web Mining and Text Mining with an example. [10]

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 106
lOMoARcPSD|18595359

DATA MINING

R13 Code: 13A05603

B.Tech III Year II Semester (R13) Regular Examinations May/June

2016 DATA MINING

(Common to CSE and IT)

Time: 3 hours

Max. Marks: 70

PART – A (Compulsory Question)

Answer the following: (10 X 02 = 20

Marks)

1(a) Define Data Mining.

(b) What is data mart?

(c) What is FP growth?

(d) Compare clustering with classification.

(e) What is Data purging?

(f) What is the difference between OLTP and OLAP?

(g) What is Fact Table?

(h) Define Slice and Dice operation.

(i) Define Association Rule Mining.

(j) What is the use of the knowledge base?

PART – B(Answer all five units, 5 X 10 = 50 Marks)

UNIT – I

2) Explain the steps in knowledge discovery data base.

3) Describe OLAP operations in multidimensional data model.

UNIT – II

4) Explain with example the various steps in Decision tree induction.

5) Explain about evaluating the performance of classifier.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 107
lOMoARcPSD|18595359

DATA MINING

UNIT – III

6) (a)Explain Rule based classifiers.

(b) Explain the Nearest neighbor classification.

7) State Bayes theorem and discuss how Bayesian classifiers

work. UNIT – IV

8) Explain how the efficiency of apriori algorithm is improved.

9) Explain the mining single dimensional Boolean associated rules from

transactional data bases.

UNIT – V

10) Explain hierarchical methods of clustering.

11) Discuss different types of clustering methods.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 108
lOMoARcPSD|18595359

DATA MINING

R15 Code No: 126VW

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY HYDERABAD

B. Tech III Year II Semester Examinations, April - 2018

Time: 3 hours

(Common to CSE & IT)

Max. Marks: 75

Note: This question paper contains two parts A and B.

Part A is compulsory which carries 25 marks. Answer all questions in Part A. Part B

consists of 5 Units. Answer any one full question from each unit. Each question
carries

10 marks and may have a, b, c as sub questions.

PART – A (25 Marks)

1.a) Define data warehouse. [2]

b) Differentiate between ROLAP and HOLAP. [3]

c) What is evolution analysis? [2]

d) List the various forms of data preprocessing. [3]

e) Quote an example for quantitative association rule. [2]

f) How to compute confidence for an association rule X  Y? [3]

g) Define classifier accuracy. [2]

h) List the characteristics of k-nearest neighbor algorithm. [3]

i) What is an outlier? [2]

j) What are the weaknesses of hierarchical clustering? [3]

PART – B (50 Marks)

2.a) Compare and contrast operational database systems with data warehouse.

b) What is the importance of data marts in data warehouse? [5+5]

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 109
lOMoARcPSD|18595359

DATA MINING

3.a)With illustrative examples explain various OLAP operations.

b)Discuss the characteristics of fact table. [5+5]

4.a) Explain data mining as a step in knowledge discovery process.

b)Differentiate between data retrieval and data mining. [5+5]

5. Demonstrate computation of the following measures for similarity/dissimilarity

among data:

a) Cosine measure

b) Euclidean distance

c) Manhattan measure. [10]

6. Find the frequent itemsets and strong association rules for the following
transactional database table using Apriori algorithm. Consider the thresholds
as
support = 30% and confidence = 40%. [10]

TID ITEM IDs

1 I1,i2,i3,i5

2 I2,i5,i7,i9

3 I1,i3,i5,i7

4 I2,i4,i6,i8

5 I1,i2,i3,i4

6 I2,i3,i4,i5

7 I3,i4,i5,i6

8 I4,i5,i6,i7

9 I5,i6,i7.i8.i9

10 I9.i1.i2.i5

11 I8,i2,i9,i7

12 I5,i6,i3,i2

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 110
lOMoARcPSD|18595359

DATA MINING

7.a) Demonstrate construction of FP-tree for the data from Question (6).

b) What is a closed item set? Is it similar to maximal item set? [5+5]

8.a) Define information gain and explain its importance in decision tree induction.

b) Give the algorithm for decision tree induction. [5+5]

9.a) State Bayes theorem. How can it be applied for data classification?

b) With example explain Bayesian belief network. [5+5]

10. What is the main objective of clustering? Give the categorization of

clustering approaches. Briefly discuss them. [10]

11.a) Compare k-means with k-medoids algorithms for clustering.

b) How to evaluate clustering algorithms? [5+5]

---ooOoo---

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 111
lOMoARcPSD|18595359

DATA MINING

R13 Code No: 117CD

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY HYDERABAD

B. Tech IV Year I Semester Examinations, March - 2017

DATA WAREHOUSING AND DATA MINING

(Common to Computer Science and Engineering & Information Technology)

Time: 3 Hours Max.

Marks: 75

Note: This question paper contains two parts A and B.

Part A is compulsory which carries 25 marks. Answer all questions in Part A. Part B

consists of 5 Units. Answer any one full question from each unit. Each question

carries 10 marks and may have a, b, c as sub questions.

Part- A (25 Marks)

1.a) What is a data mart? [2]

b) What is a fact table? [3]

c) What is data mining? [2]

d) List similarity measures. [3]

e) What is maximal frequent itemset? [2]

f) How to compute confidence measure for an association rule? [3]

g) What is classification? [2]

h) Define information gain. [3]

i) What is an outlier? [2]

j) List the demerits of k-means algorithm. [3]

Part-B (50 Marks)

2. What are the various components of data warehouse? Explain their functionality
in detail. [10]

3. What is the significance of OLAP in data warehouse? Describe OLAP operations

with necessary diagram/example. [10]

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 112
lOMoARcPSD|18595359

DATA MINING

4. Explain different data mining tasks for knowledge discovery.

[10]

5. What is the need of dimensionality reduction? Explain any two techniques for

dimensionality reduction. [10]

6. A database has six transactions. Let min-sup = 50% and min-conf =

75%. TID List of items

01 Pencil, sharpener, eraser, color papers

02 Color papers, charts, glue sticks

03 Pencil, glue stick, eraser, pen

04 Oil pastels, poster colours, correction

tape 005 Whitener, pen , pencil, charts, glue stick

006 Colour pencils, crayons, eraser, pen

Find all frequent item sets using Apriori algorithm. List all the strong association
rules. [10]

7. a) What are the advantages of FP-Growth algorithm?

b) Discuss the applications of association analysis. [5+5]

8. Explain decision tree induction algorithm for classifying data tuples and discuss
suitable example. [10]

9.a) What are the characteristics of k-nearest neighbor algorithm?

b) How to evaluate the classifier accuracy? [5+5]

10.What is the goal of clustering? How does partitioning around medoids algorithm
achieve this goal? [10]

11.a) Differentiate between AGNES and DIANA algorithms.

b) How to access the cluster quality? [5+5]

R13 Code: 13A05603

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 113
lOMoARcPSD|18595359

DATA MINING

B.Tech III Year II Semester (R13) Supplementary Examinations December 2016

DATA MINING

(Common to CSE and IT)

Time: 3 hours

Max. Marks: 70

PART – A

(Compulsory Question)*****

Answer the following: (10 X 02 = 20 Marks)

1.(a) Define data mining. Give significance of data mining.

(b) Explain the term OLAP.

(d) List various types of outcomes corresponding to various test

conditions expressing the attributes.

(e) Explain the Ensemble Methods in brief.

(f) List and explain in brief the attributes that are used to compare
classification and prediction methods.

(g) Explain Apriori principle in brief.

(h) Explain in brief the process of mining temporal pattern in data strum.

(i) Explain the term Clustering. List the limitations of K-means

clustering algorithm.

(j) Explain in brief density based clusters and conceptual clusters.

PART – B

(Answer all five units, 5 X 10 = 50

Marks) UNIT – I

2. List and explain various challenges in data mining.

3. List and explain the processing steps that may be applied to the data for data
mining.

UNIT – II
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 114
lOMoARcPSD|18595359

DATA MINING

4. Explain the process of tree induction. Describe its implementation using Hunt’s
algorithm.

5. Explain the term over fitting. Discuss how this over fitting can be addressed.

UNIT – III

6. Explain how Bayesian method is used for classification in data mining process.

7. Explain in detail neural network learning for classification using

back propagation algorithm.

UNIT – IV

8. Explain the use of Dynamic FP-Tree in data representation.

9. Explain in detail Attribute oriented induction.

UNIT – V

10. Explain in detail hierarchical clustering algorithm.

11. Define cluster analysis. List and explain the applications of cluster analysis. Also
explain various types of clustering.

*****

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 115
lOMoARcPSD|18595359

DATA MINING

R15 Code No: 126VW

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY HYDERABAD

B. Tech III Year II Semester Examinations, December - 2018

DATA MINING

(Common to CSE & IT)

Time: 3 hours

Max. Marks: 75

Note: This question paper contains two parts A and B.

Part A is compulsory which carries 25 marks. Answer all questions in Part A. Part B

consists of 5 Units. Answer any one full question from each unit. Each question
carries

10 marks and may have a, b, c as sub questions.

PART - A

(25 Marks)

1.a) What is a data warehouse? How is it differs from DBMS? [2]

b) How to index OLAP data? [3]

c) Define dimensionality reduction. [2]

d) Explain attribute subset selection. [3]

e) What are frequent patterns? Give an example. [2]

f) How are meta rules useful in data mining? [3]

g) What is associative classification? [3]

h) How is prediction different from classification? [3]

i) What is categorical variable? [3]

j) What are interval-scaled variables? [2]

PART - B

(50 Marks)

2.a) Give examples for defining star, snowflake and fact constellation schemas.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 116
lOMoARcPSD|18595359

DATA MINING

b) Discuss about a three-tier data warehouse architecture. [5+5]

3.a) What are the various types of OLAP servers? Explain.

b) Describe the measures of multidimensional data model. [5+5]

4. Discuss about discretization and concept hierarchy generation for

numerical data.[10]

5. a) Describe the process of data cleaning.

b) Write a brief note on relational databases. [5+5]

6. Can we design a method that mines the complete set of frequent item sets without

candidate generation? Explain with example. [10]

7. Explain in detail about multilevel association rules. [10]

8. Why is naive Bayesian classification called “naive”? Briefly outline the major
ideas of naive Bayesian classification. Explain Naive-Bayes classification. [10]

9.a) What are the methods for expressing attribute test conditions? Explain.

b) Write an algorithm for k-nearest-neighbor classification given k and n, the

number of attributes describing each tuple. [5+5]

10.a) What are typical requirements of clustering in data mining? Explain.

b) How does the PAM algorithm work? Explain. [5+5]

11.a) What are key issues in hierarchical clustering? Explain.

b) Explain about the basic Agglomerative Hierarchical clustering algorithm.

[5+5]

–--ooOoo----

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 117
lOMoARcPSD|18595359

DATA MINING

R13 Code No: 117CD

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY HYDERABAD

B. Tech IV Year I Semester Examinations, November/December - 2017

DATA MINING

(Common to Computer Science and Engineering & Information Technology)

Time: 3 Hours Max. Marks: 75

PART – A (25 Marks)

1.a) Define data warehouse. [2]

b) List the Data warehouse Characteristics. [3]

c) How can you go about filling in the missing values for this attribute? [2]

d) Why is the word data mining a misnomer? [3]

e) Give a note on Closed Frequent Item Set. [2]

f) Write the FP-graph algorithm. [3]

g) How prediction is different from classification? [2]

h) What is rule classification? [3]

i) Give a note on k means algorithm. [2]

j) List the Key Issues in Hierarchical Clustering. [3]

PART – B

(50 Marks)

2.a) Make a comparisons between the MOLAP and HOLAP.

b) Discuss the star and snowflake schema in detail with suitable example. [5+5]

3.a) Write the difference between designing a data warehouse and an OLAP cube.

b) Give a brief note on ROLAP. [5+5]

4. Explain concept hierarchy generation for the nominal data. [10]

5.a) Describe the Feature Subset Selection.

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 118
lOMoARcPSD|18595359

DATA MINING

b) Illustrate the Data Transformation by Normalization. [5+5]

6. Make a comparison of Apriori and ECLAT algorithms for frequent item set
mining in transactional databases. Apply these algorithms to the following data:
[10]

TID LIST OF ITEMS

1 Bread, Milk, Sugar, TeaPowder, Cheese, Tomato

2 Onion, Tomato, Chillies, Sugar, Milk

3 Milk, Cake, Biscuits, Cheese, Onion

4 Chillies, Potato, Milk, Cake, Sugar, Bread

5 Bread, Jam, Mik, Butter, Chilles

6 Butter, Cheese, Paneer, Curd, Milk, Biscuits

7 Onion, Paneer, Chilies, Garlic, Milk

8 Bread, Jam, Cake, Biscuits,

Tomato OR

7. Briefly explain the Partition Algorithms. [10]

8. Discuss K- Nearest neighbor classification-Algorithm and Characteristics. [10]

9. How does the Naïve Bayesian classification works? Explain in detail. [10]

10.a) Give a brief note on PAM Algorithm.

b) What is the drawback of k-means algorithm? How can we modify the

algorithm to diminish that problem? [5+5]

11. What are the different clustering methods? Explain in detail. [10]

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 119
lOMoARcPSD|18595359

DATA MINING

R13 Code: 13A05603

B.Tech III Year II Semester (R13) Regular & Supplementary Examinations

May/June 2017

DATA MINING

(Common to CSE and IT)

Max. Marks: 70

Time: 3 hours

PART – A

(Compulsory Question) *** Answer the following: (10 X 02 = 20 Marks)

1.(a) Explain the terms descriptive and predictive data mining.

(b) Describe various challenges in data mining.

(c) Explain the term ‘Over fitting’.

(d) Explain the significance of decision trees in data mining process.

(e) Explain in brief about multiclass problem.

(f) List and explain in brief the attributes that are used to compare classification and
prediction methods.

(g) Explain in brief confidence based pruning.

(h) Explain in brief the process of mining temporal pattern in data stream.

(i) List the advantages and disadvantages hierarchical clustering algorithm.

(j) Explain in brief scalable clustering.

PART – B (Answer all five units, 5 X 10 = 50 Marks)

UNIT – I

2. List and explain various data mining tasks.

3.Define OLAP. Explain in detail various OLAP operations.

UNIT – II

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 120
lOMoARcPSD|18595359

DATA MINING

4.Describe in detail the process of classification by using Decision tree induction.

5. Explain in detail hold out method for evaluating classifier.

UNIT – III

6.Explain how Rule Based method is used for classification in data mining process.

7. Explain in detail the methods to handle class imbalance problem.

UNIT – IV

8.Discuss in detail various methods that improve the efficiency of Apriori algorithm.

9. Illustrate with example in detail, the process of generating association rules from
frequent itemsets.

UNIT – V

10. Explain in detail DBSCAN clustering algorithm.

11.Define cluster analysis. List and explain applications of cluster analysis.

*****

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 121
lOMoARcPSD|18595359

DATA MINING

References
1. Data Mining- Concepts and Techniques- Jiawei Han, Micheline

Kamber, Morgan Kaufmann Publishers, Elsevier, 2 Edition,

2006.

2. Introduction to Data Mining, Pang-Ning Tan, Vipin Kumar,

Michael Steinbanch, Pearson Education.

3. Data mining Techniques and Applications, Hongbo Du Cengage India Publishing

4. Data Mining Techniques, Arun K Pujari, 3rd Edition, Universities Press.

5. Data Mining Principles & Applications – T.V Sveresh

Kumar, B.Esware Reddy, Jagadish S Kalimani, Elsevier.

6. Data Mining, Vikaram Pudi, P Radha Krishna, Oxford University Press

lOMoARcPSD|18595359

DATA MINING

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 122

Java Program For Bubble Sort
No ratings yet
Java Program For Bubble Sort
4 pages
Array Sorting
No ratings yet
Array Sorting
6 pages
CRPC (BNSS)
No ratings yet
CRPC (BNSS)
18 pages
Ipc As BNS
No ratings yet
Ipc As BNS
34 pages
CPC 158 Sections & 51 Orders
No ratings yet
CPC 158 Sections & 51 Orders
13 pages
Taxation
No ratings yet
Taxation
6 pages
Labour Law 1 & 2
No ratings yet
Labour Law 1 & 2
31 pages
Professional Ethics & Cases of Professional Misconduct Under Bar Council of India Rules
No ratings yet
Professional Ethics & Cases of Professional Misconduct Under Bar Council of India Rules
7 pages
Underwater Image Enhancement Via Weighted Wavelet Visual Perception Fusion
No ratings yet
Underwater Image Enhancement Via Weighted Wavelet Visual Perception Fusion
15 pages
RSPMatch2005 Manual V1
No ratings yet
RSPMatch2005 Manual V1
9 pages
1999 Bookmatter HandbookOfSplines
100% (1)
1999 Bookmatter HandbookOfSplines
15 pages
Lab Manual
No ratings yet
Lab Manual
22 pages
GRP 3 Whole Research Paper 1
No ratings yet
GRP 3 Whole Research Paper 1
84 pages
1908604-Digital-Image-Processing IMP
No ratings yet
1908604-Digital-Image-Processing IMP
8 pages
4 - Wearable Device For Personalized EMG Feedback-Based Treatments
No ratings yet
4 - Wearable Device For Personalized EMG Feedback-Based Treatments
14 pages
Acoustic Leak Localization Review
No ratings yet
Acoustic Leak Localization Review
17 pages
2021-4-TNNLS C-CNN Contourlet Convolutional Neural Networks
No ratings yet
2021-4-TNNLS C-CNN Contourlet Convolutional Neural Networks
14 pages
SDMCET: Academic Program Handbook General Information Page No
No ratings yet
SDMCET: Academic Program Handbook General Information Page No
32 pages
Biorthogonal Wavelets
No ratings yet
Biorthogonal Wavelets
23 pages
Scale Factor for ETABS Response Spectrum
No ratings yet
Scale Factor for ETABS Response Spectrum
2 pages
Machine Fault Signature Analysis
No ratings yet
Machine Fault Signature Analysis
10 pages
Wavelet-Based Voltage Sag Detection
No ratings yet
Wavelet-Based Voltage Sag Detection
8 pages
Synchro-Squeezed Time-Frequency Representations For Radar-Based Human Activity Recognition
No ratings yet
Synchro-Squeezed Time-Frequency Representations For Radar-Based Human Activity Recognition
7 pages
Advanced Deep Learning Strategies For The Analysis of Remote Sensing Images
100% (1)
Advanced Deep Learning Strategies For The Analysis of Remote Sensing Images
440 pages
Computer Science Extended Essay
No ratings yet
Computer Science Extended Essay
15 pages
Advanced Image Fusion Techniques
No ratings yet
Advanced Image Fusion Techniques
8 pages
VLSI Architecture For Signal, Speech, and Image Processing 1st Edition Durgesh Nandan Ebook All Chapters PDF
100% (1)
VLSI Architecture For Signal, Speech, and Image Processing 1st Edition Durgesh Nandan Ebook All Chapters PDF
54 pages
Digital Image Processing Guide
No ratings yet
Digital Image Processing Guide
62 pages
Efficient 2-D DWT Architecture Design
100% (2)
Efficient 2-D DWT Architecture Design
16 pages
A Low-Cost Power-Quality Meter With Series Arc-Fault Detection Capability For Smart Grid
No ratings yet
A Low-Cost Power-Quality Meter With Series Arc-Fault Detection Capability For Smart Grid
8 pages
Biometric System Seminar Report
No ratings yet
Biometric System Seminar Report
23 pages
Image Transform and Image Fusion Techniques
No ratings yet
Image Transform and Image Fusion Techniques
22 pages
Robust Image Watermarking Based On Generative Adversarial Network
No ratings yet
Robust Image Watermarking Based On Generative Adversarial Network
10 pages
Multiscale Transforms: Wavelets, Ridgelets, Curvelets, Etc
No ratings yet
Multiscale Transforms: Wavelets, Ridgelets, Curvelets, Etc
34 pages
Defect Characterization Method For Bridge Cables Based On Topology of Dynamical Reconstruction of Magnetostrictive Guided Wave Testing Signals
No ratings yet
Defect Characterization Method For Bridge Cables Based On Topology of Dynamical Reconstruction of Magnetostrictive Guided Wave Testing Signals
19 pages
E&C Engineering Syllabus Update
No ratings yet
E&C Engineering Syllabus Update
187 pages
Doppler Spectrogram Calculation Using DSP Processor & MATLAB
No ratings yet
Doppler Spectrogram Calculation Using DSP Processor & MATLAB
5 pages
A - Polyphase - Structure - of - Two-Channel - Spectral - Graph - Wavelets - and - Filter - Banks 2017
No ratings yet
A - Polyphase - Structure - of - Two-Channel - Spectral - Graph - Wavelets - and - Filter - Banks 2017
5 pages