Data Mining Digital Notes
Data Mining Digital Notes
DATA Mining
lOMoARcPSD|18595359
UNIT - I
Introduction to Data Mining: Introduction, What is Data Mining, Definition,
KDD, Challenges, Data Mining Tasks, Data Preprocessing, Data Cleaning,
Missing data, Dimensionality Reduction, Feature Subset Selection, Discretization and
Binaryzation, Data Transformation; Measures of Similarity and Dissimilarity- Basics.
UNIT - II
Association Rules: Problem Definition, Frequent Item Set Generation, The
APRIORI Principle, Support and Confidence Measures, Association Rule Generation;
APRIOIRI Algorithm, The Partition Algorithms, FP-Growth Algorithms, Compact
Representation of Frequent Item Set- Maximal Frequent Item Set, Closed Frequent Item
Set.
UNIT - III
Classification: Problem Definition, General Approaches to solving a classification
problem, Evaluation of Classifiers , Classification techniques, Decision Trees-
Decision tree Construction , Methods for Expressing attribute test conditions, Measures
for Selecting the Best Split, Algorithm for Decision tree Induction ; Naive- Bayes
Classifier, Bayesian Belief Networks; K- Nearest neighbor classification-Algorithm and
Characteristics.
UNIT - IV
Clustering: Problem Definition, Clustering Overview, Evaluation of Clustering Algorithms,
Partitioning Clustering-K-Means Algorithm, K-Means Additional issues, PAM Algorithm;
Hierarchical Clustering-Agglomerative Methods and divisive methods, Basic Agglomerative
Hierarchical Clustering Algorithm, Specific techniques, Key Issues in Hierarchical
Clustering, Strengths and Weakness; Outlier Detection.
UNIT - V
Web and Text Mining: Introduction, web mining, web content mining, web
structure mining, we usage mining, Text mining –unstructured text, episode rule
discovery for texts, hierarchy of categories, text clustering.
TEXT BOOKS:
1. Data Mining- Concepts and Techniques- Jiawei Han, Micheline
Kamber, Morgan Kaufmann Publishers, Elsevier, 2 Edition, 2006.
2. Introduction to Data Mining, Pang-Ning Tan, Vipin Kumar,
Michael Steinbanch, Pearson Education.
3. Data mining Techniques and Applications, Hongbo Du Cengage India Publishing
REFERENCE BOOKS:
1. Data Mining Techniques, Arun K Pujari, 3rd Edition, Universities Press.
2. Data Mining Principles & Applications – T.V Sveresh Kumar,
B.Esware Reddy, Jagadish S Kalimani, Elsevier.
3. Data Mining, Vikaram Pudi, P Radha Krishna, Oxford University Press
lOMoARcPSD|18595359
INDEX
PAGE
SNO TOPIC
N0.
UNIT – I
1.1 Introduction to Data Mining, Definition 1
1.2 Knowledge Data Discovery 3
1.3 Challenges 4
1.4 Data Mining Tasks 6
1.5 Data Preprocessing 8
1.6 Data Cleaning 9
1.7 Dimensionality Reduction 11
1.8 Feature Subset Selection 14
1.9 Discretization and Binaryzation 16
1.10 Data Transformation 18
1.11 Measures of Similarity and Dissimilarity 21
UNIT - 2
2.1 Problem Definition 24
2.2 Frequent Item Set Generation 26
2.3 The APRIORI Principle, Association Rule
27
Generation APRIOIRI Algorithm
2.4 The Partition Algorithms, FP-Growth Algorithms 30
2.5 Compact Representation of Frequent Item Set- 32
Maximal Frequent Item Set
2.6 Closed Frequent Item Set 33
UNIT - 3
3.1 Classification Problem Definition, General 35
Approaches to solving a classification problem
3.2 Evaluation of Classifiers, 37
Classification techniques
3.3 Decision Trees-Decision tree Construction 38
3.4 Methods for Expressing attribute test conditions, 39
Measures for Selecting the Best Split
3.5 Algorithm for Decision tree Induction Naive- Bayes 47
Classifier
3.6 Bayesian Belief Networks 50
3.7 K- Nearest neighbor classification-Algorithm and 51
Characteristics.
UNIT - 4
4.1 Clustering: Problem Definition, Clustering Overview, 53
4.2 Evaluation of Clustering Algorithms 54
4.3 Partitioning Clustering-K-Means Algorithm, K-Means 55
lOMoARcPSD|18595359
Additional issues
4.4 PAM Algorithm; 57
4.5 Hierarchical Clustering-Agglomerative Methods and 59
divisive methods
4.6 Key Issues in Hierarchical Clustering, Strengths and 61
Weakness
4.7 Outlier Detection 61
UNIT - 5
5.1 Web and Text Mining: Introduction 65
5.2 web mining, web content mining, web 66
structure mining, web usage mining
5.3 Text Mining 68
5.4 Unstructured text 71
5.5 Episode rule discovery for texts, hierarchy of categories, 72
Text clustering.
Tutorial Questions 74
Assignment Questions 76
Objective Questions 83
References 122
lOMoARcPSD|18595359
DATA MINING
Unit-I
Introducing to Data Mining
Introduction to Data Mining
Data Mining is defined as the procedure of extracting information from huge sets of data.
There is a huge amount of data available in the Information Industry. This data is of no
use until it is converted into useful information. It is necessary to analyze this huge
amount of data and extract useful information from it.
Extraction of information is not the only process we need to perform; data mining also
involves other processes such as Data Cleaning, Data Integration, Data Transformation,
Data Mining, Pattern Evaluation and Data Presentation. Once all these processes are over,
we would be able to use this information in many applications such as Fraud Detection,
Market Analysis, Production Control, Science Exploration, etc.
Growth of data due to cards: The growing use of credit cards and loyalty cards is an
important area of dat growth. In USA, there has been a tremendous growth in the use of
loyalty cards. Even in Australia, the use of cards like, FlyBuys has grown considerably.
Growth in data due to the web: E-commerce developments have resulted in information
about visitors to Web sites being captures, once again resulting in mountains of data for
some companies.
Growth in data due to other sources: There are many other sources of data.
Some of them are:
Telephone Transactions
Frequent flyer transactions
Medical transactions
Immigration and customs transactions
Banking transactions
Motor vehicle transactions
Utilities (eg electricity and gas) transactions
Shopping transactions
Growth in data storage capacity: Another way of illustrating data growth is to consider
annual disk storage sales over the last few years.
Decline in the cost of processing: The cost of computing hardware has declined rapidly
the last 30 years coupled with the increase in hardware performance. Not only do the
prices for processors continue to decline, but also the prices for computer peripherals
have also been declining.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 1
lOMoARcPSD|18595359
DATA MINING
DATA MINING
Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to
detect frauds. In fraud telephone calls, it helps to find the destination of the call, duration
of the call, time of the day or week, etc. It also analyzes the patterns that deviate from
expected norms.
DATA MINING
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 3
lOMoARcPSD|18595359
DATA MINING
Challenges
Data mining is not an easy task, as the algorithms used can get very complex and data is
not always available at one place. It needs to be integrated from various heterogeneous
data sources. These factors also create some issues. Here, we will discuss the major issues
regarding –
Mining Methodology and User Interaction
Performance Issues
Diverse Data Types Issues
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 4
lOMoARcPSD|18595359
DATA MINING
DATA MINING
Handling noisy or incomplete data− The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be poor.
Pattern evaluation− The patterns discovered should be interesting because either
they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
Parallel, distributed, and incremental mining algorithms− The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 6
lOMoARcPSD|18595359
DATA MINING
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers, and
concepts of customers include big spenders and budget spenders. Such descriptions of a
class or a concept are called class/concept descriptions. These descriptions can be derived
by the following two ways −
Data Characterization− This refers to summarizing data of class under study.
This class under study is called as Target Class.
Data Discrimination− It refers to the mapping or classification of a class with
some predefined group or class.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data and
determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is
sold with bread and only 30% of times biscuits are sold with bread.
Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyze that if they
have positive, negative or no effect on each other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the
objects in other clusters.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 7
lOMoARcPSD|18595359
DATA MINING
label is unknown. This derived model is based on the analysis of sets of training data. The
derived model can be presented in the following forms −
Classification (IF-THEN) Rules
Decision Trees
Mathematical Formulae
Neural Networks
The list of functions involved in these processes are as follows −
Classification− It predicts the class of objects whose class label is unknown. Its
objective is to find a derived model that describes and distinguishes data classes or
concepts. The Derived Model is based on the analysis set of training data i.e. the data
object whose class label is well known.
Prediction− It is used to predict missing or unavailable numerical data values
rather than class labels. Regression Analysis is generally used for prediction. Prediction
can also be used for identification of distribution trends based on available data.
Outlier Analysis− Outliers may be defined as the data objects that do not comply
with the general behavior or model of the data available.
Evolution Analysis− Evolution analysis refers to the description and model
regularities or trends for objects whose behavior changes over time.
Data Preprocessing
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking
in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is
a proven method of resolving such issues. Data preprocessing prepares raw data for
further processing.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 8
lOMoARcPSD|18595359
DATA MINING
Data Cleaning
Quality of your data is critical in getting to final analysis.Any data which tend to be
incomplete, noisy and inconsistent can effect your result.
Data cleaning in data mining is the process of detecting and removing corrupt or
inaccurate records from a record set, table or database.
Some data cleaning methods
Missing data
1. You can ignore the tuple.This is done when class label is missing. This method is
not very effective , unless the tuple contains several attributes with missing values.
2. You can fill in the missing value manually. This approach is effective on small data
set with some missing values.
3. You can replace all missing attribute values with global constant, such as a label like
“Unknown” or minus infinity.
4. You can use the attribute mean to fill in the missing value. For example customer
average income is 25000 then you can use this value to replace missing value for
income.
5. Use the most probable value to fill in the missing value.
Noisy Data
Noise is a random error or variance in a measured variable. Noisy Data may be due to
faulty data collection instruments, data entry problems and technology limitation.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 9
lOMoARcPSD|18595359
DATA MINING
Binning
Binning methods sorted data value by consulting its “neighbor- hood,” that is, the values
around it.The sorted values are distributed into a number of “buckets,” or bins.
For example
In this example, the data for price are first sorted and then partitioned into equal-
frequency bins of size 3.
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
In smoothing by bin boundaries, each bin value is replaced by the closest boundary value.
Regression
Data can be smoothed by fitting the data into a regression functions.
Clustering
Outliers may be detected by clustering,where similar values are organized into groups, or
“clusters.Values that fall outside of the set of clusters may be considered outliers.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 10
lOMoARcPSD|18595359
DATA MINING
Dimensionality Reduction
Dimensionality reduction strategies include dimensionality reduction and numerosity
reduction.
Dimensionality Reduction
Dimensionality reduction is the process of reducing the number of random variables or
attributes under consideration. Dimensionality reduction methods include wavelet
transforms and principal components analysis,which transform or project the original data
onto a smaller space. Attribute subset selection is a method of dimensionality reduction in
which irrelevant, weakly relevant, or redundant attributes or dimensions are detected and
removed.
Wavelet Transforms
The discrete wavelet transform (DWT) is a linear signal processing technique that, when
applied to a data vector X, transforms it to a numerically different vector, X 0 , of wavelet
coefficients. The two vectors are of the same length. When applying this technique to data
reduction, we consider each tuple as an n-dimensional data vector, that is, X =
(x1,x2,...,xn), depicting n measurements made on the tuple from n database attributes.3
“How can this technique be useful for data reduction if the wavelet transformed data are
of the same length as the original data?” The usefulness lies in the fact that the wavelet
transformed data can be truncated. A compressed approximation of the data can be
retained by storing only a small fraction of the strongest of the wavelet coefficients. For
example, all wavelet coefficients larger than some user-specified threshold can be
retained. All other coefficients are set to 0. The resulting data representation is therefore
very sparse, so that operations that can take advantage of data sparsity are
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 11
lOMoARcPSD|18595359
DATA MINING
computationally very fast if performed in wavelet space. The technique also works to
remove noise without smoothing out the main features of the data, making it effective for
data cleaning as well. Given a set of coefficients, an approximation of the original data
can be constructed by applying the inverse of the DWT used. The DWT is closely related
to the discrete Fourier transform (DFT), a signal processing technique involving sines and
cosines. In general, however, the DWT achieves better lossy compression. That is, if the
same number of coefficients is retained for a DWT and a DFT of a given data vector, the
DWT version will provide a more accurate approximation of the original data. Hence, for
an equivalent approximation, the DWT requires less space than the DFT. Unlike the DFT,
wavelets are quite localized in space, contributing to the conservation of local detail.
There is only one DFT, yet there are several families of DWTs. Figure shows some
wavelet families. Popular wavelet transforms include the Haar-2, Daubechies-4, and
Daubechies-6. The general procedure for applying a discrete wavelet transform uses a
hierarchical pyramid algorithm that halves the data at each iteration, resulting in fast
computational speed. The method is as follows: 1. The length, L, of the input data vector
must be an integer power of 2. This condition can be met by padding the data vector with
zeros as necessary (L ≥ n). 2. Each transform involves applying two functions. The first
applies some data smoothing, such as a sum or weighted average. The second performs a
weighted difference, which acts to bring out the detailed features of the data. 3. The two
functions are applied to pairs of data points in X, that is, to all pairs of measurements (x2i
,x2i+1). This results in two data sets of length L/2. In general, these represent a smoothed
or low-frequency version of the input data and the highfrequency content of it,
respectively. 4. The two functions are recursively applied to the data sets obtained in the
previous loop, until the resulting data sets obtained are of length 2. 5. Selected values
from the data sets obtained in the previous iterations are designated the wavelet
coefficients of the transformed data.
Fig. 1.5:
Equivalently, a matrix multiplication can be applied to the input data in order to obtain
the wavelet coefficients, where the matrix used depends on the given DWT. The matrix
must be orthonormal, meaning that the columns are unit vectors and are mutually
orthogonal, so that the matrix inverse is just its transpose. Although we do not have room
to discuss it here, this property allows the reconstruction of the data from the smooth and
smooth-difference data sets. By factoring the matrix used into a product of a few sparse
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 12
lOMoARcPSD|18595359
DATA MINING
matrices, the resulting “fast DWT” algorithm has a complexity of O(n) for an input vector
of length n. Wavelet transforms can be applied to multidimensional data such as a data
cube. This is done by first applying the transform to the first dimension, then to the
second, and so on. The computational complexity involved is linear with respect to the
number of cells in the cube. Wavelet transforms give good results on sparse or skewed
data and on data with ordered attributes. Lossy compression by wavelets is reportedly
better than JPEG compression, the current commercial standard. Wavelet transforms have
many realworld applications, including the compression of fingerprint images, computer
vision, analysis of time-series data, and data cleaning.
1. The input data are normalized, so that each attribute falls within the same range. This
step helps ensure that attributes with large domains will not dominate attributes with
smaller domains.
2. PCA computes k orthonormal vectors that provide a basis for the normalized input
data. These are unit vectors that each point in a direction perpendicular to the others.
These vectors are referred to as the principal components. The input data are a linear
combination of the principal components.
4. Because the components are sorted in decreasing order of “significance,” the data size
can be reduced by eliminating the weaker components, that is, those with low variance.
Using the strongest principal components, it should be possible to reconstruct a good
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 13
lOMoARcPSD|18595359
DATA MINING
approximation of the original data. PCA can be applied to ordered and unordered
attributes, and can handle sparse data and skewed data. Multidimensional data of more
than two dimensions can be handled by reducing the problem to two dimensions.
Principal components may be used as inputs to multiple regression and cluster analysis. In
comparison with wavelet transforms, PCA tends to be better at handling sparse data,
whereas wavelet transforms are more suitable for data of high dimensionality
Fig. 1.6:
Attribute Subset Selection Data sets for analysis may contain hundreds of attributes,
many of which may be irrelevant to the mining task or redundant. For example, if the task
is to classify customers based on whether or not they are likely to purchase a popular new
CD at AllElectronics when notified of a sale, attributes such as the customer’s telephone
number are likely to be irrelevant, unlike attributes such as age or music taste. Although it
may be possible for a domain expert to pick out some of the useful attributes, this can be
a difficult and timeconsuming task, especially when the data’s behavior is not well
known. (Hence, a reason behind its analysis!) Leaving out relevant attributes or keeping
irrelevant attributes may be detrimental, causing confusion for the mining algorithm
employed. This can result in discovered patterns of poor quality. In addition, the added
volume of irrelevant or redundant attributes can slow down the mining process.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 14
lOMoARcPSD|18595359
DATA MINING
effective in practice and may come close to estimating an optimal solution. The “best”
(and “worst”) attributes are typically determined using tests of statistical significance,
which assume that the attributes are independent of one another. Many other attribute
evaluation measures can be used such as the information gain measure used in building
decision trees for classification.5 Basic heuristic methods of attribute subset selection
include the techniques that follow, some of which are illustrated in Figure
1. Stepwise forward selection: The procedure starts with an empty set of attributes as the
reduced set. The best of the original attributes is determined and added to the reduced set.
At each subsequent iteration or step, the best of the remaining original attributes is added
to the set.
2.Stepwise backward elimination: The procedure starts with the full set of attributes.
At each step, it removes the worst attribute remaining in the set.
4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were
originally intended for classification. Decision tree induction constructs a flowchart like
structure where each internal (nonleaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node denotes a class
prediction. At each node, the algorithm chooses the “best” attribute to partition the data
into individual classes. When decision tree induction is used for attribute subset selection,
a tree is constructed from the given data. All attributes that do not appear in the tree are
assumed to be irrelevant. The set of attributes appearing in the tree form the reduced
subset of attributes. The stopping criteria for the methods may vary. The procedure may
employ a threshold on the measure used to determine when to stop the attribute selection
process. In some cases, we may want to create new attributes based on others. Such
attribute construction6 can help improve accuracy and understanding of structure in high
dimensional data. For example, we may wish to add the attribute area based on the
attributes height and width. By combining attributes, attribute construction can discover
missing information about the relationships between data attributes that can be useful for
knowledge discovery.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 15
lOMoARcPSD|18595359
DATA MINING
Fig. 1.7:
Discretization by Binning
Binning is a top-down splitting technique based on a specified number of bins. Binning
methods are also used as discretization methods for data reduction and concept hierarchy
generation. For example, attribute values can be discretized by applying equal-width or
equal-frequency binning, and then replacing each bin value by the bin mean or median, as
in smoothing by bin means or smoothing by bin medians, respectively. These techniques
can be applied recursively to the resulting partitions to generate concept hierarchies.
Binning does not use class information and is therefore an unsupervised discretization
technique. It is sensitive to the user-specified number of bins, as well as the presence of
outliers.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 16
lOMoARcPSD|18595359
DATA MINING
number of data tuples. The histogram analysis algorithm can be applied recursively to
each partition in order to automatically generate a multilevel concept hierarchy, with the
procedure terminating once a prespecified number of concept levels has been reached. A
minimum interval size can also be used per level to control the recursive procedure. This
specifies the minimum width of a partition, or the minimum number of values for each
partition at each level. Histograms can also be partitioned based on cluster analysis of the
data distribution, as described next.
Discretization by Cluster
Decision Tree, and Correlation Analyses Clustering, decision tree analysis, and
correlation analysis can be used for data discretization. We briefly study each of these
approaches. Cluster analysis is a popular data discretization method. A clustering
algorithm can be applied to discretize a numeric attribute, A, by partitioning the values of
A into clusters or groups. Clustering takes the distribution of A into consideration, as well
as the closeness of data points, and therefore is able to produce high-quality discretization
results. Clustering can be used to generate a concept hierarchy for A by following either a
top-down splitting strategy or a bottom-up merging strategy, where each cluster forms a
node of the concept hierarchy. In the former, each initial cluster or partition may be
further decomposed into several subclusters, forming a lower level of the hierarchy. In the
latter, clusters are formed by repeatedly grouping neighboring clusters in order to form
higher-level concepts. Techniques to generate decision trees for classification can be
applied to discretization. Such techniques employ a top-down splitting approach. Unlike
the other methods mentioned so far, decision tree approaches to discretization are
supervised, that is, they make use of class label information. For example, we may have a
data set of patient symptoms (the attributes) where each patient has an associated
diagnosis class label. Class distribution information is used in the calculation and
determination of split-points (data values for partitioning an attribute range). Intuitively,
the main idea is to select split-points so that a given resulting partition contains as many
tuples of the same class as possible. Entropy is the most commonly used measure for this
purpose. To discretize a numeric attribute, A, the method selects the value of A that has
the minimum entropy as a split-point, and recursively partitions the resulting intervals to
arrive at a hierarchical discretization. Such discretization forms a concept hierarchy for A.
Because decision tree–based discretization uses class information, it is more likely that
the interval boundaries (split-points) are defined to occur in places that may help improve
classification accuracy.
DATA MINING
then the intervals can be merged. Otherwise, they should remain separate. ChiMerge
proceeds as follows. Initially, each distinct value of a numeric attribute A is considered to
be one interval. χ 2 tests are performed for every pair of adjacent intervals. Adjacent
intervals with the least χ 2 values are merged together, because low χ 2 values for a pair
indicate similar class distributions. This merging process proceeds recursively until a
predefined stopping criterion is met.
Data Transformation
Overview In data transformation, the data are transformed or consolidated into forms
appropriate for mining. Strategies for data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include binning,
regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed
and added from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for data analysis at
multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range,
such as −1.0 to 1.0, or 0.0 to 1.0. 5. Discretization, where the raw values of a numeric
attribute (e.g., age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual
labels (e.g., youth, adult, senior).
5.The labels, in turn, can be recursively organized into higher-level concepts, resulting in
a concept hierarchy for the numeric attribute.
6. Concept hierarchy generation for nominal data, where attributes such as street can be
generalized to higher-level concepts, like city or country. Many hierarchies for nominal
attributes are implicit within the database schema and can be automatically defined at the
schema definition level. Recall that there is much overlap between the major data
preprocessing tasks. Smoothing is a form of data cleaning on the data cleaning process
also discussed ETL tools, where users specify transformations to correct data
inconsistencies. Discretization techniques can be categorized based on how the
discretization is performed, such as whether it uses class information or which direction it
proceeds (i.e., top-down vs. bottom-up). If the discretization process uses class
information, then we say it is supervised discretization. Otherwise, it is unsupervised. If
the process starts by first finding one or a few points (called split points or cut points) to
split the entire attribute range, and then repeats this recursively on the resulting intervals,
it is called top-down discretization or splitting. This contrasts with bottom-up
discretization or merging, which starts by considering all of the continuous values as
potential split-points, removes some by merging neighborhood values to form intervals,
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 18
lOMoARcPSD|18595359
DATA MINING
and then recursively applies this process to the resulting intervals. Data discretization and
concept hierarchy generation are also forms of data reduction. The raw data are replaced
by a smaller number of interval or concept labels. This simplifies the original data and
makes the mining more efficient. The resulting patterns mined are typically easier to
understand. Concept hierarchies are also useful for mining at multiple abstraction levels.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 19
lOMoARcPSD|18595359
DATA MINING
Min-max normalization preserves the relationships among the original data values. It will
encounter an “out-of-bounds” error if a future input case for normalization falls outside of
the original data range for A.
Example : Min-max normalization. Suppose that the minimum and maximum values for
the attribute income are $12,000 and $98,000, respectively. We would like to map income
to the range [0.0,1.0]. By min-max normalization, a value of $73,600 for income is
transformed to
In z-score normalization (or zero-mean normalization), the values for an attribute, A, are
normalized based on the mean (i.e., average) and standard deviation of A. A value, vi , of
A is normalized to v 0 i by computing
where A¯ and σA are the mean and standard deviation, respectively, of attribute A. The
mean and standard deviation, where vn ) and σA is computed as the
square root of the variance of A. This method of normalization is useful when the actual
minimum and maximum of attribute A are unknown, or when there are outliers that
dominate the min-max normalization.
z-score normalization. Suppose that the mean and standard deviation of the values for
the attribute income are $54,000 and $16,000, respectively. With z-score normalization, a
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 20
lOMoARcPSD|18595359
DATA MINING
The mean absolute deviation, sA, is more robust to outliers than the standard deviation,
σA. When computing the mean absolute deviation, the deviations from the mean (i.e.,
are not squared; hence, the effect of outliers is somewhat reduced. Normalization
by decimal scaling normalizes by moving the decimal point of values of attribute A. The
number of decimal points moved depends on the maximum absolute value of A. A value,
vi , of A is normalized to v 0 i by computing.
Decimal scaling
Suppose that the recorded values of A range from −986 to 917. The maximum absolute
value of A is 986. To normalize by decimal scaling, we therefore divide each value by
1000 (i.e., j = 3) so that −986 normalizes to −0.986 and 917 normalizes to 0.917. Note
that normalization can change the original data quite a bit, especially when using z-score
normalization or decimal scaling. It is also necessary to save the normalization
parameters (e.g., the mean and standard deviation if using z-score normalization) so that
future data can be normalized in a uniform manner.
Similarity Measure:
Numerical measure of how alike two data objects are. Often falls between 0
(no similarity) and 1 (complete similarity).
Dissimilarity Measure
Attribute
Similarity Dissimilarity
Type
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 21
lOMoARcPSD|18595359
DATA MINING
Nominal S={1 if p=q0 if p≠qs={1 if p=q0 if p≠q d={0if p=q1 if p≠qd={0 if p=q1 if p≠q
s=1−∥p−q∥n−1s=1−‖p−q‖n−1
Ordinal (values mapped to integer 0 to n-1, where n d=∥p−q∥n−1d=‖p−q‖n−1
is the number of values)
Interval or s=1−∥p−q∥,s=11+∥p−q∥s=1−‖p−q‖,s=11+‖p−
d=∥p−q∥d=‖p−q‖
Ratio q‖
dWE(i,j)=(p∑k=1Wk(xik−xjk)2)12dWE(i,j)=(∑k=1pWk(xik−xjk)2)12
If scales of the attributes differ substantially, standardization is necessary.
Minkowski Distance
The Minkowski distance is a generalization of the Euclidean distance.
With the measurement, xik,i= 1, … ,N, k= 1, … ,p, the Minkowski distance
is dM(i,j)=(p∑k=1∣∣xik−xjk∣∣λ)1λ,dM(i,j)=(∑k=1p|xik−xjk|λ)1λ,
where λ ≥ 1.It is also called the Lλmetric.
λ = 1 : L1metric, Manhattan or City-block distance.
λ = 2 : L2metric, Euclidean distance.
λ → ∞ : L∞metric, Supremum distance.
limλ→∞=(p∑k=1∣∣xik−xjk∣∣λ)1λ=max(∣∣xi1−xj1∣∣,...,∣∣xip−xjp∣∣)limλ→∞=(∑k=1p|xik−x
jk|λ)1λ=max(|xi1−xj1|,...,|xip−xjp|)
Note that λ and p are two different parameters. Dimension of the data matrix remains
finite.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 22
lOMoARcPSD|18595359
DATA MINING
Mahalanobis Distance
Let X be a N × p matrix. Then the ith row of X is
xTi=(xi1,...,xip)xiT=(xi1,...,xip)
The Mahalanobis distance is
dMH(i,j)=((xi−xj)TΣ−1(xi−xj))12dMH(i,j)=((xi−xj)TΣ−1(xi−xj))12
where ∑ is the p×p sample covariance matrix.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 23
lOMoARcPSD|18595359
DATA MINING
Unit-II
Association Rules
Problem Definition
Problem Definition
The problem of association rule mining is defined as:
Each transaction in has a unique transaction ID and contains a subset of the items in .
A rule is defined as an implication of the form
where and .
The sets of items (for short itemsets) and are called antecedent (left-hand-side or
LHS) and consequent (right-hand-side or RHS) of the rule respectively.
Example:
To illustrate the concepts, we use a small example from the supermarket domain. The set
of items is and a small database containing the
items (1 codes presence and 0 absence of an item in a transaction) is shown in the table.
1 1 1 0 0
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 24
lOMoARcPSD|18595359
DATA MINING
0 0 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0
.
For example, the rule has a confidence of
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 25
lOMoARcPSD|18595359
DATA MINING
defined as
and can be interpreted as the ratio of the expected frequency that X occurs without Y (that
is to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions.
This process analyzes customer buying habits by finding associations between the
different items that customers place in their shopping baskets. The discovery of such
association scan help retailers develop marketing strategies by gaining insight into which
items are frequently purchased together by customers.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 26
lOMoARcPSD|18595359
DATA MINING
Finding Frequent Item sets Using Candidate Generation: The Apriori Algorithm
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining
frequent itemsets for Boolean association rules. The name of the algorithm is based on the
fact that the algorithm uses prior knowledge of frequent itemset properties.
Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support. The
resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets,
which is used to find L3, and so on, until no more frequent k-itemsets can be found.
The finding of each Lkrequires one full scan of the database.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 27
lOMoARcPSD|18595359
DATA MINING
Example
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 28
lOMoARcPSD|18595359
DATA MINING
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 29
lOMoARcPSD|18595359
DATA MINING
Once the frequent itemsets from transactions in a database D have been found, it is
Example :
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 30
lOMoARcPSD|18595359
DATA MINING
Partition Algorithms
First scan:
If the minimum support in D is min_sup, then the minimum support for a partition is
min_sup * number of transactions in that partition
Second scan: Frequent items are determined from the local frequent items
First scan: Subdivide the transactions of database D into n non overlapping partitions
If the minimum support in D is min_sup, then the minimum support for a partition is
min_sup * number of transactions in D / number of transactions in that partition
Second scan: Frequent items are determined from the local frequent items
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 31
lOMoARcPSD|18595359
DATA MINING
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 32
lOMoARcPSD|18595359
DATA MINING
Like for example look at the lattice below to identify the maximal itemsets .
itemsets in blue which are maximal frequent itemsets. This can be easily seen as none of
the supersets of the blue color itemsets are above the border. However, one problem
associated with maximal frequent itemsets is that even though we know that all its subsets
are frequent we do not know their supports. Actual support information is very important
for these itemsets when we are deriving the association rules. So now we shift our goal
and try to find out all such frequent itemsets that have the same support as their subsets.
This forms the notion of a closed frequent itemset.
In the above process we have derived the set of frequent itemsets from the transactions.
Now let’s represent the itemset lattice and identify the closed and frequent itemsets. The
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 33
lOMoARcPSD|18595359
DATA MINING
minimum support value here in the current example is set to 2. The labels in red for each
node in the lattice represent the set of transactions where the element occurs. (support)
We have provided labeling for few of the nodes which can be verified by checking once.
C,D,E are closed because one can observe that none of the supersets have the same
support but there does exist a frequent superset hence it’s not maximal. But for CE and
DE one can observe that both the properties are satisfied and hence it comes under the
category of both Closed and Maximal.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 34
lOMoARcPSD|18595359
DATA MINING
Unit-III
Classification
Problem Definition
There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −
Classification
Prediction
Classification models predict categorical class labels; and prediction models predict
continuous valued functions. For example, we can build a classification model to
categorize bank loan applications as either safe or risky, or a prediction model to predict
the expenditures in dollars of potential customers on computer equipment given their
income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which customer
(loan applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given
profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes or no
for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend
during a sale at his company. In this example we are bothered to predict a numeric value.
Therefore the data analysis task is an example of numeric prediction. In this case, a model
or a predictor will be constructed that predicts a continuous-valued-function or ordered
value.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 35
lOMoARcPSD|18595359
DATA MINING
DATA MINING
Data Cleaning − Data cleaning involves removing the noise and treatment of missing
values. The noise is removed by applying smoothing techniques and the problem of
missing values is solved by replacing a missing value with most commonly occurring
value for that attribute.
Relevance Analysis − Database may also have the irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
Data Transformation and reduction − The data can be transformed by any of the
following methods.
Normalization − The data is transformed using normalization. Normalization involves
scaling all values for given attribute in order to make them fall within a small specified
range. Normalization is used when in the learning step, the neural networks or the
methods involving measurements are used.
Generalization − The data can also be transformed by generalizing it to the higher
concept. For this purpose we can use the concept hierarchies.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 37
lOMoARcPSD|18595359
DATA MINING
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 38
lOMoARcPSD|18595359
DATA MINING
although accuracy is a specific measure, the word “accuracy” is also used as a general
term to refer to a classifier’s predictive abilities. Using training data to derive a classifier
and then estimate the accuracy of the resulting learned model can result in misleading
overoptimistic estimates due to overspecialization of the learning algorithm to the data.
(We will say more on this in a moment!) Instead, it is better to measure the classifier’s
accuracy on a test set consisting of class-labeled tuples that were not used to train the
model. Before we discuss the various measures, we need to become comfortable with
some terminology. Recall that we can talk in terms of positive tuples (tuples of the main
class of interest) and negative tuples (all other tuples).
Attribute selection measures are also known as splitting rules because they determine how
the tuples at a given node are to be split. The attribute selection measure provides a
ranking for each attribute describing the given training tuples. The attribute having the
best score for the measure4 is chosen as the splitting attribute for the given tuples. If the
splitting attribute is continuous-valued or if we are restricted to binary trees, then,
respectively, either a split point or a splitting subset must also be determined as part of the
splitting criterion. The tree node created for partition D is labeled with the splitting
criterion, branches are grown for each outcome of the criterion, and the tuples are
partitioned accordingly. This section describes three popular attribute selection measures
—information gain, gain ratio, and Gini index.
The notation used herein is as follows. Let D, the data partition, be a training set of class-
labeled tuples. Suppose the class label attribute has m distinct values defining m distinct
classes, Ci (for i = 1,..., m). Let Ci,D be the set of tuples of class Ci in D. Let |D| and
|Ci,D| denote the number of tuples in D and Ci,D, respectively.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 39
lOMoARcPSD|18595359
DATA MINING
guarantees that a simple (but not necessarily the simplest) tree is found. The expected
information needed to classify a tuple in D is given by
where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is
estimated by |Ci,D|/|D|. A log function to the base 2 is used, because the information is
encoded in bits. Info(D) is just the average amount of information needed to identify the
class label of a tuple in D. Note that, at this point, the information we have is based solely
on the proportions of tuples of each class.
Info(D): This is also known as the entropy of D. Now, suppose we were to partition the
tuples in D on some attribute A having v distinct values, {a1, a2,..., av }, as observed
from the training data. If A is discrete-valued, these values correspond directly to the v
outcomes of a test on A. Attribute A can be used to split D into v partitions or subsets,
{D1, D2,..., Dv }, where Dj contains those tuples in D that have outcome aj of A. These
partitions would correspond to the branches grown from node N. Ideally, we would like
this partitioning to produce an exact classification of the tuples. That is, we would like for
each partition to be pure. However, it is quite likely that the partitions will be impure
(e.g., where a partition may contain a collection of tuples from different classes rather
than from a single class). How much more information would we still need (after the
partitioning) to arrive at an exact classification? This amount is measured by
The term |Dj | |D| acts as the weight of the jth partition. InfoA (D) is the expected
information required to classify a tuple from D based on the partitioning by A. The
smaller the expected information (still) required, the greater the purity of the partitions.
Information gain is defined as the difference between the original information
requirement (i.e., based on just the proportion of classes) and the new requirement (i.e.,
obtained after partitioning on A). That is,
In other words, Gain(A) tells us how much would be gained by branching on A. It is the
expected reduction in the information requirement caused by knowing the value of A. The
attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute
at node N. This is equivalent to saying that we want to partition on the attribute A that
would do the “best classification,” so that the amount of information still required to
finish classifying the tuples is minimal (i.e., minimum InfoA (D)).
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 40
lOMoARcPSD|18595359
DATA MINING
Next, we need to compute the expected information requirement for each attribute. Let’s
start with the attribute age. We need to look at the distribution of yes and no tuples for
each category of age. For the age category “youth,” there are two yes tuples and three no
tuples. For the category “middle aged,” there are four yes tuples and zero no tuples. For
the category “senior,” there are three yes tuples and two no tuples. Using Eq. (8.2), the
expected information needed to classify a tuple in D if the tuples are partitioned according
to age is
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 41
lOMoARcPSD|18595359
DATA MINING
Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and
Gain(credit rating) = 0.048 bits. Because age has the highest information gain among the
attributes, it is selected as the splitting attribute. Node N is labeled with age, and branches
are grown for each of the attribute’s values. The tuples are then partitioned accordingly,
as shown in Figure 8.5. Notice that the tuples falling into the partition for age = middle
aged all belong to the same class. Because they all belong to class “yes,” a leaf should
therefore be created at the end of this branch and labeled “yes.” The final decision tree
returned by the algorithm was shown earlier in Figure.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 42
lOMoARcPSD|18595359
DATA MINING
scenario, we must determine the “best” split-point for A, where the split-point is a
threshold on A. We first sort the values of A in increasing order. Typically, the midpoint
between each pair of adjacent values is considered as a possible split-point. Therefore,
given v values of A, then v − 1 possible splits are evaluated. For example, the midpoint
between the values ai and ai+1 of A is
If the values of A are sorted in advance, then determining the best split for A requires
only one pass through the values. For each possible split-point for A, we evaluate InfoA
(D), where the number of partitions is two, that is, v = 2 (or j = 1,2) in Eq. (8.2). The
point with the minimum expected information requirement for A is selected as the split
point for A. D1 is the set of tuples in D satisfying A ≤ split point, and D2 is the set of
tuples in D satisfying A > split point.
Gain Ratio The information gain measure is biased toward tests with many outcomes.
That is, it prefers to select attributes having a large number of values. For example,
consider an attribute that acts as a unique identifier such as product ID. A split on product
ID would result in a large number of partitions (as many as there are values), each one
containing just one tuple. Because each partition is pure, the information required to
classify data set D based on this partitioning would be Infoproduct ID(D) = 0. Therefore,
the information gained by partitioning on this attribute is maximal. Clearly, such a
partitioning is useless for classification. C4.5, a successor of ID3, uses an extension to
information gain known as gain ratio, which attempts to overcome this bias. It applies a
kind of normalization to information gain using a “split information” value defined
analogously with Info(D) as
This value represents the potential information generated by splitting the training data set,
D, into v partitions, corresponding to the v outcomes of a test on attribute A. Note that,
for each outcome, it considers the number of tuples having that outcome with respect to
the total number of tuples in D. It differs from information gain, which measures the
information with respect to classification that is acquired based on the same partitioning.
The gain ratio is defined as
The attribute with the maximum gain ratio is selected as the splitting attribute. Note,
however, that as the split information approaches 0, the ratio becomes unstable. A
constraint is added to avoid this, whereby the information gain of the test selected must
be large—at least as great as the average gain over all tests examined.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 43
lOMoARcPSD|18595359
DATA MINING
Example:
Computation of gain ratio for the attribute income. A test on income splits the data of
Table 8.1 into three partitions, namely low, medium, and high, containing four, six, and
four tuples, respectively. To compute the gain ratio of income, we first use the above Eq.
to obtain
Gini Index
The Gini index is used in CART. Using the notation previously described, the Gini index
measures the impurity of D, a data partition or set of training tuples, as
A. Each subset, SA, can be considered as a binary test for attribute A of the form “A ∈
split on A, we examine all the possible subsets that can be formed using known values of
SA?” Given a tuple, this test is satisfied if the value of A for the tuple is among the values
listed in SA. If A has v possible values, then there are 2v possible subsets. For example, if
income has three possible values, namely {low, medium, high}, then the possible subsets
are {low, medium, high}, {low, medium}, {low, high}, {medium, high}, {low},
{medium}, {high}, and {}. We exclude the power set, {low, medium, high}, and the
empty set from consideration since, conceptually, they do not represent a split. Therefore,
there are 2v − 2 possible ways to form two partitions of the data, D, based on a binary
split on A. When considering a binary split, we compute a weighted sum of the impurity
of each resulting partition. For example, if a binary split on A partitions D into D1 and
D2, the Gini index of D given that partitioning is
For each attribute, each of the possible binary splits is considered. For a discrete-valued
attribute, the subset that gives the minimum Gini index for that attribute is selected as its
splitting subset. For continuous-valued attributes, each possible split-point must be
considered. The strategy is similar to that described earlier for information gain, where
the midpoint between each pair of (sorted) adjacent values is taken as a possible split-
point. The point giving the minimum Gini index for a given (continuous-valued) attribute
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 44
lOMoARcPSD|18595359
DATA MINING
is taken as the split-point of that attribute. Recall that for a possible split-point of A, D1 is
the set of tuples in D satisfying A ≤ split point, and D2 is the set of tuples in D satisfying
A > split point. The reduction in impurity that would be incurred by a binary split on a
discrete- or continuous-valued attribute A is
The attribute that maximizes the reduction in impurity (or, equivalently, has the minimum
Gini index) is selected as the splitting attribute. This attribute and either its splitting
subset (for a discrete-valued splitting attribute) or split-point (for a continuous-valued
splitting attribute) together form the splitting criterion.
Example
Induction of a decision tree using the Gini index. Let D be the training data shown earlier
in Table 8.1, where there are nine tuples belonging to the class buys computer = yes and
the remaining five tuples belong to the class buys computer = no. A (root) node N is
created for the tuples in D. We first use Eq. (8.7) for the Gini index to compute the
impurity of D:
To find the splitting criterion for the tuples in D, we need to compute the Gini index for
each attribute. Let’s start with the attribute income and consider each of the possible
partition D1 satisfying the condition “income ∈ {low, medium}.” The remaining four
splitting subsets. Consider the subset {low, medium}. This would result in 10 tuples in
tuples of D would be assigned to partition D2. The Gini index value computed based on
this partitioning is
Similarly, the Gini index values for splits on the remaining subsets are 0.458 (for the
subsets {low, high} and {medium}) and 0.450 (for the subsets {medium, high} and
{low}). Therefore, the best binary split for attribute income is on {low, medium} (or
{high}) because it minimizes the Gini index. Evaluating age, we obtain {youth, senior}
(or {middle aged}) as the best split for age with a Gini index of 0.375; the attributes
student and credit rating are both binary, with Gini index values of 0.367 and 0.429,
respectively. The attribute age and splitting subset {youth, senior} therefore give the
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 45
lOMoARcPSD|18595359
DATA MINING
binary split “age ∈ {youth, senior?}” results in the maximum reduction in impurity of the
minimum Gini index overall, with a reduction in impurity of 0.459 − 0.357 = 0.102. The
tuples in D and is returned as the splitting criterion. Node N is labeled with the criterion,
two branches are grown from it, and the tuples are partitioned accordingly.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 46
lOMoARcPSD|18595359
DATA MINING
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 47
lOMoARcPSD|18595359
DATA MINING
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise
or outliers. The pruned trees are smaller and less complex.
1. LetD be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, …,xn),
depicting n measurements made on the tuple from n attributes, respectively, A1, A2, …,
An.
2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the
classifier willpredict that X belongs to the class having the highest posterior probability,
conditioned on X.
That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and
only if
Thus we maximize P(CijX). The classCifor which P(CijX) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 48
lOMoARcPSD|18595359
DATA MINING
3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are
equally likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize
P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci).
If Akis categorical, then P(xk|Ci) is the number of tuples of class Ciin D having the value
5. Inorder to predict the class label of X, P(XjCi)P(Ci) is evaluated for each class Ci. The
classifier predicts that the class label of tuple X is the class Ciif and only if
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 49
lOMoARcPSD|18595359
DATA MINING
The arc in the diagram allows representation of causal knowledge. For example, lung
cancer is influenced by a person's family history of lung cancer, as well as whether or not
the person is a smoker. It is worth noting that the variable PositiveXray is independent of
whether the patient has a family history of lung cancer or that the patient is a smoker,
given that we know the patient has lung cancer.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 50
lOMoARcPSD|18595359
DATA MINING
k-Nearest-Neighbor Classifier
Nearest-neighbor classifiers are based on learning by analogy, that is, by
comparing a given test tuplewith training tuples that are similar to it.
The training tuples are describedby n attributes. Each tuple represents a
point in an n- dimensional space. In this way,all of the training tuples are
stored in an n-dimensional pattern space. When given anunknown tuple, a k-
nearest-neighbor classifier searches the pattern space for the k trainingtuples that
are closest to the unknown tuple. These k training tuples are the k nearest
neighbors of the unknown tuple.
Closeness is defined in terms of a distance metric, such as Euclidean distance.
The Euclidean distance between two points or tuples, say, X1 = (x11, x12, … ,
x1n) and X2 = (x21, x22, … ,x2n), is
In other words, for each numeric attribute, we take the difference between the
corresponding values of that attribute in tuple X1and in tuple X2, square this
difference,and accumulate it. The square root is taken of the total accumulated distance
count.
DATA MINING
When k = 1, the unknown tuple is assigned the class of the training tuple that is
closest to it in pattern space.
Nearest neighbor classifiers can also be used for prediction, that is, to return
a real-valued prediction for a given unknown tuple.
In this case, the classifier returns the average value of the real-valued labels
associated with the k nearest neighbors of the unknown tuple.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 52
lOMoARcPSD|18595359
DATA MINING
UNIT-IV
Clustering
Problem Definition
Cluster is a group of objects that belongs to the same class. In other words, similar objects
are grouped in one cluster and dissimilar objects are grouped in another cluster.
Clustering Overview
Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the same
group than those in other groups. In simple words, the aim is to segregate groups with
similar traits and assign them into clusters.
Clustering is the process of making a group of abstract objects into classes of similar
objects.
Points to Remember
A cluster of data objects can be treated as one group.
While doing cluster analysis, we first partition the set of data into groups based
on data similarity and then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable
to changes and helps single out useful features that distinguish different groups.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 53
lOMoARcPSD|18595359
DATA MINING
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 54
lOMoARcPSD|18595359
DATA MINING
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will
classify the data into k groups, which satisfy the following requirements −
Each group contains at least one object.
Each object must belong to exactly one group.
Points to remember −
For a given number of partitions (say k), the partitioning method will create
an initial partitioning.
Then it uses the iterative relocation technique to improve the partitioning
by moving objects from one group to other.
K Means Clustering
K means is an iterative clustering algorithm that aims to find local maxima in each
iteration. This algorithm works in these 5 steps :
1. Specify the desired number of clusters K : Let us choose k=2 for these 5
data points in 2-D space.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 55
lOMoARcPSD|18595359
DATA MINING
2. Randomly assign each data point to a cluster : Let’s assign three points in cluster 1
shown using red color and two points in cluster 2 shown using grey color.
3. Compute cluster centroids : The centroid of data points in the red cluster is shown
using red cross and those in grey cluster using grey cross.
4. Re-assign each point to the closest cluster centroid : Note that only the data point
at the bottom is assigned to the red cluster even though its closer to the centroid of grey
cluster. Thus, we assign that data point into grey cluster
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 56
lOMoARcPSD|18595359
DATA MINING
5. Re-compute cluster centroids : Now, re-computing the centroids for both the
clusters.
6. Repeat steps 4 and 5 until no improvements are possible : Similarly, we’ll repeat
the 4 and 5thsteps until we’ll reach global optima. When there will be no further
th
switching of data points between two clusters for two successive repeats. It will mark the
termination of the algorithm if not explicitly mentioned.
PAM Algorithm
The PAM algorithm was developed by Leonard Kaufman and Peter J. Rousseeuw, and
this algorithm is very similar to K-means, mostly because both are partitional algorithms,
in other words, both break the dataset into groups (clusters), and both work by trying to
minimize the error, but PAM works with Medoids, that are an entity of the dataset that
represent the group in which it is inserted, and K-means works with Centroids, that are
artificially created entity that represent its cluster.
The PAM algorithm partitions the dataset of n objects into k clusters, where both the
dataset and the number k is an input of the algorithm. This algorithm works with a matrix
of dissimilarity, whose goal is to minimize the overall dissimilarity between the
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 57
lOMoARcPSD|18595359
DATA MINING
representants of each cluster and its members. The algorithm uses the following model to
solve the problem:
Subject to:
1. Σi=1nzij= 1 , j = 1,2,...,n
2. zij≤ yi, i, j = 1,2,...,n
Where F(x) is the main function to minimize, d(i,j) is the dissimilarity measurement
between the entities i and j, and z ij is a variable that ensures that only the dissimilarity
between entities from the same cluster will be computed in the main function. The others
expressions are constraints that have the following functions: (1.) ensures that every
single entity is assigned to one cluster and only one cluster, (2.) ensures that the entity is
assigned to its medoid that represents the cluster, (3.) ensures that there are exactly k
clusters and (4.) lets the decision variables assume just the values of 0 or 1.
The PAM algorithm can work over two kinds of input, the first is the matrix representing
every entity and the values of its variables, and the second is the dissimilarity matrix
directly, in the latter the user can provide the dissimilarity directly as an input to the
algorithm, instead of the data matrix representing the entities. Either way the algorithm
reaches a solution to the problem, in a general analysis the algorithm proceeds this way:
Build phase
1. Choose k entities to become the medoids, or in case these entities were provided
use them as the medoids;
2. Calculate the dissimilarity matrix if it was not informed;
3. Assign every entity to its closest medoid;
Swap phase
4. For each cluster search if any of the entities of the cluster lower the average
dissimilarity coefficient, if it does select the entity that lowers this coefficient the most as
the medoid for this cluster;
5. If at least one medoid has changed go to (3), else end the algorithm.
As was said the PAM algorithm works with a matrix of dissimilarity, and to calculate this
matrix the algorithm can use two metrics. The first one is the euclidean, that are the root
sum-of-squares of differences, while the second one is the manhattan distance that are the
sum of absolute distances.
Hierarchical clustering
This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is
formed. There are two approaches here −
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 58
lOMoARcPSD|18595359
DATA MINING
Agglomerative Approach
Divisive Approach
Agglomerative Methods
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the
termination condition holds.
Divisive Methods
This approach is also known as the top-down approach. In this, we start with all of the
objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds.
This method is rigid, i.e., once a merging or splitting is done, it can never be undone.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 59
lOMoARcPSD|18595359
DATA MINING
At the bottom, we start with 25 data points, each assigned to separate clusters. Two
closest clusters are then merged till we have just one cluster at the top. The height in the
dendrogram at which two clusters are merged represents the distance between two
clusters in the data space.The decision of the no. of clusters that can best depict different
groups can be chosen by observing the dendrogram. The best choice of the no. of clusters
is the no. of vertical lines in the dendrogram cut by a horizontal line that can transverse
the maximum distance vertically without intersecting a cluster.
In the above example, the best choice of no. of clusters will be 4 as the red horizontal line
in the dendrogram below covers maximum vertical distance AB.
This algorithm has been implemented above using bottom up approach. It is also
possible to follow top-down approach starting with all data points assigned in the same
cluster and recursively performing splits till each data point is assigned a separate cluster.
The decision of merging two clusters is taken on the basis of closeness of these
clusters. There are multiple metrics for deciding the closeness of two clusters :
o Euclidean distance: ||a-b||2= √(Σ(ai-bi))
o Squared Euclidean distance: ||a-b||22= Σ((ai-bi)2)
o Manhattan distance: ||a-b||1= Σ|ai-bi|
o Maximum distance:||a-b||INFINITY= maxi|ai-bi|
o
Mahalanobis distance: √((a-b)TS-1(-b)) {where, s : covariance matrix}
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 60
lOMoARcPSD|18595359
DATA MINING
K Means is found to work well when the shape of the clusters is hyper spherical
(like circle in 2D, sphere in 3D).
K Means clustering requires prior knowledge of K i.e. no. of clusters you want to
divide your data into. But, you can stop at whatever number of clusters you find
appropriate in hierarchical clustering by interpreting the dendrogram.
Outlier Detection
Outliers: Assume that a given statistical process is used to generate a set of data objects.
An outlier is a data object that deviates significantly from the rest of the objects, as if it
were generated by a different mechanism. For ease of presentation within this chapter, we
may refer to data objects that are not outliers as “normal” or expected data. Similarly, we
may refer to outliers as “abnormal” data.
Outliers are different from noisy data. Noise is a random error or variance in a measured
variable. In general, noise is not interesting in data analysis, including outlier detection.
For example, in credit card fraud detection, a customer’s purchase behavior can be
modeled as a random variable. A customer may generate some “noise transactions” that
may seem like “random errors” or “variance,” such as by buying a bigger lunch one day,
or having one more cup of coffee than usual. Such transactions should not be treated as
outliers; otherwise, the credit card company would incur heavy costs from verifying that
many transactions. The company may also lose customers by bothering them with
multiple false alarms. As in many other data analysis and data mining tasks, noise should
be removed before outlier detection. Outliers are interesting because they are suspected of
not being generated by the same mechanisms as the rest of the data. Therefore, in outlier
detection, it is important to justify why the outliers detected are generated by some other
mechanisms. This is often achieved by making various assumptions on the rest of the data
and showing that the outliers detected violate those assumptions significantly. Outlier
detection is also related to novelty detection in evolving data sets. For example, by
monitoring a social media web site where new content is incoming, novelty detection may
identify new topics and trends in a timely manner. Novel topics may initially appear as
outliers. To this extent, outlier detection and novelty detection share some similarity in
modeling and detection methods. However, a critical difference between the two is that in
novelty detection, once new topics are confirmed, they are usually incorporated into the
model of normal behavior so that follow-up instances are not treated as outliers anymore.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 61
lOMoARcPSD|18595359
DATA MINING
Fig.4.4 : Outliers
Types of Outliers In general, outliers can be classified into three categories, namely
global outliers, contextual (or conditional) outliers, and collective outliers. Let’s examine
each of these categories.
Global Outliers In a given data set, a data object is a global outlier if it deviates
significantly from the rest of the data set. Global outliers are sometimes called point
anomalies, and are the simplest type of outliers. Most outlier detection methods are aimed
at finding global outliers.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 62
lOMoARcPSD|18595359
DATA MINING
Contextual attributes: The contextual attributes of a data object define the object’s
context. In the temperature example, the contextual attributes may be date and location.
Behavioral attributes: These define the object’s characteristics, and are used to evaluate
whether the object is an outlier in the context to which it belongs. In the temperature
example, the behavioral attributes may be the temperature, humidity, and pressure
Unlike global outlier detection, in contextual outlier detection, whether a data object is an
outlier depends on not only the behavioral attributes but also the contextual attributes. A
configuration of behavioral attribute values may be considered an outlier in one context
(e.g., 28◦C is an outlier for a Toronto winter), but not an outlier in another context (e.g.,
28◦C is not an outlier for a Toronto summer). Contextual outliers are a generalization of
local outliers, a notion introduced in density-based outlier analysis approaches. An object
in a data set is a local outlier if its density significantly deviates from the local area in
which it occurs. collectively to understand the shipment problem. Given a data set, a
subset of data objects forms a collective outlier if the objects as a whole deviate
significantly from the entire data set. Importantly, the individual data objects may not be
outliers.
Challenges of Outlier Detection Outlier detection is useful in many applications yet faces
many challenges such as the following:
Modeling normal objects and outliers effectively. Outlier detection quality highly
depends on the modeling of normal (nonoutlier) objects and outliers. Often, building a
comprehensive model for data normality is very challenging, if not impossible. This is
partly because it is hard to enumerate all possible normal behaviors in an application. The
border between data normality and abnormality (outliers) is often not clear cut. Instead,
there can be a wide range of gray area. Consequently, while some outlier detection
methods assign to each object in the input data set a label of either “normal” or “outlier,”
other methods assign to each object a score measuring the “outlier-ness” of the object.
Handling noise in outlier detection. As mentioned earlier, outliers are different from
noise. It is also well known that the quality of real data sets tends to be poor. Noise often
unavoidably exists in data collected in many applications. Noise may be present as
deviations in attribute values or even as missing values. Low data quality and the
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 63
lOMoARcPSD|18595359
DATA MINING
presence of noise bring a huge challenge to outlier detection. They can distort the data,
blurring the distinction between normal objects and outliers. Moreover, noise and missing
data may “hide” outliers and reduce the effectiveness of outlier detection—an outlier may
appear “disguised” as a noise point, and an outlier detection method may mistakenly
identify a noise point as an outlier.
Understandability. In some application scenarios, a user may want to not only detect
outliers, but also understand why the detected objects are outliers. To meet the
understandability requirement, an outlier detection method has to provide some
justification of the detection. For example, a statistical method can be used to justify the
degree to which an object may be an outlier based on the likelihood that the object was
generated by the same mechanism that generated the majority of the data. The smaller the
likelihood, the more unlikely the object was generated by the same mechanism, and the
more likely the object is an outlier.
There are many outlier detection methods in the literature and in practice. Here, we
present two orthogonal ways to categorize outlier detection methods. First, we categorize
outlier detection methods according to whether the sample of data for analysis is given
with domain expert–provided labels that can be used to build an outlier detection model.
Second, we divide methods into groups according to their assumptions regarding normal
objects versus outliers.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 64
lOMoARcPSD|18595359
DATA MINING
UNIT V
WEB AND TEXT MINING
Introduction
Text Mining is the discovery by computer of new, previously unknown information, by
automatically extracting information from different written resources. A key element is
the linking together of the extracted information together to form new facts or new
hypotheses to be explored further by more conventional means of experimentation. Text
mining is different from what we’re familiar with in web search. In search, the user is
typically looking for something that is already known and has been written by someone
else. The problem is pushing aside all the material that currently isn’t relevant to your
needs in order to find the relevant information. In text mining, the goal is to discover
heretofore unknown information, something that no one yet knows and so could not have
yet written down.
Another application include to aid in the automatic classification of texts. For example, it
is possible to “filter” out automatically most undesirable “junk email” based on certain
terms or words that are not likely to appear in legitimate messages, but instead identify
undesirable electronic mail. In this manner, such messages can automatically be
discarded. Such automatic systems for classifying electronic messages can also be useful
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 65
lOMoARcPSD|18595359
DATA MINING
1. Train.create attribute dictioary where the attribute represents words from articles
related to a particular topic. Choose only words that occur a minimum number of
times.
3. Classify. Check each document to be classified for the presence and frequency of the
chosen attributes. Classify the document under a particular topic if it contains a
predetermined minimum number of references to the chosen attributes for the topic.
Web Mining
Web mining is the process which includes various data mining techniques to extract
knowledge from web data categorized as web content, web structure and data usage. It
includes a process of discovering the useful and unknown information from the web data.
1. Web Content
2. Web Structure
3. Web Usage
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 66
lOMoARcPSD|18595359
DATA MINING
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 67
lOMoARcPSD|18595359
DATA MINING
Text Mining
Text and data mining are considered as complementary techniques required for efficient
business management. Data mining and text mining tools have gathered its primary
location in the marketplace. Natural Language processing is a subset of text mining tools
which is used to define accurate and complete domain specific taxonomies. This helps in
effective metadata association. Text mining is more mature and efficient in comparison
with data mining process. 80 percent of the information is made of text.
The objective of text mining is to exploit information which is included in textual
documents in various patterns and trends in association with entities and predictive rules.
The results are manipulated and used for:
1. The analysis of a collection
2. Providing information about intelligent navigation and browsing method.
DATA MINING
operations. Data cleansing allows you to extract and retain the valuable information
hidden within the data and to help identify the roots of specific words.
Convert all the relevant information extracted from unstructured data into
structured formats.
Analyse the patterns within the data via Management Information System (MIS).
Store all the valuable information into a secure database to drive trend analysis
and enhance the decision-making process of the organisation.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 69
lOMoARcPSD|18595359
DATA MINING
DATA MINING
exclusively for analysing the performance of social media platforms. These help to track
and interpret the texts generated online from the news, blogs, emails, etc. Furthermore,
text mining tools can efficiently analyse the number of posts, likes, and followers of your
brand on social media, thereby allowing you to understand the reaction of people who are
interacting with your brand and online content. The analysis will enable you to
understand ‘what’s hot and what’s not’ for your target audience.
Unstructured text
Unstructured data is information, in many different forms, that doesn't hew to
conventional data models and thus typically isn't a good fit for a mainstream
relational database. Thanks to the emergence of alternative platforms for storing and
managing such data, it is increasingly prevalent in IT system and is used by organizations
in a variety of business intelligence and analytics applications.
Traditional structured data, such as the transaction data in financial systems and other
business applications, conforms to a rigid format to ensure consistency in processing and
analyzing it. Sets of unstructured data, on the other hand, can be maintained in formats
that aren't uniform, freeing analytics teams to work with all of the available data without
necessarily having to consolidate and standardize it first. That enables more
comprehensive analyses than would otherwise be possible.
Types of unstructured data
One of the most common types of unstructured data is text. Unstructured text is generated
and collected in a wide range of forms, including Word documents, email messages,
PowerPoint presentations, survey responses, transcripts of call center interactions, and
posts from blogs and social media sites.
Other types of unstructured data include images, audio and video files. Machine data is
another category, one that's growing quickly in many organizations. For example, log
files from websites, servers, networks and applications -- particularly mobile ones -- yield
a trove of activity and performance data. In addition, companies increasingly capture and
analyze data from sensors on manufacturing equipment and other internet of things (IoT)
connected devices.
In some cases, such data may be considered to be semi-structured -- for example,
if metadatatags are added to provide information and context about the content of the
data. The line between unstructured and semi-structured data isn't absolute, though; some
data management consultants contend that all data, even the unstructured kind, has some
level of structure.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 71
lOMoARcPSD|18595359
DATA MINING
Hierarchy of categories
Concept hierarchies may also be defined by discretizing or grouping values for a given
dimension or attribute, resulting in a set-grouping hierarchy. A total or partial order can
be defined among groups of values. An example of a set-grouping hierarchy is shown
in figure for the dimension Root, where an interval (vacation…Work) denotes the range
from vacation (exclusive) to Work (inclusive).
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 72
lOMoARcPSD|18595359
DATA MINING
Text clustering
Text clustering is the application of cluster analysis to text-based documents. It uses
machine learning and natural language processing (NLP) to understand and categorize
unstructured, textual data.
Document Retrieval: To improve recall, start by adding other documents from the same
cluster.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 73
lOMoARcPSD|18595359
DATA MINING
DATA MINING
TUTORIAL
QUESTIONS
UNIT-I
UNIT-II
1. List the techniques of efficiency of Apriori algorithm?
2. Name the pruning strategies in mining closed frequent itemsets?
3. State maximal frequent itemset?
4. Name the steps in association rule mining?
5. State how can we mine closed frequent itemsets?
6. List the applications of pattern mining?
7. Classify the confidence rule for itemsets A and B?
8. Interpret the rule of support for itemsets A and B?
UNIT-III
1. Name the steps in data classification?
2. Differentiate supervised learning and unsupervised learning?
3. State gain ratio?
4. State Gini index?
5. List the Attribute Selection Measures?
6. Define Naïve Bayesian Classification?
7. Define Bayes’ Theorem?
8. Explain the IF-THEN rules for classification?
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 74
lOMoARcPSD|18595359
DATA MINING
UNIT-IV
1. Define Clustering?
2. Define nominal, ordinal and ratio scaled variables?
3. Define CLARA and CLARANS?
4. State hierarchical method?
5. State K-Means method?
6. Define Outlier Detection?
7. Define Chameleon method?
8. List the fields in which clustering techniques are used?
UNIT-V
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 75
lOMoARcPSD|18595359
DATA MINING
DATA MINING
ASSIGNMENT QUESTIONS
UNIT-I
1. Describe data mining? In your answer, address the following:
a) Is it another hype?
b) Is it a simple transformation of Technology developed from databases, statistics,
and machine learning?
c) Explain how the evolutions of database technology lead to data mining?
d) Describe the steps involved in data mining when viewed as a process of knowledge
discovery.
2. Examine the following consider the following data for analysis includes the attribute
age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20,
21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
a) Use min-max normalization to transform the value 35 for age on to the range [0.0,
1.0].Use z-score normalization to transform the value 35 for age, where the standard
deviation of age is 12.94 years.
b) Use normalization by decimal scaling to transform the value 35 for age.
c) Comment on which method you would prefer to use for the given data, giving reasons
as to why.
UNIT-II
1. Discuss which algorithm is an influential algorithm for mining frequent itemsets
for boolean association rules? Explain with an example?
2. Apply the following rules on a database has five transactions. Let min sup =60%
and min con f = 80%
T100 {M, O, N, K, E, Y}
T200
{D, O, N, K, E, Y }
T300
T400 {M, A, K, E}
T500 {M, U, C, K, Y}
{C, O, O, K, I ,E}
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 76
lOMoARcPSD|18595359
DATA MINING
UNIT-III
1. Explain for a given a decision tree, you have the option of (a) converting the
decision tree to rules and then pruning the resulting rules, or (b) pruning the decision
tree and then converting the pruned tree to rules. What advantage does (a) have over
(b)?
2. Write an algorithm for k-nearest-neighbor classification given k and n, the number
of attributes describing each tuple.
UNIT-IV
1. Explain for a given a decision tree, you have the option of (a) converting the
decision tree to rules and then pruning the resulting rules, or (b) pruning the decision
tree and then converting the pruned tree to rules. What advantage does (a) have over
(b)?
2. Write an algorithm for k-nearest-neighbor classification given k and n, the number
of attributes describing each tuple.
UNIT-V
1. Discusss about the terminologies associated with Web Structure.
DATA MINING
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 77
lOMoARcPSD|18595359
DATA MINING
DATA MINING
UNIT WISE IMPORTANT QUESTION QUESTIONS
UNIT-I
PART-A (2 or 3 marks)
1. Define Data mining.
2. Write the differences of data base, data warehouse and data mining with real
time example?
3. Give some alternative terms for Data Mining.
4. What is KDD?
5. What are the steps involved in KDD process?
6. What is the use of knowledge base?
7. Mention some of the data mining techniques?
8. What is the purpose of Data mining Technique?
9. What is data mining and why it is required?
10. What are the applications of data mining?
11. Define Predictive model.
12. Define Descriptive model?
13. Define Data characterization.
14. Define Data discrimination.
15. What is meant by concept description?
16. Define data preprocessing.
17. Define data cleaning.
18. Define data reduction.
19. Define data transformation.
20. Define data integration.
21. What is binning?
22. What is missing data?
23. Define data aggregation.
24. What is meant by dimensionality reduction?
25. What is Feature Subset Selection?
26. What are the primitives that define a task?
27. What is DMQL?
28. List the different coupling schemes used in a data mining system.
29. What is entropy of a attribute?
30. Give the formula for mean, median and mode.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 78
lOMoARcPSD|18595359
DATA MINING
PART-A (2 or 3 marks)
1. Define Association Rule Mining.
2. Define frequent itemset.
3. When we can say the association rules are interesting?
4. Explain Association rule in mathematical notations.
5. Define support and confidence in Association rule mining.
6. How is association rules mined from large databases?
7. Describe the different classifications of Association rule mining.
8. What is the purpose of Apriori Algorithm?
9. What is boolean association rule?
10. How to generate association rules from frequent Item sets?
11. Give few techniques to improve the efficiency of Apriori algorithm.
12. What are the things suffering the performance of Apriori candidate
generation technique.
13. Describe the method of generating frequent item sets without
candidate generation.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 79
lOMoARcPSD|18595359
DATA MINING
Transaction_ID Item_ID
T1 I2,I4,I8,I6
T2 I1,I3,I5
T3 I6,I7,I1
T4 I1,I3,I4
T5 I2,I4,I7
UNIT-III
PART-A (2 or 3 marks)
1. Define the concept of classification?
2. Define the concept of Prediction?
3. What is Decision Tree?
4. What is Attribute Selection Measure?
5. Describe Tree pruning methods.
6. Define Prepruning.
7. Define Post pruning.
8. Why naïve Bayesian classification is is called ‘naïve’?
9. What is K-nearest neighbor?
10. What is Best-split?
11. What is a Belief networks?
PART-B (5 -10 marks)
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 80
lOMoARcPSD|18595359
DATA MINING
1. Define Clustering.
2. What do you mean by Cluster Analysis?
3. What are the fields in which clustering techniques are used?
4. What are the requirements of cluster analysis?
5. What are the different types of data used for cluster analysis?
6. What is interval scaled variables?
7. Define Binary variables. And what are the two types of binary variables?
8. Define nominal, ordinal and ratio scaled variables.
9. What is meant by partitioning method?
10. What is K-mean Algorithm?
11. What is PAM?
12. Define CLARA and CLARANS.
13. What is Hierarchical method?
14. Differentiate Agglomerative and Divisive Hierarchical Clustering.
15. Define Density based method.
16. What is a DBSCAN?
17. What is meant by Grid Based Method?
18. What is a STING?
19. Define Wave Cluster.
PART-B (5 -10 marks)
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 81
lOMoARcPSD|18595359
DATA MINING
UNIT-V
PART-A (2 or 3 marks)
1. Explain about Web Mining.
2. Discuss about Text Mining.
3. Specify the Web Mining Classification.
4. Discuss about Web Content Mining.
5. Explain about the Web Structure Mining.
6. Discuss about the terminologies associated with Web Structure.
7. Explain about the Web Usage Mining and its Applications.
8. Discuss about the Application of Text Mining.
9. Discuss about Episode rule discovery for texts.
10. Illustrate about the Hierarchy of categories.
11. Discuss about Text Clustering.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 82
lOMoARcPSD|18595359
DATA MINING
DATA MINING
OBJECTIVE
QUESTIONS
UNIT-I
1. is a subject-oriented, integrated, time-variant, nonvolatile collection
of data in support of
management decisions.
A. Data Mining.
B. Data Warehousing.
C. Web Mining.
D. Text Mining.
4. The important aspect of the data warehouse environment is that data found within
the data warehouse
is .
A. subject-oriented.
B. time-variant.
C. integrated.
D. All of the above.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 83
lOMoARcPSD|18595359
DATA MINING
DATA MINING
19. The key used in operational environment may not have an element of .
A. time.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 85
lOMoARcPSD|18595359
DATA MINING
B. cost.
C. frequency.
D. quality.
UNIT-II
21. Record cannot be updated in .
A. OLTP
B. files
C. RDBMS
D. data warehouse
25. Bill Inmon has estimated of the time required to build a data
warehouse, is consumed in
the conversion process.
A. 10 percent.
B. 20 percent.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 86
lOMoARcPSD|18595359
DATA MINING
C. 40 percent
D. 80 percent.
29. The biggest drawback of the level indicator in the classic star-schema is that
it limits .
A. quantify.
B. qualify.
C. flexibility.
D. ability.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 87
lOMoARcPSD|18595359
DATA MINING
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 88
lOMoARcPSD|18595359
DATA MINING
D. a process to upgrade the quality of data before it is moved into a data warehouse
UNIT-III
41. The type of relationship in star schema is .
A. many-to-many.
B. one-to-one.
C. one-to-many.
D. many-to-one.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 89
lOMoARcPSD|18595359
DATA MINING
45. The data administration subsystem helps you perform all of the
following, except .
A. backups and recovery.
B. query optimization.
C. security management.
D. create, change, and delete information.
46. The most common source of change data in refreshing a data warehouse is .
A. queryable change data.
B. cooperative change data.
C. logged change data.
D. snapshot change data.
47. are responsible for running queries and reports against data
warehouse tables.
A. Hardware.
B. Software.
C. End users.
D. Middle ware.
DATA MINING
C. derived attributes.
D. composite attributes.
51. is a method of incremental conceptual clustering.
A. CORBA.
B. OLAP.
C. COBWEB.
D. STING.
53. The main organizational justification for implementing a data warehouse is to provide
.
A. cheaper ways of handling transportation.
B. decision support.
C. storing large volume of data.
D. access to data.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 91
lOMoARcPSD|18595359
DATA MINING
60. are designed to overcome any limitations placed on the warehouse by the
nature of the
relational data model.
A. Operational database.
B. Relational database.
C. Multidimensional database.
D. Data repository.
UNIT-IV
61. are designed to overcome any limitations placed on the warehouse by
the nature of the
relational data model.
A. Operational database.
B. Relational database.
C. Multidimensional database.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 92
lOMoARcPSD|18595359
DATA MINING
68. The term that is not associated with data cleaning process is .
A. domain consistency.
B. deduplication.
C. disambiguation.
D. segmentation.
69. are some popular OLAP tools.
A. Metacube, Informix.
B. Oracle Express, Essbase.
C. HOLAP.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 93
lOMoARcPSD|18595359
DATA MINING
D. MOLAP.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 94
lOMoARcPSD|18595359
DATA MINING
79. The first International conference on KDD was held in the year .
A. 1996.
B. 1997.
C. 1995.
D. 1994.
UNIT-V
81. contains information that gives users an easy-to-
understand perspective of the
information stored in the data warehouse.
A. Business metadata.
B. Technical metadata.
C. Operational metadata.
D. Financial metadata.
82. helps to integrate, maintain and view the contents of the data
warehousing system.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 95
lOMoARcPSD|18595359
DATA MINING
A. Business directory.
B. Information directory.
C. Data dictionary.
D. Database.
84. Data marts that incorporate data mining tools to extract sets of data are called .
A. independent data mart.
B. dependent data marts.
C. intra-entry data mart.
D. inter-entry data mart.
85. can generate programs itself, enabling it to carry out new tasks.
A. Automated system.
B. Decision making system.
C. Self-learning system.
D. Productivity system.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 96
lOMoARcPSD|18595359
DATA MINING
D. five.
90. is data that is distilled from the low level of detail found at the
current detailed leve.
A. Highly summarized data.
B. Lightly summarized data.
C. Metadata.
D. Older detail data.
92. A directory to help the DSS analyst locate the contents of the data warehouse is
seen in .
A. Current detail data.
B. Lightly summarized data.
C. Metadata.
D. Older detail data.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 97
lOMoARcPSD|18595359
DATA MINING
99. of data means that the attributes within a given entity are
fully dependent on the entire primary key of the entity.
A. Additivity.
B. Granularity.
C. Functional dependency.
D. Dimensionality.
100. A fact is said to be fully additive if .
A. it is additive over every dimension of its dimensionality.
B. additive over atleast one but not all of the dimensions.
C. not additive over any dimension. D. None of the above.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 98
lOMoARcPSD|18595359
DATA MINING
DATA MINING
INTERNAL QUESTION PAPERS
MID-I (2017-2018)
DATA MINING (IV– CSE)
Part – A
Answer all questions
1. Design a Data Warehouse in words of Inmon.
2. List out the applications of data mining?
3. Write the differences of data base, data warehouse and data mining with real
time example?
4. Design OLAP Cube.
5. List out the OLAP operations in multidimensional data model?
Part – B
1. (a) Write the logic for binning method for using Smoothing Technique?
4 1 3 2 6 11 8 5 10 9 7 12 13
(Or)
(b) Design various star schema and snowflake schemas in Multi Dimensional Data
Model?
(Or)
(b) Design major difference between OLTP and OLAP.
MID-II (2017-2018)
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 99
lOMoARcPSD|18595359
DATA MINING
DATA MINING
Part – A
Answer all questions
1. Define a classification process?
3. Define Cluster?
4. Give few techniques to improve the efficiency of Apriori algorithm?
Part – B
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 100
lOMoARcPSD|18595359
DATA MINING
MID-I (2016-2017)
DATA MINING
Part – A
Answer all questions
1. Design a Operational Databases.
2. Design OLAP Cube.
3. Design a Data Mining.
4. Implement Market Basket Analysis.
5. Design Facts in Data Warehousing.
Part – B
2. (a) Write the logic for binning method for using Smoothing
Technique? 4 4 3 2 6 7 8 5 5 3 7 9 9
(Or)
(b) Design various schemas in Multi Dimensional Data Model?
Transaction_ID Item_ID
T1 I8,I5,I8,I9
T2 I2,I3,I6
T3 I6,I7,I5
T4 I5,I3,I8
T5 I2,I4,I8
(Or)
(b) Implement any two data reduction techniques?
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 101
lOMoARcPSD|18595359
DATA MINING
MID-II (2016-2017)
DATA MINING
Part – A
Answer all questions
1. Design a concept of classification?
2. Why naïve Bayesian classification is is called ‘naïve’??
3. Design Attribute Selection Measure?
4. Design Clustering and requirements of cluster analysis?
5. Implement Maximal Frequent item set and Closed Frequent item set?
Part – B
1. (a) Design classification by Decision tree induction Algorithm?
(Or)
(b) Design Measures for selecting best split and Estimate the Accuracy of
Classifier?
(Or)
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 102
lOMoARcPSD|18595359
DATA MINING
DATA MINING
EXTERNAL QUESTION PAPERS
DATA MINING
(Common to CSE & IT)
2.a) Explain any two of the schemas for multidimensional databases. [5+5]
OR
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 103
lOMoARcPSD|18595359
DATA MINING
3.a) What are OLAP operations in the multidimensional data model? Explain.[5+5]
OR
5.a) List and describe the five primitives for specifying a data mining task. [5+5]
b) In real-world data, tuples with missing values for some attributes are a common
occurrence. Describe various methods for handling this problem.
6. Write the the Apriori algorithm for discovering frequent item sets for mining
Boolean association rules. [10]
OR
7.a) How can we mine closed frequent item sets? Explain. [5+5]
OR
9.a.What are the measures for selecting the Best Split? Explain. [5+5]
OR
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 104
lOMoARcPSD|18595359
DATA MINING
DATA MINING
PART - A
(25 Marks)
2. Explain in detail about 3 tier Data Warehousing architecture with a neat sketch.
[10]
OR
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 105
lOMoARcPSD|18595359
DATA MINING
OR
OR
8.a)What are the general approaches to consider for solving classification problem?
OR
9.a.) How to evaluate any classifier model, which was build for classification. [5+5]
10. Explain with example how clustering can be with large databases. [10]
OR
11.Explain about Web Mining and Text Mining with an example. [10]
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 106
lOMoARcPSD|18595359
DATA MINING
Time: 3 hours
Max. Marks: 70
Marks)
UNIT – I
OR
UNIT – II
OR
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 107
lOMoARcPSD|18595359
DATA MINING
UNIT – III
OR
work. UNIT – IV
OR
UNIT – V
OR
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 108
lOMoARcPSD|18595359
DATA MINING
Time: 3 hours
Max. Marks: 75
Part A is compulsory which carries 25 marks. Answer all questions in Part A. Part B
consists of 5 Units. Answer any one full question from each unit. Each question
carries
2.a) Compare and contrast operational database systems with data warehouse.
OR
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 109
lOMoARcPSD|18595359
DATA MINING
OR
a) Cosine measure
b) Euclidean distance
6. Find the frequent itemsets and strong association rules for the following
transactional database table using Apriori algorithm. Consider the thresholds
as
support = 30% and confidence = 40%. [10]
1 I1,i2,i3,i5
2 I2,i5,i7,i9
3 I1,i3,i5,i7
4 I2,i4,i6,i8
5 I1,i2,i3,i4
6 I2,i3,i4,i5
7 I3,i4,i5,i6
8 I4,i5,i6,i7
9 I5,i6,i7.i8.i9
10 I9.i1.i2.i5
11 I8,i2,i9,i7
12 I5,i6,i3,i2
OR
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 110
lOMoARcPSD|18595359
DATA MINING
7.a) Demonstrate construction of FP-tree for the data from Question (6).
8.a) Define information gain and explain its importance in decision tree induction.
OR
9.a) State Bayes theorem. How can it be applied for data classification?
OR
---ooOoo---
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 111
lOMoARcPSD|18595359
DATA MINING
Part A is compulsory which carries 25 marks. Answer all questions in Part A. Part B
consists of 5 Units. Answer any one full question from each unit. Each question
2. What are the various components of data warehouse? Explain their functionality
in detail. [10]
OR
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 112
lOMoARcPSD|18595359
DATA MINING
OR
5. What is the need of dimensionality reduction? Explain any two techniques for
Find all frequent item sets using Apriori algorithm. List all the strong association
rules. [10]
OR
8. Explain decision tree induction algorithm for classifying data tuples and discuss
suitable example. [10]
OR
10.What is the goal of clustering? How does partitioning around medoids algorithm
achieve this goal? [10]
OR
DATA MINING
DATA MINING
Time: 3 hours
Max. Marks: 70
PART – A
(Compulsory Question)*****
(c) Explain about decision tree. Explain its role in data mining process.
(f) List and explain in brief the attributes that are used to compare
classification and prediction methods.
(h) Explain in brief the process of mining temporal pattern in data strum.
PART – B
Marks) UNIT – I
OR
3. List and explain the processing steps that may be applied to the data for data
mining.
UNIT – II
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 114
lOMoARcPSD|18595359
DATA MINING
4. Explain the process of tree induction. Describe its implementation using Hunt’s
algorithm.
OR
5. Explain the term over fitting. Discuss how this over fitting can be addressed.
UNIT – III
6. Explain how Bayesian method is used for classification in data mining process.
OR
UNIT – IV
OR
UNIT – V
OR
11. Define cluster analysis. List and explain the applications of cluster analysis. Also
explain various types of clustering.
*****
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 115
lOMoARcPSD|18595359
DATA MINING
DATA MINING
Time: 3 hours
Max. Marks: 75
Part A is compulsory which carries 25 marks. Answer all questions in Part A. Part B
consists of 5 Units. Answer any one full question from each unit. Each question
carries
PART - A
(25 Marks)
PART - B
(50 Marks)
2.a) Give examples for defining star, snowflake and fact constellation schemas.
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 116
lOMoARcPSD|18595359
DATA MINING
OR
OR
6. Can we design a method that mines the complete set of frequent item sets without
OR
8. Why is naive Bayesian classification called “naive”? Briefly outline the major
ideas of naive Bayesian classification. Explain Naive-Bayes classification. [10]
OR
9.a) What are the methods for expressing attribute test conditions? Explain.
OR
–--ooOoo----
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 117
lOMoARcPSD|18595359
DATA MINING
DATA MINING
c) How can you go about filling in the missing values for this attribute? [2]
PART – B
(50 Marks)
b) Discuss the star and snowflake schema in detail with suitable example. [5+5]
OR
3.a) Write the difference between designing a data warehouse and an OLAP cube.
OR
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 118
lOMoARcPSD|18595359
DATA MINING
6. Make a comparison of Apriori and ECLAT algorithms for frequent item set
mining in transactional databases. Apply these algorithms to the following data:
[10]
Tomato OR
OR
9. How does the Naïve Bayesian classification works? Explain in detail. [10]
OR
11. What are the different clustering methods? Explain in detail. [10]
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 119
lOMoARcPSD|18595359
DATA MINING
DATA MINING
Max. Marks: 70
Time: 3 hours
PART – A
(f) List and explain in brief the attributes that are used to compare classification and
prediction methods.
(h) Explain in brief the process of mining temporal pattern in data stream.
UNIT – I
OR
UNIT – II
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 120
lOMoARcPSD|18595359
DATA MINING
OR
UNIT – III
6.Explain how Rule Based method is used for classification in data mining process.
OR
UNIT – IV
8.Discuss in detail various methods that improve the efficiency of Apriori algorithm.
OR
9. Illustrate with example in detail, the process of generating association rules from
frequent itemsets.
UNIT – V
OR
*****
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 121
lOMoARcPSD|18595359
DATA MINING
References
1. Data Mining- Concepts and Techniques- Jiawei Han, Micheline
2006.
DATA MINING
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN (AUTONOMOUS- UGC,GOVT. Of INDIA ) Page 122