DMDW Full Notes
DMDW Full Notes
1|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE
UNIT – I
1) What is Datamining? Explain Data Mining Functionalities?
A)
• The process of extracting information to identify patterns, trends, and useful data that would
allow the business to take the data-driven decision from huge sets of data is called Data
Mining.
• Data mining functionalities are used to represent the type of patterns that must be discovered
in data mining tasks.
Classification: -
• Classification is the procedure of discovering a model that represents and distinguishes data
classes or concepts, for the objective of being able to use the model to predict the class of
objects whose class label is anonymous.
• The derived model is established on the analysis of a set of training data (i.e., data objects
whose class label is common).
Prediction: -
• It defines predict some unavailable data values or pending trends. An object can be
anticipated based on the attribute values of the object and attribute values of the classes.
• It can be a prediction of missing numerical values or increase/decrease trends in time-related
information.
Clustering: -
• It is like classification but the classes are not predefined. The classes are represented by data
attributes. It is unsupervised learning.
• The objects are clustered or grouped, depends on the principle of maximizing the intraclass
similarity and minimizing the intraclass similarity.
Outlier analysis: -
• Outliers are data elements that cannot be grouped in each class or cluster. These are the data
objects which have multiple behaviour from the general behaviour of other data objects.
• The analysis of this type of data can be essential to mine the knowledge.
Evolution analysis: -
• It defines the trends for objects whose behaviour changes over some time.
3|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE
4|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE
Tables and Tables and joins of a database Table and joins are simple in a data warehouse
Joins are complex as they are because they are denormalized.
normalized.
Orientation Is an application-oriented It is a subject-oriented collection of data
collection of data
Storage limit Generally limited to a single Stores data from any number of applications
application
Availability Data is available real-time Data is refreshed from source systems as and
when needed
Usage ER modeling techniques are Data modeling techniques are used for
used for designing. designing.
Technique Capture data Analyze data
Data Type Data stored in the Database is Current and Historical Data is stored in Data
up to date. Warehouse. May not be up to date.
Storage of Flat Relational Approach Data Ware House uses dimensional and
data method is used for data storage. normalized approach for the data structure.
Example: Star and snowflake schema.
Query Type Simple transaction queries are Complex queries are used for analysis purpose.
used.
Data Detailed Data is stored in a It stores highly summarized data.
Summary database.
6|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE
7|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE
UNIT – 2
1) What is Data Ware house? Explain Data ware house architecture?
A)
Data Ware House: -
• A data warehouse is a type of data management system that facilitates and supports business
intelligence (BI) activities, specifically analysis
Three-tier data warehouse Architecture: -
Tier-1: -
• The bottom tier is a warehouse database server that is almost always a relational database system.
• Backend tools and utilizers are used to feed data into the bottom tier from operational databases
or other external Sources
• These tools and utilities perform data extraction cleaning and transformation, as well as bad and
refresh functions to update the data ware.
• The data are extracted using application program interface known as gateways is supported by
underlying DBMS and allows Client programs to generate SQL code to be executed at a Server
Tier-2: -
• The middle tier is an app Server that is typical implemented using either a relational OLAP
(ROLAP) model or a multidimensional OLAP model
• Multidimensional OLAP model is an extended relational DBMS that map k operations on multi
dimension data to standard operational operations
• A multidimensional OLAP (MOLAP) model that is a Special purpose Server that directly
implements multi-dimensional data and Operations.
Tier-3: -
• The top tier is a front-end client layer, which contains query and reporting tools, analysis tools,
and/or data mining tools
• Online Analytical Processing Server (OLAP) is based on the multidimensional data model.
• It allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information.
• This chapter cover the types of OLAP, operations on OLAP, difference between OLAP, and
statistical databases and OLTP.
8|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE
9|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE
10 | P a g e
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE
11 | P a g e
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE
UNIT – 3
1) What is an association analysis? Explain frequent item set generation by using Apriori
algorithm?
A)
• Association analysis is the task of finding interesting relationships in large datasets.
• These interesting relationships can take two forms:
1. Frequent item sets
2. Association rules
• Frequent item sets are a collection of items that frequently occur together.
• Association rules suggest that a strong relationship exists between two items
Apriori algorithm: -
• The Apriori algorithm uses frequent item sets to generate association rules, and it is designed
to work on the databases that contain transactions.
• With the help of these association rule, it determines how strongly or how weakly two
objects are connected.
Frequent Itemset: -
• Frequent item sets are those items whose support is greater than the threshold value or user-
specified minimum support.
• It means if A & B are the frequent item sets together, then individually A and B should also
be the frequent itemset.
• Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two
transactions, 2 and 3 are the frequent item sets.
Steps for Apriori Algorithm
• Below are the steps for the apriori algorithm:
• Step-1: Determine the support of itemsets in the transactional database, and select the
minimum support and confidence.
• Step-2: Take all supports in the transaction with higher support value than the minimum or
selected support value.
• Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.
• Step-4: Sort the rules as the decreasing order of lift
Apriori Algorithm Working
Example:
• Suppose we have the following dataset that has various transactions, and from this dataset,
• we need to find the frequent item sets and generate the association rules using the Apriori
algorithm:
12 | P a g e
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE
Solution:
Step-1: Calculating C1 and L1:
• In the first step, we will create a table that contains support count (The frequency of each
itemset individually in the dataset) of each itemset in the given dataset.
• This table is called the Candidate set or C1.
• Now, we will take out all the itemsets that have the greater support count that the Minimum
Support (2). It will give us the table for the frequent itemset L1.
• Since all the itemsets have greater or equal support count than the minimum support, except
the E, so E itemset will be removed.
• Again, we need to compare the C2 Support count with the minimum support count, and after
comparing, the itemset with less support count will be eliminated from the table C2.
• It will give us the below table for L2
13 | P a g e
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE
14 | P a g e
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE
2) FP-growth Algorithm
A)
FP - Growth Algorithm: -
• It is alternative algorithm Called FP-Graph is used to discover frequent itemset
• It Encodes the dataset using a compact data structure called FP-tree and extracts frequent
item sets directly from this structure.
FP-tree Representation: -
• FP-tree is a compressed representation of input data
• It is constructed by reading the transaction one at a time and mapping each transaction on to
a path in FP-Tree
• As different transactions can have several items in common, their paths may overlapped
T id Items
1 A, B
2 B, C, D
3 A, C, D, E
4 A, D, E
5 A, B, C
6 A, B, C, D
7 A
8 A, B, C
9 A, B, D
10 B, C, E
• The above figure shows a dataset that contains 10 transactions and 5 items
• The structures of FP-Tree after reading every Transaction is also shown in above diagram
• Each Node in the tree Contains the label of an item along with counter that shows the number
of transactions trapped onto the given path
• Initially the FP-tree contains only the root node represented by Null Symbol
• The FP -Tree is extended in the following way
1. The data is Scanned ones to determine the support count
for each item (A→B→C →D→E)
2. After Reading First Transaction {A, B}, the path is formed from ‘null →A→B’, every
node along the path has a frequency count of 1
3. After Reading Record Transaction {B, C, D}, the path is formed from ‘null → B
→C→D,’ every along the path with a frequency count of 1
4. The item third transaction {A, C, D, E} shares a common prefix item ‘A’ with first
{A, B}. As a result the path ‘Null → A →C →D→E’ overlaps with ‘Null → A→B’,
because of their overlapping path frequency of A=2
5. This process continues until every trans=action has been mapped on to one of the paths
on give FP- tree
15 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE
16 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE
UNIT – IV
1) What is classification? Explain decision Tree induction &algorithm.
A)
• Classification is the process of arranging things in groups or classes according to their
resemblances and affinities
• A decision tree is a structure that includes a root node, branches, and leaf nodes.
• Each internal node denotes a test on an attribute, each branch denotes the outcome of a
test, and each leaf node holds a class label.
• The topmost node in the tree is the root node.
• Each internal node represents a test on an attribute. Each leaf node represents a class.
Hunts Algorithm:
• In Hunts algorithm a decision tree is constructed in a recursive fashion partitioning the
training records into successively proper subsets
17 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE
• The records are subsequently divided into smaller subsets based on the outcomes of
the home Owner test condition.
• In this if “Home Owner = Yes” then that has all the records in same class.
• If test condition is No (i.e “home owner=No”) then the Hunts algorithm apply the step
recursively.
• Because Home Owner is No as records are in different class.
• In the tree the left child of the root node is labelled “Defaulted=Yes” then it is not
extended recursively.
• The right child of the root node is continued by applying the recursive step of Hunts
algorithm until all the records belongs to the same class.
18 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE
19 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE
20 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE
• Each node corresponds to the random variables, and a variable can be continuous or
discrete.
• Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables.
• These directed links or arrows connect the pair of nodes in the graph.
• Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ), which determines the effect of the parent on that node.
Joint probability distribution:
• If we have variables x1, x2, x3,.., xn, then the probabilities of a different combination
of x1, x2, x3.. xn, are known as Joint probability distribution.
• P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint
probability distribution.
= P [x1| x2, x3,.., xn]P[x2, x3,....., xn]
= P [x1| x2, x3,.., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].
• In general for each variable Xi, we can write the equation as:
P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))
4)Methods for expressing an attribute test conditions, Measures for selecting the best
split
A)
Methods for expressing an attribute test condition:
• There are several methods for expressing an attribute test condition when building
decision trees or other machine learning models.
• The choice of method depends on the type and nature of the attribute being tested.
1. Binary or nominal attributes: For binary or nominal attributes, the test condition is
simply a binary decision based on the presence or absence of a specific attribute value.
For example, if the attribute is "gender" and has two possible values "male" and
"female", the test condition would be "is the gender equal to male?".
2. Ordinal attributes: For ordinal attributes, the test condition can be expressed as a
comparison between the attribute value and a threshold value. For example, if the
attribute is "age" and has values ranging from 1 to 100, the test condition might be "is
the age greater than or equal to 18?".
3. Continuous attributes: For continuous attributes, the test condition can be expressed
as a range or interval of values. This can be done by specifying a threshold value and
comparing the attribute value to it, or by dividing the range of values into multiple
intervals. For example, if the attribute is "income" and has a range of values from 0 to
100,000, the test condition might be "is the income between 50,000 and 100,000?".
21 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE
4. Text or categorical attributes: For text or categorical attributes, the test condition can
be expressed as a comparison between the attribute value and a set of predefined
categories or keywords. For example, if the attribute is "product type" and has values
such as "electronics", "clothing", and "food", the test condition might be "does the
product type belong to the electronics category?".
5. Other methods: Other methods for expressing attribute test conditions include using
decision rules or decision trees, fuzzy logic, and neural networks.
Measures for selecting the best split: -
• When building decision trees or other machine learning models, selecting the best split
at each node is crucial for achieving accurate and interpretable results.
• There are several measures or metrics that can be used to evaluate the quality of a split
and select the best attribute to split on.
• Some common measures are:
1) Information Gain (IG):
• Information gain is a measure of the reduction in entropy achieved by splitting the
data on an attribute.
• Entropy measures the impurity or randomness of the data.
• A split that produces two subsets with low entropy and high purity will have a high
information gain.
• Information gain is given by:
IG(S, A) = H(S) - Σ|Sj|/|S| H(Sj)
• where H(S) is the entropy of the parent node S, and H(Sj) is the entropy of the jth
child node of S produced by the split on attribute A.
2) Gain Ratio (GR):
• The gain ratio is a modification of information gain that adjusts for the number of
categories or values of the attribute being split on.
• It avoids bias towards attributes with many categories.
• The gain ratio is given by:
GR(S, A) = IG(S, A) / SplitInfo(S, A)
• where SplitInfo(S, A) is a measure of the potential information generated by splitting
on attribute A, given by:
SplitInfo(S, A) = -Σ|Sj|/|S| log2(|Sj|/|S|)
3) Gini Index:
• The Gini index measures the impurity or randomness of the data by calculating the
probability of misclassifying a randomly chosen sample from the set.
• A split that produces two subsets with low Gini index will have a high quality. Gini
index is given by:
Gini(S) = 1 - Σ(p(i))^2
• where p(i) is the proportion of samples in S that belong to class i. The Gini index of a
split on attribute A is given by:
Gini(S, A) = Σ|Sj|/|S| Gini(Sj)
22 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE
4) Chi-squared Test:
• The chi-squared test is a statistical test used to evaluate the independence of
categorical variables.
• It measures the difference between the observed and expected frequencies of each
category in the attribute being split on.
• A significant difference between the observed and expected frequencies indicates that
the attribute is informative for the classification task.
These measures can be used to evaluate the quality of a split on an attribute and select the
attribute with the highest score as the best attribute to split on at each node of the decision
tree. The choice of measure depends on the nature of the data and the specific problem being
addressed.
UNIT – V
1) What is cluster analysis? explain types of clustering
A)
Clustering: -
• Clustering is the process of making a group of abstract objects into classes of similar
objects.
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups based on
data similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to changes
and helps single out useful features that distinguish different groups.
• There are several types of clustering algorithms, some of which are:
K-means clustering:
1. This is one of the most popular clustering algorithms.
2. It partitions the data into a fixed number of clusters (K) based on their similarity, with
each cluster represented by its centroid.
Hierarchical clustering:
1. This clustering algorithm creates a tree-like structure of clusters by recursively
merging or dividing them.
2. There are two types of hierarchical clustering
• agglomerative, which starts with individual data points and combines them into
larger clusters
• divisive, which starts with all data points in one cluster and divides them into
smaller clusters.
Density-based clustering:
1. This clustering algorithm is based on the idea that clusters are dense regions of data
points separated by regions of lower density.
2. Density-based clustering algorithms, such as DBSCAN, group together data points
that are within a certain density threshold of each other.
23 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE
Model-based clustering:
1. This type of clustering algorithm assumes that the data is generated by a probabilistic
model.
2. It estimates the parameters of the model to identify the most likely clusters in the data.
3. Examples of model-based clustering algorithms include Bayesian clustering.
Fuzzy clustering:
1. In this type of clustering, data points are assigned to multiple clusters with different
degrees of membership.
2. Fuzzy clustering algorithms, such as Fuzzy C-Means, allow for more flexible
clustering where data points can belong to more than one cluster with different degrees
of similarity.
2) Explain k-Means algorithm & Bisecting k- Means algorithm
A)
• Clustering is a group of objects and that objects in a group (cluster) are similar to one
another and different from (or unrelated to) the objects in other groups.
• Cluster is a collection of data objects that are "similar" to one another and thus can be
treated collectively as one group.
• Cluster analysis is a function of data mining that may be used to form the groups of
data objects (clusters) based only on information which found in the data that
describes the objects and their relationships.
K-means:
• K-means clustering is a method of vector quantization originally from signal
processing that is popular for cluster analysis in data mining.
• K-means clustering aims to partition observations into & clusters in which each
observation belongs to the cluster with the nearest mean, serving as a prototype of the
cluster.
• This results in a partitioning of the data space into Voronoi cells.
Bisecting K-means:
• The Bisecting K-means algorithm is extension of the basic K-means algorithm.
• The idea is to obtain K clusters set of points are split into two clusters and among
which one cluster is chosen to split, and so on.
• The details of bisecting K-means are given by Algorithm.
Algorithm for Bisecting K-means
1. Initialize the list of clusters to contain the cluster containing all points.
2. Repeat
3. Select a cluster from the list of clusters
4. for i=1 to number of iterations do
5. Bisect the selected cluster using basic K-means
6. end for
7. Add the two clusters from the bisection with the lowest SSE to the list of clusters.
8. Until the list of clusters contains K clusters
24 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE
25 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE
26 | P a g e