0% found this document useful (0 votes)
70 views26 pages

DMDW Full Notes

The document outlines the syllabus for a course on Data Mining and Data Warehousing at Bonam Venkata Chalamayya Engineering College, detailing five units covering topics such as data mining functionalities, data pre-processing, association analysis, classification, and cluster analysis. It also includes important questions and answers related to each unit, discussing concepts like data mining task primitives, the architecture of data mining systems, and the differences between databases and data warehouses. The course aims to equip students with the knowledge and skills necessary for data analysis and decision-making in business contexts.

Uploaded by

nagallahema2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views26 pages

DMDW Full Notes

The document outlines the syllabus for a course on Data Mining and Data Warehousing at Bonam Venkata Chalamayya Engineering College, detailing five units covering topics such as data mining functionalities, data pre-processing, association analysis, classification, and cluster analysis. It also includes important questions and answers related to each unit, discussing concepts like data mining task primitives, the architecture of data mining systems, and the differences between databases and data warehouses. The course aims to equip students with the knowledge and skills necessary for data analysis and decision-making in business contexts.

Uploaded by

nagallahema2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE

BONAM VENKATA CHALAMAYYA ENGINEERING COLLEGE ODALAREVU


(Autonomous)
DATA MINING AND DATA WAREHOUSING Subject Code: 20CS6T13
UNIT I
Introduction: What Motivated Data Mining? Why Is It Important, Data Mining—On What Kind of Data, Data Mining
Functionalities—What Kinds of Patterns can Be Mined? Are All of the Patterns Interesting? Data Mining Task
Primitives, Major Issues in Data Mining. (Han & Kamber)
UNIT II
Data Pre-processing: Why Pre-process the Data? Descriptive Data Summarization, Data Cleaning, Data Integration
and Transformation, Data Reduction, Data Discretization and Concept Hierarchy Generation, Data Warehouse and
OLAP Technology: An Overview: What Is a Data Warehouse? A Multidimensional Data Model, Data Warehouse
Architecture, (Han & Kamber)
UNIT III
Association Analysis: Basic Concepts and Algorithms: Introduction, Frequent Item Set generation, Rule generation,
compact representation of frequent item sets, FP-Growth Algorithm. (Tan & Vipin)
UNIT IV
Classification: Basic Concepts, General Approach to solving a classification problem, Decision Tree Induction:
Working of Decision Tree, building a decision tree, methods for expressing an attribute test conditions, measures for
selecting the best split, Algorithm for decision tree induction. Classification: Alterative Techniques, Bayes’ Theorem,
Naïve Bayesian Classification, Bayesian Belief Networks
UNIT V
Cluster Analysis: Basic Concepts and Algorithms: What Is Cluster Analysis? Different Types of Clustering, Different
Types of Clusters, K-means, The Basic K-means Algorithm, K-means: Additional Issues, Bisecting K-means, K-
means and Different Types of Clusters, Strengths and Weaknesses, K-means as an Optimization Problem,
Agglomerative Hierarchical clustering, Basic Agglomerative Hierarchical Clustering Algorithm, Specific Techniques,
DBSCAN, Traditional Density: center-Based Approach, The DBSCAN Algorithm, Strengths and Weaknesses. (Tan
&Vipin)

Important Questions (DMDW): -


UNIT – I
1. What is datamining? explain data mining functionalities
2. Explain major issues in data mining
3. Datamining task primitives
4. Difference between Database and Data Warehouse
5. Explain about Architecture of Data Mining System
UNIT – II
1. What is Data Ware house? explain data ware house architecture
2. Explain Data pre-processing techniques?
3. Explain Multi-dimensional data Model or OLAP.
UNIT – III
1. What is an association analysis? Explain frequent item set generation by using Apriori algorithm?
2. FP-growth Algorithm
3. Rule generation and compact representation of frequent item sets
UNIT – IV
1. What is classification? Explain decision Tree induction &algorithm.
2. what is bayes theorem? Explain Naive Bayes’ classification.
3. Basian Belial Network
4. Measures for explain attribute test condition and selecting the best split
UNIT – V
1. What is cluster analysis? explain types of clustering
2. Explain k-Means algorithm & Bisecting k- Means algorithm
3. Explain Agglomerative Hierarchical Clustering Algorithm & DBSCAN algorithm

1|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE

UNIT – I
1) What is Datamining? Explain Data Mining Functionalities?
A)
• The process of extracting information to identify patterns, trends, and useful data that would
allow the business to take the data-driven decision from huge sets of data is called Data
Mining.
• Data mining functionalities are used to represent the type of patterns that must be discovered
in data mining tasks.

• In general, data mining tasks can be classified into two types


• Descriptive mining tasks define the common features of the data in the database
• Predictive mining tasks act inference on the current information to develop predictions.
• There are various data mining functionalities which are as follows −
Data characterization: -
• It is a summarization of the general characteristics of an object class of data.
• The data corresponding to the user-specified class is generally collected by a database query.
• The output of data characterization can be presented in multiple forms.
Data discrimination: -
• It is a comparison of the general characteristics of target class data objects with the general
characteristics of objects from one or a set of contrasting classes.
• The target and contrasting classes can be represented by the user, and the equivalent data
objects fetched through database queries.
Association Analysis: -
• It analyses the set of items that generally occur together in a transactional dataset. There are
two parameters that are used for determining the association rules
• It provides which identifies the common item set in the database.
• Confidence is the conditional probability that an item occurs in a transaction when another
item occurs.
2|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE

Classification: -
• Classification is the procedure of discovering a model that represents and distinguishes data
classes or concepts, for the objective of being able to use the model to predict the class of
objects whose class label is anonymous.
• The derived model is established on the analysis of a set of training data (i.e., data objects
whose class label is common).
Prediction: -
• It defines predict some unavailable data values or pending trends. An object can be
anticipated based on the attribute values of the object and attribute values of the classes.
• It can be a prediction of missing numerical values or increase/decrease trends in time-related
information.
Clustering: -
• It is like classification but the classes are not predefined. The classes are represented by data
attributes. It is unsupervised learning.
• The objects are clustered or grouped, depends on the principle of maximizing the intraclass
similarity and minimizing the intraclass similarity.
Outlier analysis: -
• Outliers are data elements that cannot be grouped in each class or cluster. These are the data
objects which have multiple behaviour from the general behaviour of other data objects.
• The analysis of this type of data can be essential to mine the knowledge.
Evolution analysis: -
• It defines the trends for objects whose behaviour changes over some time.

2) Explain Major issues in Datamining?


A)
• Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place.
• It needs to be integrated from various heterogeneous data sources
• The following diagram describes the major issues.

3|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE

Mining Methodology & User Interaction Issues: -


1. Mining different kinds of knowledge in databases
• Different users-different knowledge-different way (with same database)
2. Interactive Mining of knowledge at multiple levels of abstraction.
• Focus the search patterns.
• Different angles.
3. Incorporation of background knowledge.
• Background & Domain knowledge.
4. Data mining query languages and ad hoc data mining
• High level data mining query language
• Conditions and constraints.
5. Presentation and visualization of data mining results.
• Use visual representations.
• Expressive forms like graph, chart, matrices, curves, tables, etc...
6. Handling noisy or incomplete data.
• Confuse the process
• Over fit the data (apply any outlier analysis, data cleaning methods)
7.Pattern evaluation - the interestingness problem.
• Pattern may be uninteresting to the user.
• Solve by user specified constraints.
Performance Issues
1. Efficiency and scalability of data mining algorithms.
• Running time.
• Should be opt for huge amount of data.
2. Parallel, Distributed, and incremental mining algorithms.
• Huge size of database
• Wide distribution of data
• High cost
• Computational complexity
• Data mining methods
• Solve by; efficient algorithms.
Diversity of data Types Issues
1. Handling of relational and complex types of data.
• One system -> to mine all kinds of data
• Specific data mining system should be constructed.
2. Mining information from heterogeneous databases and global information systems.
• Web mining -> uncover knowledge about web contents, web structure, web usage and web
dynamics

4|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE

3) Datamining task primitives


A)
Data Mining Task Primitives: -
• A data mining task can be specified in the form of a data mining query, which is input to the
data mining system.
• A data mining query is defined in terms of data mining task primitives.
• These primitives allow the user to interactively communicate with the data mining system
during discovery to direct the mining process or examine the findings from different angles
or depths.
• The data mining primitives specify the following,
1) Set of task-relevant data to be mined.
2) Kind of knowledge to be mined.
3) Background knowledge to be used in the discovery process.
4) Interestingness measures and thresholds for pattern evaluation.
5) Representation for visualizing the discovered patterns.

List of Data Mining Task Primitives


• A data mining query is defined in terms of the following primitives, such as:
1. The set of task-relevant data to be mined
• This specifies the portions of the database or the set of data in which the user is interested
• The data collection process results in a new data relational called the initial data relation.
• The initial data relation can be ordered or grouped according to the conditions specified in
the query.
• This data retrieval can be thought of as a subtask of the data mining task.
2. The kind of knowledge to be mined
• This specifies the data mining functions to be performed, such as characterization,
discrimination, association or correlation analysis, classification, prediction, clustering,
outlier analysis, or evolution analysis.
3. The background knowledge to be used in the discovery process
• This knowledge about the domain to be mined is useful for guiding the knowledge discovery
process and evaluating the patterns found.
• Concept hierarchy defines a sequence of mappings from low-level concepts to higher-level,
more general concepts.
• Rolling Up - Generalization of data: Allow to view data at more meaningful and explicit
abstractions and makes it easier to understand.
• It compresses the data, and it would require fewer input/output operations.
• Drilling Down - Specialization of data: Concept values replaced by lower-level concepts.
Based on different user viewpoints, there may be more than one concept hierarchy for a
given attribute or dimension.
5|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE

4. The interestingness measures and thresholds for pattern evaluation


• Different kinds of knowledge may have different interesting measures.
• Simplicity: A factor contributing to the interestingness of a pattern is the pattern's overall
simplicity for human comprehension.
• Certainty (Confidence): A certainty measure for association rules of the form "A =>B"
where A and B are sets of items is confidence.
• Utility (Support): The potential usefulness of a pattern is a factor defining its
interestingness. It can be estimated by a utility function, such as support
• Support (A=>B) = # tuples containing both A and B / total #of tuples
5. The expected representation for visualizing the discovered patterns
• This refers to the form in which discovered patterns are to be displayed, which may include
rules, tables, cross tabs, charts, graphs, decision trees, cubes, or other visual representations.
4) Difference between Database and Data Warehouse
A)
Parameter Database Data Warehouse
Purpose Is designed to record Is designed to analyze
Processing The database uses the Online Data warehouse uses Online Analytical
Method Transactional Processing Processing (OLAP).
(OLTP)
Usage The database helps to perform Data warehouse allows you to analyze your
fundamental operations for business.
your business

Tables and Tables and joins of a database Table and joins are simple in a data warehouse
Joins are complex as they are because they are denormalized.
normalized.
Orientation Is an application-oriented It is a subject-oriented collection of data
collection of data
Storage limit Generally limited to a single Stores data from any number of applications
application
Availability Data is available real-time Data is refreshed from source systems as and
when needed
Usage ER modeling techniques are Data modeling techniques are used for
used for designing. designing.
Technique Capture data Analyze data
Data Type Data stored in the Database is Current and Historical Data is stored in Data
up to date. Warehouse. May not be up to date.
Storage of Flat Relational Approach Data Ware House uses dimensional and
data method is used for data storage. normalized approach for the data structure.
Example: Star and snowflake schema.

Query Type Simple transaction queries are Complex queries are used for analysis purpose.
used.
Data Detailed Data is stored in a It stores highly summarized data.
Summary database.

6|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE

5) Explain about Architecture of Data Mining System


A)
The architecture of a data mining system may have the following components.
1. Data sources and or Repositories: -
• This component represents multiple data sources Such as database, data warehouse, or
any other information repository.
• Data cleaning and data integration techniques may performed on the data.
2. Database Server or data warehouse Server: -
• The database or data warehouse Server is responsible for fetching the relevant data,
based on the user's data mining request
3. Knowledge base: -
• It is the area of knowledge that is used to guide the search, or to perform analysis of
the resulting patterns.
4.Data mining engine: -
• This is core component to the data mining system and consists of a set of functional
modules for tasks Such as characterization, association analysis, classification
evolution and deviation analysis.
5. pattern evaluation module: -
• This component typically employs interestingness measures and interacts with the
data mining modules So as to focus the Search towards interesting patterns
6. Graphical User interface: -
• This module communicates between users and the data mining system, allowing the
user to interact with the system by specifying a data mining query or task.
• This Component allows the user to browse database and data Warehouse schemas or
data Structures, evaluate mined patterns, and visualize the patterns in different forms.

7|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE

UNIT – 2
1) What is Data Ware house? Explain Data ware house architecture?
A)
Data Ware House: -
• A data warehouse is a type of data management system that facilitates and supports business
intelligence (BI) activities, specifically analysis
Three-tier data warehouse Architecture: -
Tier-1: -
• The bottom tier is a warehouse database server that is almost always a relational database system.
• Backend tools and utilizers are used to feed data into the bottom tier from operational databases
or other external Sources
• These tools and utilities perform data extraction cleaning and transformation, as well as bad and
refresh functions to update the data ware.
• The data are extracted using application program interface known as gateways is supported by
underlying DBMS and allows Client programs to generate SQL code to be executed at a Server
Tier-2: -
• The middle tier is an app Server that is typical implemented using either a relational OLAP
(ROLAP) model or a multidimensional OLAP model
• Multidimensional OLAP model is an extended relational DBMS that map k operations on multi
dimension data to standard operational operations
• A multidimensional OLAP (MOLAP) model that is a Special purpose Server that directly
implements multi-dimensional data and Operations.
Tier-3: -
• The top tier is a front-end client layer, which contains query and reporting tools, analysis tools,
and/or data mining tools
• Online Analytical Processing Server (OLAP) is based on the multidimensional data model.
• It allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information.
• This chapter cover the types of OLAP, operations on OLAP, difference between OLAP, and
statistical databases and OLTP.

8|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE

2) Explain Data pre-processing techniques?


A)
• Data pre-processing is an important step in the data mining process.
• It refers to the cleaning, transforming, and integrating of data in order to make it ready for
analysis.
• The goal of data pre-processing is to improve the quality of the data and to make it more
suitable for the specific data mining task.

Pre-processing in Data Mining:


• Data pre-processing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
Steps Involved in Data Pre-processing:
1. Data Cleaning:
• The data can have many irrelevant and missing parts.
• To handle this part, data cleaning is done.
• It involves handling of missing data, noisy data etc.
Missing Data:
• This situation arises when some data is missing in the data. It can be handled in various
ways.
• Some of them are:
Ignore the tuples: This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
Fill the Missing values: There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable value.
Noisy Data:
• Noisy data is a meaningless data that can’t be interpreted by machines.
• It can be generated due to faulty data collection, data entry errors etc.
• It can be handled in following ways:
Binning Method: This method works on sorted data in order to smooth it.
• The whole data is divided into segments of equal size and then various methods are
performed to complete the task.
• Each segmented is handled separately.
• One can replace all data in a segment by its mean or boundary values can be used to
complete the task.

9|Page
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE

Regression: Here data can be made smooth by fitting it to a regression function.


• The regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
Clustering: This approach groups the similar data in a cluster.
• The outliers may be undetected or it will fall outside the clusters.
2. Data Transformation:
• This step is taken in order to transform the data in appropriate forms suitable for mining
process.
• This involves following ways:
Normalization:
• It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
Attribute Selection:
• In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
Discretization:
• This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.
Concept Hierarchy Generation:
• Here attributes are converted from lower level to higher level in hierarchy. For Example-
The attribute “city” can be converted to “country.”
3. Data Reduction:
• Since data mining is a technique that is used to handle huge amount of data.
• While working with huge volume of data, analysis became harder in such cases.
• In order to get rid of this, we use data reduction technique.
• It aims to increase the storage efficiency and reduce data storage and analysis costs.
• The various steps to data reduction are:
Data Cube Aggregation:
• Aggregation operation is applied to data for the construction of the data cube.
Attribute Subset Selection:
• The highly relevant attributes should be used, rest all can be discarded.
• For performing attribute selection, one can use level of significance and p- value of
the attribute. The attribute having p-value greater than significance level can be
discarded.
Numerosity Reduction:
• This enables to store the model of data instead of whole data, for example: Regression
Models.
Dimensionality Reduction:
• This reduces the size of data by encoding mechanisms.
• It can be lossy or lossless.
• If after reconstruction from compressed data, original data can be retrieved, such
reduction are called lossless reduction else it is called lossy reduction.
• The two effective methods of dimensionality reduction are: Wavelet transforms and
PCA (Principal Component Analysis).

10 | P a g e
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE

3) Explain Multi-dimensional data Model or OLAP?


A)
• The multi-Dimensional Data Model is a method which is used for ordering data in the
database along with good arrangement and assembling of the contents in the database.
• The Multi-Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow customers
to access data in the form of queries.
• OLAP (online analytical processing) and data warehousing uses multi-dimensional
databases.
• It represents data in the form of data cubes.
• Data cubes allow to model and view the data from many dimensions and perspectives

Features of multidimensional data models:


Dimensions:
• Dimensions are attributes that describe the measures, such as time, location, or product.
• They are typically stored in dimension tables in a multidimensional data model.
Cubes:
• Cubes are structures that represent the multidimensional relationships between measures and
dimensions in a data model.
• They provide a fast and efficient way to retrieve and analyse data.
Aggregation:
• Aggregation is the process of summarizing data across dimensions and levels of detail.
• This is a key feature of multidimensional data models, as it enables users to quickly analyse
data at different levels of granularity.
OLAP (Online Analytical Processing):
• OLAP is a type of multidimensional data model that supports fast and efficient querying of
large datasets.
• OLAP systems are designed to handle complex queries and provide fast response times.
Advantages of Multi-Dimensional Data Model
• The following are the advantages of a multi-dimensional data model:
1. A multi-dimensional data model is easy to handle.
2. It is easy to maintain.
3. Its performance is better than that of normal databases (e.g., relational databases).
4. The representation of data is better than traditional databases.

11 | P a g e
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE

UNIT – 3
1) What is an association analysis? Explain frequent item set generation by using Apriori
algorithm?
A)
• Association analysis is the task of finding interesting relationships in large datasets.
• These interesting relationships can take two forms:
1. Frequent item sets
2. Association rules
• Frequent item sets are a collection of items that frequently occur together.
• Association rules suggest that a strong relationship exists between two items
Apriori algorithm: -
• The Apriori algorithm uses frequent item sets to generate association rules, and it is designed
to work on the databases that contain transactions.
• With the help of these association rule, it determines how strongly or how weakly two
objects are connected.
Frequent Itemset: -
• Frequent item sets are those items whose support is greater than the threshold value or user-
specified minimum support.
• It means if A & B are the frequent item sets together, then individually A and B should also
be the frequent itemset.
• Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two
transactions, 2 and 3 are the frequent item sets.
Steps for Apriori Algorithm
• Below are the steps for the apriori algorithm:
• Step-1: Determine the support of itemsets in the transactional database, and select the
minimum support and confidence.
• Step-2: Take all supports in the transaction with higher support value than the minimum or
selected support value.
• Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.
• Step-4: Sort the rules as the decreasing order of lift
Apriori Algorithm Working
Example:
• Suppose we have the following dataset that has various transactions, and from this dataset,
• we need to find the frequent item sets and generate the association rules using the Apriori
algorithm:

12 | P a g e
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE

Solution:
Step-1: Calculating C1 and L1:
• In the first step, we will create a table that contains support count (The frequency of each
itemset individually in the dataset) of each itemset in the given dataset.
• This table is called the Candidate set or C1.

• Now, we will take out all the itemsets that have the greater support count that the Minimum
Support (2). It will give us the table for the frequent itemset L1.
• Since all the itemsets have greater or equal support count than the minimum support, except
the E, so E itemset will be removed.

Step-2: Candidate Generation C2, and L2:


• In this step, we will generate C2 with the help of L1. In C2, we will create the pair of the
itemsets of L1 in the form of subsets.
• we will get the below table for C2:

• Again, we need to compare the C2 Support count with the minimum support count, and after
comparing, the itemset with less support count will be eliminated from the table C2.
• It will give us the below table for L2

13 | P a g e
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE

Step-3: Candidate generation C3, and L3:


• For C3, we will repeat the same two processes, but now we will form the C3 table with
subsets of three itemsets together, and will calculate the support count from the dataset.
• It will give the below table:

• Now we will create the L3 table.


• As we can see from the above C3 table, there is only one combination of itemset that has
support count equal to the minimum support count. So, the L3 will have only one
combination, i.e., {A, B, C}.
Step-4: Finding the association rules for the subsets:
• For all the rules, we will calculate the Confidence using formula sup(A ^B)/A.
• After calculating the confidence value for all rules, we will exclude the rules that have less
confidence than the minimum threshold (50%).
• Consider the below table:
Rules Support Confidence

A ^B → C 2 Sup{(A ^B) ^C}/sup(A ^B)= 2/4=0.5=50%

B^C → A 2 Sup{(B^C) ^A}/sup(B ^C)= 2/4=0.5=50%

A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%

C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%

A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%

B→ B^C 2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%


• As the given threshold or minimum confidence is 50%, so the first three rules A ^B → C,
B^C → A, and A^C → B can be considered as the strong association rules for the given
problem.
Advantages of Apriori Algorithm
1. This is easy to understand algorithm
2. The join and prune steps of the algorithm can be easily implemented on large datasets.
Disadvantages of Apriori Algorithm
1. The apriori algorithm works slow compared to other algorithms.

14 | P a g e
DATAMINING AND DATA WAREHOUSE(BR20) DEPARTMENT OF CSE

2) FP-growth Algorithm
A)
FP - Growth Algorithm: -
• It is alternative algorithm Called FP-Graph is used to discover frequent itemset
• It Encodes the dataset using a compact data structure called FP-tree and extracts frequent
item sets directly from this structure.
FP-tree Representation: -
• FP-tree is a compressed representation of input data
• It is constructed by reading the transaction one at a time and mapping each transaction on to
a path in FP-Tree
• As different transactions can have several items in common, their paths may overlapped
T id Items
1 A, B
2 B, C, D
3 A, C, D, E
4 A, D, E
5 A, B, C
6 A, B, C, D
7 A
8 A, B, C
9 A, B, D
10 B, C, E

• The above figure shows a dataset that contains 10 transactions and 5 items
• The structures of FP-Tree after reading every Transaction is also shown in above diagram
• Each Node in the tree Contains the label of an item along with counter that shows the number
of transactions trapped onto the given path
• Initially the FP-tree contains only the root node represented by Null Symbol
• The FP -Tree is extended in the following way
1. The data is Scanned ones to determine the support count
for each item (A→B→C →D→E)
2. After Reading First Transaction {A, B}, the path is formed from ‘null →A→B’, every
node along the path has a frequency count of 1
3. After Reading Record Transaction {B, C, D}, the path is formed from ‘null → B
→C→D,’ every along the path with a frequency count of 1
4. The item third transaction {A, C, D, E} shares a common prefix item ‘A’ with first
{A, B}. As a result the path ‘Null → A →C →D→E’ overlaps with ‘Null → A→B’,
because of their overlapping path frequency of A=2
5. This process continues until every trans=action has been mapped on to one of the paths
on give FP- tree

15 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE

3) Rule generation and compact representation of frequent item sets


A)
Rule Generation: -
• It Describe how to abstract association rules efficiently from a given frequent item sets
k
• Each frequent k – item sets, Y can produce up to 2 -2 association rules, ignoring rules
that have empty antecedents (or) consequents (∅-> Y (or) Y - > ∅)
• E.g.: - let x= {1,2,3} be a frequent itemsets,there are six candidate association rules
can be generated from ‘x’
• {1} -> {2,3} {2,3} -> {1}
• {2} -> {1,3} {1,3} -> {2}
• {3} -> {1,2} {1,2} -> {3}
𝜎(𝑋 𝑈 𝑌)
• Consider the rule {1,2} -> {3}, the confidence for this rule is
𝜎(𝑋)
Rule Generation in Apriori Algorithm: -

Compact Representation of Frequent Itemset


• It has two types they are
1) Maximal Frequent Itemset
2) Closed Itemset
Maximal Frequent Itemset
• An itemset is maximal frequent if none of its immediate supersets is frequent.

16 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE

Closed Frequent Itemset


• An itemset is closed if none of its immediate supersets has the same support as the
itemset.

UNIT – IV
1) What is classification? Explain decision Tree induction &algorithm.
A)
• Classification is the process of arranging things in groups or classes according to their
resemblances and affinities
• A decision tree is a structure that includes a root node, branches, and leaf nodes.
• Each internal node denotes a test on an attribute, each branch denotes the outcome of a
test, and each leaf node holds a class label.
• The topmost node in the tree is the root node.
• Each internal node represents a test on an attribute. Each leaf node represents a class.

Hunts Algorithm:
• In Hunts algorithm a decision tree is constructed in a recursive fashion partitioning the
training records into successively proper subsets

17 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE

How to build a decision tree?


Tid Homeowner Marital Status Annual Income Defaulted Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K No
Fig: - Training set for predicting borrowers who will default on loan payments
• The initial tree for the classification problem contains a single node with class label
“Defaulted=No”.
• It means that most of the borrowers successfully repaired their loans
Defaulted=No

• The records are subsequently divided into smaller subsets based on the outcomes of
the home Owner test condition.
• In this if “Home Owner = Yes” then that has all the records in same class.

• If test condition is No (i.e “home owner=No”) then the Hunts algorithm apply the step
recursively.
• Because Home Owner is No as records are in different class.

• In the tree the left child of the root node is labelled “Defaulted=Yes” then it is not
extended recursively.
• The right child of the root node is continued by applying the recursive step of Hunts
algorithm until all the records belongs to the same class.

18 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE

2) What is Bayes Theorem? Explain Naive Bayes’ classification.


A)
Bayes Theorem: -
• Bayes theorem gives the probability of an event with the given information on test.
X
P( )P(H)
P(H/X) = H
𝑃(𝑋)
Where,
1. X is data tuple (training data set)
2. H is hypothesis
3. P(X/H) is probability of X given H
4. P(H/X) is probability of H given X
Naive Bayes’ classification: -
• Naive Bayes classifiers are a collection of classification algorithms based on Bayes’
Theorem.
• It is not a single algorithm but a family of algorithms where all of them share a
common principle.
• every pair of features being classified is independent of each other
• To start with, let us consider a dataset.
• Here is a tabular representation of our dataset.
age Income Student Credit rating Buys Computer
Youth High No Fair No
Youth High No Excellent No
Middle aged High No Fair Yes
Senior Medium No Fair Yes
Senior Low Yes Fair Yes
Senior Low Yes Excellent No
Middle aged Low Yes Excellent Yes
Youth Medium No Fair No
Youth Low Yes Fair Yes
Senior Medium Yes Fair Yes
Youth Medium Yes Excellent Yes
Middle aged Medium No Excellent Yes
Middle aged High Yes Fair Yes
Senior Medium No Excellent No

19 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE

• Data tuples are age, income, student, credit rating


• Class label attribute -> buys computer
1. C1 = class label attribute -> Yes (9)
2. C2 = class label attribute -> No (5)
P(c1) = 9/14=0.643
P(c2) = 5/14=0.357
• Let us consider a data tuple
• X= {age=” youth”, income=”median”, student=”Yes”,credit_rating=”fair”}
• To compute P(X/Ci) for i=1,2
• we compute the following probabilities
P(X1/C1) =2/9 p(X2/C1) =4/9
P(X3/C1) =6/9 p(X4/C1) =6/9
P(X1/C2) =3/5 p(X2/C2) =2/5
P(X3/C2) =1/5 P(X4/C2) =2/5
• Using above probabilities, we obtain
P(X/c1) =2/9 * 4/9 * 6/9 * 6/9=0.044
P(x/c2) =3/5 * 2/5 * 1/5 * 2/5 = 0.019
• To find class Ci, that P(X/ci).P(ci)
P(X/c1) P(c1) =0.028
P(X/c2).P(c2)=0.007
• Naive Bayesian classifier predicts buys_comp = Yes (for tuple X)
3) Bayesian belief network
A)
• A Bayesian network is a probabilistic graphical model which represents a set of
variables and their conditional dependencies using a directed acyclic graph."
• It is also called a Bayes network, belief network, decision network, or Bayesian
model.
• Bayesian networks are probabilistic, because these networks are built from a
probability distribution, and also use probability theory for prediction and anomaly
detection.
• Bayesian Network can be used for building models from data and experts’ opinions,
and it consists of two parts:
1. Directed Acyclic Graph
2. Table of conditional probabilities.
• The generalized form of Bayesian network that represents and solve decision problems
under uncertain knowledge is known as an Influence diagram.
• A Bayesian network graph is made up of nodes and Arcs (directed links), where:

20 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE

• Each node corresponds to the random variables, and a variable can be continuous or
discrete.
• Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables.
• These directed links or arrows connect the pair of nodes in the graph.
• Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ), which determines the effect of the parent on that node.
Joint probability distribution:
• If we have variables x1, x2, x3,.., xn, then the probabilities of a different combination
of x1, x2, x3.. xn, are known as Joint probability distribution.
• P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint
probability distribution.
= P [x1| x2, x3,.., xn]P[x2, x3,....., xn]
= P [x1| x2, x3,.., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].
• In general for each variable Xi, we can write the equation as:
P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))
4)Methods for expressing an attribute test conditions, Measures for selecting the best
split
A)
Methods for expressing an attribute test condition:
• There are several methods for expressing an attribute test condition when building
decision trees or other machine learning models.
• The choice of method depends on the type and nature of the attribute being tested.
1. Binary or nominal attributes: For binary or nominal attributes, the test condition is
simply a binary decision based on the presence or absence of a specific attribute value.
For example, if the attribute is "gender" and has two possible values "male" and
"female", the test condition would be "is the gender equal to male?".
2. Ordinal attributes: For ordinal attributes, the test condition can be expressed as a
comparison between the attribute value and a threshold value. For example, if the
attribute is "age" and has values ranging from 1 to 100, the test condition might be "is
the age greater than or equal to 18?".
3. Continuous attributes: For continuous attributes, the test condition can be expressed
as a range or interval of values. This can be done by specifying a threshold value and
comparing the attribute value to it, or by dividing the range of values into multiple
intervals. For example, if the attribute is "income" and has a range of values from 0 to
100,000, the test condition might be "is the income between 50,000 and 100,000?".

21 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE

4. Text or categorical attributes: For text or categorical attributes, the test condition can
be expressed as a comparison between the attribute value and a set of predefined
categories or keywords. For example, if the attribute is "product type" and has values
such as "electronics", "clothing", and "food", the test condition might be "does the
product type belong to the electronics category?".
5. Other methods: Other methods for expressing attribute test conditions include using
decision rules or decision trees, fuzzy logic, and neural networks.
Measures for selecting the best split: -
• When building decision trees or other machine learning models, selecting the best split
at each node is crucial for achieving accurate and interpretable results.
• There are several measures or metrics that can be used to evaluate the quality of a split
and select the best attribute to split on.
• Some common measures are:
1) Information Gain (IG):
• Information gain is a measure of the reduction in entropy achieved by splitting the
data on an attribute.
• Entropy measures the impurity or randomness of the data.
• A split that produces two subsets with low entropy and high purity will have a high
information gain.
• Information gain is given by:
IG(S, A) = H(S) - Σ|Sj|/|S| H(Sj)
• where H(S) is the entropy of the parent node S, and H(Sj) is the entropy of the jth
child node of S produced by the split on attribute A.
2) Gain Ratio (GR):
• The gain ratio is a modification of information gain that adjusts for the number of
categories or values of the attribute being split on.
• It avoids bias towards attributes with many categories.
• The gain ratio is given by:
GR(S, A) = IG(S, A) / SplitInfo(S, A)
• where SplitInfo(S, A) is a measure of the potential information generated by splitting
on attribute A, given by:
SplitInfo(S, A) = -Σ|Sj|/|S| log2(|Sj|/|S|)
3) Gini Index:
• The Gini index measures the impurity or randomness of the data by calculating the
probability of misclassifying a randomly chosen sample from the set.
• A split that produces two subsets with low Gini index will have a high quality. Gini
index is given by:
Gini(S) = 1 - Σ(p(i))^2
• where p(i) is the proportion of samples in S that belong to class i. The Gini index of a
split on attribute A is given by:
Gini(S, A) = Σ|Sj|/|S| Gini(Sj)

22 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE

4) Chi-squared Test:
• The chi-squared test is a statistical test used to evaluate the independence of
categorical variables.
• It measures the difference between the observed and expected frequencies of each
category in the attribute being split on.
• A significant difference between the observed and expected frequencies indicates that
the attribute is informative for the classification task.
These measures can be used to evaluate the quality of a split on an attribute and select the
attribute with the highest score as the best attribute to split on at each node of the decision
tree. The choice of measure depends on the nature of the data and the specific problem being
addressed.
UNIT – V
1) What is cluster analysis? explain types of clustering
A)
Clustering: -
• Clustering is the process of making a group of abstract objects into classes of similar
objects.
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups based on
data similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to changes
and helps single out useful features that distinguish different groups.
• There are several types of clustering algorithms, some of which are:
K-means clustering:
1. This is one of the most popular clustering algorithms.
2. It partitions the data into a fixed number of clusters (K) based on their similarity, with
each cluster represented by its centroid.
Hierarchical clustering:
1. This clustering algorithm creates a tree-like structure of clusters by recursively
merging or dividing them.
2. There are two types of hierarchical clustering
• agglomerative, which starts with individual data points and combines them into
larger clusters
• divisive, which starts with all data points in one cluster and divides them into
smaller clusters.
Density-based clustering:
1. This clustering algorithm is based on the idea that clusters are dense regions of data
points separated by regions of lower density.
2. Density-based clustering algorithms, such as DBSCAN, group together data points
that are within a certain density threshold of each other.

23 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE

Model-based clustering:
1. This type of clustering algorithm assumes that the data is generated by a probabilistic
model.
2. It estimates the parameters of the model to identify the most likely clusters in the data.
3. Examples of model-based clustering algorithms include Bayesian clustering.
Fuzzy clustering:
1. In this type of clustering, data points are assigned to multiple clusters with different
degrees of membership.
2. Fuzzy clustering algorithms, such as Fuzzy C-Means, allow for more flexible
clustering where data points can belong to more than one cluster with different degrees
of similarity.
2) Explain k-Means algorithm & Bisecting k- Means algorithm
A)
• Clustering is a group of objects and that objects in a group (cluster) are similar to one
another and different from (or unrelated to) the objects in other groups.
• Cluster is a collection of data objects that are "similar" to one another and thus can be
treated collectively as one group.
• Cluster analysis is a function of data mining that may be used to form the groups of
data objects (clusters) based only on information which found in the data that
describes the objects and their relationships.
K-means:
• K-means clustering is a method of vector quantization originally from signal
processing that is popular for cluster analysis in data mining.
• K-means clustering aims to partition observations into & clusters in which each
observation belongs to the cluster with the nearest mean, serving as a prototype of the
cluster.
• This results in a partitioning of the data space into Voronoi cells.
Bisecting K-means:
• The Bisecting K-means algorithm is extension of the basic K-means algorithm.
• The idea is to obtain K clusters set of points are split into two clusters and among
which one cluster is chosen to split, and so on.
• The details of bisecting K-means are given by Algorithm.
Algorithm for Bisecting K-means
1. Initialize the list of clusters to contain the cluster containing all points.
2. Repeat
3. Select a cluster from the list of clusters
4. for i=1 to number of iterations do
5. Bisect the selected cluster using basic K-means
6. end for
7. Add the two clusters from the bisection with the lowest SSE to the list of clusters.
8. Until the list of clusters contains K clusters

24 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE

3) Explain Agglomerative Hierarchical Clustering Algorithm & DBSCAN algorithm


A)
• Agglomerative hierarchical clustering is a bottom-up clustering approach that starts
with individual data points and recursively merges them into larger clusters based on
their similarity.
• The algorithm begins by treating each data point as a separate cluster and then merges
the closest clusters into larger ones until a stopping criterion is met.
• The distance between two clusters is calculated based on a linkage criterion that
measures the distance between their members.
• There are several types of linkage criteria, including:
1. Single linkage: The distance between two clusters is defined as the shortest
distance between any two points in the two clusters.
2. Complete linkage: The distance between two clusters is defined as the longest
distance between any two points in the two clusters.
3. Average linkage: The distance between two clusters is defined as the average
distance between all pairs of points in the two clusters.
4. Ward’s linkage: The distance between two clusters is defined as the increase in the
sum of squared distances when the two clusters are merged.
• The algorithm continues merging clusters until there is only one cluster containing all
the data points or until a predetermined number of clusters is reached
• Agglomerative hierarchical clustering has the advantage of being able to handle any
type of data and is particularly useful when the number of clusters is not known
beforehand
DBSCAN:
• Density-based clustering locates regions of high density that are separated from one
another by regions of low density.
• It is a simple and effective density-based clustering technique for density-based
clustering.
Traditional Density for Center-based Approach:
• In the center-based approach, density is estimated for a particular point in the data set
by counting the number of points within a specified radius, Eps of that point.
• This includes the point itself.
• This technique is graphically illustrated.
• The number of points within a radius of Eps of point A is 7, including A itself.
• This method is simple to implement, but the density of any point will depend on the
specified radius.
For example,
• if the radius is large enough, then all points will have a density of m, the number of
points in the data set.
• Similarly, If the radius is too small, then all points will have a density of 1.

25 | P a g e
DATA MINING AND DATA WAREHOUSING(BR20) DEPARTMENT OF CSE

Classification of points According to Center-Based Density:


• The center-based approach to density allows us to classify a point as being
1) In the interior of a dense region (a core point).
2) On the edge of a dense region (a border point).
3) In a sparsely occupied region (a noise or background point).
The figure graphically illustrates the concepts of core, border, and noise points using a
collection of two-dimensional points.

26 | P a g e

You might also like