0% found this document useful (0 votes)

17 views56 pages

Data Analytics Unit 4

Unit-4 of the Data Analytics course focuses on frequent itemsets and clustering, covering concepts such as the Apriori algorithm for mining frequent itemsets, market-based modeling, and various clustering techniques. It emphasizes efficient data handling methods for large datasets, including point-wise frequent itemset mining and limited pass algorithms. The unit aims to equip students with practical skills in data analytics, including the application of R programming for big data analysis.

Uploaded by

publiccode.g

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views56 pages

Data Analytics Unit 4

Uploaded by

publiccode.g

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Data Analytics (KIT-601)

Unit-4: Frequent Itemsets and Clustering

S
.R
Dr. Radhey Shyam

Dr
Professor

Department of Information Technology

4/
SRMCEM Lucknow

(Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow)

2
20

Unit-4 has been prepared and compiled by Dr. Radhey Shyam, with grateful acknowledgment to those who

made their course contents freely available or (Contributed directly or indirectly). Feel free to use this
,
12

study material for your own academic purposes. For any query, communication can be made through this

email : [email protected].
ril
Ap

April 12, 2024

Data Analytics (KIT 601)
Course Outcome ( CO) Bloom’s Knowledge Level (KL)

At the end of course , the student will be able to

CO 1 Discuss various concepts of data analytics pipeline K1, K2

CO 2 Apply classification and regression techniques K3
CO 3 Explain and apply mining techniques on streaming data K2, K3

CO 4 Compare different clustering and frequent pattern mining algorithms K4

CO 5 Describe the concept of R programming and implement analytics on Big data using R. K2,K3

DETAILED SYLLABUS 3-0-0

S
Unit Topic Proposed
Lecture

.R
Introduction to Data Analytics: Sources and nature of data, classification of data
(structured, semi-structured, unstructured), characteristics of data, introduction to Big Data
platform, need of data analytics, evolution of analytic scalability, analytic process and
I tools, analysis vs reporting, modern data analytic tools, applications of data analytics. 08
Dr
Data Analytics Lifecycle: Need, key roles for successful analytic projects, various phases
of data analytics lifecycle – discovery, data preparation, model planning, model building,
communicating results, operationalization.
4/
Data Analysis: Regression modeling, multivariate analysis, Bayesian modeling, inference
and Bayesian networks, support vector and kernel methods, analysis of time series: linear
2

II systems analysis & nonlinear dynamics, rule induction, neural networks: learning and
20

08
generalisation, competitive learning, principal component analysis and neural networks,
fuzzy logic: extracting fuzzy models from data, fuzzy decision trees, stochastic search
methods.
,

Mining Data Streams: Introduction to streams concepts, stream data model and
12

architecture, stream computing, sampling data in a stream, filtering streams, counting

III distinct elements in a stream, estimating moments, counting oneness in a window, 08
decaying window, Real-time Analytics Platform ( RTAP) applications, Case studies – real
ril

time sentiment analysis, stock market predictions.

Frequent Itemsets and Clustering: Mining frequent itemsets, market based modelling,
Ap

Apriori algorithm, handling large data sets in main memory, limited pass algorithm,
IV counting frequent itemsets in a stream, clustering techniques: hierarchical, K-means, 08
clustering high dimensional data, CLIQUE and ProCLUS, frequent pattern based clustering
methods, clustering in non-euclidean space, clustering for streams and parallelism.
Frame Works and Visualization: MapReduce, Hadoop, Pig, Hive, HBase, MapR,
Sharding, NoSQL Databases, S3, Hadoop Distributed File Systems, Visualization: visual
V data analysis techniques, interaction techniques, systems and applications.
Introduction to R - R graphical user interfaces, data import and export, attribute and data 08
types, descriptive statistics, exploratory data analysis, visualization before analysis,
analytics for unstructured data.
Text books and References:
1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer
2. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets, Cambridge University Press.
3. John Garrett,Data Analytics for IT Networks : Developing Innovative Use Cases, Pearson Education
Curriculum & Evaluation Scheme IT & CSI (V & VI semester) 23
Unit-IV: Frequent Itemsets and
Clustering
1 Mining Frequent Itemsets

Frequent itemset mining is a popular data mining task that involves identifying sets of items that frequently

co-occur in a given dataset. In other words, it involves finding the items that occur together frequently and

then grouping them into sets of items. One way to approach this problem is by using the Apriori algorithm,

which is one of the most widely used algorithms for frequent itemset mining.

S
The Apriori algorithm works by iteratively generating candidate itemsets and then checking their fre-

.R
quency against a minimum support threshold. The algorithm starts by generating all possible itemsets of

size 1 and counting their frequencies in the dataset. The itemsets that meet the minimum support threshold

are then selected as frequent itemsets. The algorithm then proceeds to generate candidate itemsets of size
Dr
2 from the frequent itemsets of size 1 and counts their frequencies. This process is repeated until no more

frequent itemsets can be generated.

4/
However, when dealing with large datasets, this approach can become computationally expensive due
2

to the potentially large number of candidate itemsets that need to be generated and counted. Point-wise
20

frequent itemset mining is a more efficient alternative that can reduce the computational complexity of the

Apriori algorithm by exploiting the sparsity of the dataset.

Point-wise frequent itemset mining works by iterating over the transactions in the dataset and identifying
12

the itemsets that occur in each transaction. For each transaction, the algorithm generates a bitmap vector

where each bit corresponds to an item in the dataset, and its value is set to 1 if the item occurs in the
ril

transaction and 0 otherwise. The algorithm then performs a bitwise AND operation between the bitmap
Ap

vectors of each transaction to identify the itemsets that occur in all the transactions. The itemsets that meet

the minimum support threshold are then selected as frequent itemsets.

The advantage of point-wise frequent itemset mining is that it avoids generating candidate itemsets that

are not present in the dataset, thereby reducing the number of itemsets that need to be generated and

counted. Additionally, point-wise frequent itemset mining can be parallelized, making it suitable for mining

large datasets on distributed systems.

In summary, point-wise frequent itemset mining is an efficient alternative to the Apriori algorithm for

3
frequent itemset mining. It works by iterating over the transactions in the dataset and identifying the

itemsets that occur in each transaction, thereby avoiding the generation of candidate itemsets that are not

present in the dataset.

2 Market Based Modelling

Market-based modeling is a technique used in economics and business to analyze and simulate the behavior of

markets, particularly in relation to the supply and demand of goods and services. This modeling technique

involves creating mathematical models that can simulate how different market participants (consumers,

S
producers, and intermediaries) interact with each other in a market setting.

One of the most common market-based models is the supply and demand model, which assumes that the

.R
price of a good or service is determined by the balance between its supply and demand. In this model, the

price of a good or service will rise if the demand for it exceeds its supply, and will fall if the supply exceeds

the demand. Dr
Another popular market-based model is the game theory model, which is used to analyze how different
4/
participants in a market interact with each other. Game theory models assume that market participants are

rational and act in their own self-interest, and seek to identify the strategies that each participant is likely
02

to adopt in a given situation.

Market-based models can be used to analyze a wide range of economic phenomena, from the pricing

of individual goods and services to the behavior of entire industries and markets. They can also be used
12

to test the potential impact of various policies and interventions on the behavior of markets and market

participants.
ril

Overall, market-based modeling is a powerful tool for understanding and predicting the behavior of

markets and the economy as a whole. By creating mathematical models that simulate the behavior of
Ap

market participants and the interactions between them, economists and business analysts can gain valuable

insights into the workings of markets, and develop strategies for managing and optimizing their performance.

3 Apriori Algorithm

The Apriori algorithm is a popular algorithm used in data mining and machine learning to discover frequent

itemsets in large transactional datasets. It was proposed by Agrawal and Srikant in 1994 and is widely used

4
in association rule mining, market basket analysis, and other data mining applications.

The Apriori algorithm uses a bottom-up approach to generate all frequent itemsets by first identifying

frequent individual items and then using those items to generate larger itemsets. The algorithm works by

performing the following steps:

First, the algorithm scans the entire dataset to identify all individual items and their frequency of

occurrence. This information is used to generate the initial set of frequent itemsets.

Next, the algorithm uses a level-wise search strategy to generate larger itemsets by combining fre-

quent itemsets from the previous level. The algorithm starts with two-itemsets and then progressively

S
generates larger itemsets until no more frequent itemsets can be found.

At each level, the algorithm prunes the search space by eliminating itemsets that cannot be frequent

.R
based on the minimum support threshold. This is done using the Apriori principle, which states that

any subset of a frequent itemset must also be frequent.

Dr
The algorithm terminates when no more frequent itemsets can be generated or when the maximum
4/
itemset size is reached.

Once all frequent itemsets have been identified, the Apriori algorithm can be used to generate association
2

rules that describe the relationships between different items in the dataset. An association rule is a statement
20

of the form X − > Y, where X and Y are itemsets and X is a subset of Y. The rule indicates that there is a

strong relationship between items in X and items in Y.

,
12

The strength of an association rule is measured using two metrics: support and confidence. Support is

the percentage of transactions in the dataset that contain both X and Y, while confidence is the percentage

of transactions that contain Y given that they also contain X.

ril

Overall, the Apriori algorithm is a powerful tool for discovering frequent itemsets and association rules
Ap

in large datasets. By identifying patterns and relationships between different items in the dataset, it can

be used to gain valuable insights into consumer behavior, market trends, and other important business and

economic phenomena.

4 Handling Large Datasets in Main Memory

Handling large datasets in main memory can be a challenging task, as the amount of memory available on

most computer systems is often limited. However, there are several techniques and strategies that can be

5
used to effectively manage and analyze large datasets in main memory:

Use data compression: Data compression techniques can be used to reduce the amount of memory

required to store a dataset. Techniques such as gzip or bzip2 can compress text data, while binary

data can be compressed using libraries like LZ4 or Snappy.

Use data partitioning: Large datasets can be partitioned into smaller, more manageable subsets,

which can be processed and analyzed in main memory. This can be done using techniques such as

horizontal partitioning, vertical partitioning, or hybrid partitioning.

Use data sampling: Data sampling can be used to select a representative subset of data for analysis,

S
without requiring the entire dataset to be loaded into memory. Random sampling, stratified sampling,

.R
and cluster sampling are some of the commonly used sampling techniques.

Use in-memory databases: In-memory databases can be used to store large datasets in main

Dr
memory for faster querying and analysis. Examples of in-memory databases include Apache Ignite,

SAP HANA, and VoltDB.

4/
Use parallel processing: Parallel processing techniques can be used to distribute the processing of
2

large datasets across multiple processors or cores. This can be done using libraries like Apache Spark,
20

which provides distributed data processing capabilities.

Use data streaming: Data streaming techniques can be used to process large datasets in real-time
,
12

by processing data as it is generated, rather than storing it in memory. Apache Kafka, Apache Flink,

and Apache Storm are some of the popular data streaming platforms.
ril

Overall, effective management of large datasets in main memory requires a combination of data compres-
Ap

sion, partitioning, sampling, in-memory databases, parallel processing, and data streaming techniques. By

leveraging these techniques, it is possible to effectively analyze and process large datasets in main memory,

without requiring expensive hardware upgrades or specialized software tools.

5 Limited Pass Algorithm

A limited pass algorithm is a technique used in data processing and analysis to efficiently process large

datasets with limited memory resources.

6
In a limited pass algorithm, the dataset is processed in a fixed number of passes or iterations, where each

pass involves processing a subset of the data. The algorithm ensures that each pass is designed to capture

the relevant information needed for the analysis, while minimizing the memory required to store the data.

For example, a limited pass algorithm for processing a large text file could involve reading the file in chunks

or sections, processing each section in memory, and then discarding the processed data before moving onto

the next section. This approach enables the algorithm to handle large datasets that cannot be loaded entirely

into memory.

Limited pass algorithms are often used in situations where the data cannot be stored in main memory,

or when the processing of the data requires significant computational resources. Examples of applications

S
that use limited pass algorithms include text processing, machine learning, and data mining.

While limited pass algorithms can be useful for processing large datasets with limited memory resources,

.R
they can also be less efficient than algorithms that can process the entire dataset in a single pass. Therefore,

it is important to carefully design the algorithm to ensure that it can capture the relevant information needed
Dr
for the analysis, while minimizing the number of passes required to process the data.
4/
6 Counting Frequent Itemsets in a Stream
2

Counting frequent itemsets in a stream is a problem of finding the most frequent itemsets in a continuous
20

stream of transactions. This problem is commonly known as the Frequent Itemset Mining problem. Here

are the steps involved in counting frequent itemsets in a stream:

,
12

1. Initialize a hash table to store the counts of each itemset. The size of the hash table should be limited

to prevent it from becoming too large.

ril

2. Read each transaction in the stream one at a time.

3. Generate all the possible itemsets from the transaction. This can be done using the Apriori algorithm,

which generates candidate itemsets by combining smaller frequent itemsets.

4. Increment the count of each itemset in the hash table.

5. Prune infrequent itemsets from the hash table. An itemset is infrequent if its count is less than a

predefined threshold.

6. Repeat steps 2-5 for each transaction in the stream.

7
7. Output the frequent itemsets that remain in the hash table after processing all the transactions.

The main challenge in counting frequent itemsets in a stream is to keep track of the changing frequencies

of the itemsets as new transactions arrive. This can be done efficiently using the hash table to store the

counts of the itemsets. However, the hash table can become too large if the number of distinct itemsets is

too large. To prevent this, the hash table can be limited in size by using a hash function that maps each

itemset to a fixed number of hash buckets. The size of the hash table can be adjusted dynamically based on

the number of items and transactions in the stream.

Another challenge in counting frequent itemsets in a stream is to choose the threshold for the minimum

count of an itemset to be considered frequent. The threshold should be set high enough to exclude infrequent

S
itemsets, but low enough to include all the important frequent itemsets. The threshold can be determined

.R
using heuristics or by using machine learning techniques to learn the optimal threshold from the data.

7 Clustering Techniques Dr
Clustering techniques are used to group similar data points together in a dataset based on their similarity
4/
or distance measures. Here are some popular clustering techniques:
2
20

7.1 K-Means Clustering:

This is a popular clustering algorithm that partitions a dataset into K clusters based on the mean dis-
,
12

tance of the data points to their assigned cluster centers. It involves an iterative process of assigning data

points to clusters and updating the cluster centers until convergence. K-Means is commonly used in image

segmentation, marketing, and customer segmentation.

ril
Ap

7.1.1 K-means Clustering algorithm

K-Means clustering is a popular unsupervised machine learning algorithm that partitions a dataset into k

clusters, where k is a pre-defined number of clusters. The algorithm works as follows:

Initialize the k cluster centroids randomly.

Assign each data point to the nearest cluster centroid based on its distance.

Calculate the new cluster centroids based on the mean of all data points assigned to that cluster.

8
Repeat steps 2-3 until the cluster centroids no longer change significantly, or a maximum number of

iterations is reached.

The distance metric used for step 2 is typically the Euclidean distance, but other distance metrics can

be used as well.

The K-Means algorithm aims to minimize the sum of squared distances between each data point and

its assigned cluster centroid. This objective function is known as the within-cluster sum of squares

(WCSS) or the sum of squared errors (SSE).

To determine the optimal number of clusters, a common approach is to use the elbow method. This

S
involves plotting the WCSS or SSE against the number of clusters and selecting the number of clusters

.R
at the ”elbow” point, where the rate of decrease in WCSS or SSE begins to level off.

K-Means is a computationally efficient algorithm that can scale to large datasets. It is particularly useful

Dr
when the data is high-dimensional and traditional clustering algorithms may be too slow. However, K-Means

requires the number of clusters to be pre-defined and may converge to a suboptimal solution if the initial
4/
cluster centroids are not well chosen. It is also sensitive to non-linear data and may not work well with such

data. Here are some of its advantages and disadvantages:

2
, 20
12
ril
Ap

9
Advantages: Disadvantages:

Simple and easy to understand: K-Means is Requires pre-defined number of clusters: K-

easy to understand and implement, making it Means requires the number of clusters to be

a popular choice for clustering tasks. pre-defined, which can be a challenge when the

number of clusters is unknown or difficult to

Fast and scalable: K-Means is a computation-
determine.
ally efficient algorithm that can scale to large

datasets. It is particularly useful when the data Sensitive to initial cluster centers: K-Means is

is high-dimensional and traditional clustering sensitive to the initial placement of cluster cen-

S
algorithms may be too slow. ters and can converge to a suboptimal solution

if the initial centers are not well chosen.

Works well with circular or spherical clusters:

.R
K-Means works well with circular or spherical Can converge to a local minimum: K-Means

clusters, making it suitable for datasets that ex- can converge to a local minimum rather than

hibit these types of shapes.

Dr the global minimum, resulting in a suboptimal

clustering solution.
4/
Provides a clear and interpretable result: K-

Means provides a clear and interpretable clus- Not suitable for non-linear data: K-Means as-
2

tering result, where each data point is assigned sumes that the data is linearly separable and
20

to one of the k clusters. may not work well with non-linear data.
,
12

In summary, K-Means is a simple and fast clustering algorithm that works well with circular or spherical

clusters. However, it requires the number of clusters to be pre-defined and may converge to a suboptimal
ril

solution if the initial cluster centers are not well chosen. It is also sensitive to non-linear data and may not

work well with such data.

7.2 Hierarchical Clustering:

This technique builds a hierarchy of clusters by recursively dividing or merging clusters based on their

similarity. It can be agglomerative (bottom-up) or divisive (top-down). In agglomerative clustering, each

data point starts in its own cluster, and then pairs of clusters are successively merged until all data points

belong to a single cluster. Divisive clustering starts with all data points in a single cluster and recursively

divides them into smaller clusters. Hierarchical clustering is useful in gene expression analysis, social network

10
analysis, and image analysis.

7.3 Density-based Clustering:

This technique identifies clusters based on the density of data points. It assumes that clusters are areas of

higher density separated by areas of lower density. Density-based clustering algorithms, such as DBSCAN

(Density-Based Spatial Clustering of Applications with Noise), group together data points that are closely

packed together and separate outliers. Density-based clustering is commonly used in image processing,

geospatial data analysis, and anomaly detection.

S
7.4 Gaussian Mixture Models:

.R
This technique models the distribution of data points using a mixture of Gaussian probability distributions.

Each component of the mixture represents a cluster, and the algorithm estimates the parameters of the

mixture using the Expectation-Maximization algorithm. Gaussian Mixture Models are commonly used in
Dr
image segmentation, handwriting recognition, and speech recognition.
4/
7.5 Spectral Clustering:
2

This technique converts the data points into a graph and then partitions the graph into clusters based
20

on the eigenvalues and eigenvectors of the graph Laplacian matrix. Spectral clustering is useful in image

segmentation, community detection in social networks, and document clustering.

,
12

Each clustering technique has its own strengths and weaknesses, and the choice of clustering algorithm

depends on the nature of the data, the clustering objective, and the computational resources available.
ril

8 Clustering high-dimensional data

Clustering high-dimensional data is a challenging task because the distance or similarity measures used in

most clustering algorithms become less meaningful in high-dimensional space. Here are some techniques for

clustering high-dimensional data:

11
8.1 Dimensionality Reduction:

High-dimensional data can be transformed into a lower-dimensional space using dimensionality reduction

techniques, such as Principal Component Analysis (PCA) or t-SNE (t-distributed Stochastic Neighbor Em-

bedding). Dimensionality reduction can help to reduce the curse of dimensionality and make the clustering

algorithms more effective.

8.2 Feature Selection:

Not all features in high-dimensional data are equally informative. Feature selection techniques can be used

to identify the most relevant features for clustering and discard the redundant or noisy features. This can

S
help to improve the clustering accuracy and reduce the computational cost.

.R
8.3 Subspace Clustering:

Dr
Subspace clustering is a clustering technique that identifies clusters in subspaces of the high-dimensional

space. This technique assumes that the data points lie in a union of subspaces, each of which represents
4/
a cluster. Subspace clustering algorithms, such as CLIQUE (CLustering In QUEst), identify the subspaces

and clusters simultaneously.

2
20

8.4 Density-Based Clustering:

Density-based clustering algorithms, such as DBSCAN, can be used for clustering high-dimensional data by
,
12

defining the density of data points in each dimension. The clustering algorithm identifies regions of high

density in the multidimensional space, which correspond to clusters.

ril

8.5 Ensemble Clustering:

Ensemble clustering combines multiple clustering algorithms or different parameter settings of the same

algorithm to improve the clustering performance. Ensemble clustering can help to reduce the sensitivity of

the clustering results to the choice of algorithm or parameter settings.

8.6 Deep Learning-Based Clustering:

Deep learning-based clustering techniques, such as Deep Embedded Clustering (DEC) and Autoencoder-

based Clustering (AE-Clustering), use neural networks to learn a low-dimensional representation of high-

12
dimensional data and cluster the data in the reduced space. These techniques have shown promising results in

clustering high-dimensional data in various domains, including image analysis and gene expression analysis.

Clustering high-dimensional data requires careful consideration of the choice of clustering algorithm,

feature selection or dimensionality reduction technique, and parameter settings. A combination of different

techniques may be required to achieve the best clustering performance.

8.7 CLIQUE and ProCLUS

CLIQUE (CLustering In QUEst) and ProCLUS are two popular subspace clustering algorithms for high-

dimensional data.

S
CLIQUE is a density-based algorithm that works by identifying dense subspaces in the data. It assumes

.R
that clusters exist in subspaces of the data that are dense in at least k dimensions, where k is a user-defined

parameter. The algorithm identifies all possible dense subspaces by enumerating all combinations of k

dimensions and checking if the corresponding subspaces are dense. It then merges the overlapping subspaces
Dr
to form clusters. CLIQUE is efficient for high-dimensional data because it only considers a small number of

dimensions at a time.
4/
ProCLUS (PROjective CLUSters) is a subspace clustering algorithm that works by identifying clusters
2

in a low-dimensional projection of the data. It first selects a random projection matrix and projects the data
20

onto a lower-dimensional space. It then uses K-Means clustering to cluster the projected data. The algorithm

iteratively refines the projection matrix and re-clusters the data until convergence. The final clusters are
,

projected back to the original high-dimensional space. ProCLUS is effective for high-dimensional data
12

because it reduces the dimensionality of the data while preserving the clustering structure.

Both CLIQUE and ProCLUS are designed to handle high-dimensional data by identifying clusters in
ril

subspaces of the data. They are effective for clustering data that have a natural subspace structure. However,
Ap

they may not work well for data that do not have a clear subspace structure or when the data points are

widely spread out in the high-dimensional space. It is important to carefully choose the appropriate algorithm

based on the characteristics of the data and the clustering objectives.

9 Frequent pattern-based clustering methods

Frequent pattern-based clustering methods combine frequent pattern mining with clustering techniques to

identify clusters based on frequent patterns in the data. Here are some examples of frequent pattern-based

13
clustering methods:

1. Frequent Pattern-based Clustering: is a clustering algorithm that uses frequent pattern mining to

identify clusters in transactional data. The algorithm first identifies frequent itemsets in the data

using Apriori or FP-Growth algorithms. It then constructs a graph where each frequent itemset is a

node, and the edges represent the overlap between the itemsets. The graph is partitioned into clusters

using a graph clustering algorithm. The resulting clusters are then used to assign objects to clusters

based on their membership in the frequent itemsets.

2. Frequent Pattern-based Clustering Method: is a clustering algorithm that uses frequent pattern mining

S
to identify clusters in high-dimensional data. The algorithm first discretizes the continuous data into

categorical data. It then uses Apriori or FP-Growth algorithms to identify frequent itemsets in the

.R
categorical data. The frequent itemsets are used to construct a binary matrix that represents the

membership of objects in the frequent itemsets. The binary matrix is clustered using a standard

Dr
clustering algorithm, such as K-Means or Hierarchical clustering. The resulting clusters are then used

to assign objects to clusters based on their membership in the frequent itemsets.

4/
3. Clustering based on Frequent Pattern Combination: is a clustering algorithm that combines frequent
2

pattern mining with pattern combination techniques to identify clusters in transactional data. The
20

algorithm first identifies frequent itemsets in the data using Apriori or FP-Growth algorithms. It

then uses pattern combination techniques, such as Minimum Description Length (MDL) or Bayesian
,

Information Criterion (BIC), to generate composite patterns from the frequent itemsets. The composite
12

patterns are then used to construct a graph, which is partitioned into clusters using a graph clustering

algorithm.
ril

Frequent pattern-based clustering methods are effective for identifying clusters based on frequent patterns
Ap

in the data. They can be applied to a wide range of data types, including transactional data and high-

dimensional data. However, these methods may suffer from the curse of dimensionality when applied to

high-dimensional data. It is important to carefully select the appropriate frequent pattern mining and

clustering techniques based on the characteristics of the data and the clustering objectives.

14
10 Clustering in non-Euclidean space

Clustering in non-Euclidean space refers to the clustering of data points that are not represented in the

Euclidean space, such as graphs, time series, or text data. Traditional clustering algorithms, such as K-

Means and Hierarchical clustering, assume that the data points are represented in the Euclidean space and

use distance metrics, such as Euclidean distance or cosine similarity, to measure the similarity between data

points. However, in non-Euclidean spaces, the notion of distance is different, and distance-based clustering

methods may not be suitable.

Here are some approaches for clustering in non-Euclidean spaces:

S
1. Spectral clustering: Spectral clustering is a popular clustering algorithm that can be applied to data

represented in non-Euclidean spaces, such as graphs or time series. It uses the eigenvalues and eigen-

.R
vectors of the Laplacian matrix of the data to identify clusters. Spectral clustering converts the data

points into a graph representation and then computes the Laplacian matrix of the graph. The eigen-
Dr
vectors of the Laplacian matrix are used to embed the data points into a lower-dimensional space,

where clustering is performed using a standard clustering algorithm, such as K-Means or Hierarchical
4/
clustering.
2

2. Density-Based Spatial Clustering of Applications with Noise: is a density-based clustering algorithm

that can be applied to data represented in non-Euclidean spaces. It does not rely on a distance

metric and can cluster data points based on their density. DBSCAN identifies clusters by defining two
,
12

parameters: the minimum number of points required to form a cluster and a radius that determines

the neighborhood of a point. DBSCAN labels each point as either a core point, a border point, or a

noise point, based on its neighborhood. The core points are used to form clusters.
ril

3. Topic modeling: Topic modeling is a clustering method that can be applied to text data, which is
Ap

typically represented in a non-Euclidean space. Topic modeling identifies latent topics in the text data

by analyzing the co-occurrence of words. It represents each document as a distribution over topics,

and each topic as a distribution over words. The resulting topic distribution of each document can be

used to cluster the documents based on their similarity.

Clustering in non-Euclidean spaces requires careful consideration of the appropriate algorithms and tech-

niques that are suitable for the specific data type. Spectral clustering and DBSCAN are effective for clustering

15
data represented as graphs or time series, while topic modeling is suitable for text data. Other approaches,

such as manifold learning and kernel methods, can also be used for clustering in non-Euclidean spaces.

11 Clustering for streams and parallelism

Clustering for streams and parallelism are two important considerations for clustering large datasets. Stream

data refers to data that arrives continuously and in real-time, while parallelism refers to the ability to

distribute the clustering task across multiple computing resources.

Here are some approaches for clustering streams and parallelism:

S
1. Online clustering: Online clustering is a technique that can be applied to streaming data. It updates

.R
the clustering model continuously as new data arrives. Online clustering algorithms, such as BIRCH

and CluStream, are designed to handle data streams and can scale to large datasets. These algo-

rithms incrementally update the cluster model as new data arrives and discard outdated data points
Dr
to maintain the cluster model’s accuracy and efficiency.
4/
2. Parallel clustering: Parallel clustering refers to the use of multiple computing resources, such as multiple

processors or computing clusters, to speed up the clustering process. Parallel clustering algorithms,
2

such as K-Means Parallel, Hierarchical Parallel, and DBSCAN Parallel, distribute the clustering task
20

across multiple computing resources. These algorithms partition the data into smaller subsets and

assign each subset to a separate computing resource. The resulting clusters are then merged to produce
,
12

the final clustering result.

3. Distributed clustering: Distributed clustering refers to the use of multiple computing resources that
ril

are distributed across different physical locations, such as different data centers or cloud resources.
Ap

Distributed clustering algorithms, such as MapReduce and Hadoop, distribute the clustering task

across multiple computing resources and handle data that is too large to fit into a single computing

resource’s memory. These algorithms partition the data into smaller subsets and assign each subset to

a separate computing resource. The resulting clusters are then merged to produce the final clustering

result.

Clustering for streams and parallelism requires careful consideration of the appropriate algorithms and

techniques that are suitable for the specific clustering objectives and data types. Online clustering is effective

16
for clustering streaming data, while parallel clustering and distributed clustering can speed up the clustering

process for large datasets.

Q1: Write R function to check whether the given number is prime or not.

# Program to check if the input number is prime or not

# take input from the user

num = as.integer(readline(prompt=”Enter a number: ”))

flag = 0

# prime numbers are greater than 1 if(num ¿ 1)

S
# check for factors flag = 1

for(i in 2:(num-1)) {

.R
if ((num %% i) == 0)

flag = 0

break

}
Dr
4/
}
2

}
20

if(num == 2) flag = 1

if(flag == 1)
,

print(paste(num,”is a prime number”))

else

print(paste(num,”is not a prime number”))

ril
Ap

17
Apriori algorithm:—The apriori algorithm solves the frequent item sets problem. The algorithm ana-

lyzes a data set to determine which combinations of items occur together frequently. The Apriori algorithm

is at the core of various algorithms for data mining problems. The best known problem is finding the asso-

ciation rules that hold in a basket -item relation.

Numerical:

Given:

Support = 60% = 60/100 ∗ 5 = 3

Conf idence = 70%

S
ITERATION:1
STEP 1: (C1)
ITERATION 2:

.R
Itemsets Counts STEP 3: (C2)
A 1 Itemsets Counts
C 2 STEP 2: (L2) STEP 4: (L2)
E, K 4
D 1 ITERATION 3:
E

K
I
4
1
5
Itemsets Counts
E
K
4
5
E, M
E, O
E, Y
Dr
2
3
2
Itemsets Counts
E, K
E, O
4
3
STEP 5: (C3)
Itemsets Counts
E, K, O 3
M 3 K, M 3 K, M 3
4/
M 3 K, M, O 1
O 3 K, O 3 K, O 3
N 2 K, M, Y 2
Y 3 K, Y 3 K, Y 3
O 3
2

M, O 1
U 1
M, Y 2
20

Y 3
O, Y 2
STEP 6: (L3)
Itemsets Counts
,

E, K, O 3
12

Now, stop since no more combinations can be made in L3.

ril

ASSOCIATION RULE:
Ap

1. [E, K] → O = 3/4 = 75%

2. [K, O] → E = 3/3 = 100%

3. [E, O] → K = 3/3 = 100%

4. E → [K, O] = 3/4 = 75%

18
5. K → [E, O] = 3/5 = 60%

6. O → [E, K] = 3/3 = 100%

Therefore, Rule no. 5 is discarded because confidence ≥ 70%

So, Rule 1,2,3,4,6 are selected.

S
.R
Dr
2 4/
, 20
12
ril
Ap

19
Printed Page: 1 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0

BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS

Time: 3 Hours Total Marks: 100

Note: Attempt all Sections. If you require any missing data, then choose suitably.

SECTION A

1. Attempt all questions in brief. 2*10 = 20

Qno Questions CO
(a) Discuss the need of data analytics. 1
(b) Give the classification of data. 1
(c) Define neural network. 2

S
(d) What is multivariate analysis? 2
(e) Give the full form of RTAP and discuss its application. 3

.R
(f) What is the role of sampling data in a stream? 3
(g) Discuss the use of limited pass algorithm. 4
(h) What is the principle behind hierarchical clustering technique? 4
(i)
(j)
Dr
List five R functions used in descriptive statistics.
List the names of any 2 visualization tools.
5
5
4/
SECTION B
2

2. Attempt any three of the following: 10*3 = 30

Qno Questions CO
(a) Explain the process model and computation model for Big data 1
,

platform.
12

(b) Explain the use and advantages of decision trees. 2

(c) Explain the architecture of data stream model. 3
(d) Illustrate the K-means algorithm in detail with its advantages. 4
ril

(e) Differentiate between NoSQL and RDBMS databases. 5

SECTION C
3. Attempt any one part of the following: 10*1 = 10
Qno Questions CO
(a) Explain the various phases of data analytics life cycle. 1
(b) Explain modern data analytics tools in detail. 1

4. Attempt any one part of the following: 10 *1 = 10

Qno Questions CO
(a) Compare various types of support vector and kernel methods of data 2
analysis.
(b) Given data= {2,3,4,5,6,7;1,5,3,6,7,8}. Compute the principal 2
component using PCA algorithm.
Printed Page: 2 of 2
Subject Code: KIT601
0Roll No: 0 0 0 0 0 0 0 0 0 0 0 0 0

BTECH
(SEM VI) THEORY EXAMINATION 2021-22
DATA ANALYTICS

5. Attempt any one part of the following: 10*1 = 10

Qno Questions CO
(a) Explain any one algorithm to count number of distinct elements in a 3
data stream.
(b) Discuss the case study of stock market predictions in detail. 3

6. Attempt any one part of the following: 10*1 = 10

Qno Questions CO
(a) Differentiate between CLIQUE and ProCLUS clustering. 4

S
(b) A database has 5 transactions. Let min_sup=60% and min_conf=80%. 4

.R
TID Items_Bought
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y}
T300 Dr {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I, E}
2 4/

i) Find all frequent itemsets using Apriori algorithm.

ii) List all the strong association rules (with support s and confidence
c).
,
12

7. Attempt any one part of the following: 10*1 = 10

Qno Questions CO
ril

(a) Explain the HIVE architecture with its features in detail. 5

(b) Write R function to check whether the given number is prime or not. 5
Ap
Ap
ril
12
, 20
2 4/
Dr
.R
S
Ap
ril
12
, 20
2 4/
Dr
.R
S
Appendix [17]: Additional Study Material For
Numerical Perspectives

S
.R
Dr
2 4/
, 20
12
ril
Ap

24
DATABASE
SYSTEMS
What is Frequent Itemset Mining?
GROUP

Frequent Itemset Mining:

Finding frequent patterns, associations, correlations, or causal structures
among sets of items or objects in transaction databases, relational
databases, and other information repositories.
• Given:
– A set of items 𝐼 = {𝑖1 , 𝑖2 , … , 𝑖𝑚 }
– A database of transactions 𝐷, where a transaction 𝑇 ⊆ 𝐼 is a set of items
• Task 1: find all subsets of items that occur together in many
transactions.
– E.g.: 85% of transactions contain the itemset {milk, bread, butter}

S
• Task 2: find all rules that correlate the presence of one set of items with
that of another set of items in the transaction database.

.R
– E.g.: 98% of people buying tires and auto accessories also get automotive service
done
• Applications: Basket data analysis, cross-marketing, catalog design,
Dr
loss-leader analysis, clustering, classification, recommendation systems,
etc.
4/
Frequent Itemset Mining  Introduction 3
2
20

DATABASE
SYSTEMS
Example: Basket Data Analysis
,

GROUP
12

• Transaction database
D= {{butter, bread, milk, sugar};
ril

{butter, flour, milk, sugar};

{butter, eggs, milk, salt};
Ap

{eggs};
{butter, flour, milk, salt, sugar}}

• Question of interest: items frequency

{butter} 4
– Which items are bought together frequently? {milk} 4
{butter, milk} 4
{sugar} 3
• Applications {butter, sugar} 3
– Improved store layout {milk, sugar} 3
{butter, milk, sugar} 3
– Cross marketing
{eggs} 2
– Focused attached mailings / add-on sales …
– *  Maintenance Agreement
(What the store should do to boost Maintenance Agreement sales)
– Home Electronics  * (What other products should the store stock up?)

Frequent Itemset Mining  Introduction 4

DATABASE
SYSTEMS
Chapter 3: Frequent Itemset Mining
GROUP

1) Introduction
– Transaction databases, market basket data analysis
2) Mining Frequent Itemsets
– Apriori algorithm, hash trees, FP-tree
3) Simple Association Rules
– Basic notions, rule generation, interestingness measures
4) Further Topics
– Hierarchical Association Rules
• Motivation, notions, algorithms, interestingness

S
– Quantitative Association Rules
• Motivation, basic idea, partitioning numerical attributes, adaptation of

.R
apriori algorithm, interestingness
5) Extensions and Summary
Dr
4/
Outline 5
2

Mining Frequent Itemsets: Basic

Notions
DATABASE
SYSTEMS
,

GROUP
12

 Items 𝐼 = {𝑖1 , 𝑖2 , … , 𝑖𝑚 } : a set of literals (denoting items)

• Itemset 𝑋: Set of items 𝑋 ⊆ 𝐼
ril

• Database 𝐷: Set of transactions 𝑇, each transaction is a set of items T ⊆

𝐼
• Transaction 𝑇 contains an itemset 𝑋: 𝑋 ⊆ 𝑇
• The items in transactions and itemsets are sorted lexicographically:
– itemset 𝑋 = (𝑥1, 𝑥2, … , 𝑥𝑘 ), where 𝑥1  𝑥2  …  𝑥𝑘
• Length of an itemset: number of elements in the itemset
• k-itemset: itemset of length k
• The support of an itemset X is defined as: 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑋 = 𝑇 ∈ 𝐷|𝑋 ⊆ 𝑇
• Frequent itemset: an itemset X is called frequent for database 𝐷 iff it is
contained in more than 𝑚𝑖𝑛𝑆𝑢𝑝 many transactions: 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑋) ≥
𝑚𝑖𝑛𝑆𝑢𝑝

• Goal 1: Given a database 𝐷and a threshold 𝑚𝑖𝑛𝑆𝑢𝑝 , find all frequent

itemsets X ∈ 𝑃𝑜𝑡(𝐼).

Frequent Itemset Mining  Algorithms 6

DATABASE
SYSTEMS
Mining Frequent Itemsets: Basic Idea
GROUP

• Naïve Algorithm
– count the frequency of all possible subsets of 𝐼 in the database
 too expensive since there are 2m such itemsets for |𝐼| = 𝑚 items
cardinality of power set
• The Apriori principle (anti-monotonicity):
Any non-empty subset of a frequent itemset is frequent, too!
A ⊆ I with support A ≥ minSup ⇒ ∀A′ ⊂ A ∧ A′ ≠ ∅: support A′ ≥ minSup
Any superset of a non-frequent itemset is non-frequent, too!
A ⊆ I with support A < minSup ⇒ ∀A′ ⊃ A: support A′ < minSup
ABCD not frequent

S
• Method based on the apriori principle ABC ABD ACD BCD
– First count the 1-itemsets, then the 2-itemsets,

.R
AB AC AD BC BD CD
then the 3-itemsets, and so on
A B C D
– When counting (k+1)-itemsets, only consider those
(k+1)-itemsets where all subsets of length k have been
Dr Ø
determined as frequent in the previous step
4/
Frequent Itemset Mining  Algorithms  Apriori Algorithm 7
2
20

DATABASE
SYSTEMS
The Apriori Algorithm
,

GROUP
12

variable Ck: candidate itemsets of size k

ril

variable Lk: frequent itemsets of size k

L1 = {frequent items}
for (k = 1; Lk !=; k++) do begin
// JOIN STEP: join Lk with itself to produce Ck+1
produce // PRUNE STEP: discard (k+1)-itemsets from Ck+1 that
candidates contain non-frequent k-itemsets as subsets
Ck+1 = candidates generated from Lk

for each transaction t in database do

prove Increment the count of all candidates in Ck+1
candidates that are contained in t
Lk+1 = candidates in Ck+1 with min_support
return k Lk

Frequent Itemset Mining  Algorithms  Apriori Algorithm 8

DATABASE
SYSTEMS
Generating Candidates (Join Step)
GROUP

• Requirements for set of all candidate 𝑘 + 1 -itemsets 𝐶𝑘+1

– Completeness:
Must contain all frequent 𝑘 + 1 -itemsets (superset property 𝐶𝑘+1  𝐿𝑘+1 )
– Selectiveness:
Significantly smaller than the set of all 𝑘 + 1 -subsets
– Suppose the items are sorted by any order (e.g., lexicograph.)
• Step 1: Joining (𝐶𝑘+1 = 𝐿𝑘 ⋈ 𝐿𝑘 )
– Consider frequent 𝑘-itemsets 𝑝 and 𝑞
– 𝑝 and 𝑞 are joined if they share the same first 𝑘 − 1 items

S
p  Lk=3 (A, C, F)
insert into Ck+1

.R
select p.i1, p.i2, …, p.ik–1, p.ik, q.ik (A, C, F, G)  Ck+1=4
from Lk : p, Lk : q Dr q  Lk=3 (A, C, G)
where p.i1=q.i1, …, p.ik –1 =q.ik–1, p.ik < q.ik
4/
Frequent Itemset Mining  Algorithms  Apriori Algorithm 9
2
20

DATABASE
SYSTEMS
Generating Candidates (Prune Step)
,

GROUP
12

• Step 2: Pruning (𝐿𝑘+1 = {X ∈ 𝐶𝑘+1 |𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑋 ≥ 𝑚𝑖𝑛𝑆𝑢𝑝} )

ril

– Naïve: Check support of every itemset in 𝐶𝑘+1  inefficient for huge 𝐶𝑘+1
– Instead, apply Apriori principle first: Remove candidate (k+1) -itemsets
Ap

which contain a non-frequent k -subset s, i.e., s  Lk

forall itemsets c in Ck+1 do
forall k-subsets s of c do
if (s is not in Lk) then delete c from Ck+1
• Example 1
– L3 = {(ACF), (ACG), (AFG), (AFH), (CFG)}
– Candidates after the join step: {(ACFG), (AFGH)}
– In the pruning step: delete (AFGH) because (FGH)  L3, i.e., (FGH) is not a
frequent 3-itemset; also (AGH)  L3
 C4 = {(ACFG)}  check the support to generate L4

Frequent Itemset Mining  Algorithms  Apriori Algorithm 10

DATABASE
SYSTEMS
Apriori Algorithm – Full Example
GROUP

minSup=0.5 C1 itemset count L1 itemset count

database D {1} 3 {1} 3
TID items scan D {2} 2 {2} 2
100 1346 {3} 3 {3} 3
200 235 {4} 1 {5} 3 𝐿1 ⋈ 𝐿1
300 1235 {5} 3 {6} 2
400 156 {6} 2
C2 itemset C2 itemset C2 itemset count L2 itemset count
{1 2} {1 2} {1 2} 1 {1 3} 2
{1 3} prune C1 {1 3} scan D {1 3} 2 {1 5} 2
{1 5} {1 5} {1 5} 2 {1 6} 2
{1 6} {1 6} {1 6} 2 {2 3} 2
{2 3} {2 3} {2 3} 2 {2 5} 2
{2 5} {2 5} {2 5} 2 {3 5} 2
{2 6} {2 6} {2 6} 0

S
{3 5} {3 5} {3 5} 2
{3 6} {3 6} {3 6} 1 𝐿2 ⋈ 𝐿2

.R
{5 6} {5 6} {5 6} 1
C3 itemset C3 itemsetC3 itemset count L3 itemset count
{1 3 5} {1 3 5} {1 3 5} 1 {2 3 5} 2
{1 3 6} prune C2 {1 3 6} ✗ scan D {2 3 5} 2
{1 5 6} {1 5 6} ✗
Dr 𝐿3 ⋈ 𝐿3
{2 3 5} {2 3 5}
4/
C4 is empty
Frequent Itemset Mining  Algorithms  Apriori Algorithm 11
2

How to Count Supports of

Candidates?
DATABASE
SYSTEMS
,

GROUP
12

• Why is counting supports of candidates a problem?

ril

– The total number of candidates can be very huge

– One transaction may contain many candidates
Ap

• Method: Hash-Tree
– Candidate itemsets are stored in a hash-tree
– Leaf nodes of hash-tree contain lists of itemsets and their support (i.e.,
counts)
– Interior nodes contain hash tables
– Subset function finds all the candidates contained in a transaction
012
012 012 012
e.g. for 3-Itemsets
h(K) = K mod 3 (3 6 7) 0 1 2 (3 5 7) (7 9 12) (1 4 11) (7 8 9) (2 3 8) 0 1 2 (2 5 6)
(3 5 11) (1 6 11) (1 7 9) (1 11 12) (5 6 7) (2 5 7)
(5 8 11)
(3 4 15) (3 7 11) (2 4 6) (2 4 7)
(3 4 11) (2 7 9) (5 7 10)
(3 4 8)

Frequent Itemset Mining  Algorithms  Apriori Algorithm 12

DATABASE
SYSTEMS
Hash-Tree – Construction
GROUP

• Searching for an itemset

– Start at the root (level 1)
– At level d: apply the hash function h to the d-th item in the itemset
• Insertion of an itemset
– search for the corresponding leaf node, and insert the itemset into that leaf
– if an overflow occurs:
• Transform the leaf node into an internal node
• Distribute the entries to the new leaf nodes according to the hash
function
012

S
012 012 012

.R
for 3-Itemsets
(3 6 7) 0 1 2 (3 5 7) (7 9 12) (1 4 11) (7 8 9) (2 3 8) 0 1 2 (2 5 6)
h(K) = K mod 3 (3 5 11) (1 6 11) (1 7 9) (1 11 12) (5 6 7) (2 5 7)
(5 8 11)
(3 4 15) (3 7 11)
(3 4 11)
(3 4 8)
Dr (2 4 6) (2 4 7)
(2 7 9) (5 7 10)
4/
Frequent Itemset Mining  Algorithms  Apriori Algorithm 13
2
20

DATABASE
SYSTEMS
Hash-Tree – Counting
,

GROUP
12

• Search all candidate itemsets contained in a transaction T = (t1 t2 ... tn) for a
current itemset length of k
ril

• At the root
Ap

– Determine the hash values for each item t1 t2 ... tn-k+1 in T

– Continue the search in the resulting child nodes
• At an internal node at level d (reached after hashing of item 𝑡𝑖 )
– Determine the hash values and continue the search for each item 𝑡𝑗 with 𝑖 < 𝑗 ≤ 𝑛 −
𝑘+𝑑
• At a leaf node
– Check whether the itemsets in the leaf node are contained in transaction T
012
in our example n=5 and k=3 3 1,7
h(K) = K mod 3 012 012 012
9 7 3,9 7
Transaction (1, 3, 7, 9, 12) (3 6 7) 0 1 2 (3 5 7) (7 9 12) (1 4 11) (7 8 9) (2 3 8) 0 1 2 (2 5 6)
(3 5 11) (1 6 11) (1 7 9) (1 11 12) (5 6 7) (2 5 7)
(5 8 11)
9,12
Tested leaf nodes (3 4 15) (3 7 11) (2 4 6) (2 4 7)
(3 4 11) (2 7 9) (5 7 10)
Pruned subtrees (3 4 8)

Frequent Itemset Mining  Algorithms  Apriori Algorithm 14

Is Apriori Fast Enough? —
Performance Bottlenecks
DATABASE
SYSTEMS
GROUP

• The core of the Apriori algorithm:

– Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets
– Use database scan and pattern matching to collect counts for the candidate
itemsets
• The bottleneck of Apriori: candidate generation
– Huge candidate sets:
• 104 frequent 1-itemsets will generate 107 candidate 2-itemsets
• To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one
needs to generate 2100  1030 candidates.
– Multiple scans of database:

S
• Needs n or n+1 scans, n is the length of the longest pattern

.R
 Is it possible to mine the complete set of frequent itemsets without
candidate generation?
Dr
4/
Frequent Itemset Mining  Algorithms  Apriori Algorithm 15
2

Mining Frequent Patterns Without

Candidate Generation
DATABASE
SYSTEMS
,

GROUP
12

• Compress a large database into a compact, Frequent-Pattern tree (FP-

tree) structure
ril

– highly condensed, but complete for frequent pattern mining

– avoid costly database scans

• Develop an efficient, FP-tree-based frequent pattern mining method

– A divide-and-conquer methodology: decompose mining tasks into smaller
ones
– Avoid candidate generation: sub-database test only!

• Idea:
– Compress database into FP-tree, retaining the itemset association
information
– Divide the compressed database into conditional databases, each associated
with one frequent item and mine each such database separately.

Frequent Itemset Mining  Algorithms  FP-Tree 16

Construct FP-tree from a Transaction
DB
DATABASE
SYSTEMS
GROUP

Steps for compressing the database into a FP-tree:

1. Scan DB once, find frequent 1-itemsets (single items)
2. Order frequent items in frequency descending order

TID items bought

100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}

S
header table:
1&2 item frequency

.R
f 4
c 4 sort items in the order
a 3 of descending support
minSup=0.5 b 3 Dr
m 3
p 3
4/
Frequent Itemset Mining  Algorithms  FP-Tree 17
2

Construct FP-tree from a Transaction

DB
DATABASE
SYSTEMS
,

GROUP
12

Steps for compressing the database into a FP-tree:

ril

1. Scan DB once, find frequent 1-itemsets (single items)

2. Order frequent items in frequency descending order
Ap

3. Scan DB again, construct FP-tree starting with most frequent item per transaction
TID items bought (ordered) frequent
items for each transaction only
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} keep its frequent items
200 {a, b, c, f, l, m, o} {f, c, a, b, m} sorted in descending
300 {b, f, h, j, o} {f, b} order of their frequencies
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
header table:
item frequency 3a
1&2 f 4
for each transaction build a path in the FP-tree:
c 4
a 3 - If a path with common prefix exists:
b 3 increment frequency of nodes on this path
m 3 and append suffix
p 3 - Otherwise: create a new branch
Frequent Itemset Mining  Algorithms  FP-Tree 18
Construct FP-tree from a Transaction
DB
DATABASE
SYSTEMS
GROUP

Steps for compressing the database into a FP-tree:

1. Scan DB once, find frequent 1-itemsets (single items)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree starting with most frequent item per transaction
TID items bought (ordered) frequent 3b
items {}
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o} {f, b} f:4 c:1
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} c:3 b:1 b:1

S
header table: 3a
1&2
item frequency head

.R
a:3 p:1
f 4
header table c 4
references the a 3 m:2 b:1
occurrences of the
frequent items in the
b
m
3
3
Dr p:2 m:1
FP-tree
4/
p 3
Frequent Itemset Mining  Algorithms  FP-Tree 19
2
20

DATABASE
SYSTEMS
Benefits of the FP-tree Structure
,

GROUP
12

• Completeness:
ril

– never breaks a long pattern of any transaction

– preserves complete information for frequent pattern mining
Ap

• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more likely to be
shared
– never be larger than the original database (if not count node-links and
counts)
– Experiments demonstrate compression ratios over 100

Frequent Itemset Mining  Algorithms  FP-Tree 20

Mining Frequent Patterns Using
FP-tree
DATABASE
SYSTEMS
GROUP

• General idea (divide-and-conquer)

– Recursively grow frequent pattern path using the FP-tree
• Method
– For each item, construct its conditional pattern-base (prefix paths), and then
its conditional FP-tree
– Repeat the process on each newly created conditional FP-tree …
– …until the resulting FP-tree is empty, or it contains only one path (single
path will generate all the combinations of its sub-paths, each of which is a
frequent pattern)

S
.R
Dr
4/
Frequent Itemset Mining  Algorithms  FP-Tree 21
2
20

DATABASE
SYSTEMS
Major Steps to Mine FP-tree
,

GROUP
12

1) Construct conditional pattern base for each node in the FP-tree

ril

2) Construct conditional FP-tree from each conditional pattern-base

3) Recursively mine conditional FP-trees and grow frequent patterns
Ap

obtained so far
– If the conditional FP-tree contains a single path, simply enumerate all the
patterns

Frequent Itemset Mining  Algorithms  FP-Tree 22

Major Steps to Mine FP-tree:
Conditional Pattern Base
DATABASE
SYSTEMS
GROUP

1) Construct conditional pattern base for each node in the FP-tree

– Starting at the frequent header table in the FP-tree
– Traverse FP-tree by following the link of each frequent item (dashed lines)
– Accumulate all of transformed prefix paths of that item to form a conditional
pattern base
• For each item its prefixes are regarded as condition for it being a suffix. These
prefixes form the conditional pattern base. The frequency of the prefixes can be
read in the node of the item.
{}
header table:
item frequency head f:4 c:1
conditional pattern base:

S
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 f {}

.R
a 3 c f:3, {}
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3
m:2 b:1 Dr m
p
fca:2, fcab:1
fcam:2, cb:1
p:2 m:1
4/
Frequent Itemset Mining  Algorithms  FP-Tree 23
2

Properties of FP-tree for Conditional

Pattern Bases
DATABASE
SYSTEMS
,

GROUP
12

• Node-link property
ril

– For any frequent item ai, all the possible frequent patterns that contain ai
can be obtained by following ai's node-links, starting from ai's head in the
Ap

FP-tree header
• Prefix path property
– To calculate the frequent patterns for a node ai in a path P, only the prefix
sub-path of ai in P needs to be accumulated, and its frequency count should
carry the same count as node ai.

Frequent Itemset Mining  Algorithms  FP-Tree 24

Major Steps to Mine FP-tree:
Conditional FP-tree
DATABASE
SYSTEMS
GROUP

1) Construct conditional pattern base for each node in the FP-tree ✔

2) Construct conditional FP-tree from each conditional pattern-base
– The prefix paths of a suffix represent the conditional basis.
They can be regarded as transactions of a database.
– Those prefix paths whose support ≥ minSup, induce a conditional FP-tree
– For each pattern-base
• Accumulate the count for each item in the base
• Construct the FP-tree for the frequent items of the pattern base

conditional pattern base: m-conditional FP-tree

S
item cond. pattern base item frequency {}|m
f {} f 3 ..

.R
c f:3 c 3 ..
a fc:3 f:3
a 3 ..
b fca:1, f:1, c:1 b 1✗
m
p
fca:2, fcab:1
fcam:2, cb:1
Dr c:3

a:3
4/
Frequent Itemset Mining  Algorithms  FP-Tree 25
2

Major Steps to Mine FP-tree:

Conditional FP-tree
DATABASE
SYSTEMS
,

GROUP
12

1) Construct conditional pattern base for each node in the FP-tree ✔

ril

2) Construct conditional FP-tree from each conditional pattern-base

conditional pattern base:

item cond. pattern base
f {}
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1

{}|f = {} {}|c {}|a {}|b = {} {}|m {}|p

f:3 f:3 f:3 c:3

c:3 c:3

a:3
Frequent Itemset Mining  Algorithms  FP-Tree 26
DATABASE
SYSTEMS
Major Steps to Mine FP-tree
GROUP

1) Construct conditional pattern base for each node in the FP-tree ✔

2) Construct conditional FP-tree from each conditional pattern-base ✔
3) Recursively mine conditional FP-trees and grow frequent patterns
obtained so far
– If the conditional FP-tree contains a single path, simply enumerate all the
patterns (enumerate all combinations of sub-paths)

example:
m-conditional FP-tree All frequent patterns
{}|m concerning m

S
m,
f:3 just a single path
fm, cm, am,

.R
c:3 fcm, fam, cam,
fcam
a:3 Dr
4/
Frequent Itemset Mining  Algorithms  FP-Tree 27
2
20

DATABASE
SYSTEMS
FP-tree: Full Example
,

GROUP
12

database:
TID items bought (ordered) frequent items
ril

100 {b, c, f} {f, b, c}

200 {a, b, c} {b, c}
Ap

300 {d, f} {f}

400 {b, c, e, f} {f, b, c}
500 {f, g} {f} {}

minSup=0.4 header table:

item frequency head f:4 b:1
f 4
b 3
c 3 b:2 c:1

c:2
conditional pattern base:
item cond. pattern base
f {}
b f:2, {}
c fb:2, b:1

Frequent Itemset Mining  Algorithms  FP-Tree 28

DATABASE
SYSTEMS
FP-tree: Full Example
GROUP

{} conditional pattern base 1:

item cond. pattern base
f {}
f:4 b:1
b f:2
c fb:2, b:1
b:2 c:1

c:2

{}|c conditional pattern base 2:

item cond. pattern base
{}|f = {} {}|b b f:2
f:2 b:1 f {}

S
f:2 b:2

.R
{{f}}
{{b},{fb}}

Dr {}|fc = {}

{{fc}}
{}|bc
{{bc},{fbc}}
f:2
4/
Frequent Itemset Mining  Algorithms  FP-Tree 29
2

Principles of Frequent Pattern

Growth
DATABASE
SYSTEMS
,

GROUP
12

• Pattern growth property

– Let  be a frequent itemset in DB, B be 's conditional pattern base, and 
ril

be an itemset in B. Then    is a frequent itemset in DB iff  is frequent

in B.

• “abcdef ” is a frequent pattern, if and only if

– “abcde ” is a frequent pattern, and
– “f ” is frequent in the set of transactions containing “abcde ”

Frequent Itemset Mining  Algorithms  FP-Tree 30

Why Is Frequent Pattern Growth
Fast?
DATABASE
SYSTEMS
GROUP

• Performance study in [Han, Pei&Yin ’00] shows

Data set T25I20D10K:
– FP-growth is an order of 100
T 25 avg. length of transactions
magnitude faster than Apriori, 90 I 20 avg. length of frequent itemsets
D 10K database size (#transactions)
and is also faster than 80

70
tree-projection

Run time(sec.)
60
50 D1 FP-grow th runtime
40 D1 Apriori runtime

30
20
10
0

• Reasoning
0 0,5 1 1,5 2 2,5 3
Support threshold(%)

S
– No candidate generation, no candidate test

.R
• Apriori algorithm has to proceed breadth-first
– Use compact data structure
– Eliminate repeated database scan
– Basic operation is counting and FP-tree building
Dr
4/
Frequent Itemset Mining  Algorithms  FP-Tree 31
2
20

DATABASE
SYSTEMS
Maximal or Closed Frequent Itemsets
,

GROUP
12

• Big challenge: database contains potentially a huge number of frequent

itemsets (especially if minSup is set too low).
ril

– A frequent itemset of length 100 contains 2100-1 many frequent subsets

• Closed frequent itemset:

An itemset X is closed in a data set D if there exists no proper super-
itemset Y such that 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑋) = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑌) in D.
– The set of closed frequent itemsets contains complete information regarding
its corresponding frequent itemsets.
• Maximal frequent itemset:
An itemset X is maximal in a data set D if there exists no proper super-
itemset Y such that 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑌 ≥ 𝑚𝑖𝑛𝑆𝑢𝑝 in D.
– The set of maximal itemsets does not contain the complete support
information
– More compact representation

Frequent Itemset Mining  Algorithms  Maximal or Closed Frequent Itemsets 32

DATABASE
SYSTEMS
Chapter 3: Frequent Itemset Mining
GROUP

S
– Quantitative Association Rules
• Motivation, basic idea, partitioning numerical attributes, adaptation of

.R
apriori algorithm, interestingness
5) Extensions and Summary
Dr
4/
Outline 33
2

Simple Association Rules:

Introduction
DATABASE
SYSTEMS
,

GROUP
12

• Transaction database:
ril

D= {{butter, bread, milk, sugar};

{butter, flour, milk, sugar};
Ap

{butter, eggs, milk, salt};

{eggs};
{butter, flour, milk, salt, sugar}}
• Frequent itemsets: items support
{butter} 4
{milk} 4
{butter, milk} 4
{sugar} 3
{butter, sugar} 3
{milk, sugar} 3
{butter, milk, sugar} 3
• Question of interest:
– If milk and sugar are bought, will the customer always buy butter as well?
𝑚𝑖𝑙𝑘, 𝑠𝑢𝑔𝑎𝑟 ⇒ 𝑏𝑢𝑡𝑡𝑒𝑟 ?
– In this case, what would be the probability of buying butter?

Frequent Itemset Mining  Simple Association Rules 34

Simple Association Rules: Basic
Notions
DATABASE
SYSTEMS
GROUP

 Items 𝐼 = {𝑖1 , 𝑖2 , … , 𝑖𝑚 } : a set of literals (denoting items)

• Itemset 𝑋: Set of items 𝑋 ⊆ 𝐼
• Database 𝐷: Set of transactions 𝑇, each transaction is a set of items T ⊆ 𝐼
• Transaction 𝑇 contains an itemset 𝑋: 𝑋 ⊆ 𝑇
• The items in transactions and itemsets are sorted lexicographically:
– itemset 𝑋 = (𝑥1, 𝑥2, … , 𝑥𝑘 ), where 𝑥1  𝑥2  …  𝑥𝑘
• Length of an itemset: cardinality of the itemset (k-itemset: itemset of length
k)
• The support of an itemset X is defined as: 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑋 = 𝑇 ∈ 𝐷|𝑋 ⊆ 𝑇
• Frequent itemset: an itemset X is called frequent iff 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑋) ≥ 𝑚𝑖𝑛𝑆𝑢𝑝

S
.R
• Association rule: An association rule is an implication of the form 𝑋 ⇒ 𝑌
where 𝑋, 𝑌 ⊆ 𝐼 are two itemsets with 𝑋 ∩ 𝑌 = ∅.

Dr
• Note: simply enumerating all possible association rules is not reasonable!
 What are the interesting association rules w.r.t. 𝐷?
4/
Frequent Itemset Mining  Simple Association Rules 35
2
20

DATABASE
SYSTEMS
Interestingness of Association Rules
,

GROUP
12

• Interestingness of an association rule:

Quantify the interestingness of an association rule with respect to a
ril

transaction database D:
Ap

– Support: frequency (probability) of the entire rule with respect to D

{𝑇 ∈ 𝐷|𝑋 ∪ 𝑌 ⊆ 𝑇}
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑋 ⇒ 𝑌 = 𝑃 𝑋 ∪ 𝑌 = = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑋 ∪ 𝑌)
𝐷
“probability that a transaction in 𝐷 contains the itemset 𝑋 ∪ 𝑌”
– Confidence: indicates the strength of implication in the rule
{𝑇 ∈ 𝐷|𝑋 ∪ 𝑌 ⊆ 𝑇} 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑋 ∪ 𝑌)
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑋 ⇒ 𝑌 = 𝑃 𝑌|𝑋 = =
{𝑇 ∈ 𝐷|𝑋 ⊆ 𝑇} 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑋)
“conditional probability that a transaction in 𝐷 containing the itemset 𝑋 also
contains itemset 𝑌” buys diapers
– Rule form: “𝐵𝑜𝑑𝑦 ⇒ 𝐻𝑒𝑎𝑑 [𝑠𝑢𝑝𝑝𝑜𝑟𝑡, 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒]” buys both
• Association rule examples:
– buys diapers  buys beers [0.5%, 60%]
– major in CS ∧ takes DB  avg. grade A [1%, 75%]
buys beer
Frequent Itemset Mining  Simple Association Rules 36
DATABASE
SYSTEMS
Mining of Association Rules
GROUP

• Task of mining association rules:

Given a database 𝐷, determine all association rules having a 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 ≥
𝑚𝑖𝑛𝑆𝑢𝑝 and a 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 ≥ 𝑚𝑖𝑛𝐶𝑜𝑛𝑓 (so-called strong association
rules).
• Key steps of mining association rules:
1) Find frequent itemsets, i.e., itemsets that have at least support = 𝑚𝑖𝑛𝑆𝑢𝑝
2) Use the frequent itemsets to generate association rules
• For each itemset 𝑋 and every nonempty subset Y ⊂ 𝑋 generate rule Y ⇒ (𝑋 −
𝑌) if 𝑚𝑖𝑛𝑆𝑢𝑝 and 𝑚𝑖𝑛𝐶𝑜𝑛𝑓 are fulfilled
• we have 2|𝑋| − 2 many association rule candidates for each itemset 𝑋

S
• Example
frequent itemsets

.R
1-itemset count 2-itemset count 3-itemset count
{A} 3 {A, B} 3 {A, B, C} 2
{B} 4 {A, C} 2
{C} 5 {B, C} 4
Dr
rule candidates: A ⇒ 𝐵; 𝐵 ⇒ 𝐴; A ⇒ 𝐶; 𝐶 ⇒ A; 𝐵 ⇒ 𝐶; C ⇒ 𝐵;
𝐴, 𝐵 ⇒ 𝐶; 𝐴, 𝐶 ⇒ 𝐵; 𝐶, 𝐵 ⇒ 𝐴; 𝐴 ⇒ 𝐵, 𝐶; 𝐵 ⇒ 𝐴, 𝐶; 𝐶 ⇒ 𝐴, 𝐵
4/
Frequent Itemset Mining  Simple Association Rules 37
2

Generating Rules from Frequent

Itemsets
DATABASE
SYSTEMS
,

GROUP
12

• For each frequent itemset X

– For each nonempty subset Y of X, form a rule Y ⇒ (𝑋 − 𝑌)
ril

– Delete those rules that do not have minimum confidence

Note: 1) support always exceeds 𝑚𝑖𝑛𝑆𝑢𝑝

2) the support values of the frequent itemsets suffice to calculate the
confidence
• Example: 𝑋 = {𝐴, 𝐵, 𝐶}, 𝑚𝑖𝑛𝐶𝑜𝑛𝑓 = 60% itemset count
– conf (A  B) = 3/3; ✔ {A} 3
– conf (B  A) = 3/4; ✔ {B} 4
– conf (A  C) = 2/3; ✔ {C} 5
– conf (C  A) = 2/5; ✗
{A, B} 3
– conf (B  C) = 4/4; ✔
{A, C} 2
– conf (C  B) = 4/5; ✔
{B, C} 4
– conf (A  B, C) = 2/3; ✔ conf (B, C  A) = ½ ✗
{A, B, C} 2
– conf (B  A, C) = 2/4; ✗ conf (A, C  B) = 1 ✔
– conf (C  A, B) = 2/5; ✗ conf (A, B  C) = 2/3 ✔
• Exploit anti-monotonicity for generating candidates for strong
association rules!

Frequent Itemset Mining  Simple Association Rules 38

DATABASE
SYSTEMS
Interestingness Measurements
GROUP

• Objective measures
– Two popular measurements:
– support and
– confidence

• Subjective measures [Silberschatz & Tuzhilin, KDD95]

– A rule (pattern) is interesting if it is
– unexpected (surprising to the user) and/or
– actionable (the user can do something with it)

S
.R
Dr
4/
Frequent Itemset Mining  Simple Association Rules 39
2
20

DATABASE
SYSTEMS
Criticism to Support and Confidence
,

GROUP
12

Example 1 [Aggarwal & Yu, PODS98]

• Among 5000 students
ril

– 3000 play basketball (=60%)

– 3750 eat cereal (=75%)

– 2000 both play basket ball and eat cereal (=40%)
• Rule play basketball  eat cereal [40%, 66.7%] is misleading because
the overall percentage of students eating cereal is 75% which is higher
than 66.7%
• Rule play basketball  not eat cereal [20%, 33.3%] is far more
accurate, although with lower support and confidence
• Observation: play basketball and eat cereal are negatively correlated
 Not all strong association rules are interesting and some can be
misleading.
 augment the support and confidence values with interestingness
measures such as the correlation 𝐴 ⇒ 𝐵 [𝑠𝑢𝑝𝑝, 𝑐𝑜𝑛𝑓, 𝑐𝑜𝑟𝑟]

Frequent Itemset Mining  Simple Association Rules 40

Other Interestingness Measures:
Correlation
DATABASE
SYSTEMS
GROUP

• Lift is a simple correlation measure between two items A and B:

𝑃(𝐴 ‫)𝐵 ڂ‬ 𝑃 𝐵 𝐴) 𝑐𝑜𝑛𝑓(𝐴⇒𝐵)

𝑐𝑜𝑟𝑟𝐴,𝐵 = = =
𝑃 𝐴 𝑃(𝐵) 𝑃 𝐵 𝑠𝑢𝑝𝑝(𝐵)

! The two rules 𝐴 ⇒ 𝐵 and 𝐵 ⇒ 𝐴 have the same correlation coefficient.

• take both P(A) and P(B) in consideration

• 𝑐𝑜𝑟𝑟𝐴,𝐵 > 1 the two items A and B are positively correlated

S
• 𝑐𝑜𝑟𝑟𝐴,𝐵 = 1 there is no correlation between the two items A and B

.R
• 𝑐𝑜𝑟𝑟𝐴,𝐵 < 1 the two items A and B are negatively correlated

Dr
4/
Frequent Itemset Mining  Simple Association Rules 41
2

Other Interestingness Measures:

Correlation
DATABASE
SYSTEMS
,

GROUP
12

• Example 2: X 1 1 1 1 0 0 0 0
ril

Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
Ap

• X and Y: positively correlated

• X and Z: negatively related
• support and confidence of X=>Z dominates
• but items X and Z are negatively correlated
• Items X and Y are positively correlated

rule support confidence correlation

𝑋⇒𝑌 25% 50% 2
𝑋⇒𝑍 37.5% 75% 0.86
𝑌⇒𝑍 12.5% 50% 0.57

Frequent Itemset Mining  Simple Association Rules 42

DATABASE
SYSTEMS
Chapter 3: Frequent Itemset Mining
GROUP

S
– Quantitative Association Rules
• Motivation, basic idea, partitioning numerical attributes, adaptation of

.R
apriori algorithm, interestingness
5) Extensions and Summary
Dr
4/
Outline 43
2

Hierarchical Association Rules:

Motivation
DATABASE
SYSTEMS
,

GROUP
12

• Problem of association rules in plain itemsets

ril

– High minsup: apriori finds only few rules

– Low minsup: apriori finds unmanagably many rules
Ap

• Exploit item taxonomies (generalizations, is-a hierarchies) which exist

in many applications
clothes shoes

outerwear shirts sports shoes boots

jackets jeans

• New task: find all generalized association rules between generalized

items  Body and Head of a rule may have items of any level of the
hierarchy
• Generalized association rule: 𝑋 ⇒ 𝑌
with 𝑋, 𝑌 ⊂ 𝐼, 𝑋 ∩ 𝑌 = ∅ and no item in 𝑌 is an ancestor of any item in 𝑋
i.e., 𝑗𝑎𝑐𝑘𝑒𝑡𝑠 ⇒ 𝑐𝑙𝑜𝑡ℎ𝑒𝑠 is essentially true
Frequent Itemset Mining  Further Topics  Hierarchical Association Rules 44
Hierarchical Association Rules:
Motivating Example
DATABASE
SYSTEMS
GROUP

• Examples
Jeans  boots
jackets  boots Support < minSup

Outerwear  boots Support > minsup

• Characteristics
– Support(“outerwear  boots”) is not necessarily equal to the sum
support(“jackets  boots”) + support( “jeans  boots”)
e.g. if a transaction with jackets, jeans and boots exists

S
– Support for sets of generalizations (e.g., product groups) is higher
than support for sets of individual items

.R
If the support of rule “outerwear  boots” exceeds minsup, then the
support of rule “clothes  boots” does, too
Dr
4/
Frequent Itemset Mining  Further Topics  Hierarchical Association Rules 45
2
20

DATABASE
SYSTEMS
Mining Multi-Level Associations
,

GROUP
12

• A top_down, progressive deepening approach:

Food
ril

– First find high-level strong rules:

• milk  bread [20%, 60%].
Ap

milk bread
– Then find their lower-level “weaker” rules:
• 1.5% milk  wheat bread [6%, 50%]. 3.5% 1.5% wheat white

Fraser Sunset Wonder

• Different min_support threshold across multi-levels lead to different

algorithms:
– adopting the same min_support across multi-levels
– adopting reduced min_support at lower levels

Frequent Itemset Mining  Further Topics  Hierarchical Association Rules 46

DATABASE
SYSTEMS
Minimum Support for Multiple Levels
GROUP

• Uniform Support milk minsup = 5 %

support = 10 %

3.5% 1.5% minsup = 5 %

support = 6 % support = 4 %

+ the search procedure is simplified (monotonicity)

+ the user is required to specify only one support threshold
• Reduced Support milk
(Variable Support) minsup = 5 %
support = 10 %

S
.R
3.5% 1.5% minsup = 3 %
support = 6 % support = 4 %
+ takes the lower frequency of items in lower levels into consideration
Dr
4/
Frequent Itemset Mining  Further Topics  Hierarchical Association Rules 47
2

Multilevel Association Mining using

Reduced Support
DATABASE
SYSTEMS
,

GROUP
12

• A top_down, progressive deepening approach:

Food
ril

– First find high-level strong rules:

• milk  bread [20%, 60%].
Ap

milk bread
– Then find their lower-level “weaker” rules:
• 1.5% milk  wheat bread [6%, 50%]. 3.5% 1.5% wheat white
level-wise processing (breadth first)
3 approaches using reduced Support: Fraser Sunset Wonder

• Level-by-level independent method:

– Examine each node in the hierarchy, regardless of whether or not its parent
node is found to be frequent
• Level-cross-filtering by single item:
– Examine a node only if its parent node at the preceding level is frequent
• Level-cross- filtering by k-itemset:
– Examine a k-itemset at a given level only if its parent k-itemset at the
preceding level is frequent

Frequent Itemset Mining  Further Topics  Hierarchical Association Rules 48

DATABASE
SYSTEMS
Multilevel Associations: Variants
GROUP

• A top_down, progressive deepening approach:

Food
– First find high-level strong rules:
• milk  bread [20%, 60%].
milk bread
– Then find their lower-level “weaker” rules:
• 1.5% milk  wheat bread [6%, 50%]. 3.5% 1.5% wheat white
level-wise processing (breadth first)
Fraser Sunset Wonder

• Variations at mining multiple-level association rules.

S
– Level-crossed association rules:
• 1.5 % milk  Wonder wheat bread

.R
– Association rules with multiple, alternative hierarchies:
• 1.5 % milk  Wonder bread
Dr
4/
Frequent Itemset Mining  Further Topics  Hierarchical Association Rules 49
2

Multi-level Association: Redundancy

Filtering
DATABASE
SYSTEMS
,

GROUP
12

• Some rules may be redundant due to “ancestor” relationships between

items.
ril

• Example
Ap

– 𝑅1 : milk  wheat bread [support = 8%, confidence = 70%]

– 𝑅2 : 1.5% milk  wheat bread [support = 2%, confidence = 72%]
• We say that rule 1 is an ancestor of rule 2.
• Redundancy:
A rule is redundant if its support is close to the “expected” value, based
on the rule’s ancestor.

Frequent Itemset Mining  Further Topics  Hierarchical Association Rules 50

Interestingness of Hierarchical
Association Rules: Notions
DATABASE
SYSTEMS
GROUP

Let 𝑋, 𝑋 ′ , 𝑌, 𝑌 ′ ⊆ 𝐼 be itemsets.
• An itemset 𝑋′ is an ancestor of 𝑋 iff there exist ancestors 𝑥1′ , … , 𝑥𝑘′ of
𝑥1 , … , 𝑥𝑘 ∈ 𝑋 and 𝑥𝑘+1 , … , 𝑥𝑛 with 𝑛 = 𝑋 such that
𝑋 ′ = {𝑥1′ , … , 𝑥𝑘′ , 𝑥𝑘+1 , … , 𝑥𝑛 }.
• Let 𝑋 ′ and 𝑌′ be ancestors of 𝑋 and 𝑌. Then we call the rules 𝑋′  𝑌′,
𝑋𝑌′, and 𝑋′𝑌 ancestors of the rule X  Y .
• The rule X´  Y´ is a direct ancestor of rule X  Y in a set of rules if:
– Rule X´  Y‘ is an ancestor of rule X  Y, and
– There is no rule X“  Y“ such that X“  Y“ is an ancestor of

S
X  Y and X´  Y´ is an ancestor of X“  Y“

.R
• A hierarchical association rule X  Y is called R-interesting if:
– There are no direct ancestors of X  Y or
Dr
– The actual support is larger than R times the expected support or
– The actual confidence is larger than R times the expected confidence
4/
Frequent Itemset Mining  Further Topics  Hierarchical Association Rules 51
2

Expected Support and Expected

Confidence
DATABASE
SYSTEMS
,

GROUP
12

• How to compute the expected support?

Given the rule for X  Y and its ancestor rule X´  Y´ the expected
ril

support of X  Y is defined as:

P(𝑧1 ) P 𝑧𝑗 ′
𝐸𝑍 ′ P 𝑍 = ′ × ⋯ × ′ ×P 𝑍
P(𝑧1 ) P(𝑧𝑗 )
where 𝑍 = 𝑋 ∪ 𝑌 = {𝑧1 , … , 𝑧𝑛 }, 𝑍 ′ = 𝑋 ′ ∪ 𝑌 ′ = {𝑧1′ , … , 𝑧𝑗′ , 𝑧𝑗+1 , … , 𝑧𝑛 } and
each 𝑧𝑖′ ∈ 𝑍 ′ is an ancestor of 𝑧𝑖 ∈ 𝑍

[SA’95] R. Srikant, R. Agrawal: Mining Generalized Association Rules. In VLDB, 1995.

Frequent Itemset Mining  Further Topics  Hierarchical Association Rules 52
Expected Support and Expected
Confidence
DATABASE
SYSTEMS
GROUP

• How to compute the expected confidence?

Given the rule for X  Y and its ancestor rule X´  Y´, then the
expected confidence of X  Y is defined as:
P(𝑦1 ) P 𝑦𝑗
𝐸𝑋′⇒𝑌′ P 𝑌|𝑋 = ′ × ⋯× ′
× P 𝑌 ′ |𝑋′
P(𝑦1 ) P 𝑦𝑗
where 𝑌 = {𝑦1 , … , 𝑦𝑛 } and 𝑌′ = 𝑦1′ , … , 𝑦𝑗′ , 𝑦𝑗+1 , … , 𝑦𝑛 and each 𝑦𝑖′ ∈ 𝑌′ is
an ancestor of 𝑦𝑖 ∈ 𝑌

S
.R
Dr
[SA’95] R. Srikant, R. Agrawal: Mining Generalized Association Rules. In VLDB, 1995.
4/
Frequent Itemset Mining  Further Topics  Hierarchical Association Rules 53
2

Interestingness of Hierarchical
20

Association Rules:Example
DATABASE
SYSTEMS
,

GROUP
12

• Example Item Support

ril

– Let R = 1.6 clothes 20

outerwear 10
Ap

jackets 4

•
No rule support R-interesting?
1 clothes  shoes 10 yes: no ancestors
2 outerwear  shoes 9 yes:
Support > R *exp. support (wrt. rule 1) =
10
(1.6 ⋅ (20 ⋅ 10)) = 8
3 jackets  shoes 4 Not wrt. support:
Support > R * exp. support (wrt. rule 1) = 3.2
Support < R * exp. support (wrt. rule 2) = 5.75
 still need to check the confidence!

Frequent Itemset Mining  Further Topics  Hierarchical Association Rules 54

DATABASE
SYSTEMS
Chapter 3: Frequent Itemset Mining
GROUP

1) Introduction
– Transaction databases, market basket data analysis
2) Simple Association Rules
– Basic notions, rule generation, interestingness measures
3) Mining Frequent Itemsets
– Apriori algorithm, hash trees, FP-tree
4) Further Topics
– Hierarchical Association Rules
• Motivation, notions, algorithms, interestingness

S
– Multidimensional and Quantitative Association Rules
• Motivation, basic idea, partitioning numerical attributes, adaptation of

.R
apriori algorithm, interestingness
5) Summary
Dr
4/
Outline 55
2

Multi-Dimensional Association:
20

Concepts
DATABASE
SYSTEMS
,

GROUP
12

• Single-dimensional rules:
ril

– buys milk  buys bread

• Multi-dimensional rules:  2 dimensions

– Inter-dimension association rules (no repeated dimensions)
• age between 19-25  status is student  buys coke
– hybrid-dimension association rules (repeated dimensions)
• age between 19-25  buys popcorn  buys coke

Frequent Itemset Mining  Extensions & Summary 56

Techniques for Mining Multi-
Dimensional Associations
DATABASE
SYSTEMS
GROUP

• Search for frequent k-predicate set:

– Example: {age, occupation, buys} is a 3-predicate set.
– Techniques can be categorized by how age is treated.
1. Using static discretization of quantitative attributes
– Quantitative attributes are statically discretized by using predefined concept
hierarchies.
2. Quantitative association rules
– Quantitative attributes are dynamically discretized into “bins”based on the
distribution of the data.
3. Distance-based association rules

S
– This is a dynamic discretization process that considers the distance between

.R
data points.

Dr
4/
Frequent Itemset Mining  Extensions & Summary 57
2
20

DATABASE
SYSTEMS
Quantitative Association Rules
,

GROUP
12

• Up to now: associations of boolean attributes only

ril

• Now: numerical attributes, too

• Example:
– Original database

ID age marital status # cars

1 23 single 0
2 38 married 2
– Boolean database

ID age: 20..29 age: 30..39 m-status: single m-status: married ...

1 1 0 1 0 ...
2 0 1 0 1 ...

Frequent Itemset Mining  Further Topics  Quantitative Association Rules 58

DATABASE
SYSTEMS
Quantitative Association Rules: Ideas
GROUP

• Static discretization
– Discretization of all attributes before mining the association rules
– E.g. by using a generalization hierarchy for each attribute
– Substitute numerical attribute values by ranges or intervals

• Dynamic discretization
– Discretization of the attributes during association rule mining
– Goal (e.g.): maximization of confidence
– Unification of neighboring association rules to a generalized rule

S
.R
Dr
4/
Frequent Itemset Mining  Further Topics  Quantitative Association Rules 59
2
20

DATABASE
SYSTEMS
Partitioning of Numerical Attributes
,

GROUP
12

• Problem: Minimum support

– Too many intervals too small support for each individual interval
ril

– Too few intervals  too small confidence of the rules

• Solution
– First, partition the domain into many intervals
– Afterwards, create new intervals by merging adjacent interval

• Numeric attributes are dynamically discretized such that the confidence

or compactness of the rules mined is maximized.

Frequent Itemset Mining  Further Topics  Quantitative Association Rules 60

DATABASE
SYSTEMS
Quantitative Association Rules
GROUP

• 2-D quantitative association rules: Aquan1  Aquan2  Acat

• Cluster “adjacent” association
rules to form general rules
using a 2-D grid.

• Example:

S
.R
age(X,”30-34”)  income(X,”24K - 48K”)
 buys(X,”high resolution TV”)
Dr
4/
Frequent Itemset Mining  Further Topics  Quantitative Association Rules 61
2
20

DATABASE
SYSTEMS
Chapter 3: Frequent Itemset Mining
,

GROUP
12

1) Introduction
ril

– Transaction databases, market basket data analysis

2) Mining Frequent Itemsets
Ap

– Apriori algorithm, hash trees, FP-tree

3) Simple Association Rules
– Basic notions, rule generation, interestingness measures
4) Further Topics
– Hierarchical Association Rules
• Motivation, notions, algorithms, interestingness
– Quantitative Association Rules
• Motivation, basic idea, partitioning numerical attributes, adaptation of
apriori algorithm, interestingness
5) Summary

Outline 62
12 Reference

[1] https://www.jigsawacademy.com/blogs/hr-analytics/data-analytics-lifecycle/

[2] https://statacumen.com/teach/ADA1/ADA1\_notes\_F14.pdf

[3] https://www.youtube.com/watch?v=fDRa82lxzaU

[4] https://www.investopedia.com/terms/d/data-analytics.asp

[5] http://egyankosh.ac.in/bitstream/123456789/10935/1/Unit-2.pdf

[6] http://epgp.inflibnet.ac.in/epgpdata/uploads/epgp\_content/computer\_science/16.\_d

S
ata\_analytics/03.\_evolution\_of\_analytical\_scalability/et/9280\_et\_3\_et.pdf

.R
[7] https://bhavanakhivsara.files.wordpress.com/2018/06/data-science-and-big-data-analy

-nieizv\_book.pdf

Dr
[8] https://www.researchgate.net/publication/317214679\_Sentiment\_Analysis\_for\_Effect
4/
ive\_Stock\_Market\_Prediction

[9] https://snscourseware.org/snscenew/files/1569681518.pdf
2
20

[10] http://csis.pace.edu/ctappert/cs816-19fall/books/2015DataScience\&BigDataAnalytics.

pdf
,
12

[11] https://www.youtube.com/watch?v=mccsmoh2\_3c

[12] https://mentalmodels4life.net/2015/11/18/agile-data-science-applying-kanban-in-the-a
ril

nalytics-life-cycle/
Ap

[13] https://www.sas.com/en_in/insights/big-data/what-is-big-data.html#:~:text=Big%20dat

a%20refers%20to%20data,around%20for%20a%20long%20time.

[14] https://www.javatpoint.com/big-data-characteristics

[15] Liu, S., Wang, M., Zhan, Y., & Shi, J. (2009). Daily work stress and alcohol use: Testing the cross-

level moderation effects of neuroticism and job involvement. Personnel Psychology,62(3), 575–597.

http://dx.doi.org/10.1111/j.1744-6570.2009.01149.x

55
[16] https://www.google.com/search?q=architecture+of+data+stream+model&sxsrf=APwXEdf9LJ8N

XMypRU-Sg28SH8m_pwiUDA:1679823244352&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjGgY-epfn9

AhX5xTgGHRWjDmMQ_AUoAXoECAEQAw&biw=1366&bih=622#imgrc=wnFWJQ01p-w_jM

[17] Prof. Dr. Thomas Seidl, Frequent Itemset Mining, Knowledge Discovery in Databases, SS 2016.

********************

S
.R
Dr
2 4/
, 20
12
ril
Ap

Efficient Frequent Itemset Mining Techniques
No ratings yet
Efficient Frequent Itemset Mining Techniques
47 pages
Unit 4 Data Analytics
No ratings yet
Unit 4 Data Analytics
11 pages
Data 07 00011
No ratings yet
Data 07 00011
22 pages
Big Data
No ratings yet
Big Data
8 pages
DA Unit 4
100% (1)
DA Unit 4
125 pages
Data Mining for Computer Science Students
No ratings yet
Data Mining for Computer Science Students
36 pages
DWM Unit 5 Mining Frequent Patterns and Cluster Analysis
100% (1)
DWM Unit 5 Mining Frequent Patterns and Cluster Analysis
15 pages
Dmbi Ia2 Ans
No ratings yet
Dmbi Ia2 Ans
17 pages
Association Rule Mining
No ratings yet
Association Rule Mining
61 pages
Unit-1: 1. Define Data Mining and Explain Its Importance in Modern Data Analysis
No ratings yet
Unit-1: 1. Define Data Mining and Explain Its Importance in Modern Data Analysis
42 pages
Data Warehousing and Data Mining Assignment 3
No ratings yet
Data Warehousing and Data Mining Assignment 3
2 pages
Unit 4 - DA - Frequent Itemsets and Clustering-1 (Unit-5)
No ratings yet
Unit 4 - DA - Frequent Itemsets and Clustering-1 (Unit-5)
86 pages
Data Analytics Sys
No ratings yet
Data Analytics Sys
1 page
Unit 5
No ratings yet
Unit 5
9 pages
Data Analytics Lesson Plan
No ratings yet
Data Analytics Lesson Plan
11 pages
FDS Unit - 3
No ratings yet
FDS Unit - 3
10 pages
Data Mining
No ratings yet
Data Mining
44 pages
CS-DM Module - 1
No ratings yet
CS-DM Module - 1
27 pages
Explain Architecture of Data Mining
No ratings yet
Explain Architecture of Data Mining
12 pages
CSC 452 DM Week06 Association Rules 26102020 111149am
No ratings yet
CSC 452 DM Week06 Association Rules 26102020 111149am
52 pages
CE0716-Data Warehouse and Mining - Compulsory
No ratings yet
CE0716-Data Warehouse and Mining - Compulsory
5 pages
PROFICIENCY Data Mining
No ratings yet
PROFICIENCY Data Mining
6 pages
Chapter 5 Data Mining: Dr. Huma Lone
No ratings yet
Chapter 5 Data Mining: Dr. Huma Lone
56 pages
Data Mining: Patterns & Clustering
No ratings yet
Data Mining: Patterns & Clustering
9 pages
Apriori Algorithm in Frequent Itemset Mining
No ratings yet
Apriori Algorithm in Frequent Itemset Mining
15 pages
DMDW Chapter 4
No ratings yet
DMDW Chapter 4
28 pages
Ai Pass
No ratings yet
Ai Pass
12 pages
Unit - III
No ratings yet
Unit - III
38 pages
Unit 5 Notes DWM
No ratings yet
Unit 5 Notes DWM
11 pages
Data Mining: Frequent Itemsets & Clustering
No ratings yet
Data Mining: Frequent Itemsets & Clustering
152 pages
Slides03 - Items and Association
No ratings yet
Slides03 - Items and Association
17 pages
Chap5 Frequent Itemset
No ratings yet
Chap5 Frequent Itemset
70 pages
Frequent Itemsets & Market-Basket Analysis
No ratings yet
Frequent Itemsets & Market-Basket Analysis
31 pages
1.3 What Kind of Data Can Be Mined?
No ratings yet
1.3 What Kind of Data Can Be Mined?
5 pages
DWDM Mod-1
No ratings yet
DWDM Mod-1
13 pages
Introduction To Data Mining - Lecture03
No ratings yet
Introduction To Data Mining - Lecture03
23 pages
Data Analytics Chapteer 4
No ratings yet
Data Analytics Chapteer 4
9 pages
DMDW Chapter 4
No ratings yet
DMDW Chapter 4
29 pages
Frequent Pattern Mining Concepts
No ratings yet
Frequent Pattern Mining Concepts
56 pages
DM Unit - 2
No ratings yet
DM Unit - 2
14 pages
Data Mining UNIT 3 LECTURE NOTES
No ratings yet
Data Mining UNIT 3 LECTURE NOTES
13 pages
Unit 3 Data Science
No ratings yet
Unit 3 Data Science
15 pages
Unit 5 Mining Frequent Patterns and Cluster Analysis
No ratings yet
Unit 5 Mining Frequent Patterns and Cluster Analysis
63 pages
Frequent Pattern Mining in Retail Analysis
No ratings yet
Frequent Pattern Mining in Retail Analysis
18 pages
U3 FDS 1
No ratings yet
U3 FDS 1
17 pages
Apriori
No ratings yet
Apriori
33 pages
Hot Keys
No ratings yet
Hot Keys
4 pages
Data Analytics - Unit - 4
No ratings yet
Data Analytics - Unit - 4
14 pages
MCA 301 Data Mining Notes
No ratings yet
MCA 301 Data Mining Notes
6 pages
06 FPBasic
No ratings yet
06 FPBasic
69 pages
DWDM (Data Warehousing and Data Mining) Summarizer (For B.Tech MAKAUT)
No ratings yet
DWDM (Data Warehousing and Data Mining) Summarizer (For B.Tech MAKAUT)
27 pages
33 GM - ASAP-Association Rule Mining
No ratings yet
33 GM - ASAP-Association Rule Mining
64 pages
Unit 2-2
No ratings yet
Unit 2-2
53 pages
Data Cube Computation and Data Generation
No ratings yet
Data Cube Computation and Data Generation
54 pages
Apply R Tool For Developing and Evaluating Real Time Applications
No ratings yet
Apply R Tool For Developing and Evaluating Real Time Applications
1 page
VIPDMTheoryChapter 5
No ratings yet
VIPDMTheoryChapter 5
96 pages
1 Unit 4
No ratings yet
1 Unit 4
46 pages
06 FPBasic
No ratings yet
06 FPBasic
59 pages
Bca Be Enabppac.
No ratings yet
Bca Be Enabppac.
2 pages
SSC Testing for Engineers
No ratings yet
SSC Testing for Engineers
19 pages
Bullseye: Target's Cheap Chic Strategy: Achieving The Right Kind of Differentiation
No ratings yet
Bullseye: Target's Cheap Chic Strategy: Achieving The Right Kind of Differentiation
3 pages
The Making of The Last Prophet A Reconst
No ratings yet
The Making of The Last Prophet A Reconst
4 pages
OB-Blanchard-Fields, F. (2007) - Everyday Problem Solving and Emotion. An Adult Developmental Perspective
100% (1)
OB-Blanchard-Fields, F. (2007) - Everyday Problem Solving and Emotion. An Adult Developmental Perspective
6 pages
Going Pro 3 Exam 2
No ratings yet
Going Pro 3 Exam 2
2 pages
Ivivc Jurnal
No ratings yet
Ivivc Jurnal
7 pages
S-101 Annex A DCEG Ed 2.0.0
No ratings yet
S-101 Annex A DCEG Ed 2.0.0
802 pages
Ad - Pda
No ratings yet
Ad - Pda
5 pages
Organising: According To Theo Haimann, "Organising Is The Process of Defining and Grouping The
No ratings yet
Organising: According To Theo Haimann, "Organising Is The Process of Defining and Grouping The
13 pages
Hooke's Law and Young's Modulus Exam
No ratings yet
Hooke's Law and Young's Modulus Exam
6 pages
Normalization in Databases
No ratings yet
Normalization in Databases
40 pages
Akhil Hall Ticket 5sem
No ratings yet
Akhil Hall Ticket 5sem
2 pages
Conference Presentation Template Guide
No ratings yet
Conference Presentation Template Guide
40 pages
Hangul Doodly Chalkboard Pack: For Foreign Language Teachers
No ratings yet
Hangul Doodly Chalkboard Pack: For Foreign Language Teachers
63 pages
How to Use Alt Code for "í" Character
No ratings yet
How to Use Alt Code for "í" Character
3 pages
Plastic Granules (16.06.25)
No ratings yet
Plastic Granules (16.06.25)
5 pages
Indian Coal Allocation Scam
No ratings yet
Indian Coal Allocation Scam
8 pages
JEE Weightage Analysis Handout
No ratings yet
JEE Weightage Analysis Handout
30 pages
The Good Girl Raven Souls MC Candice Wright Download Full Chapters
No ratings yet
The Good Girl Raven Souls MC Candice Wright Download Full Chapters
120 pages
Godrej Company
No ratings yet
Godrej Company
24 pages
Power Electronics Exam Paper
No ratings yet
Power Electronics Exam Paper
2 pages
Logistics Syllabus
No ratings yet
Logistics Syllabus
3 pages
8-Channel Relay Controller Guide
100% (1)
8-Channel Relay Controller Guide
4 pages
Nursing Process: Steps & Guidelines
100% (1)
Nursing Process: Steps & Guidelines
5 pages
Vertex SSD Firmware Update Guide
No ratings yet
Vertex SSD Firmware Update Guide
9 pages
Turbidite: January 2014
No ratings yet
Turbidite: January 2014
8 pages
Module-3 - NR - STM (24mba205)
No ratings yet
Module-3 - NR - STM (24mba205)
11 pages
Philcare LOA for Annual Physical Exam
No ratings yet
Philcare LOA for Annual Physical Exam
1 page

Data Analytics Unit 4

Uploaded by

Data Analytics Unit 4

Uploaded by

Data Analytics (KIT-601)

Unit-4: Frequent Itemsets and Clustering

Department of Information Technology

(Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow)

April 12, 2024

At the end of course , the student will be able to

CO 1 Discuss various concepts of data analytics pipeline K1, K2

CO 4 Compare different clustering and frequent pattern mining algorithms K4

DETAILED SYLLABUS 3-0-0

architecture, stream computing, sampling data in a stream, filtering streams, counting

time sentiment analysis, stock market predictions.

frequent itemsets can be generated.

Apriori algorithm by exploiting the sparsity of the dataset.

the minimum support threshold are then selected as frequent itemsets.

large datasets on distributed systems.

present in the dataset.

2 Market Based Modelling

to adopt in a given situation.

performing the following steps:

any subset of a frequent itemset must also be frequent.

strong relationship between items in X and items in Y.

of transactions that contain Y given that they also contain X.

4 Handling Large Datasets in Main Memory

data can be compressed using libraries like LZ4 or Snappy.

horizontal partitioning, vertical partitioning, or hybrid partitioning.

SAP HANA, and VoltDB.

which provides distributed data processing capabilities.

without requiring expensive hardware upgrades or specialized software tools.

5 Limited Pass Algorithm

datasets with limited memory resources.

are the steps involved in counting frequent itemsets in a stream:

to prevent it from becoming too large.

2. Read each transaction in the stream one at a time.

which generates candidate itemsets by combining smaller frequent itemsets.

4. Increment the count of each itemset in the hash table.

6. Repeat steps 2-5 for each transaction in the stream.

the number of items and transactions in the stream.

7.1 K-Means Clustering:

segmentation, marketing, and customer segmentation.

7.1.1 K-means Clustering algorithm

clusters, where k is a pre-defined number of clusters. The algorithm works as follows:

 Initialize the k cluster centroids randomly.

(WCSS) or the sum of squared errors (SSE).

data. Here are some of its advantages and disadvantages:

 Simple and easy to understand: K-Means is  Requires pre-defined number of clusters: K-

number of clusters is unknown or difficult to

if the initial centers are not well chosen.

hibit these types of shapes.

work well with such data.

7.2 Hierarchical Clustering:

similarity. It can be agglomerative (bottom-up) or divisive (top-down). In agglomerative clustering, each

7.3 Density-based Clustering:

geospatial data analysis, and anomaly detection.

segmentation, community detection in social networks, and document clustering.

8 Clustering high-dimensional data

clustering high-dimensional data:

algorithms more effective.

8.2 Feature Selection:

and clusters simultaneously.

8.4 Density-Based Clustering:

density in the multidimensional space, which correspond to clusters.

8.5 Ensemble Clustering:

the clustering results to the choice of algorithm or parameter settings.

8.6 Deep Learning-Based Clustering:

techniques may be required to achieve the best clustering performance.

8.7 CLIQUE and ProCLUS

based on the characteristics of the data and the clustering objectives.

9 Frequent pattern-based clustering methods

based on their membership in the frequent itemsets.

to assign objects to clusters based on their membership in the frequent itemsets.

methods may not be suitable.

Here are some approaches for clustering in non-Euclidean spaces:

2. Density-Based Spatial Clustering of Applications with Noise: is a density-based clustering algorithm

used to cluster the documents based on their similarity.

11 Clustering for streams and parallelism

distribute the clustering task across multiple computing resources.

Initialize the k cluster centroids randomly.

Simple and easy to understand: K-Means is Requires pre-defined number of clusters: K-