0% found this document useful (0 votes)

10 views7 pages

(DM) PCY Algorithm Example

pcy

Uploaded by

anikajahan232

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views7 pages

(DM) PCY Algorithm Example

pcy

Uploaded by

anikajahan232

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Example: PCY Algorithm Step-by-Step

Let's go through a step-by-step example of the PCY (Park-Chen-Yu) Algorithm, which is an

improvement over the Apriori algorithm for finding frequent item sets. It optimizes the first pass
by using a hash table to reduce candidate pairs in the second pass.

Dataset:
Let's assume we have the following transaction database:

Transaction ID Items
Purchased

1 A, B, C, D

2 A, C, D

3 B, C, E

4 A, B

5 A, B, E

Step 1: Count Frequent Items (Pass 1)

We count the occurrences of each item in the dataset:

Item Count

A 4

B 4

C 3

D 2

E 2

Assuming a support threshold of 3, the frequent items are:

{A, B, C} (since D and E have counts less than 3, they are discarded).
Step 2: Hashing Item Pairs into Buckets (Pass 1)

Now, we generate pairs of items that appear together and hash them into a Hash table.

Hash Function Example:

Let’s use a simple hash function:

H (X, Y) = ( ASCII(X) + ASCII(Y) ) mod 5

Item Pair Hash Function Output Bucket

(A, B) (65+66) % 5 = 1 1

(A, C) (65+67) % 5 = 2 2

(A, D) (65+68) % 5 = 3 3

(A, E) (65+69) % 5 = 4 4

(B, C) (66+67) % 5 = 3 3

(B, D) (66+68) % 5 = 4 4

(B, E) (66+69) % 5 = 0 0

(C, D) (67+68) % 5 = 0 0

(C, E) (67+69) % 5 = 1 1

(D, E) (68+69) % 5 = 2 2

The hash bucket counts are:

Bucket (Pairs) Count Transaction Serial

0 : (B, E), (C, D) 4 1, 2, 3, 5

1 : (A, B), (C, E) 4 1, 3, 4, 5

2 : (A, C), (D, E) 2 1, 2

3 : (A, D), (B, C) 4 1, 1, 2, 3

4 : (A, E), (B, D) 2 1, 5

Step 3: Identify Frequent Buckets

We set a bucket threshold of 3 (for simplicity).

Buckets {0, 1, 3} are frequent because they have counts ≥ 3.

Applying the Threshold (Support ≥ 3)

We set a threshold of 3, meaning only buckets with counts ≥ 3 are considered frequent.

Bitmap Representation

We create a bitmap where:

● 1 indicates a frequent bucket (count ≥ 3).

● 0 indicates an infrequent bucket (count < 3).

Bucket Frequent? Bitmap Value

0 Yes 1

1 Yes 1

2 No 0

3 Yes 1

4 No 0

So, the final bitmap representation is:

11010
This bitmap helps eliminate candidate pairs that hash to an infrequent bucket, making the
second pass more efficient.

Step 4: Generate Candidate Pairs (Pass 2)

Instead of considering all item pairs, we only keep pairs where:

1. Both items are frequent (from Step 1).

2. The pair’s hash bucket was frequent (from Step 3).

Candidate pairs:

● (A, B) → both are frequent; bucket 1 is frequent

● (A, C) → both are frequent; bucket 2 is Less frequent
● (A, D) → D Less Frequent
● (A, E) → E Less Frequent
● (B, C) → both are frequent; bucket 3 is frequent
● (B, D) → D Less Frequent
● (B, E) → Less Frequent
● (C, D) → D Less Frequent
● (C, E) → E Less Frequent
● (D, E) → E Less Frequent

So, our final candidate pairs are:

(A, B) and (B, C)

Step 5: Count Frequent Pairs (Pass 2)

Now, we scan the dataset again and count occurrences of the remaining candidate pairs.

Pair Count

(A, B) 3

(B, C) 2

Since the (A, B) pair have a count ≥ 3, this is the frequent pair.
Final Result:

The frequent item pairs found using the PCY algorithm are:

● (A, B)

Summary of PCY Optimization

● Uses a hash table to reduce candidate pairs, unlike Apriori, which generates all
possible pairs.
● Avoids counting too many non-frequent pairs in the second pass.
● Efficient for large datasets with a limited number of frequent items.

Improvement over A-priori Algorithm

The PCY (Park-Chen-Yu) Algorithm improves the Apriori Algorithm by reducing the number
of candidate pairs considered in the second pass. Here’s how:

1. Efficient Use of Memory:

○ Apriori stores all candidate pairs in memory, leading to high space complexity.
○ PCY uses a hash table to store hashed item pairs in a fixed-size bitmap,
reducing memory usage.
2. Optimized Candidate Pair Generation:

○ Apriori generates all possible pairs of frequent items.

○ PCY eliminates unlikely pairs early by using hash buckets to filter out infrequent
ones.
3. Faster Processing:

○ Apriori performs two passes over the dataset, counting all candidate pairs in the
second pass.
○ PCY reduces the number of pairs to be counted in the second pass, making it
more computationally efficient.

Key Improvement:
Instead of blindly considering all item pairs in the second pass (like Apriori), PCY prunes many
non-promising pairs using hash-based filtering, making it much faster for large datasets.

Advantages and Disadvantages of PCY

Algorithm

Advantages of the PCY Algorithm

1. Memory Efficiency: Uses a hash table to store item pair counts, reducing memory
usage compared to Apriori.
2. Faster Execution: Eliminates many non-frequent pairs early, reducing the number of
candidate pairs in the second pass.
3. Improved Scalability: Works well for large datasets by minimizing the number of
computations.
4. Better Disk Utilization: Uses a bitmap structure to store frequent hash buckets,
reducing the need for multiple disk accesses.

Disadvantages of the PCY Algorithm

1. Hash Collisions: Frequent and infrequent pairs may hash to the same bucket, leading
to false positives.
2. Extra Computation in Pass 1: Additional time is spent hashing item pairs, which may
not always lead to significant gains.
3. Fixed Hash Table Size: If the hash table is too small, too many pairs fall into the same
bucket, reducing efficiency.
4. Limited to Pair Counting: Doesn't extend well to higher-order itemsets (e.g., triples,
quadruples) without additional modifications.
The PCY (Park-Chen-Yu) algorithm is designed to efficiently find frequent itemsets in
large datasets, but it can potentially produce some incorrect outputs in the form of false
positives. However, it does not produce false negatives.

False Positives
The PCY algorithm may identify some itemsets as frequent when they are actually not,
due to the following reasons:
1. Hash collisions: Multiple item pairs can hash to the same bucket, leading to
some infrequent pairs being mistakenly considered as candidates
2. Bitmap limitations: The use of a bitmap to represent frequent buckets can
sometimes lead to infrequent itemsets being considered as candidates if they
hash to a frequent bucket

Accuracy Improvements
To mitigate the risk of false positives, several variations of the PCY algorithm have been
developed:
1. Multistage Algorithm: This variation uses additional hash tables and passes to
further reduce the number of candidate pairs, potentially decreasing false
positives
2. Multihash Algorithm: It uses multiple hash functions in a single pass to achieve
similar benefits as the Multistage Algorithm, potentially reducing false positives
while maintaining efficiency
3. Threshold adjustment: Setting an appropriate threshold for bucket frequency can
help reduce false positives and improve the algorithm's accuracy
It's important to note that while these improvements can reduce false positives, they
cannot eliminate them entirely. The PCY algorithm and its variations are designed to be
memory-efficient approximations, trading some accuracy for improved performance on
large datasets.

Karat Interview Prep
100% (1)
Karat Interview Prep
1 page
Designing Your Database
No ratings yet
Designing Your Database
11 pages
MCA Syallabus PDF
No ratings yet
MCA Syallabus PDF
136 pages
Data Structures
No ratings yet
Data Structures
3 pages
Documentation
No ratings yet
Documentation
9 pages
TCL 3 2015.00 SG
No ratings yet
TCL 3 2015.00 SG
158 pages
Math For Game Develop
100% (3)
Math For Game Develop
623 pages
Lecture 12 - Facade Design Pattern
No ratings yet
Lecture 12 - Facade Design Pattern
14 pages
C# Dynamic Binding Explained
No ratings yet
C# Dynamic Binding Explained
44 pages
SAP EC Implementation Guide
No ratings yet
SAP EC Implementation Guide
3 pages
Wipro Elite NLTH Coding Questions Guide
No ratings yet
Wipro Elite NLTH Coding Questions Guide
16 pages
Inheritance in C# Explained
No ratings yet
Inheritance in C# Explained
24 pages
Convert SAP Report to PDF Guide
No ratings yet
Convert SAP Report to PDF Guide
7 pages
1, Data-Science-Basic
No ratings yet
1, Data-Science-Basic
12 pages
E-Farming: Agri E-Commerce Platform
50% (2)
E-Farming: Agri E-Commerce Platform
15 pages
Operating System Concepts in Jack OS
No ratings yet
Operating System Concepts in Jack OS
42 pages
Bsc-4-Sem-It-Oracle-Summer-2018 Rtmnu University
No ratings yet
Bsc-4-Sem-It-Oracle-Summer-2018 Rtmnu University
2 pages
Outcomes & Introduction: Unit 8 - Testing
No ratings yet
Outcomes & Introduction: Unit 8 - Testing
26 pages
Using Dmee
100% (3)
Using Dmee
24 pages
APT Detection with Sysmon & PowerShell
100% (1)
APT Detection with Sysmon & PowerShell
115 pages
Error Correction Code Megafunctions User Guide: (Altecc - Encoder and Altecc - Decoder)
No ratings yet
Error Correction Code Megafunctions User Guide: (Altecc - Encoder and Altecc - Decoder)
46 pages
Object-Oriented Software Engineering Exam
No ratings yet
Object-Oriented Software Engineering Exam
4 pages
Key Questions in Computer Organization
No ratings yet
Key Questions in Computer Organization
3 pages
Programming Fundamentals Overview
No ratings yet
Programming Fundamentals Overview
176 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
8 pages
Haskell Recursion System Parser
No ratings yet
Haskell Recursion System Parser
5 pages
CS281 2023 Fall HW3
No ratings yet
CS281 2023 Fall HW3
2 pages
Wrox Press - XSLT Programmer's Reference (2Nd Edition)
0% (1)
Wrox Press - XSLT Programmer's Reference (2Nd Edition)
531 pages
Unit 2 Arrays Inheritance and Interface
No ratings yet
Unit 2 Arrays Inheritance and Interface
77 pages
Angular Cheat Sheet 1685351904
No ratings yet
Angular Cheat Sheet 1685351904
32 pages

(DM) PCY Algorithm Example

Uploaded by

(DM) PCY Algorithm Example

Uploaded by

Example: PCY Algorithm Step-by-Step

Let's go through a step-by-step example of the PCY (Park-Chen-Yu) Algorithm, which is an

Step 1: Count Frequent Items (Pass 1)

We count the occurrences of each item in the dataset:

Assuming a support threshold of 3, the frequent items are:​

Hash Function Example:

Let’s use a simple hash function:

H (X, Y) = ( ASCII(X) + ASCII(Y) ) mod 5

Item Pair Hash Function Output Bucket

The hash bucket counts are:

Bucket (Pairs) Count Transaction Serial

0 : (B, E), (C, D) 4 1, 2, 3, 5

1 : (A, B), (C, E) 4 1, 3, 4, 5

2 : (A, C), (D, E) 2 1, 2

3 : (A, D), (B, C) 4 1, 1, 2, 3

Step 3: Identify Frequent Buckets

We set a bucket threshold of 3 (for simplicity).​

Applying the Threshold (Support ≥ 3)

We create a bitmap where:

●​ 1 indicates a frequent bucket (count ≥ 3).

Bucket Frequent? Bitmap Value

So, the final bitmap representation is:​

Step 4: Generate Candidate Pairs (Pass 2)

Instead of considering all item pairs, we only keep pairs where:

1.​ Both items are frequent (from Step 1).

●​ (A, B) → both are frequent; bucket 1 is frequent

So, our final candidate pairs are:​

Step 5: Count Frequent Pairs (Pass 2)

Summary of PCY Optimization

Improvement over A-priori Algorithm

1.​ Efficient Use of Memory:​

○​ Apriori generates all possible pairs of frequent items.

Advantages and Disadvantages of PCY

Advantages of the PCY Algorithm

Disadvantages of the PCY Algorithm

You might also like

Assuming a support threshold of 3, the frequent items are:

We set a bucket threshold of 3 (for simplicity).

● 1 indicates a frequent bucket (count ≥ 3).

So, the final bitmap representation is:

1. Both items are frequent (from Step 1).

● (A, B) → both are frequent; bucket 1 is frequent

So, our final candidate pairs are:

1. Efficient Use of Memory:

○ Apriori generates all possible pairs of frequent items.