Example: PCY Algorithm Step-by-Step
Let's go through a step-by-step example of the PCY (Park-Chen-Yu) Algorithm, which is an
improvement over the Apriori algorithm for finding frequent item sets. It optimizes the first pass
by using a hash table to reduce candidate pairs in the second pass.
Dataset:
Let's assume we have the following transaction database:
Transaction ID Items
Purchased
1 A, B, C, D
2 A, C, D
3 B, C, E
4 A, B
5 A, B, E
Step 1: Count Frequent Items (Pass 1)
We count the occurrences of each item in the dataset:
Item Count
A 4
B 4
C 3
D 2
E 2
Assuming a support threshold of 3, the frequent items are:
{A, B, C} (since D and E have counts less than 3, they are discarded).
Step 2: Hashing Item Pairs into Buckets (Pass 1)
Now, we generate pairs of items that appear together and hash them into a Hash table.
Hash Function Example:
Let’s use a simple hash function:
H (X, Y) = ( ASCII(X) + ASCII(Y) ) mod 5
Item Pair Hash Function Output Bucket
(A, B) (65+66) % 5 = 1 1
(A, C) (65+67) % 5 = 2 2
(A, D) (65+68) % 5 = 3 3
(A, E) (65+69) % 5 = 4 4
(B, C) (66+67) % 5 = 3 3
(B, D) (66+68) % 5 = 4 4
(B, E) (66+69) % 5 = 0 0
(C, D) (67+68) % 5 = 0 0
(C, E) (67+69) % 5 = 1 1
(D, E) (68+69) % 5 = 2 2
The hash bucket counts are:
Bucket (Pairs) Count Transaction Serial
0 : (B, E), (C, D) 4 1, 2, 3, 5
1 : (A, B), (C, E) 4 1, 3, 4, 5
2 : (A, C), (D, E) 2 1, 2
3 : (A, D), (B, C) 4 1, 1, 2, 3
4 : (A, E), (B, D) 2 1, 5
Step 3: Identify Frequent Buckets
We set a bucket threshold of 3 (for simplicity).
Buckets {0, 1, 3} are frequent because they have counts ≥ 3.
Applying the Threshold (Support ≥ 3)
We set a threshold of 3, meaning only buckets with counts ≥ 3 are considered frequent.
Bitmap Representation
We create a bitmap where:
● 1 indicates a frequent bucket (count ≥ 3).
● 0 indicates an infrequent bucket (count < 3).
Bucket Frequent? Bitmap Value
0 Yes 1
1 Yes 1
2 No 0
3 Yes 1
4 No 0
So, the final bitmap representation is:
11010
This bitmap helps eliminate candidate pairs that hash to an infrequent bucket, making the
second pass more efficient.
Step 4: Generate Candidate Pairs (Pass 2)
Instead of considering all item pairs, we only keep pairs where:
1. Both items are frequent (from Step 1).
2. The pair’s hash bucket was frequent (from Step 3).
Candidate pairs:
● (A, B) → both are frequent; bucket 1 is frequent
● (A, C) → both are frequent; bucket 2 is Less frequent
● (A, D) → D Less Frequent
● (A, E) → E Less Frequent
● (B, C) → both are frequent; bucket 3 is frequent
● (B, D) → D Less Frequent
● (B, E) → Less Frequent
● (C, D) → D Less Frequent
● (C, E) → E Less Frequent
● (D, E) → E Less Frequent
So, our final candidate pairs are:
(A, B) and (B, C)
Step 5: Count Frequent Pairs (Pass 2)
Now, we scan the dataset again and count occurrences of the remaining candidate pairs.
Pair Count
(A, B) 3
(B, C) 2
Since the (A, B) pair have a count ≥ 3, this is the frequent pair.
Final Result:
The frequent item pairs found using the PCY algorithm are:
● (A, B)
Summary of PCY Optimization
● Uses a hash table to reduce candidate pairs, unlike Apriori, which generates all
possible pairs.
● Avoids counting too many non-frequent pairs in the second pass.
● Efficient for large datasets with a limited number of frequent items.
Improvement over A-priori Algorithm
The PCY (Park-Chen-Yu) Algorithm improves the Apriori Algorithm by reducing the number
of candidate pairs considered in the second pass. Here’s how:
1. Efficient Use of Memory:
○ Apriori stores all candidate pairs in memory, leading to high space complexity.
○ PCY uses a hash table to store hashed item pairs in a fixed-size bitmap,
reducing memory usage.
2. Optimized Candidate Pair Generation:
○ Apriori generates all possible pairs of frequent items.
○ PCY eliminates unlikely pairs early by using hash buckets to filter out infrequent
ones.
3. Faster Processing:
○ Apriori performs two passes over the dataset, counting all candidate pairs in the
second pass.
○ PCY reduces the number of pairs to be counted in the second pass, making it
more computationally efficient.
Key Improvement:
Instead of blindly considering all item pairs in the second pass (like Apriori), PCY prunes many
non-promising pairs using hash-based filtering, making it much faster for large datasets.
Advantages and Disadvantages of PCY
Algorithm
Advantages of the PCY Algorithm
1. Memory Efficiency: Uses a hash table to store item pair counts, reducing memory
usage compared to Apriori.
2. Faster Execution: Eliminates many non-frequent pairs early, reducing the number of
candidate pairs in the second pass.
3. Improved Scalability: Works well for large datasets by minimizing the number of
computations.
4. Better Disk Utilization: Uses a bitmap structure to store frequent hash buckets,
reducing the need for multiple disk accesses.
Disadvantages of the PCY Algorithm
1. Hash Collisions: Frequent and infrequent pairs may hash to the same bucket, leading
to false positives.
2. Extra Computation in Pass 1: Additional time is spent hashing item pairs, which may
not always lead to significant gains.
3. Fixed Hash Table Size: If the hash table is too small, too many pairs fall into the same
bucket, reducing efficiency.
4. Limited to Pair Counting: Doesn't extend well to higher-order itemsets (e.g., triples,
quadruples) without additional modifications.
The PCY (Park-Chen-Yu) algorithm is designed to efficiently find frequent itemsets in
large datasets, but it can potentially produce some incorrect outputs in the form of false
positives. However, it does not produce false negatives.
False Positives
The PCY algorithm may identify some itemsets as frequent when they are actually not,
due to the following reasons:
1. Hash collisions: Multiple item pairs can hash to the same bucket, leading to
some infrequent pairs being mistakenly considered as candidates
2. Bitmap limitations: The use of a bitmap to represent frequent buckets can
sometimes lead to infrequent itemsets being considered as candidates if they
hash to a frequent bucket
Accuracy Improvements
To mitigate the risk of false positives, several variations of the PCY algorithm have been
developed:
1. Multistage Algorithm: This variation uses additional hash tables and passes to
further reduce the number of candidate pairs, potentially decreasing false
positives
2. Multihash Algorithm: It uses multiple hash functions in a single pass to achieve
similar benefits as the Multistage Algorithm, potentially reducing false positives
while maintaining efficiency
3. Threshold adjustment: Setting an appropriate threshold for bucket frequency can
help reduce false positives and improve the algorithm's accuracy
It's important to note that while these improvements can reduce false positives, they
cannot eliminate them entirely. The PCY algorithm and its variations are designed to be
memory-efficient approximations, trading some accuracy for improved performance on
large datasets.