DMKD_MODULE_PART-B
Created @June 11, 2025 4:34 PM
MODULE 4 - PART -B
🔶CoreGROUP 1: Frequent Itemsets, Association Rules, and
Concepts
Q1. Define the terms frequent item sets, closed item sets, and
association rules.
Frequent Itemsets:
An itemset is frequent if it occurs in a transactional database with a frequency no
less than a user-specified minimum support threshold. These patterns are the
foundation for mining meaningful associations in market basket analysis.
Closed Itemsets:
A frequent itemset is said to be closed if none of its immediate supersets has the
same support count as the itemset. Closed itemsets eliminate redundancy in
frequent itemset mining by retaining only the most representative patterns.
Association Rules:
Association rules are implications in the form A ⇒ B, where A and B are itemsets,
and the rule suggests that if A occurs, B is likely to occur as well. The strength of
a rule is measured using support and confidence. Support shows how often the
rule appears in the database, while confidence measures how often items in B
appear in transactions that contain A.
Q2. Which algorithm is influential for mining frequent item sets
for Boolean association rules? Explain with an example.
Apriori Algorithm is a fundamental algorithm for mining frequent itemsets from
Boolean association rules. It uses a level-wise, breadth-first search strategy
where itemsets of length k are used to generate itemsets of length k+1. It employs
DMKD_MODULE_PART-B 1
the anti-monotone property: if an itemset is not frequent, then none of its
supersets can be frequent.
Example:
Let the database contain 5 transactions:
T1: A, B, C
T2: A, C
T3: B, C
T4: A, B
T5: A, B, C
With minimum support = 3 (i.e., 60%):
Frequent 1-itemsets: A(4), B(4), C(3)
Generate 2-itemsets: AB(3), AC(3), BC(3)
All three are frequent.
Generate 3-itemset: ABC(2), which is infrequent and thus discarded.
The Apriori algorithm effectively eliminates unnecessary candidates using
pruning.
Q10. Explain the Apriori algorithm with example.
Apriori Algorithm Steps:
1. Scan database to find frequent 1-itemsets.
2. Use frequent (k-1)-itemsets to generate candidate k-itemsets.
3. Prune candidate itemsets if any subset is not frequent.
4. Count support for each candidate and retain those meeting minimum support.
5. Repeat until no new itemsets are found.
Example:
Dataset:
T1: Milk, Bread, Butter
DMKD_MODULE_PART-B 2
T2: Milk, Bread
T3: Bread, Butter
T4: Milk, Butter
Min Support: 2
1. L1 = {Milk(3), Bread(3), Butter(3)}
2. Generate L2 = {Milk, Bread}, {Milk, Butter}, {Bread, Butter}
3. L3 = {Milk, Bread, Butter} appears in 1 transaction → discarded
The Apriori algorithm continues this process to discover all frequent itemsets.
Q3. Describe the different techniques to improve the efficiency
of Apriori.
1. Hash-based Itemset Counting:
Uses hash structures to count itemsets efficiently by reducing the number of
candidate sets through hashing techniques.
2. Transaction Reduction:
After identifying infrequent itemsets, transactions that do not contain any
frequent itemsets can be skipped in subsequent iterations.
3. Partitioning:
The database is divided into partitions and frequent itemsets are found locally
in each partition. Only these are considered for global frequent itemset
checking.
4. Dynamic Itemset Counting:
New candidate itemsets can be added dynamically as the database is
scanned.
5. Sampling:
A random subset of the database is used to find frequent itemsets. These are
then verified against the entire database.
Q17. Describe mining closed frequent itemsets.
DMKD_MODULE_PART-B 3
Mining closed frequent itemsets is an approach to reduce redundancy in frequent
pattern mining. A closed itemset is one where none of its supersets has the same
support.
Benefits:
It reduces the number of patterns.
Closed itemsets can generate all frequent itemsets.
They provide a compact yet complete representation of frequent patterns.
Algorithms Used:
CHARM: Exploits a vertical format and itemset closure properties.
CLOSET: Uses a depth-first approach on an FP-tree structure.
🔶 GROUP 2: FP-Growth and Vertical Mining
Q4. Discuss the FP-growth algorithm. Explain with an example.
FP-Growth (Frequent Pattern Growth) is an efficient alternative to Apriori, avoiding
candidate generation.
Steps:
1. Build FP-Tree:
Scan DB for frequent 1-items.
Sort items by frequency and insert into a compact prefix tree.
2. Mine Tree Recursively:
Start from lowest frequency item, construct conditional pattern base.
Build conditional FP-tree.
Recursively mine frequent patterns.
Example:
Transactions:
T1: a, b
DMKD_MODULE_PART-B 4
T2: b, c, d
T3: a, c, d, e
T4: a, d, e
T5: a, b, c
Build FP-tree with sorted order and mine frequent patterns from bottom-up using
conditional FP-trees.
Q5. Explain how to mine frequent item sets using vertical data
format.
In the vertical format, each item is represented with a list of transaction IDs (TID-
list) instead of individual transactions.
Steps:
1. Create TID-lists for all items.
2. To compute support of a candidate itemset, intersect TID-lists.
3. Retain those with TID-list count ≥ minimum support.
Example:
A: {1,2,3}, B: {2,3,4}
A ∩ B = {2,3} → support = 2
This reduces I/O operations and accelerates computation by minimizing database
scans.
🔶Constraint-Based
GROUP 3: Multilevel, Multidimensional and
Mining
Q6. Discuss mining multilevel association rules from transaction
databases.
Multilevel association rules involve discovering patterns at multiple abstraction
levels using concept hierarchies (e.g., “milk” under “dairy”).
Steps:
DMKD_MODULE_PART-B 5
1. Encode items with hierarchy levels (e.g., 1** for dairy, 11* for milk).
2. Apply mining algorithm at higher level.
3. Gradually drill down to lower levels.
Approaches:
Use reduced support at deeper levels.
Prune irrelevant items using hierarchy pruning techniques.
Example:
Level 1: Dairy ⇒ Snacks
Level 2: Milk ⇒ Chips
Q7. Explain how to mine multidimensional association rules from
relational databases.
Multidimensional association rules consider data from different attributes or
dimensions.
Types:
Intra-dimensional: Within one dimension (e.g., age=20 ⇒ buys=laptop).
Inter-dimensional: Across dimensions (e.g., age=20, income=high ⇒
buys=TV).
Steps:
1. Transform relational data into transactions.
2. Use Apriori/FP-Growth on transformed data.
3. Generate rules across multiple dimensions.
Q13. Describe multidimensional association rules using static
discretization.
Static discretization involves dividing continuous attributes (e.g., age, salary) into
fixed intervals.
Steps:
DMKD_MODULE_PART-B 6
1. Apply binning techniques (equal-width or equal-frequency).
2. Use discretized values in mining.
3. Generate multidimensional rules.
Benefits:
Simplifies rule mining.
Converts numeric data into symbolic for compatibility with association
algorithms.
Q14. Explain additional rule constraints to guide mining.
Types of Constraints:
Syntax Constraints: Rule structure (e.g., body must include 'milk').
Aggregate Constraints: Sum, avg of attributes (e.g., total price > 100).
Anti-monotone Constraints: If itemset A violates constraint, supersets of A
are discarded.
Monotone Constraints: If A satisfies, all supersets are valid.
Purpose:
To reduce search space.
To focus on relevant and actionable patterns.
Q9. Discuss constraint-based association mining.
Constraint-based mining integrates user-defined constraints into the mining
process to generate more relevant patterns.
Examples of Constraints:
Include specific items
Limit rule length
Minimum profit margin
Integration Techniques:
DMKD_MODULE_PART-B 7
Constraint pushing
Post-processing
Advantage: Reduces mining time and improves interpretability.
(Due to system-imposed message length limits, the continuation for Q20–Q40 will
follow immediately in the next message without repetition or pause.)
[Continuation] Module 4 – Part B Long Answer Questions (Q20 to Q40): Detailed
Grouped Solutions
🔶Evaluation
GROUP 4: Correlation, Interestingness, and
Measures
Q8. Describe different correlation measures in association
analysis.
Correlation measures help determine whether an association rule is statistically
significant or just due to chance.
1. Lift:
Measures how much more often A and B occur together than if they were
independent.
Lift(A⇒B)=P(A∩B)P(A)⋅P(B)\text{Lift}(A \Rightarrow B) = \frac{P(A \cap B)}
{P(A) \cdot P(B)}
If lift > 1, A and B are positively correlated.
2. Chi-Square Test:
Tests whether two itemsets are statistically independent. A higher χ² value
indicates stronger dependency.
3. Conviction:
Measures the degree of implication.
Conviction(A⇒B)=P(A)⋅P(¬B)P(A∩¬B)\text{Conviction}(A \Rightarrow B) =
\frac{P(A) \cdot P(\neg B)}{P(A \cap \neg B)}
4. Leverage and Cosine:
DMKD_MODULE_PART-B 8
Leverage indicates the difference between observed and expected
support.
Cosine measures vector-based similarity between A and B.
Q15. How can we tell which strong association rules are really
interesting? Explain with an example.
Not all strong rules (high support/confidence) are useful. Interestingness
measures such as Lift, Conviction, and Leverage are applied.
Example:
Rule: {milk} ⇒ {bread}, support = 70%, confidence = 80%
If bread occurs in 90% of transactions, confidence is not informative.
Use Lift:
If lift = 0.9 → A and B are negatively correlated
If lift > 1 → rule is more interesting
Conclusion: A rule is interesting if it is unexpected and actionable with meaningful
correlation.
Q16. Describe correlation analysis using Chi-square.
Chi-square helps evaluate the independence of two itemsets.
Steps:
1. Construct a 2×2 contingency table with observed frequencies:
A and B
A and not B
Not A and B
Not A and not B
2. Compute expected frequencies.
3. Use:
χ2=∑(O−E)2E\chi^2 = \sum \frac{(O - E)^2}{E}
DMKD_MODULE_PART-B 9
4. Compare with the critical χ² value.
A high value suggests strong correlation, validating the rule's significance.
Q18. Show that items in a strong association rule may be
negatively correlated.
Example:
Rule: {milk} ⇒ {bread}, confidence = 80%, support = 60%
But:
P(bread) = 90%
Lift = 0.8 < 1 → implies negative correlation
This shows that although the rule has high support and confidence, the actual co-
occurrence is less than expected under independence.
Q19. Discuss methods to reduce the number of rules generated in
association rule mining.
1. Use of Minimum Thresholds:
Increase support/confidence thresholds.
2. Redundancy Removal:
Discard rules where consequent is contained in other rule.
3. Constraint-based Mining:
Add user constraints to focus results.
4. Use Closed or Maximal Itemsets:
Avoid generating all subsets.
5. Template-Based Pruning:
Predefine rule structures of interest.
Q20. Name pruning strategies in mining closed frequent item
sets.
DMKD_MODULE_PART-B 10
1. Support Pruning:
Eliminate itemsets below min support.
2. Closure Pruning:
Discard itemsets whose closures already exist.
3. Subsumption Pruning:
Avoid itemsets that are subsets of already known frequent itemsets with
same support.
4. Lattice Traversal Pruning:
Skip redundant branches in search space.
🔶 GROUP 5: Classification Basics and Decision Trees
Q21. Explain classification and prediction with an example.
Classification: Predicts categorical class labels.
Prediction: Estimates continuous values.
Example:
Classification: Determine whether an email is "spam" or "not spam"
Prediction: Predict housing price based on features
Both are supervised learning tasks trained using labeled datasets.
Q22. Explain basic decision tree induction algorithm.
Steps:
1. Choose the best attribute using a selection measure (e.g., Information Gain).
2. Split the dataset based on this attribute.
3. Repeat recursively for each subset.
4. Stop when node is pure or cannot be split further.
DMKD_MODULE_PART-B 11
Output: A tree where each internal node represents a test, each branch
represents outcome, and leaves represent classes.
Q23. Measures associated with attribute selection.
1. Information Gain: Measures entropy reduction. Higher IG = better attribute.
2. Gain Ratio: Adjusts IG by the intrinsic value of the split.
3. Gini Index: Measures impurity. Used in CART.
4. Chi-square: Tests statistical independence.
These measures guide how the decision tree is constructed.
Q24. How does tree pruning work? Enhancements to decision
tree induction?
Pruning: Reduces overfitting by removing branches with little predictive power.
Types:
Pre-pruning: Stops early.
Post-pruning: Grows full tree and prunes later.
Enhancements:
Ensemble methods (bagging, boosting)
Random Forests
Using surrogate splits for missing values
Q25. Explain scalability of decision tree induction.
Decision trees scale well due to:
Linear time complexity
Can handle large datasets via data partitioning
Support for incremental and parallel construction
Algorithms like SLIQ, SPRIN for scalability
DMKD_MODULE_PART-B 12
Q35. Tree pruning in decision tree induction
Why prune?
To improve generalization
Reduce model complexity
Criteria:
Error rate minimization
Cross-validation performance
Approaches:
Reduced Error Pruning
Cost-complexity Pruning
Q36. Tree to rules vs. prune then rule conversion
Approach A (Tree → Rules → Prune):
Converts tree into if-then rules
Prunes rules individually
Approach B (Prune Tree → Rules):
Prune first, then convert
Comparison:
A gives more precise, tailored rules
B is simpler, but less flexible
🔶 GROUP 6: Bayesian and Lazy Classifiers
Q26. Working of simple Bayesian classifier.
Based on Bayes Theorem:
P(C ∣X)=P(X∣C)⋅P(C)P(X)P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}
Naive Bayes Assumption: Attributes are independent.
DMKD_MODULE_PART-B 13
Steps:
Calculate prior P(C)
Compute P(X|C) for each attribute
Predict class with highest posterior
Advantages: Fast, handles missing data
Q27. Bayesian Belief Networks
Graph-based model
Nodes: variables; Edges: conditional dependencies
Uses:
Modeling uncertain knowledge
Inference with incomplete data
Learning:
Structure: Expert input or algorithms
Parameters: EM algorithm or Maximum Likelihood
Q34. Training of Bayesian belief networks
Structure Learning:
Constraint-based (independence tests)
Score-based (maximize scoring function)
Parameter Learning:
Estimate conditional probabilities
Use MLE or Bayesian Estimation
Tools: WEKA, BNlearn
🔶Comparison
GROUP 7: kNN, Accuracy, Eager vs Lazy, and
DMKD_MODULE_PART-B 14
Q28. k-Nearest Neighbor (kNN) classifier and case-based
reasoning
kNN:
Lazy learning
Predicts class based on majority of k closest data points
Case-Based Reasoning:
Solves new problems using past cases
Steps: Retrieve → Reuse → Revise → Retain
Similarity Measure: Euclidean Distance, Cosine similarity
Q29. Classifier accuracy and how to measure it
Measures:
Accuracy = (TP + TN) / Total
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1-score = harmonic mean of precision & recall
Confusion Matrix: tabulates actual vs predicted
Q30. Applying ideas to association rule mining
Concepts like:
Rule pruning
Use of interestingness measures
Constraints
Ensemble approaches
can enhance association mining by improving performance and relevance.
Q31. Major issues in classification and prediction
DMKD_MODULE_PART-B 15
1. Missing values
2. Noisy data
3. Overfitting
4. Imbalanced data
5. Model interpretability
6. Scalability
Solutions: Data cleaning, cross-validation, SMOTE, ensemble models
Q32. Compare classification and prediction methods
Aspect Classification Prediction
Output Categorical Continuous
Example Spam detection Sales forecast
Models Decision Trees, Naive Bayes Regression, Neural Networks
Evaluation Accuracy, F1 RMSE, MAE
Q37. Eager vs Lazy classification
Type Eager Lazy
Examples Decision Tree kNN
Model Building Before prediction At prediction
Speed Fast prediction Slow prediction
Memory Low High
Q38. Algorithm for kNN
1. Store all training data
2. For new point, compute distance to all training points
3. Select k closest
4. Return majority class
DMKD_MODULE_PART-B 16
Distance: Euclidean, Manhattan, etc.
🔶 GROUP 8: Clustering Reference (For Comparison)
Q39. Clustering algorithms (k-means, k-medoids)
Feature k-Means k-Medoids
Type Centroid Representative object
Sensitivity High (outliers) Robust
Complexity Low Higher
Use Case Large datasets Noisy/sensitive datasets
CLARA and CLARANS improve scalability for large datasets.
🔶 GROUP 9: Classification as Supervised Learning
Q40. Classification is supervised learning. Justify and explain
techniques.
Justification:
In supervised learning, the model is trained on labeled data. Classification fits as it
predicts discrete class labels using past data.
Techniques:
Decision Trees: interpretable and fast
Naive Bayes: simple and effective
kNN: intuitive, lazy learner
SVM: works well in high dimensions
Neural Networks: powerful for complex data
📌 By [Link]
DMKD_MODULE_PART-B 17