0% found this document useful (0 votes)
23 views17 pages

DMKD Module4 Part-B

The document outlines key concepts in data mining, focusing on frequent itemsets, association rules, and algorithms like Apriori and FP-Growth. It discusses techniques to improve efficiency in mining, methods for mining closed frequent itemsets, and various approaches for multilevel and multidimensional association rule mining. Additionally, it covers classification basics, decision tree induction, and Bayesian classifiers, providing examples and measures associated with these processes.

Uploaded by

charan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views17 pages

DMKD Module4 Part-B

The document outlines key concepts in data mining, focusing on frequent itemsets, association rules, and algorithms like Apriori and FP-Growth. It discusses techniques to improve efficiency in mining, methods for mining closed frequent itemsets, and various approaches for multilevel and multidimensional association rule mining. Additionally, it covers classification basics, decision tree induction, and Bayesian classifiers, providing examples and measures associated with these processes.

Uploaded by

charan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

DMKD_MODULE_PART-B

Created @June 11, 2025 4:34 PM

MODULE 4 - PART -B

🔶CoreGROUP 1: Frequent Itemsets, Association Rules, and


Concepts

Q1. Define the terms frequent item sets, closed item sets, and
association rules.
Frequent Itemsets:
An itemset is frequent if it occurs in a transactional database with a frequency no
less than a user-specified minimum support threshold. These patterns are the
foundation for mining meaningful associations in market basket analysis.
Closed Itemsets:
A frequent itemset is said to be closed if none of its immediate supersets has the
same support count as the itemset. Closed itemsets eliminate redundancy in
frequent itemset mining by retaining only the most representative patterns.
Association Rules:

Association rules are implications in the form A ⇒ B, where A and B are itemsets,
and the rule suggests that if A occurs, B is likely to occur as well. The strength of
a rule is measured using support and confidence. Support shows how often the
rule appears in the database, while confidence measures how often items in B
appear in transactions that contain A.

Q2. Which algorithm is influential for mining frequent item sets


for Boolean association rules? Explain with an example.
Apriori Algorithm is a fundamental algorithm for mining frequent itemsets from
Boolean association rules. It uses a level-wise, breadth-first search strategy
where itemsets of length k are used to generate itemsets of length k+1. It employs

DMKD_MODULE_PART-B 1
the anti-monotone property: if an itemset is not frequent, then none of its
supersets can be frequent.
Example:
Let the database contain 5 transactions:

T1: A, B, C

T2: A, C

T3: B, C

T4: A, B

T5: A, B, C

With minimum support = 3 (i.e., 60%):

Frequent 1-itemsets: A(4), B(4), C(3)

Generate 2-itemsets: AB(3), AC(3), BC(3)

All three are frequent.

Generate 3-itemset: ABC(2), which is infrequent and thus discarded.

The Apriori algorithm effectively eliminates unnecessary candidates using


pruning.

Q10. Explain the Apriori algorithm with example.


Apriori Algorithm Steps:

1. Scan database to find frequent 1-itemsets.

2. Use frequent (k-1)-itemsets to generate candidate k-itemsets.

3. Prune candidate itemsets if any subset is not frequent.

4. Count support for each candidate and retain those meeting minimum support.

5. Repeat until no new itemsets are found.

Example:

Dataset:

T1: Milk, Bread, Butter

DMKD_MODULE_PART-B 2
T2: Milk, Bread

T3: Bread, Butter

T4: Milk, Butter

Min Support: 2

1. L1 = {Milk(3), Bread(3), Butter(3)}

2. Generate L2 = {Milk, Bread}, {Milk, Butter}, {Bread, Butter}

3. L3 = {Milk, Bread, Butter} appears in 1 transaction → discarded

The Apriori algorithm continues this process to discover all frequent itemsets.

Q3. Describe the different techniques to improve the efficiency


of Apriori.
1. Hash-based Itemset Counting:

Uses hash structures to count itemsets efficiently by reducing the number of


candidate sets through hashing techniques.

2. Transaction Reduction:

After identifying infrequent itemsets, transactions that do not contain any


frequent itemsets can be skipped in subsequent iterations.

3. Partitioning:
The database is divided into partitions and frequent itemsets are found locally
in each partition. Only these are considered for global frequent itemset
checking.

4. Dynamic Itemset Counting:


New candidate itemsets can be added dynamically as the database is
scanned.

5. Sampling:

A random subset of the database is used to find frequent itemsets. These are
then verified against the entire database.

Q17. Describe mining closed frequent itemsets.

DMKD_MODULE_PART-B 3
Mining closed frequent itemsets is an approach to reduce redundancy in frequent
pattern mining. A closed itemset is one where none of its supersets has the same
support.
Benefits:

It reduces the number of patterns.

Closed itemsets can generate all frequent itemsets.

They provide a compact yet complete representation of frequent patterns.

Algorithms Used:

CHARM: Exploits a vertical format and itemset closure properties.

CLOSET: Uses a depth-first approach on an FP-tree structure.

🔶 GROUP 2: FP-Growth and Vertical Mining


Q4. Discuss the FP-growth algorithm. Explain with an example.
FP-Growth (Frequent Pattern Growth) is an efficient alternative to Apriori, avoiding
candidate generation.
Steps:

1. Build FP-Tree:

Scan DB for frequent 1-items.

Sort items by frequency and insert into a compact prefix tree.

2. Mine Tree Recursively:

Start from lowest frequency item, construct conditional pattern base.

Build conditional FP-tree.

Recursively mine frequent patterns.

Example:

Transactions:

T1: a, b

DMKD_MODULE_PART-B 4
T2: b, c, d

T3: a, c, d, e

T4: a, d, e

T5: a, b, c

Build FP-tree with sorted order and mine frequent patterns from bottom-up using
conditional FP-trees.

Q5. Explain how to mine frequent item sets using vertical data
format.
In the vertical format, each item is represented with a list of transaction IDs (TID-
list) instead of individual transactions.
Steps:

1. Create TID-lists for all items.

2. To compute support of a candidate itemset, intersect TID-lists.

3. Retain those with TID-list count ≥ minimum support.

Example:

A: {1,2,3}, B: {2,3,4}

A ∩ B = {2,3} → support = 2

This reduces I/O operations and accelerates computation by minimizing database


scans.

🔶Constraint-Based
GROUP 3: Multilevel, Multidimensional and
Mining

Q6. Discuss mining multilevel association rules from transaction


databases.
Multilevel association rules involve discovering patterns at multiple abstraction
levels using concept hierarchies (e.g., “milk” under “dairy”).

Steps:

DMKD_MODULE_PART-B 5
1. Encode items with hierarchy levels (e.g., 1** for dairy, 11* for milk).

2. Apply mining algorithm at higher level.

3. Gradually drill down to lower levels.

Approaches:

Use reduced support at deeper levels.

Prune irrelevant items using hierarchy pruning techniques.

Example:

Level 1: Dairy ⇒ Snacks

Level 2: Milk ⇒ Chips

Q7. Explain how to mine multidimensional association rules from


relational databases.
Multidimensional association rules consider data from different attributes or
dimensions.

Types:

Intra-dimensional: Within one dimension (e.g., age=20 ⇒ buys=laptop).

Inter-dimensional: Across dimensions (e.g., age=20, income=high ⇒


buys=TV).

Steps:

1. Transform relational data into transactions.

2. Use Apriori/FP-Growth on transformed data.

3. Generate rules across multiple dimensions.

Q13. Describe multidimensional association rules using static


discretization.
Static discretization involves dividing continuous attributes (e.g., age, salary) into
fixed intervals.
Steps:

DMKD_MODULE_PART-B 6
1. Apply binning techniques (equal-width or equal-frequency).

2. Use discretized values in mining.

3. Generate multidimensional rules.

Benefits:

Simplifies rule mining.

Converts numeric data into symbolic for compatibility with association


algorithms.

Q14. Explain additional rule constraints to guide mining.


Types of Constraints:

Syntax Constraints: Rule structure (e.g., body must include 'milk').

Aggregate Constraints: Sum, avg of attributes (e.g., total price > 100).

Anti-monotone Constraints: If itemset A violates constraint, supersets of A


are discarded.

Monotone Constraints: If A satisfies, all supersets are valid.

Purpose:

To reduce search space.

To focus on relevant and actionable patterns.

Q9. Discuss constraint-based association mining.


Constraint-based mining integrates user-defined constraints into the mining
process to generate more relevant patterns.
Examples of Constraints:

Include specific items

Limit rule length

Minimum profit margin

Integration Techniques:

DMKD_MODULE_PART-B 7
Constraint pushing

Post-processing

Advantage: Reduces mining time and improves interpretability.

(Due to system-imposed message length limits, the continuation for Q20–Q40 will
follow immediately in the next message without repetition or pause.)
[Continuation] Module 4 – Part B Long Answer Questions (Q20 to Q40): Detailed
Grouped Solutions

🔶Evaluation
GROUP 4: Correlation, Interestingness, and
Measures

Q8. Describe different correlation measures in association


analysis.
Correlation measures help determine whether an association rule is statistically
significant or just due to chance.

1. Lift:
Measures how much more often A and B occur together than if they were
independent.
Lift(A⇒B)=P(A∩B)P(A)⋅P(B)\text{Lift}(A \Rightarrow B) = \frac{P(A \cap B)}
{P(A) \cdot P(B)}

If lift > 1, A and B are positively correlated.

2. Chi-Square Test:

Tests whether two itemsets are statistically independent. A higher χ² value


indicates stronger dependency.

3. Conviction:
Measures the degree of implication.

Conviction(A⇒B)=P(A)⋅P(¬B)P(A∩¬B)\text{Conviction}(A \Rightarrow B) =
\frac{P(A) \cdot P(\neg B)}{P(A \cap \neg B)}

4. Leverage and Cosine:

DMKD_MODULE_PART-B 8
Leverage indicates the difference between observed and expected
support.

Cosine measures vector-based similarity between A and B.

Q15. How can we tell which strong association rules are really
interesting? Explain with an example.
Not all strong rules (high support/confidence) are useful. Interestingness
measures such as Lift, Conviction, and Leverage are applied.
Example:

Rule: {milk} ⇒ {bread}, support = 70%, confidence = 80%

If bread occurs in 90% of transactions, confidence is not informative.

Use Lift:

If lift = 0.9 → A and B are negatively correlated

If lift > 1 → rule is more interesting

Conclusion: A rule is interesting if it is unexpected and actionable with meaningful


correlation.

Q16. Describe correlation analysis using Chi-square.


Chi-square helps evaluate the independence of two itemsets.
Steps:

1. Construct a 2×2 contingency table with observed frequencies:

A and B

A and not B

Not A and B

Not A and not B

2. Compute expected frequencies.

3. Use:
χ2=∑(O−E)2E\chi^2 = \sum \frac{(O - E)^2}{E}

DMKD_MODULE_PART-B 9
4. Compare with the critical χ² value.

A high value suggests strong correlation, validating the rule's significance.

Q18. Show that items in a strong association rule may be


negatively correlated.
Example:

Rule: {milk} ⇒ {bread}, confidence = 80%, support = 60%


But:

P(bread) = 90%

Lift = 0.8 < 1 → implies negative correlation

This shows that although the rule has high support and confidence, the actual co-
occurrence is less than expected under independence.

Q19. Discuss methods to reduce the number of rules generated in


association rule mining.
1. Use of Minimum Thresholds:

Increase support/confidence thresholds.

2. Redundancy Removal:

Discard rules where consequent is contained in other rule.

3. Constraint-based Mining:

Add user constraints to focus results.

4. Use Closed or Maximal Itemsets:

Avoid generating all subsets.

5. Template-Based Pruning:

Predefine rule structures of interest.

Q20. Name pruning strategies in mining closed frequent item


sets.

DMKD_MODULE_PART-B 10
1. Support Pruning:

Eliminate itemsets below min support.

2. Closure Pruning:

Discard itemsets whose closures already exist.

3. Subsumption Pruning:

Avoid itemsets that are subsets of already known frequent itemsets with
same support.

4. Lattice Traversal Pruning:

Skip redundant branches in search space.

🔶 GROUP 5: Classification Basics and Decision Trees


Q21. Explain classification and prediction with an example.
Classification: Predicts categorical class labels.

Prediction: Estimates continuous values.

Example:

Classification: Determine whether an email is "spam" or "not spam"

Prediction: Predict housing price based on features

Both are supervised learning tasks trained using labeled datasets.

Q22. Explain basic decision tree induction algorithm.


Steps:

1. Choose the best attribute using a selection measure (e.g., Information Gain).

2. Split the dataset based on this attribute.

3. Repeat recursively for each subset.

4. Stop when node is pure or cannot be split further.

DMKD_MODULE_PART-B 11
Output: A tree where each internal node represents a test, each branch
represents outcome, and leaves represent classes.

Q23. Measures associated with attribute selection.


1. Information Gain: Measures entropy reduction. Higher IG = better attribute.

2. Gain Ratio: Adjusts IG by the intrinsic value of the split.

3. Gini Index: Measures impurity. Used in CART.

4. Chi-square: Tests statistical independence.

These measures guide how the decision tree is constructed.

Q24. How does tree pruning work? Enhancements to decision


tree induction?
Pruning: Reduces overfitting by removing branches with little predictive power.
Types:

Pre-pruning: Stops early.

Post-pruning: Grows full tree and prunes later.

Enhancements:

Ensemble methods (bagging, boosting)

Random Forests

Using surrogate splits for missing values

Q25. Explain scalability of decision tree induction.


Decision trees scale well due to:

Linear time complexity

Can handle large datasets via data partitioning

Support for incremental and parallel construction

Algorithms like SLIQ, SPRIN for scalability

DMKD_MODULE_PART-B 12
Q35. Tree pruning in decision tree induction
Why prune?

To improve generalization

Reduce model complexity

Criteria:

Error rate minimization

Cross-validation performance

Approaches:

Reduced Error Pruning

Cost-complexity Pruning

Q36. Tree to rules vs. prune then rule conversion


Approach A (Tree → Rules → Prune):

Converts tree into if-then rules

Prunes rules individually

Approach B (Prune Tree → Rules):

Prune first, then convert

Comparison:

A gives more precise, tailored rules

B is simpler, but less flexible

🔶 GROUP 6: Bayesian and Lazy Classifiers


Q26. Working of simple Bayesian classifier.
Based on Bayes Theorem:
P(C ∣X)=P(X∣C)⋅P(C)P(X)P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}
Naive Bayes Assumption: Attributes are independent.

DMKD_MODULE_PART-B 13
Steps:

Calculate prior P(C)

Compute P(X|C) for each attribute

Predict class with highest posterior

Advantages: Fast, handles missing data

Q27. Bayesian Belief Networks


Graph-based model

Nodes: variables; Edges: conditional dependencies

Uses:

Modeling uncertain knowledge

Inference with incomplete data

Learning:

Structure: Expert input or algorithms

Parameters: EM algorithm or Maximum Likelihood

Q34. Training of Bayesian belief networks


Structure Learning:

Constraint-based (independence tests)

Score-based (maximize scoring function)

Parameter Learning:

Estimate conditional probabilities

Use MLE or Bayesian Estimation

Tools: WEKA, BNlearn

🔶Comparison
GROUP 7: kNN, Accuracy, Eager vs Lazy, and

DMKD_MODULE_PART-B 14
Q28. k-Nearest Neighbor (kNN) classifier and case-based
reasoning
kNN:

Lazy learning

Predicts class based on majority of k closest data points

Case-Based Reasoning:

Solves new problems using past cases

Steps: Retrieve → Reuse → Revise → Retain

Similarity Measure: Euclidean Distance, Cosine similarity

Q29. Classifier accuracy and how to measure it


Measures:

Accuracy = (TP + TN) / Total

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

F1-score = harmonic mean of precision & recall

Confusion Matrix: tabulates actual vs predicted

Q30. Applying ideas to association rule mining


Concepts like:

Rule pruning

Use of interestingness measures

Constraints

Ensemble approaches
can enhance association mining by improving performance and relevance.

Q31. Major issues in classification and prediction

DMKD_MODULE_PART-B 15
1. Missing values

2. Noisy data

3. Overfitting

4. Imbalanced data

5. Model interpretability

6. Scalability

Solutions: Data cleaning, cross-validation, SMOTE, ensemble models

Q32. Compare classification and prediction methods


Aspect Classification Prediction

Output Categorical Continuous

Example Spam detection Sales forecast

Models Decision Trees, Naive Bayes Regression, Neural Networks

Evaluation Accuracy, F1 RMSE, MAE

Q37. Eager vs Lazy classification


Type Eager Lazy

Examples Decision Tree kNN

Model Building Before prediction At prediction

Speed Fast prediction Slow prediction

Memory Low High

Q38. Algorithm for kNN


1. Store all training data

2. For new point, compute distance to all training points

3. Select k closest

4. Return majority class

DMKD_MODULE_PART-B 16
Distance: Euclidean, Manhattan, etc.

🔶 GROUP 8: Clustering Reference (For Comparison)


Q39. Clustering algorithms (k-means, k-medoids)
Feature k-Means k-Medoids

Type Centroid Representative object

Sensitivity High (outliers) Robust

Complexity Low Higher

Use Case Large datasets Noisy/sensitive datasets

CLARA and CLARANS improve scalability for large datasets.

🔶 GROUP 9: Classification as Supervised Learning


Q40. Classification is supervised learning. Justify and explain
techniques.
Justification:
In supervised learning, the model is trained on labeled data. Classification fits as it
predicts discrete class labels using past data.
Techniques:

Decision Trees: interpretable and fast

Naive Bayes: simple and effective

kNN: intuitive, lazy learner

SVM: works well in high dimensions

Neural Networks: powerful for complex data

📌 By [Link]

DMKD_MODULE_PART-B 17

You might also like