1.
Decision Tree Learning
1.1 Decision Tree Representation
• Decision Trees: A decision tree is a flowchart-like structure used for decision making.
Each internal node represents a "test" or "decision" based on an attribute, each branch
represents the outcome of the test, and each leaf node represents a class label (in
classification) or a value (in regression).
• Structure:
o Root Node: The starting point of the tree, where the first decision is made.
o Internal Nodes: Nodes representing decisions or tests.
o Edges: Branches linking nodes, each representing the outcome of a decision.
o Leaf Nodes: Final nodes representing the predicted output (class label or
continuous value).
1.2 Appropriate Problems for Decision Tree Learning
Decision trees can be used for:
• Classification: Predicting a categorical class label.
• Regression: Predicting a continuous output value.
• Decision trees are best used when the data is:
o Structured (tabular format).
o Features are easily interpretable.
o Relationships between features are non-linear.
1.3 Univariate Trees (Classification and Regression)
• Univariate Trees: Each decision node in the tree checks a single attribute (or feature)
at a time.
o Classification: A classification decision tree uses discrete features to predict a
class label.
o Regression: A regression decision tree predicts a continuous output, with the
leaves containing the predicted value.
1.4 Multivariate Trees
• Multivariate Trees: A decision node can test multiple features simultaneously (for
example, using linear combinations of features).
o Advantage: Often more efficient than univariate trees because they can make
more complex decisions at each node.
o Disadvantage: More computationally expensive and harder to interpret.
1.5 Basic Decision Tree Learning Algorithms
• ID3 (Iterative Dichotomiser 3):
o ID3 uses information gain to select the feature that best splits the data at each
node.
o Information Gain is computed using entropy (measure of uncertainty).
• CART (Classification and Regression Trees):
o CART constructs binary trees using Gini Impurity (for classification) or Mean
Squared Error (for regression) to determine the best split.
• C4.5:
o C4.5 is an extension of ID3, which uses gain ratio to avoid the bias towards
features with more values.
• CHAID (Chi-squared Automatic Interaction Detection):
o Uses Chi-square statistics to decide the best split at each node.
1.6 Hypothesis Space Search in Decision Tree Learning
• Hypothesis Space: The set of all possible trees that can be formed using the features.
• Search Process: The algorithm searches through this hypothesis space, trying to find
the tree that best classifies the data.
• Greedy Approach: Decision tree algorithms often use greedy techniques (e.g.,
choosing the best feature at each step) to explore this space.
• Problem: The greedy approach might lead to suboptimal trees because it does not
consider future splits.
1.7 Inductive Bias in Decision Tree Learning
• Inductive Bias: The assumptions made by the algorithm to generalize from the training
data to unseen data.
• Inductive Bias in Decision Trees:
o Preference for smaller trees (pruning).
o Bias towards features that are more informative.
o The assumption that the classes are separable by axis-parallel splits.
1.8 Issues in Decision Tree Learning
• Overfitting: Trees that are too complex may fit noise in the data.
o Solution: Pruning, cross-validation, or using simpler models.
• Handling Missing Data: Decision trees might struggle when some values are missing.
This can be handled by imputation or surrogate splits.
• Bias and Variance: Decision trees can have high variance and low bias (overfitting),
or low variance and high bias (underfitting).
• Computational Complexity: As the tree grows, the computational cost increases
exponentially.
2. Bayesian Learning
2.1 Bayes' Theorem and Concept Learning
• Bayes’ Theorem:
P(H∣D)=P(D∣H)⋅P(H)P(D)P(H|D) = \frac{P(D|H) \cdot
P(H)}{P(D)}P(H∣D)=P(D)P(D∣H)⋅P(H)
Where:
o P(H∣D)P(H|D)P(H∣D) is the posterior probability of hypothesis HHH given the
data DDD.
o P(D∣H)P(D|H)P(D∣H) is the likelihood of data DDD given hypothesis HHH.
o P(H)P(H)P(H) is the prior probability of hypothesis HHH.
o P(D)P(D)P(D) is the probability of the data.
• Concept Learning: Concept learning involves learning a concept (e.g., classification
rule) based on examples. Bayesian learning can be applied to derive the probability of
a hypothesis based on observed data.
2.2 Bayes Optimal Classifier
• The Bayes Optimal Classifier is the classifier that minimizes the expected
classification error.
• It is defined as: C∗(x)=argmaxcP(c∣x)C^*(x) = \arg\max_c P(c|x)C∗(x)=argcmax
P(c∣x) Where ccc is the class label and xxx is the input feature vector.
• This optimal classifier uses the true posterior probabilities but is impractical for real-
world problems because it requires knowledge of the entire probability distribution.
2.3 Gibbs Sampling Algorithm
• Gibbs Sampling is a Markov Chain Monte Carlo (MCMC) method used to
approximate the distribution of parameters in Bayesian networks when exact sampling
is difficult.
• It iteratively samples from the conditional distribution of each variable given the others.
2.4 Naive Bayes Classifier
• Naive Bayes is a classification algorithm based on applying Bayes' theorem with the
naive assumption that all features are conditionally independent given the class.
• The classifier calculates:
P(c∣x)=P(c)⋅∏i=1nP(xi∣c)P(x)P(c|x) = \frac{P(c) \cdot \prod_{i=1}^{n}
P(x_i|c)}{P(x)}P(c∣x)=P(x)P(c)⋅∏i=1nP(xi∣c)
Where:
o ccc is the class label.
o xix_ixi is the feature.
o The independence assumption makes it computationally efficient.
• Assumptions:
o All features are independent.
o Features are discrete (though this can be adapted for continuous features).
• Advantages:
o Simple and efficient.
o Works well with small datasets and high-dimensional data.
• Limitations:
o The independence assumption is often unrealistic.
2.5 Bayesian Belief Networks
• Bayesian Belief Networks (BBNs) are graphical models that represent probabilistic
relationships among variables.
• Nodes represent random variables, and edges represent probabilistic dependencies.
• These networks can be used for reasoning under uncertainty and making predictions
based on known evidence.
2.6 Expectation Maximization (EM) Algorithm
• The EM algorithm is an iterative method for finding maximum likelihood estimates
of parameters in models with latent (hidden) variables.
• Steps:
1. E-step: Estimate the expected value of the latent variables given the observed
data and current parameter estimates.
2. M-step: Maximize the likelihood of the observed data given the expected values
of the latent variables.
• Applications: Used in clustering (e.g., Gaussian Mixture Models), missing data, and
other problems with hidden variables.
2 Marks Questions:
1. What is a Decision Tree in machine learning?
o A Decision Tree is a tree-like model used for decision-making and classification
tasks. Each internal node represents a decision based on a feature, and the
branches represent the outcomes of those decisions. The leaf nodes represent
the class label or predicted value.
2. What is the purpose of pruning in decision trees?
o Pruning is the process of removing branches from a decision tree that have little
significance or contribute to overfitting. It helps in improving the tree's
generalization by reducing complexity and avoiding overfitting.
3. What is Naive Bayes Classifier?
o Naive Bayes Classifier is a probabilistic classifier based on Bayes' Theorem,
assuming that features are conditionally independent given the class label. It is
particularly efficient for large datasets and is used for both binary and multi-
class classification tasks.
4. Define Gini Impurity.
o Gini Impurity is a metric used in decision trees to measure the impurity of a
node. It calculates the probability of incorrectly classifying a randomly chosen
element from the dataset. Lower Gini Impurity values indicate better splits.
5. What is the role of Bayes' Theorem in Bayesian Learning?
o Bayes' Theorem provides a method for updating the probability of a hypothesis
given new evidence. It is fundamental to Bayesian Learning, allowing the
model to learn and update beliefs about data as more evidence becomes
available.
4 Marks Questions:
6. Explain the difference between Univariate and Multivariate Decision Trees.
o Univariate Decision Trees test only one feature at each decision node, using
that feature's value to split the data into subsets. In contrast, Multivariate
Decision Trees can use multiple features to create more complex decision
boundaries at each node, often leading to better classification performance at
the cost of increased complexity.
7. What is Information Gain in Decision Tree Learning?
o Information Gain is used to measure the effectiveness of a feature in
classifying the data. It is calculated as the difference between the entropy of the
parent node and the weighted sum of the entropies of the child nodes. A higher
information gain indicates a better feature for splitting the data.
8. What are the advantages and disadvantages of the Naive Bayes classifier?
o Advantages:
▪ Simple and fast to train.
▪ Works well with high-dimensional data and small datasets.
▪ Handles missing data well.
o Disadvantages:
▪ Assumes all features are conditionally independent, which is often
unrealistic.
▪ Doesn't perform well with highly correlated features.
9. How does the Expectation-Maximization (EM) algorithm work?
o The EM algorithm works by iterating between two steps:
1. E-step: Estimate the expected value of the latent (hidden) variables, given the observed
data and current parameter estimates.
2. M-step: Maximize the likelihood of the observed data given the expected values of the
latent variables.
o The algorithm continues to iterate until convergence.
10. What are Bayesian Belief Networks and their applications?
o Bayesian Belief Networks (BBNs) are graphical models that represent
probabilistic dependencies among variables using a directed acyclic graph
(DAG). They are used for reasoning under uncertainty, making predictions, and
modeling complex probabilistic relationships. Applications include diagnostic
systems, risk analysis, and decision support systems.
6 Marks Questions:
11. Describe the basic algorithm for Decision Tree Learning.
o The basic steps of the Decision Tree Learning algorithm are:
1. Select the Best Feature: Choose the feature that best splits the data
(e.g., using Information Gain, Gini Index, or Chi-squared test).
2. Split the Data: Partition the dataset into subsets based on the selected
feature.
3. Create a Node: Create a decision node for the chosen feature.
4. Repeat for Each Subset: Recursively apply the decision tree algorithm
to each subset of the data, until all data is classified.
5. Stop when:
▪ All data in a node belongs to a single class.
▪ There are no more features to split on.
▪ The tree reaches a predefined maximum depth.
6. Prune the Tree: Optional step to remove branches that do not improve
the model's performance.
12. Explain the concept of Bayesian Learning with an example.
o Bayesian Learning is a method of learning where we update our beliefs about
a model’s parameters based on observed data using Bayes' Theorem.
o Example: Suppose we have a binary classification problem where we need to
predict whether an email is spam or not. Initially, we have a prior belief about
the probabilities of spam and non-spam emails. As we observe new emails
(data), we update these probabilities (posterior) to improve the classification of
future emails. The update uses Bayes' Theorem, which incorporates both the
prior knowledge and the likelihood of the observed data.
13. What is the difference between Classification and Regression in Decision Trees?
o Classification Trees: These trees predict a categorical class label (e.g., yes/no,
spam/ham). The decision at each node is made based on the best split of
categorical or numerical features that maximizes the homogeneity of the class
labels in the child nodes.
o Regression Trees: These trees predict a continuous value. Instead of class
labels, each leaf node contains a numeric value representing the predicted
output. The splitting criteria focus on minimizing the error in predicting the
target value (e.g., minimizing squared error).
14. Discuss the advantages and disadvantages of Decision Trees.
o Advantages:
▪ Easy to understand and interpret.
▪ Can handle both numerical and categorical data.
▪ Non-linear relationships between features do not require transformation.
▪ Can handle missing data through surrogate splits.
o Disadvantages:
▪ Prone to overfitting, especially with deep trees.
▪ Sensitive to noisy data and outliers.
▪ Can be computationally expensive for large datasets.
▪ Tends to create biased trees if some classes dominate the dataset.
15. Explain the working of a Naive Bayes Classifier with an example.
o A Naive Bayes Classifier works by calculating the posterior probability of each
class given the features of the data, using Bayes' Theorem. It assumes that the
features are independent, simplifying the computation.
o Example: For a spam email classifier, let’s assume we have two features:
"contains 'free'" and "contains 'offer'". The algorithm calculates the probability
of an email being spam or not, given these features:
P(spam∣X)=P(X∣spam)⋅P(spam)P(X)P(\text{spam}|X) =
\frac{P(X|\text{spam}) \cdot
P(\text{spam})}{P(X)}P(spam∣X)=P(X)P(X∣spam)⋅P(spam) Here, XXX is the
feature vector (contains 'free', contains 'offer'). The classifier computes the
likelihood of each feature given the class, multiplies by the prior probability of
each class, and chooses the class with the highest posterior probability.