Dimensions of Supervised ML Algorithm
Dimensions of Supervised ML Algorithm
Supervised machine learning algorithms are defined by several key dimensions: problem
type (classification or regression), input data (features), output (labels), model complexity, and
evaluation metrics. These dimensions determine how the algorithm learns and makes
predictions based on labeled training data.
1. Problem Type:
Supervised learning algorithms are categorized by the type of problem they solve.
Classification: Algorithms predict a category or class label for each input (e.g.,
spam or not spam, cat or dog).
1. Binary Classification: The simplest form, where there are only two
possible output classes (e.g., classifying an email as 'spam' or 'not spam',
predicting if a customer will 'churn' or 'not churn', or diagnosing a disease as
'present' or 'absent'). Many algorithms inherently support binary classification
and can be extended for multi-class problems.
2. Multi-class Classification: Involves more than two mutually exclusive
output classes (e.g., identifying handwritten digits from 0-9, categorizing
animal species from images, or classifying news articles into 'sports', 'politics',
'entertainment', etc.). Here, an instance belongs to exactly one of the many
available classes.
3. Multi-label Classification: A more complex scenario where an instance
can simultaneously belong to multiple classes (e.g., an image might contain
both a 'cat' and a 'dog', or a movie might be categorized as both 'action' and
'comedy'). The algorithm predicts a set of labels rather than a single one.
Examples : Support Vector Machines (SVMs), Decision Trees, Random
Forests, K-Nearest Neighbors (KNN), Naive Bayes, Neural Networks.
Data Type:
Accuracy:
Precision:
Formula: TP / FP + TP
Description: Out of all instances predicted as positive, how many were actually positive?
It focuses on the quality of positive predictions.
Formula: TP / FN+TP
Description: Out of all actual positive instances, how many did the model correctly
identify? It focuses on the model's ability to "find" all positive instances.
********************
Bayesian decision theory offers a probabilistic framework for making optimal decisions in
machine learning, particularly in classification tasks. It leverages Bayes' theorem to calculate
posterior probabilities, guiding the selection of the most likely class or outcome based on
available evidence and prior knowledge.
Key Concepts:
Prior Probability:
Represents the initial belief about the likelihood of different outcomes before observing any
data. For example, in a medical diagnosis scenario, this could be the prevalence of a disease in
the population.
Likelihood:
Indicates how likely it is to observe the given data if a particular outcome is true. In the medical
diagnosis example, this would be the probability of observing specific test results given the
presence or absence of the disease.
Posterior Probability:
Represents the updated belief about the outcome after considering the observed data. It's
calculated using Bayes' theorem, combining prior probabilities and likelihoods.
Decision Rule:
Based on the posterior probabilities, a decision rule is established to classify the data or make a
choice. For instance, in a binary classification problem, the rule might be to assign the data to the
class with the higher posterior probability.
Loss Function:
Quantifies the cost or penalty associated with making incorrect decisions. This helps in defining
what constitutes an "optimal" decision, as it considers the potential consequences of errors.
How it works:
1. Define the problem:
Determine the possible outcomes (classes) and the relevant features or attributes of the data.
2. Estimate prior probabilities:
Determine the prior probabilities for each outcome, based on available knowledge or historical
data.
3. Estimate likelihoods:
Calculate the likelihood of observing the data given each outcome, often using probability
density functions (PDFs) or probability mass functions (PMFs).
4. Apply Bayes' theorem:
Combine prior probabilities and likelihoods to calculate the posterior probabilities for each
outcome.
5. Make a decision:
Select the outcome with the highest posterior probability or, if a loss function is defined, choose
the decision that minimizes the expected loss.
Example 2:
You want to decide: Should I take an umbrella today?
1. Your "Best Guess" Before You Look:
o "It's July, so it probably won't rain." (Your starting idea)
2. You Get a "Clue":
o You look outside and see dark clouds. (This is your new information!)
3. Your Idea "Changes":
o Now, because of the dark clouds, you think: "Hmm, it's more likely to rain now
than I first thought." (Your updated idea)
4. You Think About "What Happens If I'm Wrong":
o Mistake 1: You bring the umbrella, but it doesn't rain.
Result: A little annoying to carry, but not terrible. (Small bad thing)
o Mistake 2: You don't bring the umbrella, and it does rain.
Result: You get totally soaked, maybe ruin your phone! (BIG bad thing)
5. The "Smart Decision":
o Because getting totally soaked is much worse than just carrying an umbrella, and
because the dark clouds make rain seem more likely, you decide: "I should take
the umbrella."
Classification
Classification in Bayesian Decision Theory is a principled and powerful way to assign an
observed data point (e.g., an email, a patient's symptoms) to one of several predefined categories
or classes.
It's not just about guessing the most likely category, but about making the best decision by
considering:
1. How likely each category is, given the data.
2. How "bad" (or costly) it is to be wrong for each possible mistake.
Here's a breakdown:
The Goal: Optimal Class Assignment
Imagine you have a new email (x) and you want to classify it as either "Spam" (ω1) or "Not
Spam" (ω2).
Bayesian Decision Theory tells you to make the decision that minimizes the total expected
"cost" or "loss" associated with your classification.
The Key Steps in Classification using BDT:
1. Observe the Data (x):
o You get a new piece of data. For our email example, this is the content of the
email, its sender, subject line, etc. (x).
2. Calculate Posterior Probabilities (P(ω∣x)):
o This is the core Bayesian step. For each possible class (Spam or Not Spam), you
calculate:
P(Spam∣email x): What is the probability that this specific email x is
Spam, given its content?
P(Not Spam∣email x): What is the probability that this specific email x is
Not Spam, given its content?
o These probabilities are derived using Bayes' Theorem, which combines:
Your prior belief about how often spam usually occurs (P(Spam)).
The likelihood of seeing this email's features if it were truly spam
(P(x∣Spam)).
3. Define a Loss Function (Costs of Mistakes):
o This is where the "decision theory" part comes in. You decide how "bad" different
types of errors are:
Cost of "False Positive" (FP): Classifying a Not Spam email as Spam.
Example: A legitimate work email goes to your spam folder. This
is annoying, maybe you miss something important. (Moderate
Cost).
Cost of "False Negative" (FN): Classifying a Spam email as Not Spam.
Example: A spam email gets into your inbox. This is an nuisance,
but generally less critical than missing a work email. (Low Cost).
Cost of "True Positive/Negative": Correctly classifying. (Cost = 0).
o Crucial Point: These costs don't have to be equal! In a medical context, a False
Negative (missing a disease) is far more costly than a False Positive (a healthy
person gets an unnecessary test).
4. Make the Decision (Minimize Expected Loss/Risk):
o For each possible action (classify as "Spam" or "Not Spam"), you calculate its
expected loss (or risk) by multiplying the posterior probability of each true class
by the cost of making that decision for that true class, and summing them up.
o You then choose the class (the action) that has the lowest expected loss.
Two Common Rules Derived from BDT:
1. Maximum A Posteriori (MAP) Rule (Simple Case):
o If all misclassification costs are considered equal (i.e., a False Positive is just as
bad as a False Negative), then the decision rule simplifies to: "Choose the class
with the highest posterior probability."
o Example: If P(Spam∣x)=0.6 and P(Not Spam∣x)=0.4, you'd classify it as Spam.
This is the most common default in many ML classifiers.
2. Minimum Risk Rule (General Case):
o When misclassification costs are unequal, you explicitly use the loss function to
calculate the expected loss for each decision and choose the one with the
minimum expected loss.
o Example: Even if P(Spam∣x)=0.4 and P(Not Spam∣x)=0.6 (making "Not Spam"
more likely), if the cost of a False Negative (missing spam, so spam goes to
inbox) is very, very high compared to the cost of a False Positive (legit email goes
to spam), you might still choose to classify it as "Spam" to avoid that very high
cost.
Why it Matters for Classification:
Optimal: Bayesian Decision Theory provides the theoretical framework for making the
optimal classification decisions, given your probabilistic models and defined costs.
Handles Uncertainty: It naturally incorporates the uncertainty inherent in predictions by
using probabilities.
Cost-Sensitive: It allows you to build classifiers that are aware of and account for the
different real-world consequences of various types of errors, which is critical in many
applications (e.g., medical diagnosis, fraud detection, safety systems).
Disadvantages:
Requires Probability Distributions:
It can be challenging to accurately estimate the necessary probability distributions, especially for
complex problems.
Computational Complexity:
Calculating posterior probabilities can be computationally expensive, especially with many
classes or complex features.
Losses and Risks
In Bayesian decision theory, loss functions quantify the cost associated with making incorrect
decisions, while risk is the expected loss, calculated by averaging the loss over all possible
outcomes, weighted by their probabilities. The goal of Bayesian decision theory is to choose the
action that minimizes this expected loss, or risk.
Loss Function:
A loss function, denoted as L(α, θ), assigns a numerical cost to each possible outcome
based on the chosen action (α) and the true state of the world (θ).
For example, in a binary classification problem, a loss function might penalize
misclassifying a cat as a dog differently than misclassifying a dog as a cat.
The loss function is crucial for defining what constitutes a "good" or "bad" decision in a
specific context.
A loss function quantifies the penalty or cost incurred when a specific action (a) is taken,
given that a particular state of nature (ω) is the true, underlying condition.
What it represents: It's a numerical value representing "how bad" it is to make a particular
decision when the reality is something else. A higher loss value means a worse outcome.
Context:
ω: The true state of nature (e.g., "patient has cancer," "email is spam," "the stock price will
go up").
a: The action taken by the decision-maker (e.g., "diagnose as cancer," "classify as spam,"
"buy the stock").
Key Characteristics:
Risk:
Risk, denoted as R(α|x), represents the average loss associated with a particular decision
(α) given the observed data (x).
It's calculated by averaging the loss over all possible true states (θ), weighted by their
posterior probabilities (given the observed data): R(α|x) = ∫ L(α, θ) * p(θ|x) dθ, where p(θ|
x) is the posterior probability of θ given x.
In essence, risk quantifies the expected penalty for making a specific decision.
Bayesian Decision Rule:
Bayesian decision theory aims to find the decision rule (α(x)) that minimizes the overall
risk, which is the expected risk across all possible data instances.
The decision rule maps each possible observation (x) to an action (α).
The optimal decision rule selects the action that minimizes the conditional risk for each
observation.
Example:
Imagine a spam filter. A loss function might penalize misclassifying an important email as spam
more heavily than misclassifying a spam email as important. The risk would then be the expected
cost of using a particular filtering strategy, considering the probability of different types of
emails and the associated costs of misclassification.