0% found this document useful (0 votes)
19 views11 pages

Dimensions of Supervised ML Algorithm

Uploaded by

Barvin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views11 pages

Dimensions of Supervised ML Algorithm

Uploaded by

Barvin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

DIMENSIONS OF SUPERVISED MACHINE LEARNING ALGORITHM

Supervised machine learning algorithms are defined by several key dimensions: problem
type (classification or regression), input data (features), output (labels), model complexity, and
evaluation metrics. These dimensions determine how the algorithm learns and makes
predictions based on labeled training data.
1. Problem Type:
Supervised learning algorithms are categorized by the type of problem they solve.
 Classification: Algorithms predict a category or class label for each input (e.g.,
spam or not spam, cat or dog).

1.  Binary Classification: The simplest form, where there are only two
possible output classes (e.g., classifying an email as 'spam' or 'not spam',
predicting if a customer will 'churn' or 'not churn', or diagnosing a disease as
'present' or 'absent'). Many algorithms inherently support binary classification
and can be extended for multi-class problems.
2.  Multi-class Classification: Involves more than two mutually exclusive
output classes (e.g., identifying handwritten digits from 0-9, categorizing
animal species from images, or classifying news articles into 'sports', 'politics',
'entertainment', etc.). Here, an instance belongs to exactly one of the many
available classes.
3.  Multi-label Classification: A more complex scenario where an instance
can simultaneously belong to multiple classes (e.g., an image might contain
both a 'cat' and a 'dog', or a movie might be categorized as both 'action' and
'comedy'). The algorithm predicts a set of labels rather than a single one.
 Examples : Support Vector Machines (SVMs), Decision Trees, Random
Forests, K-Nearest Neighbors (KNN), Naive Bayes, Neural Networks.

 Regression: Algorithms predict a continuous numerical value (e.g., price of a


house, temperature). In regression tasks, the target variable is continuous (or
numerical), meaning it can take any real value within a given range. The
algorithm's goal is to predict a precise numerical quantity.
 Algorithmic Examples: Linear Regression, Polynomial Regression, Ridge
Regression, Lasso Regression, Support Vector Regression (SVR), Decision Tree
Regressors,
2. Input Data (Features):
 This dimension describes the characteristics of the data points (features or
independent variables) that the algorithm uses to learn and make predictions.
The input data consists of features, which are the attributes or characteristics
used to describe each data point.
 The number of features determines the dimensionality of the input space.
 For example, in a weather prediction model, features might include temperature,
humidity, wind speed, etc.

Data Type:

1. Numerical: Features represented by numbers.


1. Continuous: Can take any value within a range (e.g., age,
temperature, income). Often requires scaling or normalization.
2. Discrete: Can only take specific, distinct numerical values (e.g.,
number of rooms, number of children).
2. Categorical: Features representing categories or labels, not numerical
quantities.
1. Nominal: Categories without any inherent order (e.g., color (red,
blue, green), gender (male, female)). Often require one-hot encoding.
2. Ordinal: Categories with a meaningful order (e.g., educational
level (high school, bachelor's, master's, PhD), satisfaction
rating (low, medium, high)). Can sometimes be mapped to integers.
3. Text: Unstructured textual data (e.g., customer reviews, social media posts).
1.
3. Output (Labels):
 The output is the target variable or label associated with each input data point.
 In classification, the labels are discrete categories, while in regression, they are
continuous numerical values.
 The number of labels and their distribution can influence the complexity of the
model and its ability to generalize.
4. Model Complexity:
 Supervised learning models can vary in complexity, from simple linear models to
complex neural networks.
 The choice of model complexity depends on the complexity of the data and the
desired trade-off between bias and variance.
 High-dimensional data, with many features, may require more complex models
to capture intricate relationships.
 The Bias-Variance Trade-off: This is a central concept here. Low-complexity
models have high bias and low variance. High-complexity models have low bias
and high variance. The goal is to find a model with an optimal balance that
generalizes well to new, unseen data. Regularization techniques are often
employed to manage complexity and prevent overfitting in high-capacity
models.
5. Evaluation Metrics:
 Supervised learning algorithms are evaluated using various metrics to assess
their performance. Evaluation metrics are crucial in the dimensions of supervised
machine learning algorithms because they provide a quantitative way to assess
how well a model is performing. Without them, we wouldn't know if our
algorithm is learning effectively, if it's overfitting or underfitting, or if it's suitable
for the real-world problem it's trying to solve.
 Metrics vary depending on the problem type (classification or regression).
 Examples include accuracy, precision, recall, F1-score for classification, and mean
squared error, R-squared for regression.

Accuracy:

Formula : (TP+TN )/ FP+FN+TP+TN It measures the proportion of total predictions that


were correct.

Precision:

 Formula: TP / FP + TP
 Description: Out of all instances predicted as positive, how many were actually positive?
It focuses on the quality of positive predictions.

Recall (Sensitivity / True Positive Rate - TPR):

 Formula: TP / FN+TP
 Description: Out of all actual positive instances, how many did the model correctly
identify? It focuses on the model's ability to "find" all positive instances.

********************

Bayesian Decision Theory


Introduction
Bayesian Decision Theory is a framework for making the best possible decisions under
uncertainty by using probabilities and considering the costs of different choices.
It's a mathematical way to make optimal decisions by combining:
1. Prior beliefs: What you thought was likely before new information.
2. Observed evidence: New data you've collected.
3. Costs/benefits: The consequences of making correct or incorrect choices. The goal is to
choose the action that minimizes the average cost or maximizes the average benefit.

Bayesian decision theory offers a probabilistic framework for making optimal decisions in
machine learning, particularly in classification tasks. It leverages Bayes' theorem to calculate
posterior probabilities, guiding the selection of the most likely class or outcome based on
available evidence and prior knowledge.
Key Concepts:
 Prior Probability:
Represents the initial belief about the likelihood of different outcomes before observing any
data. For example, in a medical diagnosis scenario, this could be the prevalence of a disease in
the population.
 Likelihood:
Indicates how likely it is to observe the given data if a particular outcome is true. In the medical
diagnosis example, this would be the probability of observing specific test results given the
presence or absence of the disease.
 Posterior Probability:
Represents the updated belief about the outcome after considering the observed data. It's
calculated using Bayes' theorem, combining prior probabilities and likelihoods.
 Decision Rule:
Based on the posterior probabilities, a decision rule is established to classify the data or make a
choice. For instance, in a binary classification problem, the rule might be to assign the data to the
class with the higher posterior probability.
 Loss Function:
Quantifies the cost or penalty associated with making incorrect decisions. This helps in defining
what constitutes an "optimal" decision, as it considers the potential consequences of errors.

How it works:
1. Define the problem:
Determine the possible outcomes (classes) and the relevant features or attributes of the data.
2. Estimate prior probabilities:
Determine the prior probabilities for each outcome, based on available knowledge or historical
data.
3. Estimate likelihoods:
Calculate the likelihood of observing the data given each outcome, often using probability
density functions (PDFs) or probability mass functions (PMFs).
4. Apply Bayes' theorem:
Combine prior probabilities and likelihoods to calculate the posterior probabilities for each
outcome.
5. Make a decision:
Select the outcome with the highest posterior probability or, if a loss function is defined, choose
the decision that minimizes the expected loss.

Advantages of Bayesian Decision Theory:


 Optimal decision-making:
Provides a framework for making decisions that are optimal in the sense of minimizing expected
loss.
 Incorporates prior knowledge:
Allows for the integration of prior beliefs or knowledge into the decision-making process, which
can be particularly useful when data is limited.
 Handles uncertainty:
Explicitly models and quantifies uncertainty through probabilities, allowing for more informed
decisions.
Example 1 :
In a spam filter, the theory could be used to classify emails as spam or not spam. Prior
probabilities could be based on the overall proportion of spam emails in the user's
inbox. Likelihoods would be calculated based on the frequency of certain words or patterns in
the email content. The posterior probabilities would then indicate the likelihood of the email
being spam or not, guiding the filtering decision.

Example 2:
You want to decide: Should I take an umbrella today?
1. Your "Best Guess" Before You Look:
o "It's July, so it probably won't rain." (Your starting idea)
2. You Get a "Clue":
o You look outside and see dark clouds. (This is your new information!)
3. Your Idea "Changes":
o Now, because of the dark clouds, you think: "Hmm, it's more likely to rain now
than I first thought." (Your updated idea)
4. You Think About "What Happens If I'm Wrong":
o Mistake 1: You bring the umbrella, but it doesn't rain.
 Result: A little annoying to carry, but not terrible. (Small bad thing)
o Mistake 2: You don't bring the umbrella, and it does rain.
 Result: You get totally soaked, maybe ruin your phone! (BIG bad thing)
5. The "Smart Decision":
o Because getting totally soaked is much worse than just carrying an umbrella, and
because the dark clouds make rain seem more likely, you decide: "I should take
the umbrella."

Classification
Classification in Bayesian Decision Theory is a principled and powerful way to assign an
observed data point (e.g., an email, a patient's symptoms) to one of several predefined categories
or classes.
It's not just about guessing the most likely category, but about making the best decision by
considering:
1. How likely each category is, given the data.
2. How "bad" (or costly) it is to be wrong for each possible mistake.
Here's a breakdown:
The Goal: Optimal Class Assignment
Imagine you have a new email (x) and you want to classify it as either "Spam" (ω1) or "Not
Spam" (ω2).
Bayesian Decision Theory tells you to make the decision that minimizes the total expected
"cost" or "loss" associated with your classification.
The Key Steps in Classification using BDT:
1. Observe the Data (x):
o You get a new piece of data. For our email example, this is the content of the
email, its sender, subject line, etc. (x).
2. Calculate Posterior Probabilities (P(ω∣x)):
o This is the core Bayesian step. For each possible class (Spam or Not Spam), you
calculate:
 P(Spam∣email x): What is the probability that this specific email x is
Spam, given its content?
 P(Not Spam∣email x): What is the probability that this specific email x is
Not Spam, given its content?
o These probabilities are derived using Bayes' Theorem, which combines:
 Your prior belief about how often spam usually occurs (P(Spam)).
 The likelihood of seeing this email's features if it were truly spam
(P(x∣Spam)).
3. Define a Loss Function (Costs of Mistakes):
o This is where the "decision theory" part comes in. You decide how "bad" different
types of errors are:
 Cost of "False Positive" (FP): Classifying a Not Spam email as Spam.
 Example: A legitimate work email goes to your spam folder. This
is annoying, maybe you miss something important. (Moderate
Cost).
 Cost of "False Negative" (FN): Classifying a Spam email as Not Spam.
 Example: A spam email gets into your inbox. This is an nuisance,
but generally less critical than missing a work email. (Low Cost).
 Cost of "True Positive/Negative": Correctly classifying. (Cost = 0).
o Crucial Point: These costs don't have to be equal! In a medical context, a False
Negative (missing a disease) is far more costly than a False Positive (a healthy
person gets an unnecessary test).
4. Make the Decision (Minimize Expected Loss/Risk):
o For each possible action (classify as "Spam" or "Not Spam"), you calculate its
expected loss (or risk) by multiplying the posterior probability of each true class
by the cost of making that decision for that true class, and summing them up.
o You then choose the class (the action) that has the lowest expected loss.
Two Common Rules Derived from BDT:
1. Maximum A Posteriori (MAP) Rule (Simple Case):
o If all misclassification costs are considered equal (i.e., a False Positive is just as
bad as a False Negative), then the decision rule simplifies to: "Choose the class
with the highest posterior probability."
o Example: If P(Spam∣x)=0.6 and P(Not Spam∣x)=0.4, you'd classify it as Spam.
This is the most common default in many ML classifiers.
2. Minimum Risk Rule (General Case):
o When misclassification costs are unequal, you explicitly use the loss function to
calculate the expected loss for each decision and choose the one with the
minimum expected loss.
o Example: Even if P(Spam∣x)=0.4 and P(Not Spam∣x)=0.6 (making "Not Spam"
more likely), if the cost of a False Negative (missing spam, so spam goes to
inbox) is very, very high compared to the cost of a False Positive (legit email goes
to spam), you might still choose to classify it as "Spam" to avoid that very high
cost.
Why it Matters for Classification:
 Optimal: Bayesian Decision Theory provides the theoretical framework for making the
optimal classification decisions, given your probabilistic models and defined costs.
 Handles Uncertainty: It naturally incorporates the uncertainty inherent in predictions by
using probabilities.
 Cost-Sensitive: It allows you to build classifiers that are aware of and account for the
different real-world consequences of various types of errors, which is critical in many
applications (e.g., medical diagnosis, fraud detection, safety systems).
Disadvantages:
 Requires Probability Distributions:
It can be challenging to accurately estimate the necessary probability distributions, especially for
complex problems.
 Computational Complexity:
Calculating posterior probabilities can be computationally expensive, especially with many
classes or complex features.
Losses and Risks
In Bayesian decision theory, loss functions quantify the cost associated with making incorrect
decisions, while risk is the expected loss, calculated by averaging the loss over all possible
outcomes, weighted by their probabilities. The goal of Bayesian decision theory is to choose the
action that minimizes this expected loss, or risk.
Loss Function:
 A loss function, denoted as L(α, θ), assigns a numerical cost to each possible outcome
based on the chosen action (α) and the true state of the world (θ).
 For example, in a binary classification problem, a loss function might penalize
misclassifying a cat as a dog differently than misclassifying a dog as a cat.
 The loss function is crucial for defining what constitutes a "good" or "bad" decision in a
specific context.
A loss function quantifies the penalty or cost incurred when a specific action (a) is taken,
given that a particular state of nature (ω) is the true, underlying condition.
What it represents: It's a numerical value representing "how bad" it is to make a particular
decision when the reality is something else. A higher loss value means a worse outcome.
Context:
ω: The true state of nature (e.g., "patient has cancer," "email is spam," "the stock price will
go up").
a: The action taken by the decision-maker (e.g., "diagnose as cancer," "classify as spam,"
"buy the stock").
Key Characteristics:

λ(a∣ω)≥0: Loss is typically non-negative. Zero loss means a perfect decision.


It allows for asymmetric costs: The cost of one type of error can be very different from
another. For example, in medical diagnosis, the loss of a false negative (missing a serious
disease) is usually much higher than the loss of a false positive (diagnosing a disease that
isn't there).
Common Types of Loss Functions:
0-1 Loss (or Zero-One Loss):
This is the simplest for classification.
λ(a∣ω)=0 if a=ω (correct classification).

λ(a∣ω)=1 if a=ω (incorrect classification).


Used when: All misclassifications are considered equally bad. The goal is to minimize the
total number of errors (or maximize accuracy).
Squared Error Loss (L(θ,θ^)=(θ−θ^)2):
Commonly used in regression or estimation problems.
θ: The true value of a parameter.
θ^: Your estimated value.
Used when: Larger errors are penalized disproportionately more than smaller errors. It leads
to the posterior mean as the optimal estimator.

Absolute Error Loss (L(θ,θ^)=∣θ−θ^∣):


Also used in regression/estimation.
Used when: The penalty for errors increases linearly with the magnitude of the error. It is
more robust to outliers than squared error loss. It leads to the posterior median as the optimal
estimator.

Risk:
 Risk, denoted as R(α|x), represents the average loss associated with a particular decision
(α) given the observed data (x).
 It's calculated by averaging the loss over all possible true states (θ), weighted by their
posterior probabilities (given the observed data): R(α|x) = ∫ L(α, θ) * p(θ|x) dθ, where p(θ|
x) is the posterior probability of θ given x.
 In essence, risk quantifies the expected penalty for making a specific decision.
Bayesian Decision Rule:
 Bayesian decision theory aims to find the decision rule (α(x)) that minimizes the overall
risk, which is the expected risk across all possible data instances.
 The decision rule maps each possible observation (x) to an action (α).
 The optimal decision rule selects the action that minimizes the conditional risk for each
observation.
Example:
Imagine a spam filter. A loss function might penalize misclassifying an important email as spam
more heavily than misclassifying a spam email as important. The risk would then be the expected
cost of using a particular filtering strategy, considering the probability of different types of
emails and the associated costs of misclassification.

You might also like