Comparing Multiple Algorithms

Uploaded by

www.vicky4209211

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views70 pages

Comparing Multiple Algorithms

Uploaded by

www.vicky4209211

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Unit - II

Analysis of Machine Learning

Cross Validation and Resampling Methods

Dr. Chetan Jalendra

Department of Technology
Introduction to Cross-Validation
• Definition:
• Cross-validation is a statistical method used to estimate the performance of a
model by partitioning the data into training and validation sets multiple times.

• Goal:
• To obtain reliable error estimates and assess the performance of model on
unseen data.
K-Fold Cross-Validation
• Process:
• Divide the dataset 𝑋𝑋 into 𝐾𝐾 equal parts (𝑋𝑋1,𝑋𝑋2,...,𝑋𝑋𝐾𝐾).
• Perform 𝐾𝐾 iterations:
• Use one part as the validation set 𝑉𝑉𝑖𝑖 .
• Combine the remaining 𝐾𝐾−1 parts to form the training set 𝑇𝑇𝑖𝑖.
• Example:
• 𝑉𝑉1=𝑋𝑋1, 𝑇𝑇1=𝑋𝑋2 ∪ 𝑋𝑋3 ∪⋯∪ 𝑋𝑋𝐾𝐾
• 𝑉𝑉2=𝑋𝑋2, 𝑇𝑇2=𝑋𝑋1 ∪ 𝑋𝑋3 ∪⋯∪ 𝑋𝑋𝐾𝐾
• ...
• 𝑉𝑉𝐾𝐾=𝑋𝑋𝐾𝐾, 𝑇𝑇𝐾𝐾=𝑋𝑋1 ∪ 𝑋𝑋2 ∪⋯∪ 𝑋𝑋𝐾𝐾−1
K-Fold Cross-Validation
Benefits and Drawbacks of K-Fold Cross-
Validation
• Benefits:
• Maximizes training set size for robust estimators.
• Balances between training and validation set sizes.
• Drawbacks:
• Smaller validation sets can lead to less reliable error estimates.
• Significant overlap in training sets (each pair shares 𝐾𝐾−2 parts).
Choosing the Optimal 𝐾𝐾
• Factors:
• Size of the dataset (𝑁𝑁):
• Larger 𝑁𝑁: Smaller 𝐾𝐾.
• Smaller 𝑁𝑁: Larger 𝐾𝐾.
• Typical Values:
• Common choices are 𝐾𝐾=10 or 𝐾𝐾=30.
Leave-One-Out Cross-Validation (LOOCV)
• Definition:
• Special case of 𝐾𝐾-fold where 𝐾𝐾=𝑁𝑁 (number of instances in the dataset).
• Each instance is used once as the validation set, while the remaining 𝑁𝑁−1
instances form the training set.
• Use Case:
• Ideal for small datasets or applications where labeled data is hard to find
(e.g., medical diagnosis).
Leave-One-Out Cross-Validation (LOOCV)
Benefits and Drawbacks of LOOCV
• Benefits:
• Maximizes the use of available data.
• Drawbacks:
• High computational cost (requires 𝑁𝑁 training iterations).
• Does not allow for stratification.
Stratification in Cross-Validation
• Definition:
• Ensures class proportions are maintained in both training and validation sets.
• Prevents distortion of class prior probabilities.
• Importance:
• Especially crucial for imbalanced datasets to obtain reliable performance
estimates.
Multiple Runs of K-Fold Cross-Validation
• Recent Advancements:
• Increased computational power allows for multiple runs of K-fold cross-
validation (e.g., 10×10 fold).
• Process:
• Perform K-fold cross-validation multiple times.
• Average the results to obtain more reliable error estimates.
• Reference:
• Bouckaert (2003).
5×2 Cross-Validation
• Definition:
• Proposed by Dietterich (1998), uses training and validation sets of equal size.
• Process:
• Divide the dataset randomly into two parts, 𝑋𝑋1(1) and 𝑋𝑋1(2) .
• First pair: 𝑇𝑇1=𝑋𝑋1(1) , 𝑉𝑉1=𝑋𝑋1(2) .
• Swap the roles for the second pair: 𝑇𝑇2=𝑋𝑋1(2), 𝑉𝑉2=𝑋𝑋1(1) .
• Repeat for five folds: 𝑇𝑇1, 𝑉𝑉1, 𝑇𝑇2, 𝑉𝑉2,...,𝑇𝑇10, 𝑉𝑉10.
Benefits and Drawbacks of 5×2 Cross-
Validation
• Benefits:
• Equal-sized training and validation sets.
• Drawbacks:
• After five folds, the sets share many instances and overlap significantly.
• The statistics become dependent, reducing the new information obtained.
Bootstrapping
• Definition:
• An alternative to cross-validation for generating multiple samples from a
single sample.
• Process:
• Draw instances from the original dataset with replacement.
• The original dataset serves as the validation set.
• Probability:
• Probability of not picking an instance after 𝑁𝑁 draws: 𝑒𝑒−1≈0.368.
• Training data contains approximately 63.2% of instances.
Bootstrapping
Benefits and Drawbacks of Bootstrapping
• Benefits:
• Suitable for very small datasets.
• Drawbacks:
• Overlap more than cross-validation samples.
• Error estimate may be pessimistic.
• Solution: Replicate the process multiple times and average the results.
Summary and Best Practices
• Summary:
• Cross-validation is essential for model performance estimation.
• K-fold and LOOCV are common methods with specific use cases and
limitations.
• Stratification and multiple runs can improve reliability.
• 5×2 cross-validation offers balanced training and validation sets.
• Bootstrapping is ideal for small datasets.
• Best Practices:
• Choose 𝐾𝐾 based on dataset size.
• Use stratification for imbalanced datasets.
• Consider computational cost and model complexity.
Measuring Classifier Performance

Metrics and Methods for Evaluating Model Accuracy

Introduction to Classifier Performance
• Definition:
• Evaluating the performance of a classifier is crucial for understanding how
well it predicts unseen data.
• Goal:
• To use various metrics and visual tools to assess and compare classifiers.
Confusion Matrix
• Definition:
• A confusion matrix is a table that is used to describe the performance of a
classification model.
• Components:
• True Positive (tp): Correctly predicted positive instances.
• True Negative (tn): Correctly predicted negative instances.
• False Positive (fp): Incorrectly predicted positive instances.
• False Negative (fn): Incorrectly predicted negative instances.
• Example: Predicted Class Positive Negative Total
True Class Positive tp fn P
True Class Negative Fp tn n
Total p’ n’ N
Performance Measures
• Error Rate:
• Formula: (fp+fn)/𝑁𝑁
• Measures the proportion of incorrect predictions.
• Accuracy:
• Formula: (tp+tn)/𝑁𝑁 = 1-error
• Measures the proportion of correct predictions.
Key Metrics for Two-class Problems
• True Positive Rate (TPR) / Sensitivity / Recall:
• Formula: tp/p
• Measures the proportion of actual positives correctly identified.
• False Positive Rate (FPR):
• Formula: fp/p
• Measures the proportion of actual negatives incorrectly identified as positive.
• Precision:
• Formula: tp/p’
• Measures the proportion of positive predictions that are actually correct.
ROC Curve
• Definition:
• The Receiver Operating Characteristics (ROC) curve plots TPR vs. FPR at
various threshold settings.
• Purpose:
• To visualize the trade-off between sensitivity and specificity for different
thresholds.
ROC Curve
Example:
Diagram:
Typical ROC
curve with
TPR on the
y-axis and
FPR on the
x-axis.
Area Under the Curve (AUC)
• Definition:
• The Area Under the Curve (AUC) provides a single value summarizing the
overall performance of the classifier.
• Ideal AUC:
• AUC = 1 indicates a perfect classifier.
• Interpretation:
• Higher AUC values indicate better classifier performance.
Precision-Recall Curve
• Definition:
• The Precision-Recall curve plots
precision vs. recall for different
threshold values.
• Use Case:
• Particularly useful for
imbalanced datasets where the
positive class is rare.
• Example:
• Diagram: Precision-Recall curve
illustrating the trade-off
between precision and recall.
Sensitivity and Specificity
• Sensitivity (Recall):
• Formula: tp /p = tp-rate
• Measures how well the classifier identifies positive instances.
• Specificity:
• Formula: tn /n = 1 - fp-rate
• Measures how well the classifier identifies negative instances.
• Curve:
• Sensitivity vs. Specificity curve can also be plotted using different thresholds.
Class Confusion Matrix for Multi-Class
Problems
• Definition:
• For 𝐾𝐾>2 classes, the confusion matrix is a 𝐾𝐾×𝐾𝐾 matrix.
• Components:
• Entry (i, j) represents the number of instances of class 𝐶𝐶𝑖𝑖 predicted as class 𝐶𝐶𝑗𝑗.
• Purpose:
• To identify which classes are frequently confused.
Example: Authentication Application
• Scenario:
• Users log on to their accounts by voice.
• Errors:
• False Positive: Impostor is wrongly logged on.
• False Negative: Valid user is refused.
• Metrics:
• TP-rate (Sensitivity): Proportion of valid users correctly authenticated.
• FP-rate: Proportion of impostors wrongly accepted.
Summary and Best Practices
• Summary:
• Various metrics are used to evaluate classifier performance, each with its own
strengths and weaknesses.
• ROC and Precision-Recall curves provide valuable visual insights.
• AUC is a useful summary statistic for comparing classifiers.
• Best Practices:
• Choose metrics based on the specific context and importance of different
types of errors.
• Use multiple metrics to get a comprehensive understanding of classifier
performance.
Understanding Interval Estimation and
Hypothesis Testing
Key Techniques in Statistical Inference
Introduction
• Overview:
• Importance of statistical inference
• Key techniques: Interval Estimation and Hypothesis Testing.
• Objective:
• Understand confidence intervals
• Learn the steps of hypothesis testing.
Interval Estimation
• Definition:
• Interval estimation provides a range of values for an unknown parameter.
• Importance:
• Gives an estimate of the parameter with an associated confidence level.
Key Components of Confidence Intervals
• Point Estimate:
• Sample mean (𝑥𝑥)̅
• Confidence Level:
• Typically 95% or 99%
• Margin of Error:
• Depends on the standard deviation and sample size
Formulas for Confidence Intervals
• When 𝜎𝜎 is known:
𝜎𝜎
𝑥𝑥̅ ± 𝑍𝑍𝛼𝛼/2
𝑛𝑛
• When 𝜎𝜎 is unknown:
𝑠𝑠
𝑥𝑥̅ ± 𝑡𝑡𝛼𝛼/2
𝑛𝑛
• For Proportions:
𝑝𝑝(1
̂ − 𝑝𝑝)̂
𝑝𝑝̂ ± 𝑍𝑍𝛼𝛼/2
𝑛𝑛
Example of Interval Estimation
• Scenario:
• Estimate the average weight loss of participants in a diet program.
• Given:
• Sample size (𝑛𝑛): 30
• Sample mean (𝑥𝑥): ̅ 5 kg
• Sample standard deviation (𝑠𝑠): 1.5 kg
• Confidence Level: 95%
• Calculation:
𝑠𝑠
• 𝑥𝑥̅ ± 𝑡𝑡𝛼𝛼/2
𝑛𝑛
Transition to Hypothesis Testing
• Connection:
• While interval estimation provides a range, hypothesis testing evaluates
specific claims about a parameter.
• Objective:
• Understand the process and application of hypothesis testing.
Hypothesis Testing Overview
• Definition:
• A method to test if there is enough evidence to reject a null hypothesis (𝐻𝐻0 )
in favor of an alternative hypothesis (𝐻𝐻𝑎𝑎).
• Importance:
• Helps in making decisions based on data.
Steps in Hypothesis Testing
1. State the Hypotheses:
• Null Hypothesis (𝐻𝐻0): e.g., 𝜇𝜇 = 𝜇𝜇0
• Alternative Hypothesis (𝐻𝐻𝑎𝑎): e.g., 𝜇𝜇≠𝜇𝜇0
2. Choose Significance Level (𝛼𝛼):
• Common values: 0.05, 0.01
3. Select Test and Calculate Test Statistic:
• e.g., t-test, z-test
4. Calculate p-value or Critical Value:
• Compare with 𝛼𝛼
5. Make Decision:
• Reject or fail to reject 𝐻𝐻0
6. Draw Conclusion:
• Interpret results in context
Types of Tests
• t-test:
• Used to compare means
• z-test:
• For large samples or known variance
• Chi-square test:
• For categorical data
• ANOVA:
• For comparing means among three or more groups
Example of Hypothesis Testing
• Scenario:
• Test if the mean weight loss is different from a hypothesized value of 4.5 kg.
• Given:
• Sample size (𝑛𝑛): 30
• Sample mean (𝑥𝑥): ̅ 5 kg
• Sample standard deviation (𝑠𝑠): 1.5 kg
• Hypothesized mean (𝜇𝜇0): 4.5 kg
• Significance level (𝛼𝛼): 0.05
Calculating the Test Statistic and Determining
the Critical Value
• Formula:
̅ 0
𝑥𝑥−𝜇𝜇
• 𝑡𝑡 = 𝑠𝑠
� 𝑛𝑛
• t ≈1.83

Degrees of Freedom: 𝑑𝑑𝑓𝑓=29

t-table: 𝑡𝑡0.025,29 ≈ 2.045
Decision:
Since ∣1.83∣<2.045, fail to reject 𝐻𝐻0.
Conclusion of Hypothesis Testing
• Interpretation:
• There is not enough evidence to suggest that the mean weight loss is
significantly different from 4.5 kg.
Assessing a Classification Algorithm’s
Performance
Understanding Error Rate Assessment and Comparison
Introduction
• Purpose: Evaluating the performance of classification algorithms by
assessing error rates.
• Scope:
• Primarily focuses on classification error rates.
• Methods are also applicable to:
• Squared error in regression: Measures the average squared difference between
observed and predicted values.
• Log likelihoods in unsupervised learning: Evaluates the fit of a model to the data using
likelihood functions.
• Expected reward in reinforcement learning: Assesses the long-term return an agent
expects to achieve from a policy.
Introduction: Key Concepts
• Hypothesis Testing: Determines if observed error rates are statistically
significant.
• Parametric Tests: Assume a specific distribution for the error rate
(e.g., binomial, normal).
• Nonparametric Tests: Do not assume any specific distribution for the
error rate, useful when the parametric form is unknown.
Introduction: Parametric Tests
• Binomial Test: Tests the hypothesis about the probability of misclassification
errors.
• Assumes the number of errors follows a binomial distribution.
• Example: Testing if the error rate is less than or equal to a specified value 𝑝𝑝0 .
• Approximate Normal Test: Uses the normal approximation of the binomial
distribution for large sample sizes.
• Applies the Central Limit Theorem to approximate the binomial distribution with a normal
distribution.
• Example: Testing if the observed error rate is significantly greater than a specified value 𝑝𝑝0 .
• t-Test: Compares the average error rate across multiple runs to a specified value
𝑝𝑝0.
• Uses the t-distribution to account for variability in multiple runs.
• Example: Running the algorithm multiple times and testing if the average error rate is
significantly different from 𝑝𝑝0.
Introduction: Nonparametric Tests
• Advantages: Do not rely on assumptions about the underlying
distribution of the error rate.
• Example Methods:
• Wilcoxon Signed-Rank Test: Compares paired samples to assess whether their
population mean ranks differ.
• Mann-Whitney U Test: Compares differences between two independent
samples.
• Applications: Useful when the data do not meet the assumptions
required for parametric tests or when dealing with small sample sizes.
Binomial Test - Setup
• Single training set 𝑇𝑇 and single validation set 𝑉𝑉
• Train classifier on 𝑇𝑇 and test on 𝑉𝑉
• 𝑝𝑝: probability of misclassification
• 𝑥𝑥𝑡𝑡: indicator of misclassification (0/1 Bernoulli variable)
• 𝑋𝑋: total number of errors (𝑋𝑋 = ∑𝑁𝑁 𝑡𝑡=1 𝑥𝑥𝑡𝑡 )
Binomial Test - Example
• Example: 100 validation samples, observed 30 errors
• Null Hypothesis (𝐻𝐻0): 𝑝𝑝 ≤ 0.25
• Alternative Hypothesis (𝐻𝐻1): 𝑝𝑝 > 0.25

• Step 1: Define the Binomial Random Variable

• The number of errors X follows a binomial distribution: X∼Binomial
(n=100, p=0.25).
Binomial Test - Example
• Step 2: Calculate the Probability Mass Function
• The probability mass function for a binomial random variable is:
𝑛𝑛 𝑥𝑥
𝑃𝑃 𝑋𝑋 = 𝑥𝑥 = 𝑝𝑝 (1 − 𝑝𝑝)𝑛𝑛−𝑥𝑥
𝑥𝑥
𝑛𝑛
where is the binomial coefficient, p is the probability of error, n is the
𝑥𝑥
number of trials, and x is the number of successes (errors in this case).
• Step 3: Calculate the Cumulative Probability P(X≥30)
To find P(X≥30), we need to sum the probabilities from X=30 to X=100
100
100
𝑃𝑃 𝑋𝑋 ≥ 30 = � 0.25𝑥𝑥 (0.75)100−𝑥𝑥
𝑥𝑥
𝑥𝑥=30
Binomial Test - Example
• Step 4: Compute Each Term
• Calculate each term in the sum. For example, for x=30:
100
𝑃𝑃 𝑋𝑋 ≥ 30 = 0.2530 (0.75)70
30

100 100!
=
30 30!. (100 − 30)!
• Calculate this using a calculator or computational tool for precision.
• Step 5: Sum the Probabilities
Sum the probabilities for x=30 to x=100. This is computationally intensive and is typically done using
statistical software or a binomial calculator.
• Step 6: Compare to Significance Level
Compare the cumulative probability to the significance level α=0.05. If P(X≥30)<0.05, reject H0;
otherwise, do not reject H0.
Approximate Normal Test
• The approximate normal test is used to test a hypothesis about the error rate of a classifier when
the number of validation samples is large.
• This test uses the Central Limit Theorem to approximate the binomial distribution with a normal
distribution.
• Steps:
1. Define the Problem:
• You have a single training set 𝑇𝑇 and a single validation set 𝑉𝑉.
• Train the classifier on 𝑇𝑇 and test it on 𝑉𝑉.
• Let 𝑝𝑝 be the probability of misclassification (error rate) which we want to estimate or test.
2. Hypothesis:
• Null hypothesis 𝐻𝐻0:𝑝𝑝≤𝑝𝑝0
• Alternative hypothesis 𝐻𝐻1:𝑝𝑝>𝑝𝑝0
3. Binomial Distribution Approximation:
• Let 𝑋𝑋 be the total number of errors.
• The point estimate of 𝑝𝑝 is 𝑝𝑝� = 𝑋𝑋/𝑁𝑁 , where 𝑁𝑁 is the total number of validation samples.
Approximate Normal Test
• Central Limit Theorem:
• For large 𝑁𝑁, the distribution of 𝑝𝑝̂ is approximately normal with mean 𝑝𝑝0 and
variance 𝑝𝑝0(1−𝑝𝑝0)/𝑁𝑁 .
• Test Statistic:
• Calculate the z-score:
� 0
𝑝𝑝−𝑝𝑝
𝑧𝑧 = 𝑝𝑝0(1−𝑝𝑝 )
0
𝑁𝑁

• Decision Rule:
• Compare the z-score to the critical value from the standard normal
distribution (e.g., 𝑧𝑧0.05 =1.64 for a significance level 𝛼𝛼=0.05).
• Reject 𝐻𝐻0 if 𝑧𝑧 is greater than the critical value.
Approximate Normal Test - Example
• Problem Setup:
• You have a validation set with 100 samples.
• You observed 30 errors.
• You want to test 𝐻𝐻0:𝑝𝑝≤0.25 against 𝐻𝐻1:𝑝𝑝>0.25.
Step-by-Step Calculation:
1. Calculate Point Estimate 𝑝𝑝̂ = 𝑋𝑋/𝑁𝑁=30/100=0.30
2. Calculate the Standard Error:
𝑝𝑝0(1−𝑝𝑝0 ) 0.25 𝑋𝑋 0.75
Standard Error = = = 0.0433
𝑁𝑁 100
3. Calculate the z-score:
� 0
𝑝𝑝−𝑝𝑝 0.3−0.25
𝑧𝑧 = = ≈1.15
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 0.0433
Approximate Normal Test - Example
4. Decision Rule:
• For a significance level 𝛼𝛼=0.05, the critical value 𝑧𝑧0.05≈1.64.
• Compare the calculated z-score (1.15) with the critical value (1.64).
Since 𝑧𝑧 = 1.15 is less than 1.64, we do not reject the null hypothesis
𝐻𝐻0:𝑝𝑝≤0.25. This means there is not enough evidence to conclude that the
error rate is greater than 0.25.
Interpretation
• The approximate normal test helps us determine whether the observed
error rate is significantly higher than a specified threshold.
• In this example, the observed error rate (30%) was not significantly higher
than the threshold (25%) at the 0.05 significance level, so we did not reject
the null hypothesis.
t-Test - Setup
• Multiple training/validation sets (𝐾𝐾 pairs)
• Error percentages 𝑝𝑝𝑖𝑖 for each validation set 𝑖𝑖
• Average error rate:
𝐾𝐾
1
𝑚𝑚 = � 𝑝𝑝𝑖𝑖
𝐾𝐾
𝑖𝑖=1
t-Test - Example
• Example: 10 validation sets, error percentages 𝑝𝑝𝑖𝑖:
0.28,0.32,0.30,0.27,0.31,0.29,0.33,0.26,0.34,0.25
Calculate m:
10
1
𝑚𝑚 = � 𝑝𝑝𝑖𝑖 = 0.295
10
𝑖𝑖=1
Calculate sample variance
10
1
𝑆𝑆 2 = �(𝑝𝑝𝑖𝑖 −0.295)2 = 0.00296
9
𝑖𝑖=1
Standard deviation S = 0.0544
t-Test – Decision Rule
• Null Hypothesis (H0): p ≤ 0.25
• Calculate t-statistic:
𝑥𝑥̅ − 𝜇𝜇0 0.295 − 0.25
𝑡𝑡 = 𝑠𝑠 = = 2.61
� 𝑛𝑛 0.0544�
10
• Compare to t0.05,9 = 1.83
• Since 2.61 > 1.83, reject H0
• Conclude p>0.25
Key Points
• Binomial Test: Used for single validation set, evaluates if observed
error rate is higher than specified value.
• Approximate Normal Test: Normal approximation for large sample
sizes, used for single validation set.
• t Test: Used for multiple validation sets, assesses significance of
average error rate.
Comparing Two Classification Algorithms

Statistical Tests for Evaluating Error Rates

Introduction
• Objective: Compare two classifiers to test if they have the same
expected error rate.
• Approach: Use statistical tests to evaluate whether the error rates of
two classifiers are significantly different.
• Methods Covered:
• McNemar’s Test
• K-Fold Cross-Validated Paired t-Test
• 5 × 2 CV Paired t-Test
• 5 × 2 CV Paired F-Test
McNemar’s Test
• Purpose: Test if two classifiers have the same error rate.
• Contingency Table:
• e00: Misclassified by both classifiers
• e01: Misclassified by Classifier 1 but not 2
• e10: Misclassified by Classifier 2 but not 1
• e11: Correctly classified by both classifiers
• Chi-Square Statistic:
( 𝑒𝑒01 −𝑒𝑒10 −1)2
𝜒𝜒 2 =
𝑒𝑒01 + 𝑒𝑒10
2 2
• Decision Rule: Reject the null hypothesis if 𝜒𝜒 2 > 𝜒𝜒𝛼𝛼,1 where 𝜒𝜒0.05,1 =
3.84
Example of McNemar’s Test
• Scenario: Compare Classifier A and Classifier B on a validation set.
• No. of examples misclassified by both – 50
• No. of examples misclassified by A but not B – 25
• No. of examples misclassified by B but not A – 15
• No. of examples correctly classified by both - 50
• Under the null hypothesis that the classification algorithms have the same error
rate, we expect e01 = e10 and these to be equal to (e01+e10)/2. We have the chi-
square statistic with one degree of freedom
• Calculate Statistic:
( 25−15 −1)2 (10−1)2 81
• 𝜒𝜒 2
= = =
= 2.025
25+15 40 40
• Interpretation: Since 2.025 < 3.84, fail to reject the null hypothesis.
K-Fold Cross-Validated Paired t-Test
• Purpose: Evaluate if the difference in error rates between two classifiers is statistically
significant.
• If the two classification algorithms have the same error rate, then we expect them to
have the same mean, or equivalently, that the difference of their means is 0.
• Procedure:
1. Perform K-Fold Cross-Validation.
2. Record error rates for both classifiers on each fold.
3. Calculate the difference 𝑝𝑝𝑖𝑖 = 𝑝𝑝𝑖𝑖1 − 𝑝𝑝𝑖𝑖2 for each fold.
4. Test if the mean difference is zero.
• t-Statistic:
𝐾𝐾.𝑚𝑚
𝑡𝑡 =
𝑆𝑆
• Where: 𝑚𝑚 = Mean of differences, 𝑆𝑆 = Standard deviation of differences
• Decision Rule: Reject the null hypothesis if 𝑡𝑡 is outside (−𝑡𝑡𝛼𝛼/2,𝐾𝐾−1, 𝑡𝑡𝛼𝛼/2,𝐾𝐾−1).
Example of K-Fold Cross-Validated Paired t-
Test
• Scenario: Use 5-Fold Cross-Validation.
• Results: Differences in error rates: [0.02, -0.01, 0.03, -0.02, 0.01]
• Calculate:
• Mean difference 𝑚𝑚=0.006
• Variance 𝑆𝑆2=0.0001
• 𝑡𝑡-Statistic for 4 degrees of freedom:
5.0.006
𝑡𝑡 = =1.34
0.0001
• Interpretation: Compare with 𝑡𝑡0.05/2,4=2.776. Since 1.34 < 2.776, fail to
reject the null hypothesis.
5 × 2 CV Paired t-Test
• Purpose: Extended version of paired t-test using 5 × 2 cross-validation.
• Proposed by Dietterich (1998).
• Procedure:
1. Perform 5 × 2 Cross-Validation.
2. Compute the error rate difference for each fold and replication.
3. Calculate average and variance of differences.-
(𝑗𝑗)
• 𝑝𝑝𝑖𝑖 is the difference between the error rates of the two classifiers on fold j =
1,2 of replication i = 1, . . . , 5.
1 (2)
• The average on replication i is 𝑝𝑝𝑖𝑖̅ = (𝑝𝑝𝑖𝑖 + 𝑝𝑝𝑖𝑖 )/2, and the estimated
(1) 2
variance is 𝑠𝑠𝑖𝑖2 = (𝑝𝑝𝑖𝑖 −𝑝𝑝𝑖𝑖̅ )2 + (𝑝𝑝𝑖𝑖 − 𝑝𝑝𝑖𝑖̅ )2 .
• Check – The two classification algorithms have the same error rate.
5 × 2 CV Paired t-Test
• t-Statistic:
𝑝𝑝11
𝑡𝑡 = ~𝑡𝑡5
1 5 2
∑𝑖𝑖=1 𝑠𝑠𝑖𝑖
5
• Where: 𝑠𝑠𝑖𝑖2 = Variance for each replication
• Decision Rule: Reject the null hypothesis if 𝑡𝑡 is outside (−𝑡𝑡𝛼𝛼/2,5, 𝑡𝑡𝛼𝛼/2,5).
5 × 2 CV Paired F-Test
• Purpose: Extend the 5 × 2 CV t-test using the F-distribution.
• Procedure:
1. Compute squared differences and variances.
2. Combine results into an F-statistic.
• F-Statistic:
5 2 𝑗𝑗 2
∑𝑖𝑖=1 ∑𝑗𝑗=1(𝑝𝑝𝑖𝑖 )
𝐹𝐹10,5 ~
2 ∑5𝑖𝑖=1 𝑠𝑠𝑖𝑖2
• Decision Rule: Reject the null hypothesis if 𝑓𝑓>𝐹𝐹𝛼𝛼,10,5.
Summary
• McNemar’s Test: Good for a quick comparison based on misclassification
counts.
• K-Fold Cross-Validated Paired t-Test: Uses multiple folds for a robust
comparison of means.
• 5 × 2 CV Paired t-Test: Adds multiple replications for a more reliable
comparison.
• 5 × 2 CV Paired F-Test: Provides a detailed comparison using the F-
distribution.

Unit 5 ML
No ratings yet
Unit 5 ML
21 pages
ML - 03 Evaluation Metrics
No ratings yet
ML - 03 Evaluation Metrics
17 pages
Cofusion Matrix Cross - Validation
No ratings yet
Cofusion Matrix Cross - Validation
34 pages
Presentation On Classification
No ratings yet
Presentation On Classification
18 pages
6 Model Evalution
No ratings yet
6 Model Evalution
16 pages
Unit 2
No ratings yet
Unit 2
28 pages
ML Mod 5
No ratings yet
ML Mod 5
58 pages
Lecture 5 Evaluation - Classifer
No ratings yet
Lecture 5 Evaluation - Classifer
61 pages
Dimensionality Reduction & Model Evaluation
No ratings yet
Dimensionality Reduction & Model Evaluation
80 pages
Advanced ML Classification Guide
No ratings yet
Advanced ML Classification Guide
40 pages
Module 6
No ratings yet
Module 6
24 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
Model Evaluation & Cross-Validation Guide
No ratings yet
Model Evaluation & Cross-Validation Guide
43 pages
ML Pyq Ans
No ratings yet
ML Pyq Ans
37 pages
Module 10 Notes
No ratings yet
Module 10 Notes
5 pages
T1 ML QB Soln
No ratings yet
T1 ML QB Soln
23 pages
MLA CT1 - Notes
No ratings yet
MLA CT1 - Notes
17 pages
Chapter 3
No ratings yet
Chapter 3
56 pages
Class Imbalance Problem: BY Dr. Anupam Ghosh 4 SEPT, 2023
No ratings yet
Class Imbalance Problem: BY Dr. Anupam Ghosh 4 SEPT, 2023
27 pages
Data Mining Evaluation Metrics Guide
No ratings yet
Data Mining Evaluation Metrics Guide
40 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
37 pages
Unit 5-2 Marks
No ratings yet
Unit 5-2 Marks
5 pages
Topic 3
No ratings yet
Topic 3
48 pages
Evaluating ML Methods and Metrics
No ratings yet
Evaluating ML Methods and Metrics
50 pages
ERROR and Confusion Matrix
No ratings yet
ERROR and Confusion Matrix
29 pages
Unit 5 (ML)
No ratings yet
Unit 5 (ML)
25 pages
Model Evaluation
No ratings yet
Model Evaluation
44 pages
TR Rain Error
No ratings yet
TR Rain Error
6 pages
Unit 5 New
No ratings yet
Unit 5 New
9 pages
Lecture 3b - Evaluation
No ratings yet
Lecture 3b - Evaluation
37 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
25 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
41 pages
ML Model Evaluation
No ratings yet
ML Model Evaluation
17 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
Cross Validation
No ratings yet
Cross Validation
10 pages
Data Mining Evaluation Techniques
No ratings yet
Data Mining Evaluation Techniques
36 pages
6 Evaluarea Performantei
No ratings yet
6 Evaluarea Performantei
43 pages
Lec 16
No ratings yet
Lec 16
18 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Improving Machine Learning Performance
No ratings yet
Improving Machine Learning Performance
14 pages
Estimation Techniques for Classifiers
No ratings yet
Estimation Techniques for Classifiers
61 pages
Module 2
No ratings yet
Module 2
19 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
Data Mining: Class Imbalance Solutions
No ratings yet
Data Mining: Class Imbalance Solutions
56 pages
ML Module Iii
No ratings yet
ML Module Iii
12 pages
List Steps in Data Preparation. Give Short Description of Each Step
No ratings yet
List Steps in Data Preparation. Give Short Description of Each Step
20 pages
DS Notes Unit - V
No ratings yet
DS Notes Unit - V
13 pages
Unit 6-Feature Engineering and Sensitivity Analysis
No ratings yet
Unit 6-Feature Engineering and Sensitivity Analysis
63 pages
Accuracy and Error Measures
No ratings yet
Accuracy and Error Measures
46 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
DL IT324a 4
No ratings yet
DL IT324a 4
52 pages
Lecture 11
No ratings yet
Lecture 11
61 pages
K-Nearest Neighbors Overview
No ratings yet
K-Nearest Neighbors Overview
31 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
49 pages
Classifier Evaluation Techniques
No ratings yet
Classifier Evaluation Techniques
59 pages
Modelling and Evaluation
No ratings yet
Modelling and Evaluation
36 pages
Unit6 - 7 Issues
No ratings yet
Unit6 - 7 Issues
53 pages
4.8 Estimating The Performance of A Classifier
No ratings yet
4.8 Estimating The Performance of A Classifier
19 pages
A10 Model Performance v2 2up
No ratings yet
A10 Model Performance v2 2up
11 pages
EM2 G3 M3 TC Quiz3
No ratings yet
EM2 G3 M3 TC Quiz3
2 pages
Shipbuilding Catalog of Koerting
No ratings yet
Shipbuilding Catalog of Koerting
12 pages
Yield Line Notes
No ratings yet
Yield Line Notes
15 pages
Calculul Impingerii Pamantului
No ratings yet
Calculul Impingerii Pamantului
4 pages
FIITJEE IOQM Screening Test Guide
No ratings yet
FIITJEE IOQM Screening Test Guide
7 pages
API 5L PSL 1 vs PSL 2 Comparison Guide
No ratings yet
API 5L PSL 1 vs PSL 2 Comparison Guide
1 page
DSB09 0035
No ratings yet
DSB09 0035
5 pages
Hydropower Unlined Tunnel Insights
No ratings yet
Hydropower Unlined Tunnel Insights
10 pages
Thromboses Veineuses Cérébrales (TVC)
No ratings yet
Thromboses Veineuses Cérébrales (TVC)
31 pages
3-1/2 Digit LCD Multimeter With Tachometer Kit: Instructions and Precautions
No ratings yet
3-1/2 Digit LCD Multimeter With Tachometer Kit: Instructions and Precautions
16 pages
Final 1st Term G8
No ratings yet
Final 1st Term G8
4 pages
Longines - Cal. L700.2 Repair Manual - en
No ratings yet
Longines - Cal. L700.2 Repair Manual - en
17 pages
Epson TM-L90 Liner-Free Compatible Label Printer Brochure
No ratings yet
Epson TM-L90 Liner-Free Compatible Label Printer Brochure
2 pages
Chapter 34B - Reflection and Mirrors II (Analytical)
No ratings yet
Chapter 34B - Reflection and Mirrors II (Analytical)
28 pages
Grade 9 Geography Term 2 Test 2023
No ratings yet
Grade 9 Geography Term 2 Test 2023
6 pages
Semiconductor
No ratings yet
Semiconductor
11 pages
W&B REPORT MSN1199 TG-TRF-signed
No ratings yet
W&B REPORT MSN1199 TG-TRF-signed
2 pages
Medical Thermometry & Heat Therapy
No ratings yet
Medical Thermometry & Heat Therapy
61 pages
Galileo's Inclined Plane Experiment
No ratings yet
Galileo's Inclined Plane Experiment
9 pages
Serial Wire Debug: Overview and Benefits
No ratings yet
Serial Wire Debug: Overview and Benefits
8 pages
01 - Recommendations For Grillage Modelling - Oct - 2019
No ratings yet
01 - Recommendations For Grillage Modelling - Oct - 2019
45 pages
1 - Tech Pass For Drill Bits
No ratings yet
1 - Tech Pass For Drill Bits
4 pages
Lotus Valley School Timetable 2022-23
No ratings yet
Lotus Valley School Timetable 2022-23
1 page
4CH0 1C Que 20190110-Edexcel-IGCSE-Chemistry
No ratings yet
4CH0 1C Que 20190110-Edexcel-IGCSE-Chemistry
28 pages
Car Brochure Hyundai Ioniq PX 929 R
No ratings yet
Car Brochure Hyundai Ioniq PX 929 R
13 pages
TTU Fittings Ttu Industrial Corp., LTD.: Manufacturer'S Certificate
No ratings yet
TTU Fittings Ttu Industrial Corp., LTD.: Manufacturer'S Certificate
1 page
Mark Meaney, Capital As Organic Unity PDF
No ratings yet
Mark Meaney, Capital As Organic Unity PDF
100 pages
Brooks Model 0254 Installation Manual
No ratings yet
Brooks Model 0254 Installation Manual
124 pages
LDG AT-1000ProII Tuner Manual
No ratings yet
LDG AT-1000ProII Tuner Manual
28 pages
UniStrong DoraGIS User Manual PDF
100% (1)
UniStrong DoraGIS User Manual PDF
117 pages

Comparing Multiple Algorithms

Uploaded by

Comparing Multiple Algorithms

Uploaded by

Unit - II

Analysis of Machine Learning

Dr. Chetan Jalendra

Metrics and Methods for Evaluating Model Accuracy

Degrees of Freedom: 𝑑𝑑𝑓𝑓=29

• Step 1: Define the Binomial Random Variable

Statistical Tests for Evaluating Error Rates

You might also like