0% found this document useful (0 votes)
12 views70 pages

Comparing Multiple Algorithms

Uploaded by

www.vicky4209211
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views70 pages

Comparing Multiple Algorithms

Uploaded by

www.vicky4209211
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Unit - II

Analysis of Machine Learning


Cross Validation and Resampling Methods

Dr. Chetan Jalendra


Department of Technology
Introduction to Cross-Validation
• Definition:
• Cross-validation is a statistical method used to estimate the performance of a
model by partitioning the data into training and validation sets multiple times.

• Goal:
• To obtain reliable error estimates and assess the performance of model on
unseen data.
K-Fold Cross-Validation
• Process:
• Divide the dataset 𝑋𝑋 into 𝐾𝐾 equal parts (𝑋𝑋1,𝑋𝑋2,...,𝑋𝑋𝐾𝐾).
• Perform 𝐾𝐾 iterations:
• Use one part as the validation set 𝑉𝑉𝑖𝑖​ .
• Combine the remaining 𝐾𝐾−1 parts to form the training set 𝑇𝑇𝑖𝑖​.
• Example:
• 𝑉𝑉1=𝑋𝑋1,​ 𝑇𝑇1=𝑋𝑋2 ∪ 𝑋𝑋3 ∪⋯∪ 𝑋𝑋𝐾𝐾
• 𝑉𝑉2=𝑋𝑋2​, 𝑇𝑇2=𝑋𝑋1 ∪ 𝑋𝑋3 ∪⋯∪ 𝑋𝑋𝐾𝐾
• ...
• 𝑉𝑉𝐾𝐾=𝑋𝑋𝐾𝐾, 𝑇𝑇𝐾𝐾=𝑋𝑋1 ∪ 𝑋𝑋2 ∪⋯∪ 𝑋𝑋𝐾𝐾−1
K-Fold Cross-Validation
Benefits and Drawbacks of K-Fold Cross-
Validation
• Benefits:
• Maximizes training set size for robust estimators.
• Balances between training and validation set sizes.
• Drawbacks:
• Smaller validation sets can lead to less reliable error estimates.
• Significant overlap in training sets (each pair shares 𝐾𝐾−2 parts).
Choosing the Optimal 𝐾𝐾
• Factors:
• Size of the dataset (𝑁𝑁):
• Larger 𝑁𝑁: Smaller 𝐾𝐾.
• Smaller 𝑁𝑁: Larger 𝐾𝐾.
• Typical Values:
• Common choices are 𝐾𝐾=10 or 𝐾𝐾=30.
Leave-One-Out Cross-Validation (LOOCV)
• Definition:
• Special case of 𝐾𝐾-fold where 𝐾𝐾=𝑁𝑁 (number of instances in the dataset).
• Each instance is used once as the validation set, while the remaining 𝑁𝑁−1
instances form the training set.
• Use Case:
• Ideal for small datasets or applications where labeled data is hard to find
(e.g., medical diagnosis).
Leave-One-Out Cross-Validation (LOOCV)
Benefits and Drawbacks of LOOCV
• Benefits:
• Maximizes the use of available data.
• Drawbacks:
• High computational cost (requires 𝑁𝑁 training iterations).
• Does not allow for stratification.
Stratification in Cross-Validation
• Definition:
• Ensures class proportions are maintained in both training and validation sets.
• Prevents distortion of class prior probabilities.
• Importance:
• Especially crucial for imbalanced datasets to obtain reliable performance
estimates.
Multiple Runs of K-Fold Cross-Validation
• Recent Advancements:
• Increased computational power allows for multiple runs of K-fold cross-
validation (e.g., 10×10 fold).
• Process:
• Perform K-fold cross-validation multiple times.
• Average the results to obtain more reliable error estimates.
• Reference:
• Bouckaert (2003).
5×2 Cross-Validation
• Definition:
• Proposed by Dietterich (1998), uses training and validation sets of equal size.
• Process:
• Divide the dataset randomly into two parts, 𝑋𝑋1(1) and 𝑋𝑋1(2) .
• First pair: 𝑇𝑇1=𝑋𝑋1(1) , 𝑉𝑉1=𝑋𝑋1(2)​ .
• Swap the roles for the second pair: 𝑇𝑇2=𝑋𝑋1(2), 𝑉𝑉2=𝑋𝑋1(1)​ .
• Repeat for five folds: 𝑇𝑇1, 𝑉𝑉1, 𝑇𝑇2, 𝑉𝑉2,...,𝑇𝑇10, 𝑉𝑉10.
Benefits and Drawbacks of 5×2 Cross-
Validation
• Benefits:
• Equal-sized training and validation sets.
• Drawbacks:
• After five folds, the sets share many instances and overlap significantly.
• The statistics become dependent, reducing the new information obtained.
Bootstrapping
• Definition:
• An alternative to cross-validation for generating multiple samples from a
single sample.
• Process:
• Draw instances from the original dataset with replacement.
• The original dataset serves as the validation set.
• Probability:
• Probability of not picking an instance after 𝑁𝑁 draws: 𝑒𝑒−1≈0.368.
• Training data contains approximately 63.2% of instances.
Bootstrapping
Benefits and Drawbacks of Bootstrapping
• Benefits:
• Suitable for very small datasets.
• Drawbacks:
• Overlap more than cross-validation samples.
• Error estimate may be pessimistic.
• Solution: Replicate the process multiple times and average the results.
Summary and Best Practices
• Summary:
• Cross-validation is essential for model performance estimation.
• K-fold and LOOCV are common methods with specific use cases and
limitations.
• Stratification and multiple runs can improve reliability.
• 5×2 cross-validation offers balanced training and validation sets.
• Bootstrapping is ideal for small datasets.
• Best Practices:
• Choose 𝐾𝐾 based on dataset size.
• Use stratification for imbalanced datasets.
• Consider computational cost and model complexity.
Measuring Classifier Performance

Metrics and Methods for Evaluating Model Accuracy


Introduction to Classifier Performance
• Definition:
• Evaluating the performance of a classifier is crucial for understanding how
well it predicts unseen data.
• Goal:
• To use various metrics and visual tools to assess and compare classifiers.
Confusion Matrix
• Definition:
• A confusion matrix is a table that is used to describe the performance of a
classification model.
• Components:
• True Positive (tp): Correctly predicted positive instances.
• True Negative (tn): Correctly predicted negative instances.
• False Positive (fp): Incorrectly predicted positive instances.
• False Negative (fn): Incorrectly predicted negative instances.
• Example: Predicted Class Positive Negative Total
True Class Positive tp fn P
True Class Negative Fp tn n
Total p’ n’ N
Performance Measures
• Error Rate:
• Formula: (fp+fn)/𝑁𝑁
• Measures the proportion of incorrect predictions.
• Accuracy:
• Formula: (tp+tn)/𝑁𝑁 = 1-error
• Measures the proportion of correct predictions.
Key Metrics for Two-class Problems
• True Positive Rate (TPR) / Sensitivity / Recall:
• Formula: tp/p
• Measures the proportion of actual positives correctly identified.
• False Positive Rate (FPR):
• Formula: fp/p
• Measures the proportion of actual negatives incorrectly identified as positive.
• Precision:
• Formula: tp/p’
• Measures the proportion of positive predictions that are actually correct.
ROC Curve
• Definition:
• The Receiver Operating Characteristics (ROC) curve plots TPR vs. FPR at
various threshold settings.
• Purpose:
• To visualize the trade-off between sensitivity and specificity for different
thresholds.
ROC Curve
Example:
Diagram:
Typical ROC
curve with
TPR on the
y-axis and
FPR on the
x-axis.
Area Under the Curve (AUC)
• Definition:
• The Area Under the Curve (AUC) provides a single value summarizing the
overall performance of the classifier.
• Ideal AUC:
• AUC = 1 indicates a perfect classifier.
• Interpretation:
• Higher AUC values indicate better classifier performance.
Precision-Recall Curve
• Definition:
• The Precision-Recall curve plots
precision vs. recall for different
threshold values.
• Use Case:
• Particularly useful for
imbalanced datasets where the
positive class is rare.
• Example:
• Diagram: Precision-Recall curve
illustrating the trade-off
between precision and recall.
Sensitivity and Specificity
• Sensitivity (Recall):
• Formula: tp /p = tp-rate
• Measures how well the classifier identifies positive instances.
• Specificity:
• Formula: tn /n = 1 - fp-rate
• Measures how well the classifier identifies negative instances.
• Curve:
• Sensitivity vs. Specificity curve can also be plotted using different thresholds.
Class Confusion Matrix for Multi-Class
Problems
• Definition:
• For 𝐾𝐾>2 classes, the confusion matrix is a 𝐾𝐾×𝐾𝐾 matrix.
• Components:
• Entry (i, j) represents the number of instances of class 𝐶𝐶𝑖𝑖 predicted as class 𝐶𝐶𝑗𝑗​.
• Purpose:
• To identify which classes are frequently confused.
Example: Authentication Application
• Scenario:
• Users log on to their accounts by voice.
• Errors:
• False Positive: Impostor is wrongly logged on.
• False Negative: Valid user is refused.
• Metrics:
• TP-rate (Sensitivity): Proportion of valid users correctly authenticated.
• FP-rate: Proportion of impostors wrongly accepted.
Summary and Best Practices
• Summary:
• Various metrics are used to evaluate classifier performance, each with its own
strengths and weaknesses.
• ROC and Precision-Recall curves provide valuable visual insights.
• AUC is a useful summary statistic for comparing classifiers.
• Best Practices:
• Choose metrics based on the specific context and importance of different
types of errors.
• Use multiple metrics to get a comprehensive understanding of classifier
performance.
Understanding Interval Estimation and
Hypothesis Testing
Key Techniques in Statistical Inference
Introduction
• Overview:
• Importance of statistical inference
• Key techniques: Interval Estimation and Hypothesis Testing.
• Objective:
• Understand confidence intervals
• Learn the steps of hypothesis testing.
Interval Estimation
• Definition:
• Interval estimation provides a range of values for an unknown parameter.
• Importance:
• Gives an estimate of the parameter with an associated confidence level.
Key Components of Confidence Intervals
• Point Estimate:
• Sample mean (𝑥𝑥)̅
• Confidence Level:
• Typically 95% or 99%
• Margin of Error:
• Depends on the standard deviation and sample size
Formulas for Confidence Intervals
• When 𝜎𝜎 is known:
𝜎𝜎
𝑥𝑥̅ ± 𝑍𝑍𝛼𝛼/2
𝑛𝑛
• When 𝜎𝜎 is unknown:​
𝑠𝑠
𝑥𝑥̅ ± 𝑡𝑡𝛼𝛼/2
𝑛𝑛
• For Proportions:
𝑝𝑝(1
̂ − 𝑝𝑝)̂
𝑝𝑝̂ ± 𝑍𝑍𝛼𝛼/2
𝑛𝑛
Example of Interval Estimation
• Scenario:
• Estimate the average weight loss of participants in a diet program.
• Given:
• Sample size (𝑛𝑛): 30
• Sample mean (𝑥𝑥): ̅ 5 kg
• Sample standard deviation (𝑠𝑠): 1.5 kg
• Confidence Level: 95%
• Calculation:
𝑠𝑠
• 𝑥𝑥̅ ± 𝑡𝑡𝛼𝛼/2
𝑛𝑛
Transition to Hypothesis Testing
• Connection:
• While interval estimation provides a range, hypothesis testing evaluates
specific claims about a parameter.
• Objective:
• Understand the process and application of hypothesis testing.
Hypothesis Testing Overview
• Definition:
• A method to test if there is enough evidence to reject a null hypothesis (𝐻𝐻0​ )
in favor of an alternative hypothesis (𝐻𝐻𝑎𝑎).
• Importance:
• Helps in making decisions based on data.
Steps in Hypothesis Testing
1. State the Hypotheses:
• Null Hypothesis (𝐻𝐻0​): e.g., 𝜇𝜇 = 𝜇𝜇0
• Alternative Hypothesis (𝐻𝐻𝑎𝑎): e.g., 𝜇𝜇≠𝜇𝜇0
2. Choose Significance Level (𝛼𝛼):
• Common values: 0.05, 0.01
3. Select Test and Calculate Test Statistic:
• e.g., t-test, z-test
4. Calculate p-value or Critical Value:
• Compare with 𝛼𝛼
5. Make Decision:
• Reject or fail to reject 𝐻𝐻0
6. Draw Conclusion:
• Interpret results in context
Types of Tests
• t-test:
• Used to compare means
• z-test:
• For large samples or known variance
• Chi-square test:
• For categorical data
• ANOVA:
• For comparing means among three or more groups
Example of Hypothesis Testing
• Scenario:
• Test if the mean weight loss is different from a hypothesized value of 4.5 kg.
• Given:
• Sample size (𝑛𝑛): 30
• Sample mean (𝑥𝑥): ̅ 5 kg
• Sample standard deviation (𝑠𝑠): 1.5 kg
• Hypothesized mean (𝜇𝜇0): 4.5 kg
• Significance level (𝛼𝛼): 0.05
Calculating the Test Statistic and Determining
the Critical Value
• Formula:
̅ 0
𝑥𝑥−𝜇𝜇
• 𝑡𝑡 = 𝑠𝑠
� 𝑛𝑛
• t​ ≈1.83

Degrees of Freedom: 𝑑𝑑𝑓𝑓=29


t-table: 𝑡𝑡0.025,29 ≈ 2.045
Decision:
Since ∣1.83∣<2.045, fail to reject 𝐻𝐻0.
Conclusion of Hypothesis Testing
• Interpretation:
• There is not enough evidence to suggest that the mean weight loss is
significantly different from 4.5 kg.
Assessing a Classification Algorithm’s
Performance
Understanding Error Rate Assessment and Comparison
Introduction
• Purpose: Evaluating the performance of classification algorithms by
assessing error rates.
• Scope:
• Primarily focuses on classification error rates.
• Methods are also applicable to:
• Squared error in regression: Measures the average squared difference between
observed and predicted values.
• Log likelihoods in unsupervised learning: Evaluates the fit of a model to the data using
likelihood functions.
• Expected reward in reinforcement learning: Assesses the long-term return an agent
expects to achieve from a policy.
Introduction: Key Concepts
• Hypothesis Testing: Determines if observed error rates are statistically
significant.
• Parametric Tests: Assume a specific distribution for the error rate
(e.g., binomial, normal).
• Nonparametric Tests: Do not assume any specific distribution for the
error rate, useful when the parametric form is unknown.
Introduction: Parametric Tests
• Binomial Test: Tests the hypothesis about the probability of misclassification
errors.
• Assumes the number of errors follows a binomial distribution.
• Example: Testing if the error rate is less than or equal to a specified value 𝑝𝑝0​ .
• Approximate Normal Test: Uses the normal approximation of the binomial
distribution for large sample sizes.
• Applies the Central Limit Theorem to approximate the binomial distribution with a normal
distribution.
• Example: Testing if the observed error rate is significantly greater than a specified value 𝑝𝑝0​ .
• t-Test: Compares the average error rate across multiple runs to a specified value
𝑝𝑝0.
• Uses the t-distribution to account for variability in multiple runs.
• Example: Running the algorithm multiple times and testing if the average error rate is
significantly different from 𝑝𝑝0.
Introduction: Nonparametric Tests
• Advantages: Do not rely on assumptions about the underlying
distribution of the error rate.
• Example Methods:
• Wilcoxon Signed-Rank Test: Compares paired samples to assess whether their
population mean ranks differ.
• Mann-Whitney U Test: Compares differences between two independent
samples.
• Applications: Useful when the data do not meet the assumptions
required for parametric tests or when dealing with small sample sizes.
Binomial Test - Setup
• Single training set 𝑇𝑇 and single validation set 𝑉𝑉
• Train classifier on 𝑇𝑇 and test on 𝑉𝑉
• 𝑝𝑝: probability of misclassification
• 𝑥𝑥𝑡𝑡: indicator of misclassification (0/1 Bernoulli variable)
• 𝑋𝑋: total number of errors (𝑋𝑋 = ∑𝑁𝑁 𝑡𝑡=1 𝑥𝑥𝑡𝑡 )
Binomial Test - Example
• Example: 100 validation samples, observed 30 errors
• Null Hypothesis (𝐻𝐻0): 𝑝𝑝 ≤ 0.25
• Alternative Hypothesis (𝐻𝐻1): 𝑝𝑝 > 0.25

• Step 1: Define the Binomial Random Variable


• The number of errors X follows a binomial distribution: X∼Binomial
(n=100, p=0.25).
Binomial Test - Example
• Step 2: Calculate the Probability Mass Function
• The probability mass function for a binomial random variable is:
𝑛𝑛 𝑥𝑥
𝑃𝑃 𝑋𝑋 = 𝑥𝑥 = 𝑝𝑝 (1 − 𝑝𝑝)𝑛𝑛−𝑥𝑥
𝑥𝑥
𝑛𝑛
where is the binomial coefficient, p is the probability of error, n is the
𝑥𝑥
number of trials, and x is the number of successes (errors in this case).
• Step 3: Calculate the Cumulative Probability P(X≥30)
To find P(X≥30), we need to sum the probabilities from X=30 to X=100
100
100
𝑃𝑃 𝑋𝑋 ≥ 30 = � 0.25𝑥𝑥 (0.75)100−𝑥𝑥
𝑥𝑥
𝑥𝑥=30
Binomial Test - Example
• Step 4: Compute Each Term
• Calculate each term in the sum. For example, for x=30:
100
𝑃𝑃 𝑋𝑋 ≥ 30 = 0.2530 (0.75)70
30

100 100!
=
30 30!. (100 − 30)!
• Calculate this using a calculator or computational tool for precision.
• Step 5: Sum the Probabilities
Sum the probabilities for x=30 to x=100. This is computationally intensive and is typically done using
statistical software or a binomial calculator.
• Step 6: Compare to Significance Level
Compare the cumulative probability to the significance level α=0.05. If P(X≥30)<0.05, reject H0​;
otherwise, do not reject H0​.
Approximate Normal Test
• The approximate normal test is used to test a hypothesis about the error rate of a classifier when
the number of validation samples is large.
• This test uses the Central Limit Theorem to approximate the binomial distribution with a normal
distribution.
• Steps:
1. Define the Problem:
• You have a single training set 𝑇𝑇 and a single validation set 𝑉𝑉.
• Train the classifier on 𝑇𝑇 and test it on 𝑉𝑉.
• Let 𝑝𝑝 be the probability of misclassification (error rate) which we want to estimate or test.
2. Hypothesis:
• Null hypothesis 𝐻𝐻0:𝑝𝑝≤𝑝𝑝0
• Alternative hypothesis 𝐻𝐻1:𝑝𝑝>𝑝𝑝0
3. Binomial Distribution Approximation:
• Let 𝑋𝑋 be the total number of errors.
• The point estimate of 𝑝𝑝 is 𝑝𝑝� = 𝑋𝑋/𝑁𝑁 , where 𝑁𝑁 is the total number of validation samples.
Approximate Normal Test
• Central Limit Theorem:
• For large 𝑁𝑁, the distribution of 𝑝𝑝̂ is approximately normal with mean 𝑝𝑝0​ and
variance 𝑝𝑝0(1−𝑝𝑝0)/𝑁𝑁 .
• Test Statistic:
• Calculate the z-score:
� 0
𝑝𝑝−𝑝𝑝
𝑧𝑧 = 𝑝𝑝0(1−𝑝𝑝 )
0
𝑁𝑁

• Decision Rule:
• Compare the z-score to the critical value from the standard normal
distribution (e.g., 𝑧𝑧0.05 =1.64 for a significance level 𝛼𝛼=0.05).
• Reject 𝐻𝐻0​ if 𝑧𝑧 is greater than the critical value.
Approximate Normal Test - Example
• Problem Setup:
• You have a validation set with 100 samples.
• You observed 30 errors.
• You want to test 𝐻𝐻0:𝑝𝑝≤0.25 against 𝐻𝐻1:𝑝𝑝>0.25.
Step-by-Step Calculation:
1. Calculate Point Estimate 𝑝𝑝̂ = 𝑋𝑋/𝑁𝑁=30/100=0.30
2. Calculate the Standard Error:
𝑝𝑝0(1−𝑝𝑝0 ) 0.25 𝑋𝑋 0.75
Standard Error = = = 0.0433
𝑁𝑁 100
3. Calculate the z-score:
� 0
𝑝𝑝−𝑝𝑝 0.3−0.25
𝑧𝑧 = = ≈1.15
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 0.0433
Approximate Normal Test - Example
4. Decision Rule:
• For a significance level 𝛼𝛼=0.05, the critical value 𝑧𝑧0.05≈1.64.
• Compare the calculated z-score (1.15) with the critical value (1.64).
Since 𝑧𝑧 = 1.15 is less than 1.64, we do not reject the null hypothesis
𝐻𝐻0:𝑝𝑝≤0.25. This means there is not enough evidence to conclude that the
error rate is greater than 0.25.
Interpretation
• The approximate normal test helps us determine whether the observed
error rate is significantly higher than a specified threshold.
• In this example, the observed error rate (30%) was not significantly higher
than the threshold (25%) at the 0.05 significance level, so we did not reject
the null hypothesis.
t-Test - Setup
• Multiple training/validation sets (𝐾𝐾 pairs)
• Error percentages 𝑝𝑝𝑖𝑖​ for each validation set 𝑖𝑖
• Average error rate:
𝐾𝐾
1
𝑚𝑚 = � 𝑝𝑝𝑖𝑖
𝐾𝐾
𝑖𝑖=1
t-Test - Example
• Example: 10 validation sets, error percentages 𝑝𝑝𝑖𝑖​:
0.28,0.32,0.30,0.27,0.31,0.29,0.33,0.26,0.34,0.25
Calculate m:
10
1
𝑚𝑚 = � 𝑝𝑝𝑖𝑖 = 0.295
10
𝑖𝑖=1
Calculate sample variance
10
1
𝑆𝑆 2 = �(𝑝𝑝𝑖𝑖 −0.295)2 = 0.00296
9
𝑖𝑖=1
Standard deviation S = 0.0544
t-Test – Decision Rule
• Null Hypothesis (H0): p ≤ 0.25
• Calculate t-statistic:
𝑥𝑥̅ − 𝜇𝜇0 0.295 − 0.25
𝑡𝑡 = 𝑠𝑠 = = 2.61
� 𝑛𝑛 0.0544�
10
• Compare to t0.05,9 = 1.83
• Since 2.61 > 1.83, reject H0
• Conclude p>0.25
Key Points
• Binomial Test: Used for single validation set, evaluates if observed
error rate is higher than specified value.
• Approximate Normal Test: Normal approximation for large sample
sizes, used for single validation set.
• t Test: Used for multiple validation sets, assesses significance of
average error rate.
Comparing Two Classification Algorithms

Statistical Tests for Evaluating Error Rates


Introduction
• Objective: Compare two classifiers to test if they have the same
expected error rate.
• Approach: Use statistical tests to evaluate whether the error rates of
two classifiers are significantly different.
• Methods Covered:
• McNemar’s Test
• K-Fold Cross-Validated Paired t-Test
• 5 × 2 CV Paired t-Test
• 5 × 2 CV Paired F-Test
McNemar’s Test
• Purpose: Test if two classifiers have the same error rate.
• Contingency Table:
• e00: Misclassified by both classifiers
• e01: Misclassified by Classifier 1 but not 2
• e10: Misclassified by Classifier 2 but not 1
• e11: Correctly classified by both classifiers
• Chi-Square Statistic:
( 𝑒𝑒01 −𝑒𝑒10 −1)2
𝜒𝜒 2 =
𝑒𝑒01 + 𝑒𝑒10
2 2
• Decision Rule: Reject the null hypothesis if 𝜒𝜒 2 > 𝜒𝜒𝛼𝛼,1 where 𝜒𝜒0.05,1 =
3.84
Example of McNemar’s Test
• Scenario: Compare Classifier A and Classifier B on a validation set.
• No. of examples misclassified by both – 50
• No. of examples misclassified by A but not B – 25
• No. of examples misclassified by B but not A – 15
• No. of examples correctly classified by both - 50
• Under the null hypothesis that the classification algorithms have the same error
rate, we expect e01 = e10 and these to be equal to (e01+e10)/2. We have the chi-
square statistic with one degree of freedom
• Calculate Statistic:
( 25−15 −1)2 (10−1)2 81
• 𝜒𝜒 2
= = =
= 2.025 ​
25+15 40 40
• Interpretation: Since 2.025 < 3.84, fail to reject the null hypothesis.
K-Fold Cross-Validated Paired t-Test
• Purpose: Evaluate if the difference in error rates between two classifiers is statistically
significant.
• If the two classification algorithms have the same error rate, then we expect them to
have the same mean, or equivalently, that the difference of their means is 0.
• Procedure:
1. Perform K-Fold Cross-Validation.
2. Record error rates for both classifiers on each fold.
3. Calculate the difference 𝑝𝑝𝑖𝑖 = 𝑝𝑝𝑖𝑖1 − 𝑝𝑝𝑖𝑖2 for each fold.
4. Test if the mean difference is zero.
• t-Statistic:
𝐾𝐾.𝑚𝑚
𝑡𝑡 =
𝑆𝑆
• Where: 𝑚𝑚 = Mean of differences, 𝑆𝑆 = Standard deviation of differences
• Decision Rule: Reject the null hypothesis if 𝑡𝑡 is outside (−𝑡𝑡𝛼𝛼/2,𝐾𝐾−1, 𝑡𝑡𝛼𝛼/2,𝐾𝐾−1).
Example of K-Fold Cross-Validated Paired t-
Test
• Scenario: Use 5-Fold Cross-Validation.
• Results: Differences in error rates: [0.02, -0.01, 0.03, -0.02, 0.01]
• Calculate:
• Mean difference 𝑚𝑚=0.006
• Variance 𝑆𝑆2=0.0001
• 𝑡𝑡-Statistic for 4 degrees of freedom:
5.0.006
𝑡𝑡 = =1.34
0.0001
• Interpretation: Compare with 𝑡𝑡0.05/2,4=2.776. Since 1.34 < 2.776, fail to
reject the null hypothesis.
5 × 2 CV Paired t-Test
• Purpose: Extended version of paired t-test using 5 × 2 cross-validation.
• Proposed by Dietterich (1998).
• Procedure:
1. Perform 5 × 2 Cross-Validation.
2. Compute the error rate difference for each fold and replication.
3. Calculate average and variance of differences.-
(𝑗𝑗)
• 𝑝𝑝𝑖𝑖 is the difference between the error rates of the two classifiers on fold j =
1,2 of replication i = 1, . . . , 5.
1 (2)
• The average on replication i is 𝑝𝑝𝑖𝑖̅ = (𝑝𝑝𝑖𝑖 + 𝑝𝑝𝑖𝑖 )/2, and the estimated
(1) 2
variance is 𝑠𝑠𝑖𝑖2 = (𝑝𝑝𝑖𝑖 −𝑝𝑝𝑖𝑖̅ )2 + (𝑝𝑝𝑖𝑖 − 𝑝𝑝𝑖𝑖̅ )2 .
• Check – The two classification algorithms have the same error rate.
5 × 2 CV Paired t-Test
• t-Statistic:
𝑝𝑝11
𝑡𝑡 = ~𝑡𝑡5
1 5 2
∑𝑖𝑖=1 𝑠𝑠𝑖𝑖
5
• ​ Where: 𝑠𝑠𝑖𝑖2 = Variance for each replication
• Decision Rule: Reject the null hypothesis if 𝑡𝑡 is outside (−𝑡𝑡𝛼𝛼/2,5, 𝑡𝑡𝛼𝛼/2,5).
5 × 2 CV Paired F-Test
• Purpose: Extend the 5 × 2 CV t-test using the F-distribution.
• Procedure:
1. Compute squared differences and variances.
2. Combine results into an F-statistic.
• F-Statistic:
5 2 𝑗𝑗 2
∑𝑖𝑖=1 ∑𝑗𝑗=1(𝑝𝑝𝑖𝑖 )
𝐹𝐹10,5 ~
2 ∑5𝑖𝑖=1 𝑠𝑠𝑖𝑖2
• Decision Rule: Reject the null hypothesis if 𝑓𝑓>𝐹𝐹𝛼𝛼,10,5.
Summary
• McNemar’s Test: Good for a quick comparison based on misclassification
counts.
• K-Fold Cross-Validated Paired t-Test: Uses multiple folds for a robust
comparison of means.
• 5 × 2 CV Paired t-Test: Adds multiple replications for a more reliable
comparison.
• 5 × 2 CV Paired F-Test: Provides a detailed comparison using the F-
distribution.

You might also like