Unit - II
Analysis of Machine Learning
Cross Validation and Resampling Methods
Dr. Chetan Jalendra
Department of Technology
Introduction to Cross-Validation
• Definition:
• Cross-validation is a statistical method used to estimate the performance of a
model by partitioning the data into training and validation sets multiple times.
• Goal:
• To obtain reliable error estimates and assess the performance of model on
unseen data.
K-Fold Cross-Validation
• Process:
• Divide the dataset 𝑋𝑋 into 𝐾𝐾 equal parts (𝑋𝑋1,𝑋𝑋2,...,𝑋𝑋𝐾𝐾).
• Perform 𝐾𝐾 iterations:
• Use one part as the validation set 𝑉𝑉𝑖𝑖 .
• Combine the remaining 𝐾𝐾−1 parts to form the training set 𝑇𝑇𝑖𝑖.
• Example:
• 𝑉𝑉1=𝑋𝑋1, 𝑇𝑇1=𝑋𝑋2 ∪ 𝑋𝑋3 ∪⋯∪ 𝑋𝑋𝐾𝐾
• 𝑉𝑉2=𝑋𝑋2, 𝑇𝑇2=𝑋𝑋1 ∪ 𝑋𝑋3 ∪⋯∪ 𝑋𝑋𝐾𝐾
• ...
• 𝑉𝑉𝐾𝐾=𝑋𝑋𝐾𝐾, 𝑇𝑇𝐾𝐾=𝑋𝑋1 ∪ 𝑋𝑋2 ∪⋯∪ 𝑋𝑋𝐾𝐾−1
K-Fold Cross-Validation
Benefits and Drawbacks of K-Fold Cross-
Validation
• Benefits:
• Maximizes training set size for robust estimators.
• Balances between training and validation set sizes.
• Drawbacks:
• Smaller validation sets can lead to less reliable error estimates.
• Significant overlap in training sets (each pair shares 𝐾𝐾−2 parts).
Choosing the Optimal 𝐾𝐾
• Factors:
• Size of the dataset (𝑁𝑁):
• Larger 𝑁𝑁: Smaller 𝐾𝐾.
• Smaller 𝑁𝑁: Larger 𝐾𝐾.
• Typical Values:
• Common choices are 𝐾𝐾=10 or 𝐾𝐾=30.
Leave-One-Out Cross-Validation (LOOCV)
• Definition:
• Special case of 𝐾𝐾-fold where 𝐾𝐾=𝑁𝑁 (number of instances in the dataset).
• Each instance is used once as the validation set, while the remaining 𝑁𝑁−1
instances form the training set.
• Use Case:
• Ideal for small datasets or applications where labeled data is hard to find
(e.g., medical diagnosis).
Leave-One-Out Cross-Validation (LOOCV)
Benefits and Drawbacks of LOOCV
• Benefits:
• Maximizes the use of available data.
• Drawbacks:
• High computational cost (requires 𝑁𝑁 training iterations).
• Does not allow for stratification.
Stratification in Cross-Validation
• Definition:
• Ensures class proportions are maintained in both training and validation sets.
• Prevents distortion of class prior probabilities.
• Importance:
• Especially crucial for imbalanced datasets to obtain reliable performance
estimates.
Multiple Runs of K-Fold Cross-Validation
• Recent Advancements:
• Increased computational power allows for multiple runs of K-fold cross-
validation (e.g., 10×10 fold).
• Process:
• Perform K-fold cross-validation multiple times.
• Average the results to obtain more reliable error estimates.
• Reference:
• Bouckaert (2003).
5×2 Cross-Validation
• Definition:
• Proposed by Dietterich (1998), uses training and validation sets of equal size.
• Process:
• Divide the dataset randomly into two parts, 𝑋𝑋1(1) and 𝑋𝑋1(2) .
• First pair: 𝑇𝑇1=𝑋𝑋1(1) , 𝑉𝑉1=𝑋𝑋1(2) .
• Swap the roles for the second pair: 𝑇𝑇2=𝑋𝑋1(2), 𝑉𝑉2=𝑋𝑋1(1) .
• Repeat for five folds: 𝑇𝑇1, 𝑉𝑉1, 𝑇𝑇2, 𝑉𝑉2,...,𝑇𝑇10, 𝑉𝑉10.
Benefits and Drawbacks of 5×2 Cross-
Validation
• Benefits:
• Equal-sized training and validation sets.
• Drawbacks:
• After five folds, the sets share many instances and overlap significantly.
• The statistics become dependent, reducing the new information obtained.
Bootstrapping
• Definition:
• An alternative to cross-validation for generating multiple samples from a
single sample.
• Process:
• Draw instances from the original dataset with replacement.
• The original dataset serves as the validation set.
• Probability:
• Probability of not picking an instance after 𝑁𝑁 draws: 𝑒𝑒−1≈0.368.
• Training data contains approximately 63.2% of instances.
Bootstrapping
Benefits and Drawbacks of Bootstrapping
• Benefits:
• Suitable for very small datasets.
• Drawbacks:
• Overlap more than cross-validation samples.
• Error estimate may be pessimistic.
• Solution: Replicate the process multiple times and average the results.
Summary and Best Practices
• Summary:
• Cross-validation is essential for model performance estimation.
• K-fold and LOOCV are common methods with specific use cases and
limitations.
• Stratification and multiple runs can improve reliability.
• 5×2 cross-validation offers balanced training and validation sets.
• Bootstrapping is ideal for small datasets.
• Best Practices:
• Choose 𝐾𝐾 based on dataset size.
• Use stratification for imbalanced datasets.
• Consider computational cost and model complexity.
Measuring Classifier Performance
Metrics and Methods for Evaluating Model Accuracy
Introduction to Classifier Performance
• Definition:
• Evaluating the performance of a classifier is crucial for understanding how
well it predicts unseen data.
• Goal:
• To use various metrics and visual tools to assess and compare classifiers.
Confusion Matrix
• Definition:
• A confusion matrix is a table that is used to describe the performance of a
classification model.
• Components:
• True Positive (tp): Correctly predicted positive instances.
• True Negative (tn): Correctly predicted negative instances.
• False Positive (fp): Incorrectly predicted positive instances.
• False Negative (fn): Incorrectly predicted negative instances.
• Example: Predicted Class Positive Negative Total
True Class Positive tp fn P
True Class Negative Fp tn n
Total p’ n’ N
Performance Measures
• Error Rate:
• Formula: (fp+fn)/𝑁𝑁
• Measures the proportion of incorrect predictions.
• Accuracy:
• Formula: (tp+tn)/𝑁𝑁 = 1-error
• Measures the proportion of correct predictions.
Key Metrics for Two-class Problems
• True Positive Rate (TPR) / Sensitivity / Recall:
• Formula: tp/p
• Measures the proportion of actual positives correctly identified.
• False Positive Rate (FPR):
• Formula: fp/p
• Measures the proportion of actual negatives incorrectly identified as positive.
• Precision:
• Formula: tp/p’
• Measures the proportion of positive predictions that are actually correct.
ROC Curve
• Definition:
• The Receiver Operating Characteristics (ROC) curve plots TPR vs. FPR at
various threshold settings.
• Purpose:
• To visualize the trade-off between sensitivity and specificity for different
thresholds.
ROC Curve
Example:
Diagram:
Typical ROC
curve with
TPR on the
y-axis and
FPR on the
x-axis.
Area Under the Curve (AUC)
• Definition:
• The Area Under the Curve (AUC) provides a single value summarizing the
overall performance of the classifier.
• Ideal AUC:
• AUC = 1 indicates a perfect classifier.
• Interpretation:
• Higher AUC values indicate better classifier performance.
Precision-Recall Curve
• Definition:
• The Precision-Recall curve plots
precision vs. recall for different
threshold values.
• Use Case:
• Particularly useful for
imbalanced datasets where the
positive class is rare.
• Example:
• Diagram: Precision-Recall curve
illustrating the trade-off
between precision and recall.
Sensitivity and Specificity
• Sensitivity (Recall):
• Formula: tp /p = tp-rate
• Measures how well the classifier identifies positive instances.
• Specificity:
• Formula: tn /n = 1 - fp-rate
• Measures how well the classifier identifies negative instances.
• Curve:
• Sensitivity vs. Specificity curve can also be plotted using different thresholds.
Class Confusion Matrix for Multi-Class
Problems
• Definition:
• For 𝐾𝐾>2 classes, the confusion matrix is a 𝐾𝐾×𝐾𝐾 matrix.
• Components:
• Entry (i, j) represents the number of instances of class 𝐶𝐶𝑖𝑖 predicted as class 𝐶𝐶𝑗𝑗.
• Purpose:
• To identify which classes are frequently confused.
Example: Authentication Application
• Scenario:
• Users log on to their accounts by voice.
• Errors:
• False Positive: Impostor is wrongly logged on.
• False Negative: Valid user is refused.
• Metrics:
• TP-rate (Sensitivity): Proportion of valid users correctly authenticated.
• FP-rate: Proportion of impostors wrongly accepted.
Summary and Best Practices
• Summary:
• Various metrics are used to evaluate classifier performance, each with its own
strengths and weaknesses.
• ROC and Precision-Recall curves provide valuable visual insights.
• AUC is a useful summary statistic for comparing classifiers.
• Best Practices:
• Choose metrics based on the specific context and importance of different
types of errors.
• Use multiple metrics to get a comprehensive understanding of classifier
performance.
Understanding Interval Estimation and
Hypothesis Testing
Key Techniques in Statistical Inference
Introduction
• Overview:
• Importance of statistical inference
• Key techniques: Interval Estimation and Hypothesis Testing.
• Objective:
• Understand confidence intervals
• Learn the steps of hypothesis testing.
Interval Estimation
• Definition:
• Interval estimation provides a range of values for an unknown parameter.
• Importance:
• Gives an estimate of the parameter with an associated confidence level.
Key Components of Confidence Intervals
• Point Estimate:
• Sample mean (𝑥𝑥)̅
• Confidence Level:
• Typically 95% or 99%
• Margin of Error:
• Depends on the standard deviation and sample size
Formulas for Confidence Intervals
• When 𝜎𝜎 is known:
𝜎𝜎
𝑥𝑥̅ ± 𝑍𝑍𝛼𝛼/2
𝑛𝑛
• When 𝜎𝜎 is unknown:
𝑠𝑠
𝑥𝑥̅ ± 𝑡𝑡𝛼𝛼/2
𝑛𝑛
• For Proportions:
𝑝𝑝(1
̂ − 𝑝𝑝)̂
𝑝𝑝̂ ± 𝑍𝑍𝛼𝛼/2
𝑛𝑛
Example of Interval Estimation
• Scenario:
• Estimate the average weight loss of participants in a diet program.
• Given:
• Sample size (𝑛𝑛): 30
• Sample mean (𝑥𝑥): ̅ 5 kg
• Sample standard deviation (𝑠𝑠): 1.5 kg
• Confidence Level: 95%
• Calculation:
𝑠𝑠
• 𝑥𝑥̅ ± 𝑡𝑡𝛼𝛼/2
𝑛𝑛
Transition to Hypothesis Testing
• Connection:
• While interval estimation provides a range, hypothesis testing evaluates
specific claims about a parameter.
• Objective:
• Understand the process and application of hypothesis testing.
Hypothesis Testing Overview
• Definition:
• A method to test if there is enough evidence to reject a null hypothesis (𝐻𝐻0 )
in favor of an alternative hypothesis (𝐻𝐻𝑎𝑎).
• Importance:
• Helps in making decisions based on data.
Steps in Hypothesis Testing
1. State the Hypotheses:
• Null Hypothesis (𝐻𝐻0): e.g., 𝜇𝜇 = 𝜇𝜇0
• Alternative Hypothesis (𝐻𝐻𝑎𝑎): e.g., 𝜇𝜇≠𝜇𝜇0
2. Choose Significance Level (𝛼𝛼):
• Common values: 0.05, 0.01
3. Select Test and Calculate Test Statistic:
• e.g., t-test, z-test
4. Calculate p-value or Critical Value:
• Compare with 𝛼𝛼
5. Make Decision:
• Reject or fail to reject 𝐻𝐻0
6. Draw Conclusion:
• Interpret results in context
Types of Tests
• t-test:
• Used to compare means
• z-test:
• For large samples or known variance
• Chi-square test:
• For categorical data
• ANOVA:
• For comparing means among three or more groups
Example of Hypothesis Testing
• Scenario:
• Test if the mean weight loss is different from a hypothesized value of 4.5 kg.
• Given:
• Sample size (𝑛𝑛): 30
• Sample mean (𝑥𝑥): ̅ 5 kg
• Sample standard deviation (𝑠𝑠): 1.5 kg
• Hypothesized mean (𝜇𝜇0): 4.5 kg
• Significance level (𝛼𝛼): 0.05
Calculating the Test Statistic and Determining
the Critical Value
• Formula:
̅ 0
𝑥𝑥−𝜇𝜇
• 𝑡𝑡 = 𝑠𝑠
� 𝑛𝑛
• t ≈1.83
Degrees of Freedom: 𝑑𝑑𝑓𝑓=29
t-table: 𝑡𝑡0.025,29 ≈ 2.045
Decision:
Since ∣1.83∣<2.045, fail to reject 𝐻𝐻0.
Conclusion of Hypothesis Testing
• Interpretation:
• There is not enough evidence to suggest that the mean weight loss is
significantly different from 4.5 kg.
Assessing a Classification Algorithm’s
Performance
Understanding Error Rate Assessment and Comparison
Introduction
• Purpose: Evaluating the performance of classification algorithms by
assessing error rates.
• Scope:
• Primarily focuses on classification error rates.
• Methods are also applicable to:
• Squared error in regression: Measures the average squared difference between
observed and predicted values.
• Log likelihoods in unsupervised learning: Evaluates the fit of a model to the data using
likelihood functions.
• Expected reward in reinforcement learning: Assesses the long-term return an agent
expects to achieve from a policy.
Introduction: Key Concepts
• Hypothesis Testing: Determines if observed error rates are statistically
significant.
• Parametric Tests: Assume a specific distribution for the error rate
(e.g., binomial, normal).
• Nonparametric Tests: Do not assume any specific distribution for the
error rate, useful when the parametric form is unknown.
Introduction: Parametric Tests
• Binomial Test: Tests the hypothesis about the probability of misclassification
errors.
• Assumes the number of errors follows a binomial distribution.
• Example: Testing if the error rate is less than or equal to a specified value 𝑝𝑝0 .
• Approximate Normal Test: Uses the normal approximation of the binomial
distribution for large sample sizes.
• Applies the Central Limit Theorem to approximate the binomial distribution with a normal
distribution.
• Example: Testing if the observed error rate is significantly greater than a specified value 𝑝𝑝0 .
• t-Test: Compares the average error rate across multiple runs to a specified value
𝑝𝑝0.
• Uses the t-distribution to account for variability in multiple runs.
• Example: Running the algorithm multiple times and testing if the average error rate is
significantly different from 𝑝𝑝0.
Introduction: Nonparametric Tests
• Advantages: Do not rely on assumptions about the underlying
distribution of the error rate.
• Example Methods:
• Wilcoxon Signed-Rank Test: Compares paired samples to assess whether their
population mean ranks differ.
• Mann-Whitney U Test: Compares differences between two independent
samples.
• Applications: Useful when the data do not meet the assumptions
required for parametric tests or when dealing with small sample sizes.
Binomial Test - Setup
• Single training set 𝑇𝑇 and single validation set 𝑉𝑉
• Train classifier on 𝑇𝑇 and test on 𝑉𝑉
• 𝑝𝑝: probability of misclassification
• 𝑥𝑥𝑡𝑡: indicator of misclassification (0/1 Bernoulli variable)
• 𝑋𝑋: total number of errors (𝑋𝑋 = ∑𝑁𝑁 𝑡𝑡=1 𝑥𝑥𝑡𝑡 )
Binomial Test - Example
• Example: 100 validation samples, observed 30 errors
• Null Hypothesis (𝐻𝐻0): 𝑝𝑝 ≤ 0.25
• Alternative Hypothesis (𝐻𝐻1): 𝑝𝑝 > 0.25
• Step 1: Define the Binomial Random Variable
• The number of errors X follows a binomial distribution: X∼Binomial
(n=100, p=0.25).
Binomial Test - Example
• Step 2: Calculate the Probability Mass Function
• The probability mass function for a binomial random variable is:
𝑛𝑛 𝑥𝑥
𝑃𝑃 𝑋𝑋 = 𝑥𝑥 = 𝑝𝑝 (1 − 𝑝𝑝)𝑛𝑛−𝑥𝑥
𝑥𝑥
𝑛𝑛
where is the binomial coefficient, p is the probability of error, n is the
𝑥𝑥
number of trials, and x is the number of successes (errors in this case).
• Step 3: Calculate the Cumulative Probability P(X≥30)
To find P(X≥30), we need to sum the probabilities from X=30 to X=100
100
100
𝑃𝑃 𝑋𝑋 ≥ 30 = � 0.25𝑥𝑥 (0.75)100−𝑥𝑥
𝑥𝑥
𝑥𝑥=30
Binomial Test - Example
• Step 4: Compute Each Term
• Calculate each term in the sum. For example, for x=30:
100
𝑃𝑃 𝑋𝑋 ≥ 30 = 0.2530 (0.75)70
30
100 100!
=
30 30!. (100 − 30)!
• Calculate this using a calculator or computational tool for precision.
• Step 5: Sum the Probabilities
Sum the probabilities for x=30 to x=100. This is computationally intensive and is typically done using
statistical software or a binomial calculator.
• Step 6: Compare to Significance Level
Compare the cumulative probability to the significance level α=0.05. If P(X≥30)<0.05, reject H0;
otherwise, do not reject H0.
Approximate Normal Test
• The approximate normal test is used to test a hypothesis about the error rate of a classifier when
the number of validation samples is large.
• This test uses the Central Limit Theorem to approximate the binomial distribution with a normal
distribution.
• Steps:
1. Define the Problem:
• You have a single training set 𝑇𝑇 and a single validation set 𝑉𝑉.
• Train the classifier on 𝑇𝑇 and test it on 𝑉𝑉.
• Let 𝑝𝑝 be the probability of misclassification (error rate) which we want to estimate or test.
2. Hypothesis:
• Null hypothesis 𝐻𝐻0:𝑝𝑝≤𝑝𝑝0
• Alternative hypothesis 𝐻𝐻1:𝑝𝑝>𝑝𝑝0
3. Binomial Distribution Approximation:
• Let 𝑋𝑋 be the total number of errors.
• The point estimate of 𝑝𝑝 is 𝑝𝑝� = 𝑋𝑋/𝑁𝑁 , where 𝑁𝑁 is the total number of validation samples.
Approximate Normal Test
• Central Limit Theorem:
• For large 𝑁𝑁, the distribution of 𝑝𝑝̂ is approximately normal with mean 𝑝𝑝0 and
variance 𝑝𝑝0(1−𝑝𝑝0)/𝑁𝑁 .
• Test Statistic:
• Calculate the z-score:
� 0
𝑝𝑝−𝑝𝑝
𝑧𝑧 = 𝑝𝑝0(1−𝑝𝑝 )
0
𝑁𝑁
• Decision Rule:
• Compare the z-score to the critical value from the standard normal
distribution (e.g., 𝑧𝑧0.05 =1.64 for a significance level 𝛼𝛼=0.05).
• Reject 𝐻𝐻0 if 𝑧𝑧 is greater than the critical value.
Approximate Normal Test - Example
• Problem Setup:
• You have a validation set with 100 samples.
• You observed 30 errors.
• You want to test 𝐻𝐻0:𝑝𝑝≤0.25 against 𝐻𝐻1:𝑝𝑝>0.25.
Step-by-Step Calculation:
1. Calculate Point Estimate 𝑝𝑝̂ = 𝑋𝑋/𝑁𝑁=30/100=0.30
2. Calculate the Standard Error:
𝑝𝑝0(1−𝑝𝑝0 ) 0.25 𝑋𝑋 0.75
Standard Error = = = 0.0433
𝑁𝑁 100
3. Calculate the z-score:
� 0
𝑝𝑝−𝑝𝑝 0.3−0.25
𝑧𝑧 = = ≈1.15
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 0.0433
Approximate Normal Test - Example
4. Decision Rule:
• For a significance level 𝛼𝛼=0.05, the critical value 𝑧𝑧0.05≈1.64.
• Compare the calculated z-score (1.15) with the critical value (1.64).
Since 𝑧𝑧 = 1.15 is less than 1.64, we do not reject the null hypothesis
𝐻𝐻0:𝑝𝑝≤0.25. This means there is not enough evidence to conclude that the
error rate is greater than 0.25.
Interpretation
• The approximate normal test helps us determine whether the observed
error rate is significantly higher than a specified threshold.
• In this example, the observed error rate (30%) was not significantly higher
than the threshold (25%) at the 0.05 significance level, so we did not reject
the null hypothesis.
t-Test - Setup
• Multiple training/validation sets (𝐾𝐾 pairs)
• Error percentages 𝑝𝑝𝑖𝑖 for each validation set 𝑖𝑖
• Average error rate:
𝐾𝐾
1
𝑚𝑚 = � 𝑝𝑝𝑖𝑖
𝐾𝐾
𝑖𝑖=1
t-Test - Example
• Example: 10 validation sets, error percentages 𝑝𝑝𝑖𝑖:
0.28,0.32,0.30,0.27,0.31,0.29,0.33,0.26,0.34,0.25
Calculate m:
10
1
𝑚𝑚 = � 𝑝𝑝𝑖𝑖 = 0.295
10
𝑖𝑖=1
Calculate sample variance
10
1
𝑆𝑆 2 = �(𝑝𝑝𝑖𝑖 −0.295)2 = 0.00296
9
𝑖𝑖=1
Standard deviation S = 0.0544
t-Test – Decision Rule
• Null Hypothesis (H0): p ≤ 0.25
• Calculate t-statistic:
𝑥𝑥̅ − 𝜇𝜇0 0.295 − 0.25
𝑡𝑡 = 𝑠𝑠 = = 2.61
� 𝑛𝑛 0.0544�
10
• Compare to t0.05,9 = 1.83
• Since 2.61 > 1.83, reject H0
• Conclude p>0.25
Key Points
• Binomial Test: Used for single validation set, evaluates if observed
error rate is higher than specified value.
• Approximate Normal Test: Normal approximation for large sample
sizes, used for single validation set.
• t Test: Used for multiple validation sets, assesses significance of
average error rate.
Comparing Two Classification Algorithms
Statistical Tests for Evaluating Error Rates
Introduction
• Objective: Compare two classifiers to test if they have the same
expected error rate.
• Approach: Use statistical tests to evaluate whether the error rates of
two classifiers are significantly different.
• Methods Covered:
• McNemar’s Test
• K-Fold Cross-Validated Paired t-Test
• 5 × 2 CV Paired t-Test
• 5 × 2 CV Paired F-Test
McNemar’s Test
• Purpose: Test if two classifiers have the same error rate.
• Contingency Table:
• e00: Misclassified by both classifiers
• e01: Misclassified by Classifier 1 but not 2
• e10: Misclassified by Classifier 2 but not 1
• e11: Correctly classified by both classifiers
• Chi-Square Statistic:
( 𝑒𝑒01 −𝑒𝑒10 −1)2
𝜒𝜒 2 =
𝑒𝑒01 + 𝑒𝑒10
2 2
• Decision Rule: Reject the null hypothesis if 𝜒𝜒 2 > 𝜒𝜒𝛼𝛼,1 where 𝜒𝜒0.05,1 =
3.84
Example of McNemar’s Test
• Scenario: Compare Classifier A and Classifier B on a validation set.
• No. of examples misclassified by both – 50
• No. of examples misclassified by A but not B – 25
• No. of examples misclassified by B but not A – 15
• No. of examples correctly classified by both - 50
• Under the null hypothesis that the classification algorithms have the same error
rate, we expect e01 = e10 and these to be equal to (e01+e10)/2. We have the chi-
square statistic with one degree of freedom
• Calculate Statistic:
( 25−15 −1)2 (10−1)2 81
• 𝜒𝜒 2
= = =
= 2.025
25+15 40 40
• Interpretation: Since 2.025 < 3.84, fail to reject the null hypothesis.
K-Fold Cross-Validated Paired t-Test
• Purpose: Evaluate if the difference in error rates between two classifiers is statistically
significant.
• If the two classification algorithms have the same error rate, then we expect them to
have the same mean, or equivalently, that the difference of their means is 0.
• Procedure:
1. Perform K-Fold Cross-Validation.
2. Record error rates for both classifiers on each fold.
3. Calculate the difference 𝑝𝑝𝑖𝑖 = 𝑝𝑝𝑖𝑖1 − 𝑝𝑝𝑖𝑖2 for each fold.
4. Test if the mean difference is zero.
• t-Statistic:
𝐾𝐾.𝑚𝑚
𝑡𝑡 =
𝑆𝑆
• Where: 𝑚𝑚 = Mean of differences, 𝑆𝑆 = Standard deviation of differences
• Decision Rule: Reject the null hypothesis if 𝑡𝑡 is outside (−𝑡𝑡𝛼𝛼/2,𝐾𝐾−1, 𝑡𝑡𝛼𝛼/2,𝐾𝐾−1).
Example of K-Fold Cross-Validated Paired t-
Test
• Scenario: Use 5-Fold Cross-Validation.
• Results: Differences in error rates: [0.02, -0.01, 0.03, -0.02, 0.01]
• Calculate:
• Mean difference 𝑚𝑚=0.006
• Variance 𝑆𝑆2=0.0001
• 𝑡𝑡-Statistic for 4 degrees of freedom:
5.0.006
𝑡𝑡 = =1.34
0.0001
• Interpretation: Compare with 𝑡𝑡0.05/2,4=2.776. Since 1.34 < 2.776, fail to
reject the null hypothesis.
5 × 2 CV Paired t-Test
• Purpose: Extended version of paired t-test using 5 × 2 cross-validation.
• Proposed by Dietterich (1998).
• Procedure:
1. Perform 5 × 2 Cross-Validation.
2. Compute the error rate difference for each fold and replication.
3. Calculate average and variance of differences.-
(𝑗𝑗)
• 𝑝𝑝𝑖𝑖 is the difference between the error rates of the two classifiers on fold j =
1,2 of replication i = 1, . . . , 5.
1 (2)
• The average on replication i is 𝑝𝑝𝑖𝑖̅ = (𝑝𝑝𝑖𝑖 + 𝑝𝑝𝑖𝑖 )/2, and the estimated
(1) 2
variance is 𝑠𝑠𝑖𝑖2 = (𝑝𝑝𝑖𝑖 −𝑝𝑝𝑖𝑖̅ )2 + (𝑝𝑝𝑖𝑖 − 𝑝𝑝𝑖𝑖̅ )2 .
• Check – The two classification algorithms have the same error rate.
5 × 2 CV Paired t-Test
• t-Statistic:
𝑝𝑝11
𝑡𝑡 = ~𝑡𝑡5
1 5 2
∑𝑖𝑖=1 𝑠𝑠𝑖𝑖
5
• Where: 𝑠𝑠𝑖𝑖2 = Variance for each replication
• Decision Rule: Reject the null hypothesis if 𝑡𝑡 is outside (−𝑡𝑡𝛼𝛼/2,5, 𝑡𝑡𝛼𝛼/2,5).
5 × 2 CV Paired F-Test
• Purpose: Extend the 5 × 2 CV t-test using the F-distribution.
• Procedure:
1. Compute squared differences and variances.
2. Combine results into an F-statistic.
• F-Statistic:
5 2 𝑗𝑗 2
∑𝑖𝑖=1 ∑𝑗𝑗=1(𝑝𝑝𝑖𝑖 )
𝐹𝐹10,5 ~
2 ∑5𝑖𝑖=1 𝑠𝑠𝑖𝑖2
• Decision Rule: Reject the null hypothesis if 𝑓𝑓>𝐹𝐹𝛼𝛼,10,5.
Summary
• McNemar’s Test: Good for a quick comparison based on misclassification
counts.
• K-Fold Cross-Validated Paired t-Test: Uses multiple folds for a robust
comparison of means.
• 5 × 2 CV Paired t-Test: Adds multiple replications for a more reliable
comparison.
• 5 × 2 CV Paired F-Test: Provides a detailed comparison using the F-
distribution.