Lecture 4: Model Selection and Evaluation -
MCQ Study Guide
Key Concepts Explained Simply
Model Selection
What is Model Selection? Model selection is the process of choosing the
best model from a set of candidate models. It’s like trying on different shoes to
find the pair that fits best.
Approaches to Model Selection
1. Hold-out Validation: Split data into training and validation sets
2. Cross-Validation: Split data into k folds, train on k-1 folds, test on the
remaining fold
3. Nested Cross-Validation: Cross-validation within cross-validation for
both model selection and evaluation
Cross-Validation Techniques
• K-fold Cross-Validation: Split data into k equal parts
• Stratified K-fold: Maintains the same class distribution in each fold
• Leave-One-Out (LOOCV): Use n-1 samples for training and 1 for test-
ing (where n is the total number of samples)
• Leave-P-Out: Use n-p samples for training and p for testing
Hyperparameter Tuning
• Grid Search: Try all combinations of a predefined set of hyperparameter
values
• Random Search: Randomly sample hyperparameter values from defined
distributions
• Bayesian Optimization: Use past evaluations to guide the search for
better hyperparameters
Model Evaluation
Classification Metrics
• Accuracy: Proportion of correct predictions
• Precision: Proportion of positive identifications that were actually cor-
rect
• Recall (Sensitivity): Proportion of actual positives that were identified
correctly
• F1 Score: Harmonic mean of precision and recall
• ROC Curve: Plot of True Positive Rate vs. False Positive Rate
• AUC (Area Under the Curve): Area under the ROC curve
1
• Confusion Matrix: Table showing correct and incorrect predictions
Regression Metrics
• Mean Absolute Error (MAE): Average of absolute differences between
predicted and actual values
• Mean Squared Error (MSE): Average of squared differences between
predicted and actual values
• Root Mean Squared Error (RMSE): Square root of MSE
• R-squared (Coefficient of Determination): Proportion of variance
explained by the model
• Adjusted R-squared: R-squared adjusted for the number of predictors
Bias-Variance Tradeoff
Understanding Bias and Variance
• Bias: Error from overly simplistic assumptions (underfitting)
• Variance: Error from sensitivity to small fluctuations in training data
(overfitting)
• Tradeoff: Reducing bias typically increases variance and vice versa
Total Error Decomposition Total Error = Bias² + Variance + Irreducible
Error
How to Balance Bias and Variance
• High Bias (Underfitting): Use more complex models, add features
• High Variance (Overfitting): Use simpler models, add regularization,
get more training data
Regularization
What is Regularization? Regularization is a technique to prevent overfit-
ting by adding a penalty term to the loss function. It’s like adding weight to a
seesaw to keep it balanced.
Types of Regularization
• L1 Regularization (Lasso): Adds the sum of absolute values of coeffi-
cients to the loss function
– Can lead to sparse models (feature selection)
– Formula: Loss + � × Σ|w_i|
• L2 Regularization (Ridge): Adds the sum of squared values of coeffi-
cients to the loss function
– Shrinks coefficients towards zero but rarely to exactly zero
– Formula: Loss + � × Σ(w_i)²
• Elastic Net: Combination of L1 and L2
2
– Formula: Loss + �� × Σ|w_i| + �� × Σ(w_i)²
Regularization Parameter (�)
• Controls the strength of regularization
• Higher � = stronger regularization = simpler model
• Lower � = weaker regularization = more complex model
• Optimal � is typically found through cross-validation
Ensemble Methods
What are Ensemble Methods? Ensemble methods combine multiple mod-
els to improve performance. It’s like asking multiple experts and taking their
collective wisdom.
Types of Ensemble Methods
• Bagging (Bootstrap Aggregating):
– Train multiple models on random subsets of the data
– Combine by averaging (regression) or voting (classification)
– Example: Random Forest
• Boosting:
– Train models sequentially, each focusing on errors of previous models
– Combine by weighted voting
– Examples: AdaBoost, Gradient Boosting
• Stacking:
– Train multiple models and use their predictions as inputs to a meta-
model
– Meta-model learns how to best combine the predictions
Popular Ensemble Algorithms
• Random Forest: Ensemble of decision trees using bagging
• AdaBoost: Boosts weak learners by focusing on misclassified instances
• Gradient Boosting: Builds trees sequentially to correct errors
• XGBoost: Optimized implementation of gradient boosting
• Voting Classifier/Regressor: Combines different types of models
Feature Selection
Why Feature Selection?
• Reduces overfitting
• Improves model performance
• Reduces training time
• Makes models more interpretable
3
Feature Selection Methods
• Filter Methods: Select features based on statistical measures
– Correlation
– Chi-square test
– Information gain
• Wrapper Methods: Use a model to evaluate feature subsets
– Recursive Feature Elimination (RFE)
– Forward/Backward selection
• Embedded Methods: Feature selection as part of model training
– Lasso regression
– Decision trees
– Random Forest feature importance
MCQ Practice Questions
Question 1
Which cross-validation technique is most appropriate when dealing
with imbalanced classes? - A) K-fold Cross-Validation - B) Leave-One-Out
Cross-Validation - C) Stratified K-fold Cross-Validation - D) Random Subsam-
pling
Answer: C) Stratified K-fold Cross-Validation
Explanation: Stratified K-fold ensures that each fold has the same proportion
of classes as the original dataset, which is crucial for imbalanced datasets to
avoid bias in the validation process.
Question 2
Which regularization technique can reduce coefficients to exactly
zero, effectively performing feature selection? - A) L1 Regularization
(Lasso) - B) L2 Regularization (Ridge) - C) Dropout - D) Batch Normalization
Answer: A) L1 Regularization (Lasso)
Explanation: L1 regularization adds the sum of absolute values of coefficients
to the loss function, which can shrink some coefficients to exactly zero, effectively
removing those features from the model.
Question 3
What is the main difference between bagging and boosting ensemble
methods? - A) Bagging uses decision trees while boosting uses neural networks
- B) Bagging trains models in parallel while boosting trains them sequentially
- C) Bagging is for classification while boosting is for regression - D) Bagging
requires more data than boosting
4
Answer: B) Bagging trains models in parallel while boosting trains them se-
quentially
Explanation: In bagging, multiple models are trained independently on ran-
dom subsets of the data. In boosting, models are trained sequentially, with each
model focusing on the errors made by previous models.
Question 4
Which metric is most appropriate for evaluating a regression model
when outliers are a concern? - A) Mean Squared Error (MSE) - B) Root
Mean Squared Error (RMSE) - C) Mean Absolute Error (MAE) - D) R-squared
Answer: C) Mean Absolute Error (MAE)
Explanation: MAE uses absolute differences rather than squared differences,
making it less sensitive to outliers compared to MSE or RMSE.
Question 5
In the bias-variance tradeoff, what happens as model complexity in-
creases? - A) Bias increases, variance decreases - B) Bias decreases, variance
increases - C) Both bias and variance increase - D) Both bias and variance
decrease
Answer: B) Bias decreases, variance increases
Explanation: As model complexity increases, the model can fit the training
data better (reducing bias), but becomes more sensitive to fluctuations in the
training data (increasing variance).
Question 6
Which of the following is NOT a method for hyperparameter tuning?
- A) Grid Search - B) Random Search - C) Bayesian Optimization - D) Principal
Component Analysis
Answer: D) Principal Component Analysis
Explanation: Principal Component Analysis (PCA) is a dimensionality reduc-
tion technique, not a method for hyperparameter tuning.
Question 7
What does the Area Under the ROC Curve (AUC) measure? - A)
The accuracy of the model - B) The probability that a randomly chosen positive
instance is ranked higher than a randomly chosen negative instance - C) The
precision of the model - D) The recall of the model
Answer: B) The probability that a randomly chosen positive instance is ranked
higher than a randomly chosen negative instance
5
Explanation: AUC represents the probability that the model will rank a ran-
domly chosen positive instance higher than a randomly chosen negative instance,
making it a measure of the model’s ability to discriminate between classes.
Question 8
Which ensemble method is MOST likely to reduce bias in a model? -
A) Bagging - B) Boosting - C) Stacking - D) Voting with identical models
Answer: B) Boosting
Explanation: Boosting focuses on reducing bias by sequentially training mod-
els that focus on the errors made by previous models, making it particularly
effective at reducing bias.
Calculation Problems
Problem 1: Cross-Validation
You have a dataset with 1000 instances and want to perform 5-fold
cross-validation. How many instances will be used for training and
testing in each fold?
Solution: - Total instances: 1000 - Number of folds: 5 - Testing instances per
fold: 1000 ÷ 5 = 200 - Training instances per fold: 1000 - 200 = 800
Therefore, each fold will use 800 instances for training and 200 for testing.
Problem 2: Confusion Matrix Metrics
**A classification model produces the following confusion matrix for a binary
classification problem: - True Positives (TP): 120 - False Positives (FP): 30 -
False Negatives (FN): 20 - True Negatives (TN): 130
Calculate the accuracy, precision, recall, F1 score, and specificity.**
Solution: - Accuracy = (TP + TN) / (TP + TN + FP + FN) = (120 + 130)
/ (120 + 130 + 30 + 20) = 250 / 300 = 0.833 or 83.3% - Precision = TP / (TP
+ FP) = 120 / (120 + 30) = 120 / 150 = 0.8 or 80% - Recall (Sensitivity) =
TP / (TP + FN) = 120 / (120 + 20) = 120 / 140 = 0.857 or 85.7% - F1 Score
= 2 × (Precision × Recall) / (Precision + Recall) = 2 × (0.8 × 0.857) / (0.8
+ 0.857) = 2 × 0.686 / 1.657 = 1.372 / 1.657 = 0.828 or 82.8% - Specificity =
TN / (TN + FP) = 130 / (130 + 30) = 130 / 160 = 0.813 or 81.3%
Problem 3: Regularization Effect
In a linear regression model with two features, the unregularized
coefficients are �� = 5 and �� = -3. If L2 regularization with � = 0.1
is applied, what will be the regularization penalty term added to the
loss function?
6
Solution: L2 regularization penalty = � × (��² + ��²) L2 regularization penalty
= 0.1 × (5² + (-3)²) L2 regularization penalty = 0.1 × (25 + 9) L2 regularization
penalty = 0.1 × 34 L2 regularization penalty = 3.4
Problem 4: R-squared Calculation
**A regression model produces the following predictions and actual values: -
Predicted: [12, 15, 18, 11, 20] - Actual: [10, 14, 17, 13, 22]
Calculate the R-squared value.**
Solution: First, calculate the mean of actual values: Mean(y) = (10 + 14 +
17 + 13 + 22) / 5 = 76 / 5 = 15.2
Then, calculate the total sum of squares (TSS): TSS = Σ(y_i - mean(y))² =
(10 - 15.2)² + (14 - 15.2)² + (17 - 15.2)² + (13 - 15.2)² + (22 - 15.2)² TSS =
(-5.2)² + (-1.2)² + (1.8)² + (-2.2)² + (6.8)² TSS = 27.04 + 1.44 + 3.24 + 4.84
+ 46.24 TSS = 82.8
Next, calculate the residual sum of squares (RSS): RSS = Σ(y_i - ŷ_i)² = (10
- 12)² + (14 - 15)² + (17 - 18)² + (13 - 11)² + (22 - 20)² RSS = (-2)² + (-1)² +
(-1)² + (2)² + (2)² RSS = 4 + 1 + 1 + 4 + 4 RSS = 14
Finally, calculate R-squared: R² = 1 - (RSS / TSS) = 1 - (14 / 82.8) = 1 - 0.169
= 0.831 or 83.1%
Key Formulas to Remember
1. Accuracy: (TP + TN) / (TP + TN + FP + FN)
2. Precision: TP / (TP + FP)
3. Recall (Sensitivity): TP / (TP + FN)
4. Specificity: TN / (TN + FP)
5. F1 Score: 2 × (Precision × Recall) / (Precision + Recall)
6. Mean Absolute Error (MAE): (1/n) × Σ|y_i - ŷ_i|
7. Mean Squared Error (MSE): (1/n) × Σ(y_i - ŷ_i)²
8. Root Mean Squared Error (RMSE): √MSE
9. R-squared: 1 - (RSS / TSS)
• RSS: Residual Sum of Squares = Σ(y_i - ŷ_i)²
• TSS: Total Sum of Squares = Σ(y_i - mean(y))²
10. L1 Regularization: Loss + � × Σ|w_i|
11. L2 Regularization: Loss + � × Σ(w_i)²
Tips for MCQ Questions
1. Understand evaluation metrics: Know which metrics are appropriate
for different types of problems.
2. Know the tradeoffs: Understand the bias-variance tradeoff and how
different techniques affect it.
7
3. Remember regularization effects: Know how L1 and L2 regularization
affect model coefficients differently.
4. Understand ensemble methods: Know the differences between bag-
ging, boosting, and stacking.
5. Practice calculations: Be comfortable calculating common metrics from
raw data or confusion matrices.