0% found this document useful (0 votes)
10 views5 pages

Data M11

The document outlines various model assessment measures for predictive and classification models, including metrics like RMSE, R-squared, and confusion matrix-based measures such as accuracy, precision, and recall. It discusses prediction error analysis, ROC curves, lift curves, profit matrices, and model comparison criteria, emphasizing the importance of understanding model performance from both statistical and business perspectives. Additionally, it covers ensemble modeling techniques like bagging and boosting to enhance predictive accuracy and robustness.

Uploaded by

kshaw4349
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views5 pages

Data M11

The document outlines various model assessment measures for predictive and classification models, including metrics like RMSE, R-squared, and confusion matrix-based measures such as accuracy, precision, and recall. It discusses prediction error analysis, ROC curves, lift curves, profit matrices, and model comparison criteria, emphasizing the importance of understanding model performance from both statistical and business perspectives. Additionally, it covers ensemble modeling techniques like bagging and boosting to enhance predictive accuracy and robustness.

Uploaded by

kshaw4349
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Model Assessment Measures for Predictive and Classification Models  Root Mean Squared Error (RMSE): The square

red Error (RMSE): The square root of MSE. It's in the same
units as the target variable, making it more interpretable than MSE. RMSE=n1
1. Model Scoring
What it is: Model scoring refers to the process of applying a trained machine ∑i=1n(yi−y^i)2 Applications: Similar to MSE, but often preferred for
learning model to new, unseen data to generate predictions or probabilities. It's the presenting results due to interpretability.
step where the model's learned patterns are used to forecast future outcomes.  R-squared (R2): Represents the proportion of the variance in the dependent
How it works: variable that is predictable from the independent variables. A higher R2
indicates a better fit. R2=1−∑i=1n(yi−yˉ)2∑i=1n(yi−y^i)2 Applications:
 For predictive (regression) models, scoring typically produces a continuous Explaining the variability in an outcome, such as how well factors explain
numerical value. For example, predicting a house price, a stock value, or a variations in crop yield.
temperature.
Key Measures for Classification Models (often based on a Confusion Matrix): A
 For classification models, scoring usually produces a probability for each Confusion Matrix is fundamental for classification error analysis. It's a table that
class, or a direct class label. For example, the probability of a customer summarizes the performance of a classification model on a set of test data for which
churning, the probability of an email being spam, or directly classifying an the true values are known.
image as a cat or dog.
Predicted Positive Predicted Negative
Applications:
 Real-time predictions: In financial trading, models score market data to Actual Positive True Positive (TP) False Negative (FN)
predict price movements. In e-commerce, they score user behavior to Actual Negative False Positive (FP) True Negative (TN)
recommend products.
Export to Sheets
 Batch processing: Scoring large datasets offline, such as predicting credit
risk for all loan applicants or identifying fraudulent transactions in a nightly  Accuracy: The proportion of correctly classified instances (TP + TN) / (TP +
batch. TN + FP + FN). Applications: General performance measure when classes
are balanced.
 Operationalizing models: Integrating trained models into business systems
to automate decision-making.  Precision: The proportion of true positive predictions among all positive
predictions: TP / (TP + FP). It answers: "Of all instances predicted as positive,
2. Prediction Error Analysis
how many were actually positive?" Applications: Spam detection (minimize
What it is: Prediction error analysis involves quantifying the discrepancy between a false positives, i.e., legitimate emails marked as spam), medical diagnosis
model's predicted outcomes and the actual observed outcomes. It helps in (when a false positive leads to unnecessary invasive procedures).
understanding the magnitude and nature of a model's mistakes.
 Recall (Sensitivity or True Positive Rate): The proportion of true positive
Key Measures for Predictive (Regression) Models: predictions among all actual positive instances: TP / (TP + FN). It answers:
"Of all actual positive instances, how many did the model correctly identify?"
 Mean Absolute Error (MAE): The average of the absolute differences
Applications: Fraud detection (minimize false negatives, i.e., undetected
between predicted and actual values. It's less sensitive to outliers than MSE.
fraud), disease screening (identify as many sick people as possible).
MAE=n1∑i=1n∣yi−y^i∣ Applications: Forecasting sales, estimating project
completion times.  F1-Score: The harmonic mean of precision and recall. It balances both
metrics, especially useful when there's an uneven class distribution.
 Mean Squared Error (MSE): The average of the squared differences
F1=2×Precision+RecallPrecision×Recall Applications: Information retrieval,
between predicted and actual values. It penalizes larger errors more heavily.
imbalanced classification problems where both false positives and false
MSE=n1∑i=1n(yi−y^i)2 Applications: Optimizing control systems, financial risk
negatives are important.
modeling where large errors are particularly costly.
 Specificity (True Negative Rate): The proportion of true negative predictions  Model comparison: Comparing the overall performance of different
among all actual negative instances: TN / (TN + FP). Applications: Similar to classification models, especially when the class distribution is imbalanced (as
recall, but for the negative class. Useful in medical testing (correctly AUC is less sensitive to class imbalance than accuracy).
identifying healthy individuals).
 Threshold selection: Identifying an optimal operating point (threshold) on the
 False Positive Rate (FPR): The proportion of false positive predictions curve that balances TPR and FPR based on business requirements. For
among all actual negative instances: FP / (TN + FP). Also known as 1 - example, in fraud detection, you might tolerate a higher FPR to achieve a very
Specificity. high TPR.

Applications of Prediction Error Analysis:  Medical diagnosis: Assessing the performance of diagnostic tests.

 Model tuning: Identifying where the model makes errors helps adjust Lift Curve
parameters or choose different algorithms.
What it is: A Lift curve (or Gains chart) is a visual tool used to evaluate the
 Problem understanding: Uncovering patterns in errors can reveal underlying performance of a classification model, particularly in direct marketing and customer
data issues or limitations in feature engineering. targeting scenarios. It shows how much better a model performs compared to a
random selection.
 Business impact assessment: Translating prediction errors into business
costs or missed opportunities. How it works:

3. ROC and Lift Curves  Sort by probability: The data is sorted by the predicted probability of the
positive class in descending order.
ROC (Receiver Operating Characteristic) Curve
 Divide into deciles: The sorted data is typically divided into deciles (or other
What it is: The ROC curve is a graphical plot that illustrates the diagnostic ability of
percentiles).
a binary classifier system as its discrimination threshold is varied. It plots the True
Positive Rate (TPR, or Recall) against the False Positive Rate (FPR) at various  Calculate lift: For each decile, you calculate the "lift" by comparing the
threshold settings. proportion of actual positive cases in that decile to the proportion of actual
positive cases in the entire population. $Lift = \frac{\% \text{of actual positives
How it works:
in top X% of predictions}}{\% \text{of actual positives in entire population}}$
 For each possible classification threshold (the probability cutoff above which a
 The curve plots the cumulative percentage of the population (X-axis) against
prediction is classified as positive), you calculate the TPR and FPR.
the cumulative percentage of true positives found (Y-axis).
 Plotting these (FPR, TPR) pairs creates the ROC curve.
 A diagonal line represents a random model (lift of 1).
 A random classifier yields a diagonal line from (0,0) to (1,1).
 A good model will have a curve that rises steeply at the beginning, indicating
 A perfect classifier would have a point at (0,1) (100% TPR, 0% FPR). that a small percentage of the targeted population contains a high percentage
of the positive cases.
 Area Under the Curve (AUC-ROC): The area under the ROC curve. It
represents the probability that the model will rank a randomly chosen positive Applications:
instance higher than a randomly chosen negative instance.
 Targeted marketing: Identifying the most responsive customers for a
o AUC of 0.5 suggests a random classifier. marketing campaign to maximize ROI. For example, if a model predicts which
customers are likely to respond to an offer, the lift curve shows how many
o AUC of 1.0 suggests a perfect classifier.
more responses you'll get by targeting the top X% of customers according to
o Higher AUC values indicate better overall model performance across the model, compared to targeting X% randomly.
all possible thresholds.
 Fraud detection: Prioritizing investigations by focusing on transactions most
Applications: likely to be fraudulent.
 Resource allocation: Directing limited resources to the most promising  Understanding real-world impact: Bridging the gap between statistical
segments. model performance and actual business value.

4. Profit Matrices for Classification 5. Various Model Comparison Criteria

What it is: A profit matrix (or cost-benefit matrix) is a tool used in classification Beyond the individual metrics, there are broader criteria and techniques for
problems to assign monetary values (profits or costs) to each possible outcome of a comparing different models:
classification. It allows you to evaluate a model's performance from a business
 Statistical Significance Tests:
perspective, rather than just statistical accuracy.
o T-tests, Chi-squared tests, ANOVA: Used to determine if the
How it works: It extends the concept of a confusion matrix by assigning a specific
performance difference between two models is statistically significant
profit or cost to each cell:
or simply due to chance.
Predicted Positive Predicted Negative o Application: Deciding if a new model truly outperforms an existing one.
Actual Positive Profit (TP) Cost (FN)  Cross-Validation:

Actual Negative Cost (FP) Profit (TN) o Splitting the data into multiple folds and training/testing the model on
different combinations of these folds. This provides a more robust
Export to Sheets estimate of model performance and reduces the impact of data
 True Positive (TP): Correctly predicting a positive event (e.g., identifying a randomness. Common types include K-Fold Cross-Validation, Stratified
loyal customer). This usually has a positive profit. K-Fold.

 False Negative (FN): Failing to predict a positive event (e.g., missing a o Application: Getting a reliable estimate of how a model will generalize
fraudulent transaction). This typically incurs a cost or missed opportunity. to unseen data, and comparing models fairly on the same dataset.

 False Positive (FP): Incorrectly predicting a positive event (e.g., marking a  Bias-Variance Trade-off:
legitimate transaction as fraudulent). This can incur costs (e.g., investigation o Bias: Error due to overly simplistic assumptions in the learning
time, customer dissatisfaction). algorithm. High bias can cause a model to underfit the data.
 True Negative (TN): Correctly predicting a negative event (e.g., correctly o Variance: Error due to too much complexity in the learning algorithm.
identifying a non-fraudulent transaction). This might have a neutral or small High variance can cause a model to overfit the training data.
positive profit (e.g., avoiding unnecessary action).
o When comparing models, you often aim for a balance. A model with
By multiplying the counts in each cell of the confusion matrix by their corresponding low bias and low variance is ideal.
profit/cost values from the profit matrix, you can calculate the total expected profit (or
loss) of a model at a given classification threshold. o Application: Guiding model selection; for example, a simple linear
model might have high bias but low variance, while a complex neural
Applications: network might have low bias but high variance.
 Optimizing business decisions: Setting a classification threshold that  AIC (Akaike Information Criterion) and BIC (Bayesian Information
maximizes overall profit, even if it means sacrificing some traditional accuracy
Criterion):
metrics. For instance, in loan default prediction, the cost of a false negative
(lending to a defaulter) is much higher than a false positive (denying a loan to o Information criteria used for model selection, particularly for statistical
a good borrower). models. They penalize models with more parameters to avoid
overfitting. Lower values generally indicate a better model.
 Comparing models financially: Choosing the model that generates the
highest expected profit for the business. o Application: Comparing different regression models or time series
models, especially when considering model complexity.
 Time to Train/Predict: What it is: Bagging involves training multiple instances of the same base learning
algorithm on different, randomly sampled subsets of the training data. The final
o For practical applications, the computational resources and time
prediction is typically an average (for regression) or a majority vote (for classification)
required to train a model and make predictions can be a crucial
of the individual model predictions.
comparison criterion, especially for large datasets or real-time systems.
How it works:
o Application: Choosing between a highly accurate but slow model and a
slightly less accurate but much faster model for deployment. 1. Bootstrap Sampling: Create multiple (e.g., 100 or 500) bootstrap samples
from the original training dataset. Each bootstrap sample is created by
 Interpretability/Explainability:
randomly drawing observations with replacement from the original dataset.
o Some models (e.g., linear regression, decision trees) are inherently This means some observations may appear multiple times in a sample, and
more interpretable than others (e.g., deep neural networks, complex some may not appear at all.
ensembles). The ability to understand why a model makes a particular
2. Base Model Training: Train a separate instance of the same base learning
prediction can be vital for trust, debugging, and regulatory compliance.
algorithm (e.g., decision tree, neural network) on each of these bootstrap
o Application: In finance or healthcare, where explainable AI (XAI) is samples.
increasingly important for auditing and ethical considerations.
3. Aggregation:
 Robustness to Outliers/Noise:
o For regression: Average the predictions of all individual models.
o How well a model performs when exposed to noisy or outlier data
o For classification: Take a majority vote among the predicted classes
points.
of all individual models.
o Application: In real-world datasets which often contain anomalies,
Key Characteristics:
choosing a model that can handle such data gracefully.
 Parallel processing: Individual models can be trained in parallel as they are
Ensemble Modeling
independent.
What it is: Ensemble modeling is a powerful machine learning technique where
 Reduces variance: Primarily aims to reduce the variance of the base model,
multiple individual models (often called "base learners" or "weak learners") are
making it less prone to overfitting.
combined to produce a single, more robust, and typically more accurate predictive
model than any single model could achieve alone. The idea is that the "wisdom of  Homogeneous learners: Typically uses the same type of base model (e.g.,
the crowd" often outperforms individual experts. all decision trees).

Why it works:  Example: Random Forest (an extension of bagging where decision trees are
built on bootstrapped samples and also consider only a random subset of
 Reduces bias: By combining models with different biases, the ensemble can
features at each split).
converge on a more accurate overall prediction.
Applications:
 Reduces variance: By averaging or combining predictions from multiple
models, the impact of random fluctuations or errors in individual models is  Random Forest: Widely used for both classification and regression in various
reduced, leading to more stable predictions. domains due to its high accuracy and robustness.

 Improves robustness: Ensembles are less sensitive to the specific o Healthcare: Predicting disease outcomes, identifying patient
characteristics of the training data or the initialization of individual models. subgroups.

Strategies for Ensemble Modeling o Finance: Stock price prediction, credit scoring.

1. Bagging (Bootstrap Aggregating) o Image classification: Object recognition.

2. Boosting
What it is: Boosting is an ensemble technique that builds models sequentially. Each  Stacking (Stacked Generalization): Trains multiple base models (often
new model in the sequence focuses on correcting the errors made by the previous diverse types) and then trains a "meta-model" (or "learner") on the predictions
models. It iteratively adjusts the weights of misclassified instances, giving more of the base models to make the final prediction. This allows the meta-model to
emphasis to the "harder" examples. learn how to best combine the strengths of different base learners.

How it works:  Voting: A simple ensemble method where multiple models (can be different
types) are trained independently, and their predictions are combined through
1. Initial Model: Train an initial weak learner (e.g., a shallow decision tree) on
a simple voting mechanism (e.g., majority vote for classification, averaging for
the entire dataset.
regression).
2. Weight Adjustment: Evaluate the initial model's performance. Instances that
Applications of Ensemble Modeling (General)
were misclassified or had large errors are given higher weights.
Ensemble methods are widely applied across various industries and problems due to
3. Sequential Training: Train a new weak learner on the dataset, now with the
their superior performance and robustness:
adjusted instance weights. This new model focuses more on the previously
difficult instances.  Healthcare: Disease diagnosis and prognosis (e.g., predicting cancer
recurrence), drug discovery.
4. Iteration: Repeat steps 2 and 3 for a fixed number of iterations or until
performance stops improving.  Finance: Fraud detection, credit scoring, algorithmic trading, risk assessment.
5. Weighted Combination: The final prediction is a weighted sum (for  E-commerce: Recommendation systems, customer churn prediction,
regression) or a weighted majority vote (for classification) of all the individual personalized marketing.
weak learners, where models that performed better on previous iterations
 Image and Speech Recognition: Object detection, facial recognition, natural
might have higher weights.
language processing tasks.
Key Characteristics:
 Manufacturing: Predictive maintenance, quality control.
 Sequential processing: Models are built one after another, as each depends
 Environmental Science: Weather forecasting, climate modeling.
on the previous one's performance.
 Sports Analytics: Predicting game outcomes, player performance.
 Reduces bias: Primarily aims to reduce the bias of the base model,
addressing systematic errors.
 Can lead to overfitting: If not properly tuned, boosting can sometimes
overfit, especially with noisy data.
 Common Algorithms: AdaBoost, Gradient Boosting Machines (GBM),
XGBoost, LightGBM, CatBoost.

Applications:

 Fraud detection: Highly effective in identifying rare fraud patterns.

 Customer churn prediction: Accurately predicting which customers are


likely to leave.
 Image and speech recognition: Achieving state-of-the-art performance in
various complex tasks.

 Ranking problems: In search engines and recommendation systems.

Other Ensemble Strategies (Briefly Mentioned)

You might also like