Outline
Introduction to Predictive Analytics Course logistics
What is predictive
analytics?
Predictive Analytics Principles (Lecture 1)
Applications of
Predictive
Analytics
Regression and
Classification
Nicola Armstrong
Predictive
Analytics workflow
EECMS, Curtin University Some important
concepts
Data
preprocessing and
February 25, 2025 feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Outline
Outline
1. Course logistics
Course logistics
2. What is predictive analytics? What is predictive
analytics?
3. Applications of Predictive Analytics Applications of
Predictive
Analytics
4. Regression and Classification Regression and
Classification
5. Predictive Analytics workflow Predictive
Analytics workflow
6. Some important concepts Some important
concepts
7. Data preprocessing and feature engineering Data
preprocessing and
feature
8. Overfitting and model tuning engineering
Overfitting and
model tuning
9. Four datasets
Four datasets
10. R programming and reproducibility R programming
and reproducibility
11. Summary Summary
Unit objectives
Outline
▶ Categorise the purpose of different predictive analytic Course logistics
techniques and the kind of data to which they can be What is predictive
analytics?
applied. Applications of
Predictive
▶ Employ exploratory data analysis, data cleaning, and Analytics
Regression and
data wrangling(transformation) in the context of a Classification
business problem. Predictive
Analytics workflow
▶ Analyse the data using several predictive models, Some important
concepts
evaluate them to select the best model, and Data
preprocessing and
communicate the results by appropriate visualization. feature
engineering
▶ Learn the powerful statistical programming language R Overfitting and
model tuning
for fitting predictive models to real-world datasets. Four datasets
R programming
and reproducibility
Summary
Unit objectives
Outline
▶ Categorise the purpose of different predictive analytic Course logistics
techniques and the kind of data to which they can be What is predictive
analytics?
applied. Applications of
Predictive
▶ Employ exploratory data analysis, data cleaning, and Analytics
Regression and
data wrangling(transformation) in the context of a Classification
business problem. Predictive
Analytics workflow
▶ Analyse the data using several predictive models, Some important
concepts
evaluate them to select the best model, and Data
preprocessing and
communicate the results by appropriate visualization. feature
engineering
▶ Learn the powerful statistical programming language R Overfitting and
model tuning
for fitting predictive models to real-world datasets. Four datasets
R programming
and reproducibility
Summary
Unit objectives
Outline
▶ Categorise the purpose of different predictive analytic Course logistics
techniques and the kind of data to which they can be What is predictive
analytics?
applied. Applications of
Predictive
▶ Employ exploratory data analysis, data cleaning, and Analytics
Regression and
data wrangling(transformation) in the context of a Classification
business problem. Predictive
Analytics workflow
▶ Analyse the data using several predictive models, Some important
concepts
evaluate them to select the best model, and Data
preprocessing and
communicate the results by appropriate visualization. feature
engineering
▶ Learn the powerful statistical programming language R Overfitting and
model tuning
for fitting predictive models to real-world datasets. Four datasets
R programming
and reproducibility
Summary
Unit objectives
Outline
▶ Categorise the purpose of different predictive analytic Course logistics
techniques and the kind of data to which they can be What is predictive
analytics?
applied. Applications of
Predictive
▶ Employ exploratory data analysis, data cleaning, and Analytics
Regression and
data wrangling(transformation) in the context of a Classification
business problem. Predictive
Analytics workflow
▶ Analyse the data using several predictive models, Some important
concepts
evaluate them to select the best model, and Data
preprocessing and
communicate the results by appropriate visualization. feature
engineering
▶ Learn the powerful statistical programming language R Overfitting and
model tuning
for fitting predictive models to real-world datasets. Four datasets
R programming
and reproducibility
Summary
Program Calender (First half)
Outline
Course logistics
What is predictive
analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Program Calender (Second half)
Outline
Course logistics
What is predictive
analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Book Recommendations
Outline
Course logistics
What is predictive
analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Assessments
Outline
Course logistics
What is predictive
analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Predicting future based on data
Outline
Predictive analytics uses past data to capture relationships
Course logistics
between predictors and an outcome variable, and exploits What is predictive
analytics?
them to predict future response.
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Figure 1: Machine (algorithm) used in predicting future.
Learning versus Design
▶ In predictive modeling, we learn from data, we do not Outline
design an algorithm for predicting the outcome. Course logistics
What is predictive
analytics?
▶ Learning is from Data, but Design is from Specification. Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Figure 2: Design-based algorithm versus Data-based algorithm.
Classic problems in Predictive Modeling
▶ Medical Diagnosis: Based on patient’s medical record Outline
and symptom, a doctor wants to prescribe the most Course logistics
What is predictive
appropriate treatment. analytics?
Applications of
Predictive
▶ Handwritten digit recognition: Postal zip code Analytics
recognition for sorting out mails. Regression and
Classification
Predictive
▶ Email — Spam vs Ham: Spam filter used for Analytics workflow
Some important
predicting spam emails. concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Figure 3: Classic problems in Predictive Modeling.
Connection to machine learning
Outline
Predictive analytics is similar to supervised learning, where
Course logistics
data on predictors and on a target variable are used to train What is predictive
analytics?
a machine learning model to predict the target variable for
Applications of
new data on predictor variables. Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Figure 4: Supervised Learning (image classification). Summary
Unsupervised learning
Outline
In unsupervised learning, we do not have any label
Course logistics
associated with the target variable. Our main aim is to find What is predictive
analytics?
groups/clusters of similar objects from the unlabelled data.
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Figure 5: Unsupervised Learning (pattern in unlabeled data). Summary
Examples of Predictive Analytics
Outline
Course logistics
What is predictive
analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
Figure 6: Predictive analytics across industries. and reproducibility
Summary
Regression
In prediction, when the response variable is quantitative, it is Outline
a regression problem. Course logistics
What is predictive
analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Figure 7: Training data on predictor and response (Left); Summary
Non-linear regression function (Right).
Classification
In prediction, when the response variable is categorical, it is Outline
a classification problem. Course logistics
What is predictive
analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Figure 8: Training data on predictors and a binary response Summary
(Left); Linear classification boundary (Right).
Predictive analytics cycle across industries
Outline
Course logistics
What is predictive
analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Sample Explore Modify Model Assess
▶ Sample: Generate a representative sample of the data Outline
Course logistics
▶ Explore: Visualize and summary of the data What is predictive
analytics?
▶ Modify: Variable transformation and selection Applications of
Predictive
▶ Model: Fit prospective predictive models Analytics
Regression and
▶ Assess: Evaluate models to find the best model Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
Figure 9: SEMMA: analytics paradigm by SAS Institute. R programming
and reproducibility
Summary
Typical project timeline
Outline
Course logistics
What is predictive
analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Inference versus Prediction
Outline
Course logistics
▶ When we build a model to understand the State of What is predictive
analytics?
Nature, we are interested in Inference.
Applications of
▶ Example: (i) Predicting disease outcome or (ii) effect Predictive
Analytics
of a new mobile application on the customers. Regression and
▶ Typically involves testing the significance of model Classification
Predictive
parameters. Analytics workflow
Some important
concepts
▶ In Prediction, the main goal is to make accurate Data
preprocessing and
prediction, irrespective of whether we understand or not feature
engineering
the underlying state of nature. Overfitting and
model tuning
▶ Example: (i) Recommending movies or (ii)
Four datasets
predicting credit score. R programming
▶ Typically involves estimation of the unknown model and reproducibility
Summary
parameters and optimising them for better prediction.
Inference versus Prediction
Outline
Course logistics
▶ When we build a model to understand the State of What is predictive
analytics?
Nature, we are interested in Inference.
Applications of
▶ Example: (i) Predicting disease outcome or (ii) effect Predictive
Analytics
of a new mobile application on the customers. Regression and
▶ Typically involves testing the significance of model Classification
Predictive
parameters. Analytics workflow
Some important
concepts
▶ In Prediction, the main goal is to make accurate Data
preprocessing and
prediction, irrespective of whether we understand or not feature
engineering
the underlying state of nature. Overfitting and
model tuning
▶ Example: (i) Recommending movies or (ii)
Four datasets
predicting credit score. R programming
▶ Typically involves estimation of the unknown model and reproducibility
Summary
parameters and optimising them for better prediction.
Parsimony versus Accuracy
Outline
▶ While building a model, there will always be a trade-off
Course logistics
between Parsimony versus Accuracy. What is predictive
analytics?
Applications of
▶ A Parsimonious model is more interpretable. Predictive
Analytics
Regression and
▶ However, Accuracy should not be seriously sacrificed Classification
Predictive
for the sake of simplicity. Analytics workflow
Some important
concepts
▶ Complex models are a solution to poor accuracy, but
Data
often act as blackboxes. preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Correlation is not Causation
▶ “Correlation is not causation” means that just because Outline
Course logistics
two things correlate does not necessarily mean that one
What is predictive
causes the other. analytics?
Applications of
▶ Correlations between two things may be caused by a Predictive
Analytics
third factor that affects them both. This hidden factor Regression and
Classification
is called a confounder. Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Figure 10: Source:https://xkcd.com/552/
Simpson’s Paradox (be wary of observational studies)
Simpson’s paradox occurs when the relationship between Outline
Course logistics
two variables changes or reverses in the presence of a third
What is predictive
variable. It is often the result of a missing confounder. analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Figure 11: Source: Doing Data Science (Ch 11) by Schutt and Summary
ONeil.
Predictors and their representation
Outline
Course logistics
What is predictive
analytics?
Predictors and how they are represented (Features) in the Applications of
model are critical for building an accurate predictive model. Predictive
Analytics
Regression and
▶ Example 1: For predicting house prices, predictors could Classification
Predictive
be (i) area of the house, (ii) number of bedrooms and Analytics workflow
(iii) bathrooms, whereas derived features could be (i) Some important
concepts
bedrooms per bathroom and (ii) log(area). Data
preprocessing and
▶ Example 2: For a predictor X , features could be any feature
engineering
transformation of X , including X , X 2 , 1/X , cos(X ), or Overfitting and
model tuning
log(X ), that is included in the model for predicting the Four datasets
outcome. R programming
and reproducibility
Summary
Feature Engineering
Outline
Course logistics
What is predictive
analytics?
Applications of
‘The idea that there are different ways Predictive
Analytics
to represent predictors in a model, and that some Regression and
Classification
of these representations are better than others, leads Predictive
Analytics workflow
to the idea of Feature Engineering - the process of
Some important
creating representations of data that increase the concepts
Data
effectiveness of a model’ preprocessing and
feature
Max Kuhn and Kjell Johnson engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Example: Feature engineering improves
predictive performance Outline
Course logistics
What is predictive
analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
Figure 12: Inverse transformations of both predictors improve
R programming
the performance by 7%; this example is obtained from Ch. 1.1 of and reproducibility
the book Feature Engineering and Selection by Max Kuhn and Summary
Kjell Johnson.
Over Fitting and model tuning
▶ Overfitting occurs when a model fits training data very
Outline
well, but predicts poorly on new samples. Course logistics
▶ Our main aim would be not to overfit the model, and What is predictive
analytics?
produce a model that generalize well. Applications of
Predictive
▶ The balance between over- and under fitting in a Analytics
Regression and
predictive model is controlled by one or more tuning (or Classification
hyper) parameters. Our aim would be to select tuning Predictive
Analytics workflow
parameters that optimize the predictive performance. Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Assessing predictive performance (Train /
Test Split) Outline
Course logistics
What is predictive
analytics?
Because our goal is to produce a model that generalize well, Applications of
Predictive
we need to test the predictive performance on unseen new Analytics
Regression and
data. Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Assessment Metrics (Classification)
Outline
Course logistics
Confusion Matrix is a key element we need to construct for
What is predictive
assessing a Classification model. analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
(TP + TN) TP TN and reproducibility
Accuracy = , Sensitivity = , Specificity = . Summary
N (TP + FN) (TN + FP)
Assessment Metrics (Regression)
Two most common metrics for assessing a Regression model
Outline
are Root mean squared error (RMSE) and Mean absolute
Course logistics
error (MAE): What is predictive
v analytics?
Ntest Ntest Applications of
u
Predictive
1 1
u X X
Error2j Analytics
u
RMSE = t MAE = |Errorj |.
Ntest j=1
Ntest j=1
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Data on USA gun deaths
Outline
Course logistics
What is predictive
analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Gapminder data
Outline
Course logistics
What is predictive
analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
German Credit Risk data
Outline
Course logistics
What is predictive
analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Caravan Insurance data
Outline
Course logistics
What is predictive
analytics?
Applications of
Predictive
Analytics
Regression and
Classification
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Four datasets
R programming
and reproducibility
Summary
Predictive analytics in R way
▶ R is free, open-source, and Outline
multi-platform statistics computing Course logistics
What is predictive
environment, and has become the analytics?
standard platform for statistical Applications of
Predictive
analysis. Analytics
Regression and
▶ Very easy to install and start Classification
Predictive
producing results with minimal Analytics workflow
amount of coding. Figure 13: Created
Some important
concepts
▶ Several thousand packages are in 1991 by Ross Ihaka Data
preprocessing and
and Robert Gentleman feature
available via CRAN, which are easy from University of engineering
to install and use for data analytics Aukland, Newzealand; Overfitting and
model tuning
it uses GNU General
projects. Public License. Four datasets
R programming
▶ You can create your own R-package and reproducibility
and submit it to CRAN for everyone Summary
to use.
Reproducible analytics
Outline
Reproducibility crisis is a big Course logistics
problem in scientific research. What is predictive
analytics?
Analytics work should be pub-
Applications of
lished with their data and soft- Predictive
ware code so that others can Analytics
verify the findings and build Regression and
Classification
upon them.
Predictive
Analytics workflow
Some important
concepts
Data
preprocessing and
feature
engineering
Overfitting and
model tuning
Figure 14: Published in 2005 by Four datasets
Prof. John P.A. Ioannidis, Stanford.
R programming
and reproducibility
Summary
Summary
Outline
A. We use data on predictors and the target variable to learn a Course logistics
predictive model. What is predictive
analytics?
Applications of
B. Predictive model for a quantitative target is called Regression Predictive
Analytics
model and for a categorical target Classification model.
Regression and
Classification
C. Exploratory data visualisation, data cleaning, and feature Predictive
engineering are key steps in a predictive analytics pipeline. Analytics workflow
Creation of good features from raw predictors often improves Some important
concepts
predictive performance. Data
preprocessing and
feature
D. Our main goal is to predict well for new data, and therefore, engineering
it is important not to overfit the training data. Overfitting and
model tuning
E. Splitting the data into Training and Testing sets is a key step Four datasets
for model tuning and model selection. R programming
and reproducibility
Summary
F. Reproducibility of the predictive analytics work flow is
important for credibility of the prescribed model.