Predictive Modeling:
From Data to Decisions
#the basics
CHESTER ALLAN F. BAUTISTA, MIT
#How Does...?
Netflix/Spotify recommend content?
Email providers filter spam?
Universities identify students at risk of
dropping out?
Friend Suggestions(“People You May
Know”)
#Predictive Modeling
Using historical data + algorithms/ML =>
predictions about future/unknown outcomes
Finding patterns in the past to forecast the
future.
Descriptive and Diagnostic Analytics
Predictive Analytics (What will happen?)
#Why Use?
Make Better Choices
Get what you want
Save Time/Effort
Spot Problems Early
#Process
#Two Main Flavors of Prediction
Classification
Predicting Numbers (Regression)
#Past Data(Features & Target)
Target: The thing you want to predict (e.g.,
'Will Rain Tomorrow?').
Features:The pieces of past information used
to make the prediction
Model learns: How do Features relate to the
Target?
# Don't Cheat! Training vs. Testing
Training Data: The practice questions with
answers you study from
Testing Data: A separate set of questions
without answers used for the actual exam
Why? To avoid Overfitting
We need to know if the model works on new
problems it hasn't seen!
#How Good Was the Guess? (Evaluation)
We need a score! How do we grade the
model's test performance?
For Categories (Classification):
Accuracy: What percentage of predictions were correct?
(Simple, but can be tricky if one category is rare).
For Numbers (Regression):
Average Error (like MAE): On average, how far off was the
prediction from the real number? (Easy to understand).
#What about Python?
Python has great tools (libraries) to help.
Pandas: For handling and preparing your data
(the ingredients).
Scikit-learn (sklearn): The 'Swiss Army Knife'
for predictive modeling!
Has tools for: Data Splitting (Train/Test), Prepping Data, Many Model
Recipes (Classification/Regression), Evaluation Scores.
#Recap?
Predictive Modeling = Using the Past -> Predict the
Future.
Why? Better choices, personalization, efficiency.
Recipe: Goal -> Data -> Prep -> Model -> Train -> TEST!
Flavors: Categories (Classification) vs. Numbers
(Regression).
Testing on unseen data is crucial (avoid overfitting).
#Recap?
Evaluate how good the predictions are (metrics like
accuracy/average error).
Python (Pandas, Sklearn) helps make it happen.
#Gratitude?
Thank you !
#References
For Core Concepts & Process (Data Mining Perspective): Han, J., Kamber, M., & Pei, J. (2011). Data mining: Concepts
and techniques (3rd ed.). Morgan Kaufmann Publishers.
For Core Concepts & Process (Alternative Data Mining Perspective): Tan, P. N., Steinbach, M., & Kumar, V. (2019).
Introduction to data mining (2nd ed.). Pearson.
For Conceptual Understanding with Statistical Learning Focus: James, G., Witten, D., Hastie, T., & Tibshirani, R.
(2021). An introduction to statistical learning: With applications in R (2nd ed.). Springer. 1 (Note: While examples are in R,
the conceptual explanations of classification, regression, train/test splits, overfitting, and basic evaluation are excellent and
widely applicable).
For Concepts Linked Directly to Python/Scikit-learn: Müller, A. C., & Guido, S. (2016). Introduction to machine learning
with Python: A guide for data scientists. O'Reilly Media.
For Python Libraries Mentioned:
Scikit-learn Documentation: Scikit-learn Developers. (n.d.). Scikit-learn: Machine learning in Python. Retrieved April 26,
2025, from https://scikit-learn.org/stable/
Pandas Documentation: The Pandas Development Team. (2024). pandas documentation. https://pandas.pydata.org/docs/
(Note: The date refers to the latest documentation build/release date if available, otherwise use n.d. and retrieval date).