Regression
Welcome to Numerical targets!
Big Data, Machine Learning, and their Real World Applications
Pre-College Program
Columbia University, SPS
Regression Concepts
• Intro to Regression
• Linear Regression
• Mean Squared Error (MSE)
• Other models for regression
• Metrics for Regression
• Time Series: a special case of regression
• Housing Data Lab
Which tasks are regression tasks?
• Predicting the highest price passengers are willing to pay for a taxi.
• Forecasting how many tickets will be sold for morning and evening
movie showings.
• Identifying the age group of a person: under 18, under 35, 35+.
• Determining how many days visa processing will take based on the
country, entry permit type, and the applicant's details.
Which tasks are regression tasks?
• Predicting the highest price passengers are willing to pay for a taxi.
• Forecasting how many tickets will be sold for morning and evening
movie showings.
• Identifying the age group of a person: under 18, under 35, 35+.
• Determining how many days visa processing will take based on the
country, entry permit type, and the applicant's details.
Linear Regression- The Basics
• Fits the best line for
observed data
• y=mx+b
• Finds slope (m) and
intercept (b) that minimize
least squares error
• Can be done in multiple
dimensions (multiple
features).
Mean Squared Error
• Cost function, also used for evaluation
predicted
expected
Mean Absolute Error
• Mean Absolute Error
A Note on Target Distributions, when to use
MSE vs MAE?
• MSE is good when your target variable follows a normal or
symmetrical distribution
• MAE is good when your target variable follows a skewed distribution
Coefficient of Determination – R2 score
(Goodness of Fit)
• Usually between 0 and 1
• Can be negative if fit of model is really bad
• The closer to 1, the better fit
Random Forest Regressor
• Random forests are a
collection of decision
trees where the
predictions of all trees
are averaged.
Which Model to Choose? Underfitting vs
Overfitting
How to choose a model?
• Usually we will try a couple of different models ( algorithms ) such as
linear regression vs random forest vs decision tree.
• We can do some hyperparameter tuning for each of the models to
see which sets of hyperparameters perform better. Then we compare
the models.
• We stick to one evaluation metric for all models, like MSE or MAE.
Multiple is fine as long as you do it for every model.
• We also assess if we want a simpler or more heavy model depending
on our application.
Housing Data Lab
• Load in housing data from Kaggle: https://www.kaggle.com/datasets/
yasserh/housing-prices-dataset
• Fit linear regression algorithm using at least one feature to find the price
of the house (split into training and testing data first)
• Use model.coef_ to find what the coefficients of linear regression are
• If using features that are categorical , make sure you encode them
• What are the first 5 residuals?
• Calculate the MSE of the first 5 predictions.
• What is the R^2 value of the entire fit of the model? Use
sklearn.metrics.r2_score()
Time Series- Autoregression
• Autoregression :
when you use one
variable (the past)
to predict the
same variable at
another time (the
future).
Final Project Specifications
• 15-20 minute presentations + 5 minutes for questions
• Use slides: Introduction, Analysis, Conclusions , Future work
• Show your code
• Data Analysis Project:
• At least 5 visualizations with conclusions.
• Make sure you tell a story with your data.
• Machine Learning Project:
• At least two models with comparisons (except if using neural networks) with
the same metric (MSE/precision, etc)
• What challenges did you face? Talk about hyperparameter tuning, model
comparison, metrics, etc.