CSE 445 - Lecture 2 - Data Exploration - Regression

The document outlines the steps involved in a machine learning project, using California housing data as a case study to predict housing prices. It discusses the problem framing, data preparation, model selection, evaluation metrics, and the importance of visualizing data and ensuring representative test sets. The document also emphasizes the need for model validation and fine-tuning to achieve better predictions.

Uploaded by

sarwar76200

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views31 pages

CSE 445 - Lecture 2 - Data Exploration - Regression

Uploaded by

sarwar76200

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

CSE 445: Machine Learning

Machine Learning Project – Data Exploration

Image taken from xkcd

Steps in an ML Project
1. Look at the big picture
2. Get the data
3. Visualize data for insights
4. Preprocess the data
5. Select a model and train it
6. Fine-tune your model
7. Present solution
8. Launch!
Prediction gone wrong?

Source: fivethirtyeight
Example Problem: California Housing
 Objective: Use census data from California to build a model of housing prices
 Source: StatLib repository
Frame the Problem
 Task: Predict median housing price in ANY district in CA, given all other metrics
 Dataset includes metrics such as population, median household income, total
rooms etc – as well as the median housing price in a district
 Model output will go in as a signal to decide whether to invest in a given area

Data Pipeline
for District
pricing
project
What type of Problem is it?
 Supervised, unsupervised, reinforcement?
 We have the labeled training examples – the median housing price is
included in the dataset. We can use Supervised Learning!
 Classification or Regression?
 The desired output is a continuous variable – this is a regression task
 Multiple features available - it’s a multiple regression problem
 Only one value to be predicted – univariate regression problem
 Batch Learning or Online Learning?
 No incoming data other than dataset – batch learning is fine
Performance Measure
 Root Mean Square Error (RMSE) or Mean Absolute Error (MAE)

 m - number of instances in the dataset

 x(i) – vector of all feature values for instance i and y(i) – label value for
instance i
 X- matrix containing all feature values (excluding the labels) of all
instances in dataset
 h – system’s hypothesis/prediction function/weights
 RMSE(X,h) or MAE(X,h) – cost function measured on set of examples
using your hypothesis h
Performance Measure
 Error – i.e. the difference between true label value y and the
predicted label value computed as h(x(i))
 RMSE and MAE are two ways to measure the distance between
two vectors – namely, the distance between your predicted
labels, and the true label value
 RMSE – straight line distance, or Euclidean norm, or L2 norm
 MAE – city block distance, or Manhattan norm, or L1 norm
 Higher norm index  more sensitive to outliers
Download Data
 Automate process of fetching data because
 It’s useful if the data changes regularly (can schedule this)
 Can install on multiple machines
Quick Look at Data Structure
 Each row represents 1 district
(i.e. 1 instance)
 Ten features (attributes)
 Total instances = 20,640 in the
dataset
Histogram
 Plot a histogram for each
numerical attribute

 Noticeable oddities:
1. Preprocessed Features – median
income isn’t in raw USD, but has
been scaled by a factor of 10,000
and capped to 15
2. Output (i.e. median house value)
has also been capped – ere go, will
not be able to predict beyond
$500,000!
3. Tail-heavy histogram
Test Set Generation – Random Sampling
 Create and set aside the test set as early as possible! (Data Snooping bias)
 Use random seed generator to ensure same shuffled indices (if your dataset
updates, this alone will not work!)
 Compute a hash of each instance and put in test set if it’s less than or equal to 20%
of max hash value (ensures consistency across multiple runs)
Stratified Sampling
 Need to have a test set that is representative of the overall population
 Strata – homogenous subgroups
 e.g. male to female ratio in Bangladesh is 103 to 100
 Stratified sampling – ensuring test set is representative of overall population
 Test set should have 103 males for every 100 females
Stratified Sampling
 Without Stratified Sampling, the
sampling bias in the test set is
generally much higher
 We want the test set data to resemble
the real world data
Data Visualization
 Use latitude and longitude to visualize the district locations
Bad Visualization Slightly Better Visualization
Data Visualization
 Best visualization –
show median house
value using color!
 Can see high value
around metro areas
near the ocean (LA, SF,
SD etc)
 Note: Only do this for
training set!
Correlation between attributes
 Correlation coefficient near 1 
Strong Positive Correlation
 If x goes up, y goes up too
 Correlation coefficient near -1 
Strong negative correlation
 If x goes up, y goes down
 Correlation coefficient near 0  no
linear correlation
 Doesn’t mean there’s no
relationship between the two!

 Note: The size of the slope has nothing to do with how

correlated the two attributes are!
 In case of housing dataset, median house value is
positively correlated to median income (as expected)
Scatter matrix visualization of Correlation
 Main diagonal has each
attributes’ histogram generated
by the pandas library (instead of
a perfectly correlated straight
line)
 Look for patterns in each
attribute pair
Correlation coefficient with
median house value
Attribute Combinations
 Upward trend visible – and the points are not too
dispersed
 Price cap visible at 500,000
 Horizontal lines at 450,000, 350,000, and
280,000 (Why?)
 Algorithm may pick up and reproduce these quirks

• Consider other attribute combinations – may

be more revealing!
• Rooms per household, bedrooms per room,
and population per household more correlated
to house value than total rooms, households,
or population alone!
Missing Data - Numerical
 Option 1: Remove instances (rows) that contain missing data
[ dropna() in pandas]
 Can work if there aren’t too many instances with missing
data
 Option 2: Remove the entire attribute (column) with missing
data [ drop() in pandas]
 Worse than option 1, especially if it’s a useful attribute
when available!
 Option 3: Set values to some statistical measure (zero, mean,
median, etc) [ fillna() in pandas]
 Only do this on training set – replace later in test set
during evaluation
 Use SimpleImputer in Scikit-Learn to streamline this
process
Text and Categorical Attributes
 Discrete, non-arbitrary text values – category
 Convert categorical values from text to numbers
 Option 1: Map each category to a unique
number
 ML algorithm may assume (incorrectly) that
two “nearby” values are more similar than
two distant values
 <1H OCEAN and NEAR OCEAN encoded as 0
and 4 – even though they’re more similar to
each other than INLAND (encoded to 1)
 Only use this if the categories are ordered
Text and Categorical Attributes
 Option 2: One hot encoding – one dummy binary
attribute per category
 Sparse matrix with nonzero element location
generated
 May degrade performance if there are too many
categories – consider replacing with useful numeric
features related to categories if possible
Feature Scaling
 Usually, ML systems don’t perform well if
input attributes have very different scales
 In housing data, total # of rooms  6 to 39,320,
whereas Median Income  0 to 15
 Normalization shift and rescale values so
they range from 0 (minimum value) and 1
Normalization:
(maximum value)
 Might be adversely affected by outliers
 Standardization  subtract the mean value,
and divide by standard deviation so that the
resulting distribution has zero mean, and
unit variance Standardization:
 Not bound to a specific range (could be
problematic for some algorithms such as NNs)
Custom Transformers
 Scikit-Learn relies on duck typing  create
a class and implement three methods:
fit(), transform(), and fit_transform()
 just add TransformerMixin as a base
class to get fit_transform()
 Adding BaseEstimator gives you two
extra methods: get_params() and
set_params()
 Consider the following  add the custom
parameter bedrooms_per_room to the
set of input attributes
 One hyperparameter in the form of
add_bedrooms_per_room
Pipeline
 Ensure data transformation steps
are executed in the right order
 Pipeline constructor  take list of
name/estimator pairs in sequence
 Column transformer handle
numerical and categorical data
together
 housing_prepared contains data
ready for training
Training – Linear Regression
 Import LinearRegression and fit to
prepared training data and corresponding
labels
 Predictions are not exactly accurate

 RMSE is $68,628 – better than

nothing, but not great!
 Model is underfitting the data –
model isn’t powerful enough/need
better features/reduce constraints
Training – Decision Tree Regression
 Decision Trees  more powerful than Linear Regression
 DTR can find complex, nonlinear relationships in the data – something linear regression
is not capable of doing by definition
 No error in between housing labels and housing predictions?
 Tendency to overfit the data – need model validation!
 Note: Decision Trees often look great during the training phase, and fail miserably in
validation/testing. Don’t jump to conclusions too quickly!
Validation
 Option 1: Single Validation set
 Use train_test_split to divide training set
into training+validation
 Train your model with the smaller training
set, and evaluate them against validation
set
 Option 2: K-fold cross-validation
 Randonly split training set into K subsets, or
folds - train and evaluate model K times,
picking one fold for eval and training on the
rest
 Better to do this, especially if you have a
small dataset
 Lin Reg model actually does better than the
Decision Tree model!
Fine Tuning Models
 Grid Search
 Specify hyperparameters to experiment with
 Use cross validation to evaluate all possible combinations
 Randomized Search
 Useful when hyperparameter search space is large
 More control over computing budget – set number of iterations
 Ensemble Methods
 Group of models often work better than a single individual
model
 Analyze the best models (and features) and their errors – then
predict on the test set
Evaluation on Test Set
 Get predictors and labels from test set, set into pipeline, and evaluate!
 Performance may be (and often is) worse than what was measured in cross
validation
 Final performance of system for housing data is comparable to human
experts (but not better – still worth it!)
Check your Bias
 During WW2, Allied Forces ran
an analysis on airplanes that
took bullets during combat
 They wanted to figure out
where planes were most
vulnerable and needed armor
 The red dots are bullet holes
observed in planes that
returned from combat
 Where would YOU increase
armor?

Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Regression Pipeline in AI Techniques
No ratings yet
Regression Pipeline in AI Techniques
94 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Hands On Machine Learning, End-to-End Machine Learning Project Notes
No ratings yet
Hands On Machine Learning, End-to-End Machine Learning Project Notes
10 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
Lec3 4 ML Project
No ratings yet
Lec3 4 ML Project
26 pages
AIMLlatestmodule 2notes Removed
No ratings yet
AIMLlatestmodule 2notes Removed
33 pages
Module 2
No ratings yet
Module 2
35 pages
Hands On ML Workshop End To End ML
No ratings yet
Hands On ML Workshop End To End ML
20 pages
Dawit House
No ratings yet
Dawit House
49 pages
End-to-End ML Pipeline Example
No ratings yet
End-to-End ML Pipeline Example
50 pages
Report
No ratings yet
Report
40 pages
CWH Sklearn Merged
No ratings yet
CWH Sklearn Merged
74 pages
Machine Learning Problem-Solving Steps: 1. Look at The Big Picture
No ratings yet
Machine Learning Problem-Solving Steps: 1. Look at The Big Picture
41 pages
BA Project - Team17
No ratings yet
BA Project - Team17
13 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Unit 2
No ratings yet
Unit 2
78 pages
ISMLA Module5
No ratings yet
ISMLA Module5
25 pages
Module 5
No ratings yet
Module 5
46 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
House Pricing
No ratings yet
House Pricing
15 pages
Unit 5
No ratings yet
Unit 5
18 pages
Regression Models for Housing Prices
No ratings yet
Regression Models for Housing Prices
17 pages
House Price Prediction Models Analysis
No ratings yet
House Price Prediction Models Analysis
27 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
End To End Machine Learning Project-2
No ratings yet
End To End Machine Learning Project-2
10 pages
House Price Prediction Using ML
No ratings yet
House Price Prediction Using ML
26 pages
House Price Prediction Using Machine Learning Techniques
No ratings yet
House Price Prediction Using Machine Learning Techniques
5 pages
House Price Prediction Using Machine Learning Techniques
No ratings yet
House Price Prediction Using Machine Learning Techniques
5 pages
Week 1 Get Familier With Jupyter Notebook
No ratings yet
Week 1 Get Familier With Jupyter Notebook
4 pages
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
No ratings yet
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
9 pages
Making Predictions
No ratings yet
Making Predictions
13 pages
Regression Dataset
No ratings yet
Regression Dataset
3 pages
House Price Prediction with ML in Python
No ratings yet
House Price Prediction with ML in Python
13 pages
Advanced Feature Engineering and Data Preprocessing in Machine Learning
No ratings yet
Advanced Feature Engineering and Data Preprocessing in Machine Learning
7 pages
Project Report
No ratings yet
Project Report
15 pages
ML Lab File
No ratings yet
ML Lab File
47 pages
Housepriceprediction ML 221104055342 Fb5109ae
No ratings yet
Housepriceprediction ML 221104055342 Fb5109ae
17 pages
Report 1
No ratings yet
Report 1
11 pages
ML Lab Report for ECE Students
No ratings yet
ML Lab Report for ECE Students
38 pages
Important Notes
No ratings yet
Important Notes
8 pages
Phase 2 M.dhatchanamurthy
No ratings yet
Phase 2 M.dhatchanamurthy
5 pages
CS 2 3 4 Aml
No ratings yet
CS 2 3 4 Aml
70 pages
ML Recordjp
No ratings yet
ML Recordjp
35 pages
Case Study 219302405
No ratings yet
Case Study 219302405
14 pages
Explain Me Every Code Written in It With Deep Know
No ratings yet
Explain Me Every Code Written in It With Deep Know
7 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
Module 2
No ratings yet
Module 2
24 pages
1 - Lab Manual (ML)
No ratings yet
1 - Lab Manual (ML)
42 pages
ML Project Part A 1
No ratings yet
ML Project Part A 1
6 pages
Regression Model Training Guide
No ratings yet
Regression Model Training Guide
13 pages
House Price Prediction
No ratings yet
House Price Prediction
14 pages
ML Practical 04
No ratings yet
ML Practical 04
19 pages
Linear Regression with Boston Housing Data
No ratings yet
Linear Regression with Boston Housing Data
14 pages
Seminar Ppt4
No ratings yet
Seminar Ppt4
19 pages
CSE 445 Logistic Regression
No ratings yet
CSE 445 Logistic Regression
11 pages
CSE445 Linear-Regression
No ratings yet
CSE445 Linear-Regression
40 pages
CSE 445 - Lecture 1 - Machine Learning Introduction
No ratings yet
CSE 445 - Lecture 1 - Machine Learning Introduction
23 pages
Week 10 Lecture 1 - CE, CC BJT-Operation, Characteristics, Fixed-Bias Network
No ratings yet
Week 10 Lecture 1 - CE, CC BJT-Operation, Characteristics, Fixed-Bias Network
16 pages
Week 10 Lecture 2 - Fixed-Bias, Emitter-Bias, Voltage-Divider Bias Network and Load-Line Analysis
No ratings yet
Week 10 Lecture 2 - Fixed-Bias, Emitter-Bias, Voltage-Divider Bias Network and Load-Line Analysis
16 pages
Ch5 Big Data and Analytics Definitions
No ratings yet
Ch5 Big Data and Analytics Definitions
2 pages
Pratical Classes Econometria I
No ratings yet
Pratical Classes Econometria I
33 pages
Blood Pressure and Descriptive Measures
No ratings yet
Blood Pressure and Descriptive Measures
126 pages
Kuis Doe (Winner)
No ratings yet
Kuis Doe (Winner)
10 pages
Frequency Distributions and Data Visualization
No ratings yet
Frequency Distributions and Data Visualization
8 pages
Selecting Predictive Modeling Technique
No ratings yet
Selecting Predictive Modeling Technique
121 pages
II - MA3391 - PS Question Bank - 1 (CIA-II)
No ratings yet
II - MA3391 - PS Question Bank - 1 (CIA-II)
2 pages
LAS Variance
No ratings yet
LAS Variance
2 pages
A Study On Regression Algorithm in Machine Learning
No ratings yet
A Study On Regression Algorithm in Machine Learning
3 pages
R Lab2
No ratings yet
R Lab2
37 pages
Co2 Emission Project
No ratings yet
Co2 Emission Project
6 pages
Regression With One Regressor-Hypothesis Tests and Confidence Intervals
100% (1)
Regression With One Regressor-Hypothesis Tests and Confidence Intervals
53 pages
PROBLEMS ch05
No ratings yet
PROBLEMS ch05
117 pages
03 Preguntas
No ratings yet
03 Preguntas
5 pages
Chap 26 One Way Anova
No ratings yet
Chap 26 One Way Anova
38 pages
Lind 19e Chap015 PPT Accessible
No ratings yet
Lind 19e Chap015 PPT Accessible
30 pages
STD Deviation
No ratings yet
STD Deviation
17 pages
Statistical Questions For Practice Exercises
No ratings yet
Statistical Questions For Practice Exercises
7 pages
Multiple Linear Regression Analysis in Excel
No ratings yet
Multiple Linear Regression Analysis in Excel
5 pages
Cheat Sheets - Stats Analytics
No ratings yet
Cheat Sheets - Stats Analytics
2 pages
Time Series
No ratings yet
Time Series
11 pages
Brooklyn College Economics Department Economics 4400w
No ratings yet
Brooklyn College Economics Department Economics 4400w
8 pages
Analisis Regresi Sederhana Dan Berganda (Teori Dan Praktik)
No ratings yet
Analisis Regresi Sederhana Dan Berganda (Teori Dan Praktik)
53 pages
Statistical Analysis for Coaches
No ratings yet
Statistical Analysis for Coaches
6 pages
ARMA Model Selection Guide
No ratings yet
ARMA Model Selection Guide
3 pages
Assesing Performance of Regression-Error Measures
No ratings yet
Assesing Performance of Regression-Error Measures
5 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
53 pages
Homework 1
No ratings yet
Homework 1
9 pages
11 CramerV
No ratings yet
11 CramerV
4 pages
03 Index Model
No ratings yet
03 Index Model
35 pages

CSE 445 - Lecture 2 - Data Exploration - Regression

Uploaded by

CSE 445 - Lecture 2 - Data Exploration - Regression

Uploaded by

CSE 445: Machine Learning

Machine Learning Project – Data Exploration

Image taken from xkcd

 m - number of instances in the dataset

 Note: The size of the slope has nothing to do with how

• Consider other attribute combinations – may

 RMSE is $68,628 – better than

You might also like