CSE 445: Machine Learning
Machine Learning Project – Data Exploration
Image taken from xkcd
Steps in an ML Project
1. Look at the big picture
2. Get the data
3. Visualize data for insights
4. Preprocess the data
5. Select a model and train it
6. Fine-tune your model
7. Present solution
8. Launch!
Prediction gone wrong?
Source: fivethirtyeight
Example Problem: California Housing
Objective: Use census data from California to build a model of housing prices
Source: StatLib repository
Frame the Problem
Task: Predict median housing price in ANY district in CA, given all other metrics
Dataset includes metrics such as population, median household income, total
rooms etc – as well as the median housing price in a district
Model output will go in as a signal to decide whether to invest in a given area
Data Pipeline
for District
pricing
project
What type of Problem is it?
Supervised, unsupervised, reinforcement?
We have the labeled training examples – the median housing price is
included in the dataset. We can use Supervised Learning!
Classification or Regression?
The desired output is a continuous variable – this is a regression task
Multiple features available - it’s a multiple regression problem
Only one value to be predicted – univariate regression problem
Batch Learning or Online Learning?
No incoming data other than dataset – batch learning is fine
Performance Measure
Root Mean Square Error (RMSE) or Mean Absolute Error (MAE)
m - number of instances in the dataset
x(i) – vector of all feature values for instance i and y(i) – label value for
instance i
X- matrix containing all feature values (excluding the labels) of all
instances in dataset
h – system’s hypothesis/prediction function/weights
RMSE(X,h) or MAE(X,h) – cost function measured on set of examples
using your hypothesis h
Performance Measure
Error – i.e. the difference between true label value y and the
predicted label value computed as h(x(i))
RMSE and MAE are two ways to measure the distance between
two vectors – namely, the distance between your predicted
labels, and the true label value
RMSE – straight line distance, or Euclidean norm, or L2 norm
MAE – city block distance, or Manhattan norm, or L1 norm
Higher norm index more sensitive to outliers
Download Data
Automate process of fetching data because
It’s useful if the data changes regularly (can schedule this)
Can install on multiple machines
Quick Look at Data Structure
Each row represents 1 district
(i.e. 1 instance)
Ten features (attributes)
Total instances = 20,640 in the
dataset
Histogram
Plot a histogram for each
numerical attribute
Noticeable oddities:
1. Preprocessed Features – median
income isn’t in raw USD, but has
been scaled by a factor of 10,000
and capped to 15
2. Output (i.e. median house value)
has also been capped – ere go, will
not be able to predict beyond
$500,000!
3. Tail-heavy histogram
Test Set Generation – Random Sampling
Create and set aside the test set as early as possible! (Data Snooping bias)
Use random seed generator to ensure same shuffled indices (if your dataset
updates, this alone will not work!)
Compute a hash of each instance and put in test set if it’s less than or equal to 20%
of max hash value (ensures consistency across multiple runs)
Stratified Sampling
Need to have a test set that is representative of the overall population
Strata – homogenous subgroups
e.g. male to female ratio in Bangladesh is 103 to 100
Stratified sampling – ensuring test set is representative of overall population
Test set should have 103 males for every 100 females
Stratified Sampling
Without Stratified Sampling, the
sampling bias in the test set is
generally much higher
We want the test set data to resemble
the real world data
Data Visualization
Use latitude and longitude to visualize the district locations
Bad Visualization Slightly Better Visualization
Data Visualization
Best visualization –
show median house
value using color!
Can see high value
around metro areas
near the ocean (LA, SF,
SD etc)
Note: Only do this for
training set!
Correlation between attributes
Correlation coefficient near 1
Strong Positive Correlation
If x goes up, y goes up too
Correlation coefficient near -1
Strong negative correlation
If x goes up, y goes down
Correlation coefficient near 0 no
linear correlation
Doesn’t mean there’s no
relationship between the two!
Note: The size of the slope has nothing to do with how
correlated the two attributes are!
In case of housing dataset, median house value is
positively correlated to median income (as expected)
Scatter matrix visualization of Correlation
Main diagonal has each
attributes’ histogram generated
by the pandas library (instead of
a perfectly correlated straight
line)
Look for patterns in each
attribute pair
Correlation coefficient with
median house value
Attribute Combinations
Upward trend visible – and the points are not too
dispersed
Price cap visible at 500,000
Horizontal lines at 450,000, 350,000, and
280,000 (Why?)
Algorithm may pick up and reproduce these quirks
• Consider other attribute combinations – may
be more revealing!
• Rooms per household, bedrooms per room,
and population per household more correlated
to house value than total rooms, households,
or population alone!
Missing Data - Numerical
Option 1: Remove instances (rows) that contain missing data
[ dropna() in pandas]
Can work if there aren’t too many instances with missing
data
Option 2: Remove the entire attribute (column) with missing
data [ drop() in pandas]
Worse than option 1, especially if it’s a useful attribute
when available!
Option 3: Set values to some statistical measure (zero, mean,
median, etc) [ fillna() in pandas]
Only do this on training set – replace later in test set
during evaluation
Use SimpleImputer in Scikit-Learn to streamline this
process
Text and Categorical Attributes
Discrete, non-arbitrary text values – category
Convert categorical values from text to numbers
Option 1: Map each category to a unique
number
ML algorithm may assume (incorrectly) that
two “nearby” values are more similar than
two distant values
<1H OCEAN and NEAR OCEAN encoded as 0
and 4 – even though they’re more similar to
each other than INLAND (encoded to 1)
Only use this if the categories are ordered
Text and Categorical Attributes
Option 2: One hot encoding – one dummy binary
attribute per category
Sparse matrix with nonzero element location
generated
May degrade performance if there are too many
categories – consider replacing with useful numeric
features related to categories if possible
Feature Scaling
Usually, ML systems don’t perform well if
input attributes have very different scales
In housing data, total # of rooms 6 to 39,320,
whereas Median Income 0 to 15
Normalization shift and rescale values so
they range from 0 (minimum value) and 1
Normalization:
(maximum value)
Might be adversely affected by outliers
Standardization subtract the mean value,
and divide by standard deviation so that the
resulting distribution has zero mean, and
unit variance Standardization:
Not bound to a specific range (could be
problematic for some algorithms such as NNs)
Custom Transformers
Scikit-Learn relies on duck typing create
a class and implement three methods:
fit(), transform(), and fit_transform()
just add TransformerMixin as a base
class to get fit_transform()
Adding BaseEstimator gives you two
extra methods: get_params() and
set_params()
Consider the following add the custom
parameter bedrooms_per_room to the
set of input attributes
One hyperparameter in the form of
add_bedrooms_per_room
Pipeline
Ensure data transformation steps
are executed in the right order
Pipeline constructor take list of
name/estimator pairs in sequence
Column transformer handle
numerical and categorical data
together
housing_prepared contains data
ready for training
Training – Linear Regression
Import LinearRegression and fit to
prepared training data and corresponding
labels
Predictions are not exactly accurate
RMSE is $68,628 – better than
nothing, but not great!
Model is underfitting the data –
model isn’t powerful enough/need
better features/reduce constraints
Training – Decision Tree Regression
Decision Trees more powerful than Linear Regression
DTR can find complex, nonlinear relationships in the data – something linear regression
is not capable of doing by definition
No error in between housing labels and housing predictions?
Tendency to overfit the data – need model validation!
Note: Decision Trees often look great during the training phase, and fail miserably in
validation/testing. Don’t jump to conclusions too quickly!
Validation
Option 1: Single Validation set
Use train_test_split to divide training set
into training+validation
Train your model with the smaller training
set, and evaluate them against validation
set
Option 2: K-fold cross-validation
Randonly split training set into K subsets, or
folds - train and evaluate model K times,
picking one fold for eval and training on the
rest
Better to do this, especially if you have a
small dataset
Lin Reg model actually does better than the
Decision Tree model!
Fine Tuning Models
Grid Search
Specify hyperparameters to experiment with
Use cross validation to evaluate all possible combinations
Randomized Search
Useful when hyperparameter search space is large
More control over computing budget – set number of iterations
Ensemble Methods
Group of models often work better than a single individual
model
Analyze the best models (and features) and their errors – then
predict on the test set
Evaluation on Test Set
Get predictors and labels from test set, set into pipeline, and evaluate!
Performance may be (and often is) worse than what was measured in cross
validation
Final performance of system for housing data is comparable to human
experts (but not better – still worth it!)
Check your Bias
During WW2, Allied Forces ran
an analysis on airplanes that
took bullets during combat
They wanted to figure out
where planes were most
vulnerable and needed armor
The red dots are bullet holes
observed in planes that
returned from combat
Where would YOU increase
armor?