0% found this document useful (0 votes)
19 views31 pages

CSE 445 - Lecture 2 - Data Exploration - Regression

The document outlines the steps involved in a machine learning project, using California housing data as a case study to predict housing prices. It discusses the problem framing, data preparation, model selection, evaluation metrics, and the importance of visualizing data and ensuring representative test sets. The document also emphasizes the need for model validation and fine-tuning to achieve better predictions.

Uploaded by

sarwar76200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views31 pages

CSE 445 - Lecture 2 - Data Exploration - Regression

The document outlines the steps involved in a machine learning project, using California housing data as a case study to predict housing prices. It discusses the problem framing, data preparation, model selection, evaluation metrics, and the importance of visualizing data and ensuring representative test sets. The document also emphasizes the need for model validation and fine-tuning to achieve better predictions.

Uploaded by

sarwar76200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

CSE 445: Machine Learning

Machine Learning Project – Data Exploration

Image taken from xkcd


Steps in an ML Project
1. Look at the big picture
2. Get the data
3. Visualize data for insights
4. Preprocess the data
5. Select a model and train it
6. Fine-tune your model
7. Present solution
8. Launch!
Prediction gone wrong?

Source: fivethirtyeight
Example Problem: California Housing
 Objective: Use census data from California to build a model of housing prices
 Source: StatLib repository
Frame the Problem
 Task: Predict median housing price in ANY district in CA, given all other metrics
 Dataset includes metrics such as population, median household income, total
rooms etc – as well as the median housing price in a district
 Model output will go in as a signal to decide whether to invest in a given area

Data Pipeline
for District
pricing
project
What type of Problem is it?
 Supervised, unsupervised, reinforcement?
 We have the labeled training examples – the median housing price is
included in the dataset. We can use Supervised Learning!
 Classification or Regression?
 The desired output is a continuous variable – this is a regression task
 Multiple features available - it’s a multiple regression problem
 Only one value to be predicted – univariate regression problem
 Batch Learning or Online Learning?
 No incoming data other than dataset – batch learning is fine
Performance Measure
 Root Mean Square Error (RMSE) or Mean Absolute Error (MAE)

 m - number of instances in the dataset


 x(i) – vector of all feature values for instance i and y(i) – label value for
instance i
 X- matrix containing all feature values (excluding the labels) of all
instances in dataset
 h – system’s hypothesis/prediction function/weights
 RMSE(X,h) or MAE(X,h) – cost function measured on set of examples
using your hypothesis h
Performance Measure
 Error – i.e. the difference between true label value y and the
predicted label value computed as h(x(i))
 RMSE and MAE are two ways to measure the distance between
two vectors – namely, the distance between your predicted
labels, and the true label value
 RMSE – straight line distance, or Euclidean norm, or L2 norm
 MAE – city block distance, or Manhattan norm, or L1 norm
 Higher norm index  more sensitive to outliers
Download Data
 Automate process of fetching data because
 It’s useful if the data changes regularly (can schedule this)
 Can install on multiple machines
Quick Look at Data Structure
 Each row represents 1 district
(i.e. 1 instance)
 Ten features (attributes)
 Total instances = 20,640 in the
dataset
Histogram
 Plot a histogram for each
numerical attribute

 Noticeable oddities:
1. Preprocessed Features – median
income isn’t in raw USD, but has
been scaled by a factor of 10,000
and capped to 15
2. Output (i.e. median house value)
has also been capped – ere go, will
not be able to predict beyond
$500,000!
3. Tail-heavy histogram
Test Set Generation – Random Sampling
 Create and set aside the test set as early as possible! (Data Snooping bias)
 Use random seed generator to ensure same shuffled indices (if your dataset
updates, this alone will not work!)
 Compute a hash of each instance and put in test set if it’s less than or equal to 20%
of max hash value (ensures consistency across multiple runs)
Stratified Sampling
 Need to have a test set that is representative of the overall population
 Strata – homogenous subgroups
 e.g. male to female ratio in Bangladesh is 103 to 100
 Stratified sampling – ensuring test set is representative of overall population
 Test set should have 103 males for every 100 females
Stratified Sampling
 Without Stratified Sampling, the
sampling bias in the test set is
generally much higher
 We want the test set data to resemble
the real world data
Data Visualization
 Use latitude and longitude to visualize the district locations
Bad Visualization Slightly Better Visualization
Data Visualization
 Best visualization –
show median house
value using color!
 Can see high value
around metro areas
near the ocean (LA, SF,
SD etc)
 Note: Only do this for
training set!
Correlation between attributes
 Correlation coefficient near 1 
Strong Positive Correlation
 If x goes up, y goes up too
 Correlation coefficient near -1 
Strong negative correlation
 If x goes up, y goes down
 Correlation coefficient near 0  no
linear correlation
 Doesn’t mean there’s no
relationship between the two!

 Note: The size of the slope has nothing to do with how


correlated the two attributes are!
 In case of housing dataset, median house value is
positively correlated to median income (as expected)
Scatter matrix visualization of Correlation
 Main diagonal has each
attributes’ histogram generated
by the pandas library (instead of
a perfectly correlated straight
line)
 Look for patterns in each
attribute pair
Correlation coefficient with
median house value
Attribute Combinations
 Upward trend visible – and the points are not too
dispersed
 Price cap visible at 500,000
 Horizontal lines at 450,000, 350,000, and
280,000 (Why?)
 Algorithm may pick up and reproduce these quirks

• Consider other attribute combinations – may


be more revealing!
• Rooms per household, bedrooms per room,
and population per household more correlated
to house value than total rooms, households,
or population alone!
Missing Data - Numerical
 Option 1: Remove instances (rows) that contain missing data
[ dropna() in pandas]
 Can work if there aren’t too many instances with missing
data
 Option 2: Remove the entire attribute (column) with missing
data [ drop() in pandas]
 Worse than option 1, especially if it’s a useful attribute
when available!
 Option 3: Set values to some statistical measure (zero, mean,
median, etc) [ fillna() in pandas]
 Only do this on training set – replace later in test set
during evaluation
 Use SimpleImputer in Scikit-Learn to streamline this
process
Text and Categorical Attributes
 Discrete, non-arbitrary text values – category
 Convert categorical values from text to numbers
 Option 1: Map each category to a unique
number
 ML algorithm may assume (incorrectly) that
two “nearby” values are more similar than
two distant values
 <1H OCEAN and NEAR OCEAN encoded as 0
and 4 – even though they’re more similar to
each other than INLAND (encoded to 1)
 Only use this if the categories are ordered
Text and Categorical Attributes
 Option 2: One hot encoding – one dummy binary
attribute per category
 Sparse matrix with nonzero element location
generated
 May degrade performance if there are too many
categories – consider replacing with useful numeric
features related to categories if possible
Feature Scaling
 Usually, ML systems don’t perform well if
input attributes have very different scales
 In housing data, total # of rooms  6 to 39,320,
whereas Median Income  0 to 15
 Normalization shift and rescale values so
they range from 0 (minimum value) and 1
Normalization:
(maximum value)
 Might be adversely affected by outliers
 Standardization  subtract the mean value,
and divide by standard deviation so that the
resulting distribution has zero mean, and
unit variance Standardization:
 Not bound to a specific range (could be
problematic for some algorithms such as NNs)
Custom Transformers
 Scikit-Learn relies on duck typing  create
a class and implement three methods:
fit(), transform(), and fit_transform()
 just add TransformerMixin as a base
class to get fit_transform()
 Adding BaseEstimator gives you two
extra methods: get_params() and
set_params()
 Consider the following  add the custom
parameter bedrooms_per_room to the
set of input attributes
 One hyperparameter in the form of
add_bedrooms_per_room
Pipeline
 Ensure data transformation steps
are executed in the right order
 Pipeline constructor  take list of
name/estimator pairs in sequence
 Column transformer handle
numerical and categorical data
together
 housing_prepared contains data
ready for training
Training – Linear Regression
 Import LinearRegression and fit to
prepared training data and corresponding
labels
 Predictions are not exactly accurate

 RMSE is $68,628 – better than


nothing, but not great!
 Model is underfitting the data –
model isn’t powerful enough/need
better features/reduce constraints
Training – Decision Tree Regression
 Decision Trees  more powerful than Linear Regression
 DTR can find complex, nonlinear relationships in the data – something linear regression
is not capable of doing by definition
 No error in between housing labels and housing predictions?
 Tendency to overfit the data – need model validation!
 Note: Decision Trees often look great during the training phase, and fail miserably in
validation/testing. Don’t jump to conclusions too quickly!
Validation
 Option 1: Single Validation set
 Use train_test_split to divide training set
into training+validation
 Train your model with the smaller training
set, and evaluate them against validation
set
 Option 2: K-fold cross-validation
 Randonly split training set into K subsets, or
folds - train and evaluate model K times,
picking one fold for eval and training on the
rest
 Better to do this, especially if you have a
small dataset
 Lin Reg model actually does better than the
Decision Tree model!
Fine Tuning Models
 Grid Search
 Specify hyperparameters to experiment with
 Use cross validation to evaluate all possible combinations
 Randomized Search
 Useful when hyperparameter search space is large
 More control over computing budget – set number of iterations
 Ensemble Methods
 Group of models often work better than a single individual
model
 Analyze the best models (and features) and their errors – then
predict on the test set
Evaluation on Test Set
 Get predictors and labels from test set, set into pipeline, and evaluate!
 Performance may be (and often is) worse than what was measured in cross
validation
 Final performance of system for housing data is comparable to human
experts (but not better – still worth it!)
Check your Bias
 During WW2, Allied Forces ran
an analysis on airplanes that
took bullets during combat
 They wanted to figure out
where planes were most
vulnerable and needed armor
 The red dots are bullet holes
observed in planes that
returned from combat
 Where would YOU increase
armor?

You might also like