Feature Engineering
What is Machine Learning?
Simple
How machines learn rules from examples.
definition:
Goal of any machine learning:
• Learn patterns from examples
• Be able to generalize them to new examples
Supervised and unsupervised machine learning:
In both cases learning is achieved through examples!
What is Machine Learning?
A program is said to learn from experience E with regard to
Formal
task T and performance measure P, if its performance on task T
definition:
improves with experience.
# Task Experience Performance Measure
Recognize Set of digits with Percent of correct
1
handwritten digits labels recognitions
Predict length of
2 Patient histories Mean prediction error
hospital stay
Recommend Netflix
3 Viewing histories # users viewing show
shows
Some motivating examples
Early detection of Identifying vulnerable Predicting
disease outbreaks buildings for retrofit transport demand
Preventing violent Reducing CO2 Targeting fire
crime emissions risk inspections
Acknowledgment: D. Neill, Machine Learning for Cities, CUSP NYU
An approach to model fitting
Five main steps:
Use Case & Model Tune
Predict Evaluate
Data training (calibrate)
Determine Split data into Tune model Use the tuned Compare the
question of training and parameters model to form predictions
interest, get test sets. Fit predictions with the
informative model to about your actual values
data. training set. test set for the test set
Variables of interest are categorical, supported by classification
or numerical, supported by regression
Unsupervised Learning
• The only thing we have is input data.
• Labels are not provided by a supervisor.
What does an unsupervised algorithm do?
Extract patterns in the data.
Create clusters whose members are similar (based on some set of measurements).
Example: Take raw data on visitors to my website
Segment them into groups that share same characteristics; target ads.
Supervised Learning
• The machine learns from examples that have already been labelled.
• Each example has input values (attributes) and an output value.
Example: A spam classifier learns rules from this training set of emails1
Goal:
Use known output values to learn the patterns of the input.
Predict the output value of new examples.
Image credit: Géron, Hands-On Machine Learning
Supervised Learning algorithms
Linear regression
• Models output as linear combination of inputs
Fast to train, effective on high-dimensional data.
Support Vector Machines
• Learns a decision boundary (linear or non-linear)
Suits complex, medium-size datasets
Decision trees and Random Forest
• Builds flow-chart style rules that maximize information gain
High predictive power, requires less data preparation.
Neural networks
• Algorithms inspired by structure and function of the brain.
• Scalable, highly accurate on tasks like image recognition.
Building a model
Use Case & Model
Tune Predict Evaluate
Data training
Use Case & Model
Tune Predict Evaluate
Data training
Build labeled dataset for question of interest
Use Case & Model
Tune Predict Evaluate
Data training
Split training and test data
When fitting ML algorithms, it is common to separate
data into training and test sets
Split the dataset
Dataset
(e.g. 70/30 ratio)
Build model on
the training set
Training set Test set Evaluate model on
(70% of records) (30%)
the test set
Image credit: D. Ziganto “Standard Deviations” blog
Use Case & Model
Tune Predict Evaluate
Data training
Complexity vs. accuracy
We can build models of lower or higher complexity by
changing their hyper-parameters.
Aim for the ‘sweet spot’ that maximizes performance but
avoids overfitting.
* Overfitting: a complex model that memorizes the test set (including noise in it)
but fails to generalize to new data.
Complexity vs. accuracy
We can build models of lower or higher complexity by
changing their hyper-parameters.
Aim for the ‘sweet spot’ that maximizes performance but
avoids overfitting.
* Overfitting: a complex model that memorizes the test set (including noise in it)
but fails to generalize to new data.
Use Case & Model
Tune Predict Evaluate
Data training
Make predictions
With the model tuned and fitted to training data, we can
predict outcomes for test set, ensure its performance is
satisfactory, and deploy.
Figure: Object detection in images
Image: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2016)
EXAMPLE:
Predicting mode of
transport and music
taste
Decision tree model for transport planning
Scenario: The World Bank has hired a cohort of 100 new staff, who start this
summer. GSD needs to decide how many bike racks or parking spaces to build for
them.
Decision tree model for transport planning
Scenario: The World Bank has hired a cohort of 100 new staff, who start this
summer. GSD needs to decide how many bike racks or parking spaces to build for
them.
Attributes (𝑿𝟏 … 𝑿𝑵 ) Target variable (y)
From this training set, construct
set of rules to predict mode of
transport for unseen examples.
Decision tree model for transport planning
Scenario: The World Bank has hired a talented cohort of 100 new staff, who start
after Thanksgiving. GSD needs to decide how many bike racks or parking spaces to
build for them.
Mix of home states and ages
Decision tree model for transport planning
Scenario: The World Bank has hired a talented cohort of 100 new staff, who start
after Thanksgiving. GSD needs to decide how many bike racks or parking spaces to
build for them.
High enjoyment of Netflix
Use Case & Model
Tune Predict Evaluate
Data training