Introduction to Machine Learning
(CSCI-UA.0480-002)
David Sontag
New York University
Slides adapted from Luke Zettlemoyer, Pedro Domingos, and
Carlos Guestrin
Logistics
• Class webpage:
– [Link]
– Sign up for mailing list!
• Office hours:
– Tuesday 3:30-4:30pm and by appointment.
– 715 Broadway, 12th floor, Room 1204
• Grader: Jinglun Dong
– Email: jinglundong@[Link]
Evaluation
• About 7 homeworks (50%)
– Both theory and programming
– See collaboration policy on class webpage
• Midterm & final exam (45%)
• Course participation (5%)
Prerequisites
• Basic algorithms (CS 310)
– Dynamic programming, algorithmic analysis
• Linear algebra (Math 140)
– Matrices, vectors, systems of linear equations
– Eigenvectors, matrix rank
– Singular value decomposition
• Multivariable calculus (Math 123)
– Derivatives, integration, tangent planes
– Lagrange multipliers
• Probability (Math 233 or 235)
Source Materials
Optional textbooks:
• C. Bishop, Pattern Recognition and Machine
Learning, Springer, 2007
• K. Murphy, Machine Learning: a Probabilistic
Perspective, MIT Press, 2012
A Few Quotes
• A breakthrough in machine learning would be worth ten
Microsofts (Bill Gates, Chairman, Microsoft)
• Machine learning is the next Internet (Tony Tether, former
director, DARPA)
• “Machine learning is the hot new thing (John Hennessy, President,
Stanford)
• Web rankings today are mostly a matter of machine
learning (Prabhakar Raghavan, former Dir. Research, Yahoo)
• Machine learning is going to result in a real revolution (Greg
Papadopoulos, former CTO, Sun)
• Machine learning is today’s discontinuity (Jerry Yang, former
CEO, Yahoo)
What is Machine Learning ?
(by examples)
Classification
from data to discrete classes
Spam filtering
data prediction
Spam
vs.
Not Spam
Object detection
Example training images
for each orientation
10 ©2009 Carlos Guestrin
Weather prediction
Regression
predicting a numeric value
Stock market
Weather prediction revisted
Temperature
72° F
Ranking
comparing items
Web search
Given image, find similar images
[Link]
Collaborative Filtering
Recommendation systems
Recommendation systems
Machine learning competition with a $1 million prize
Clustering
discovering structure in data
Clustering Data: Group similar things
Clustering images
Set of Images
[Goldberger et al.]
Clustering web search results
Embedding
visualizing data
Embedding images
• Images have
thousands or
millions of pixels.
• Can we give each
image a coordinate,
such that similar
images are near
each other?
26 ©2009 Carlos Guestrin
[Saul & Roweis ‘03]
Embedding words
[Joseph Turian]
Embedding words (zoom in)
[Joseph Turian]
Structured prediction
from data to discrete classes
Speech recognition
Natural language processing
I need to hide a body
noun, verb, preposition, …
Growth of Machine Learning
• Machine learning is preferred approach to
– Speech recognition, Natural language processing
– Computer vision
– Medical outcomes analysis
– Robot control
– Computational biology
– Sensor networks
– …
• This trend is accelerating
– Improved machine learning algorithms
– Improved data capture, networking, faster computers
– Software too complex to write by hand
– New sensors / IO devices
– Demand for self-customization to user, environment
Supervised Learning: find f
• Given: Training set {(xi, yi) | i = 1 … n}
• Find: A good approximation to f : X Y
Examples: what are X and Y ?
• Spam Detection
– Map email to {Spam, Not Spam}
• Digit recognition
– Map pixels to {0,1,2,3,4,5,6,7,8,9}
• Stock Prediction
– Map new, historic prices, etc. to ! (the real numbers)
Example: Spam Filter
Dear Sir.
• Input: email
• Output: spam/ham First, I must solicit your confidence in this
transaction, this is by virture of its nature
• Setup: as being utterly confidencial and top
– Get a large collection of secret. …
example emails, each
labeled spam or ham
TO BE REMOVED FROM FUTURE
– Note: someone has to hand MAILINGS, SIMPLY REPLY TO THIS
label all this data! MESSAGE AND PUT "REMOVE" IN THE
– Want to learn to predict SUBJECT.
labels of new, future emails
99 MILLION EMAIL ADDRESSES
FOR ONLY $99
• Features: The attributes used to
make the ham / spam decision
Ok, Iknow this is blatantly OT but I'm
– Words: FREE! beginning to go insane. Had an old Dell
– Text Patterns: $dd, CAPS Dimension XPS sitting in the corner and
– Non-text: SenderInContacts decided to put it to use, I know it was
– … working pre being stuck in the corner, but
when I plugged it in, hit the power nothing
happened.
Example: Digit Recognition
• Input: images / pixel grids
0
• Output: a digit 0-9
• Setup:
– Get a large collection of example 1
images, each labeled with a digit
– Note: someone has to hand label all
this data!
– Want to learn to predict labels of new, 2
future digit images
• Features: The attributes used to make the
digit decision 1
– Pixels: (6,8)=ON
– Shape Patterns: NumComponents,
AspectRatio, NumLoops ??
– …
Important Concepts
• Data: labeled instances, e.g. emails marked spam/ham
– Training set
– Held out set (sometimes call Validation set)
– Test set
• Features: attribute-value pairs which characterize each x
Training
• Experimentation cycle Data
– Select a hypothesis f to best match training set
– (Tune hyperparameters on held-out or validation set)
– Compute accuracy of test set
– Very important: never peek at the test set!
• Evaluation
– Accuracy: fraction of instances predicted correctly Held-Out
Data
• Overfitting and generalization
– Want a classifier which does well on test data
– Overfitting: fitting the training data very closely, but not
generalizing well Test
– We’ll investigate overfitting and generalization formally in a Data
few lectures
A Supervised Learning Problem
• Consider a simple, Dataset:
Boolean dataset:
– f : X Y
– X = {0,1}4
– Y = {0,1}
• Question 1: How should
we pick the hypothesis
space, the set of possible
functions f ?
• Question 2: How do we
find the best f in the
hypothesis space?
Most General Hypothesis Space
Consider all possible boolean functions over four
input features!
Dataset:
• 216 possible
hypotheses
• 29 are consistent
with our dataset
• How do we
choose the best
one?
A Restricted Hypothesis Space
Consider all conjunctive boolean functions.
Dataset:
• 16 possible
hypotheses
• None are
consistent with our
dataset
• How do we
choose the best
one?