CSCI218: Foundations
of Artificial Intelligence
Lectures on learning
§ Learning: a process for improving the performance of an agent
through experience
§ Learning I (today):
§ The general idea: generalization from experience
§ Supervised learning: classification and regression
§ Learning II: neural networks and deep learning
§ Reinforcement learning: learning complex V and Q functions
Supervised learning
§ To learn an unknown target function f
§ Input: a training set of labeled examples (xj,yj) where yj = f(xj)
§ E.g., xj is an image, f(xj) is the label “giraffe”
§ E.g., xj is a seismic signal, f(xj) is the label “explosion”
§ Output: hypothesis h that is “close” to f, i.e., predicts well on unseen
examples (“test set”)
§ Many possible hypothesis families for h
§ Linear models, logistic regression, neural networks, decision trees, examples
(nearest-neighbor), grammars, kernelized separators, etc etc
§ Classification = learning f with discrete output value
§ Regression = learning f with real-valued output value
Inductive Learning (Science)
§ Simplest form: learn a function from examples
§ A target function: g
§ Examples: input-output pairs (x, g(x))
§ E.g. x is an email and g(x) is spam / ham
§ E.g. x is a house and g(x) is its selling price
§ Problem:
§ Given a hypothesis space H
§ Given a training set of examples xi
§ Find a hypothesis h(x) such that h ~ g
§ Includes:
§ Classification (outputs = class labels)
§ Regression (outputs = real numbers)
Classification example: Object recognition
x
f(x) giraffe giraffe giraffe llama llama llama
X= f(x)=?
Example: Spam Filter
§ Input: an email Dear Sir.
§ Output: spam/ham First, I must solicit your confidence in
this transaction, this is by virture of its
§ Setup: nature as being utterly confidencial and
top secret. …
§ Get a large collection of example emails, each labeled
“spam” or “ham” (by hand) TO BE REMOVED FROM FUTURE
§ Learn to predict labels of new incoming emails MAILINGS, SIMPLY REPLY TO THIS
MESSAGE AND PUT "REMOVE" IN THE
§ Classifiers reject 200 billion spam emails per day SUBJECT.
§ Features: The attributes used to make the ham / 99 MILLION EMAIL ADDRESSES
FOR ONLY $99
spam decision
§ Words: FREE! Ok, Iknow this is blatantly OT but I'm
beginning to go insane. Had an old Dell
§ Text Patterns: $dd, CAPS Dimension XPS sitting in the corner and
§ Non-text: SenderInContacts, AnchorLinkMismatch decided to put it to use, I know it was
§ … working pre being stuck in the corner,
but when I plugged it in, hit the power
nothing happened.
Example: Digit Recognition
§ Input: images / pixel grids 0
§ Output: a digit 0-9
1
§ Setup:
§ MNIST data set of 60K collection hand-labeled images
§ Note: someone has to hand label all this data! 2
§ Want to learn to predict labels of new digit images
1
§ Features: The attributes used to make the digit decision
§ Pixels: (6,8)=ON
§ Shape Patterns: NumComponents, AspectRatio, NumLoops
§ … ??
Other Classification Tasks
§ Medical diagnosis
§ input: symptoms
§ output: disease
§ Automatic essay grading
§ input: document
§ output: grades
§ Fraud detection
§ input: account activity
§ output: fraud / no fraud
§ Email routing
§ input: customer complaint email
§ output: which department needs to ignore this email
§ Fruit and vegetable inspection
§ input: image (or gas analysis)
§ output: moldy or OK
§ … many more
Regression example: Curve fitting
Regression example: Curve fitting
Regression example: Curve fitting
Regression example: Curve fitting
Regression example: Curve fitting
Basic questions
§ Which hypothesis space H to choose?
§ How to measure degree of fit?
§ How to trade off degree of fit vs. complexity?
§ “Ockham’s razor”
§ How do we find a good h?
§ How do we know if a good h will predict well?
Training and Testing
A few important points about learning
§ Data: labeled instances, e.g. emails marked spam/ham
§ Training set
§ Held out set
§ Test set
§ Features: attribute-value pairs which characterize each x Training
Data
§ Experimentation cycle
§ Learn parameters (e.g. model probabilities) on training set
§ (Tune hyperparameters on held-out set)
§ Compute accuracy of test set
§ Very important: never “peek” at the test set!
§ Evaluation Held-Out
§ Accuracy: fraction of instances predicted correctly Data
(Validation set)
§ Overfitting and generalization
§ Want a classifier which does well on test data
§ Overfitting: fitting the training data very closely, but not Test
generalizing well Data
§ Underfitting: fits the training set poorly
A few important points about learning
§ What should we learn where?
§ Learn parameters from training data
§ Tune hyperparameters on different data
§ Why?
§ For each value of the hyperparameters, train
and test on the held-out data
§ Choose the best value and do a final test on
the test data
§ What are examples of hyperparameters?
Supervised Learning
§ Classification = learning f with discrete output value
§ Regression = learning f with real-valued output value
Linear Regression
Hypothesis family: Linear functions
Linear Regression
1000
900
House price in $1000
800
(x, y=f(x)), x: house size, y: house price 700
600
500
400
300
500 1000 1500 2000 2500 3000 3500
House size in square feet
Berkeley house prices, 2009
Linear regression = fitting a straight line/hyperplane
40
1000
900
House price in $1000
hw(x) 800
20
700
600
500
0
400
0
x 20
300
500 1000 1500 2000 2500 3000 3500
Prediction: hw(x) = w0 + w1x House size in square feet
Berkeley house prices, 2009
Prediction error
Error on one instance: y – hw(x)
Error or “residual”
Observation y
Prediction hw(x)
0
0
x 20
Find w
§ Define loss function
§ Find w* to minimize loss function
Least squares: Minimizing squared error
§ L2 loss function: sum of squared errors over all examples
§ Loss = ____________________________
§ We want the weights w* that minimize loss
§ At w* the derivatives of loss w.r.t. each weight are zero:
§ ¶Loss/¶w0 = __________________________
§ ¶Loss/¶w1 = __________________________
§ Exact solutions for N examples:
§ w1 = [NSj xjyj – (Sj xj)(Sj yj)]/[NSj xj2 – (Sj xj)2] and w0 = 1/N [Sj yj – w1Sj xj]
§ For the general case where x is an n-dimensional vector
§ X is the data matrix (all the data, one example per row); y is the column of labels
§ w* = (XTX)-1XTy
Least squares: Minimizing squared error
§ L2 loss function: sum of squared errors over all examples
§ Loss = Sj (yj – hw(xj))2 = Sj (yj – (w0 + w1xj))2
§ We want the weights w* that minimize loss
§ At w* the derivatives of loss w.r.t. each weight are zero:
§ ¶Loss/¶w0 = – 2 Sj (yj – (w0 + w1xj)) = 0
§ ¶Loss/¶w1 = – 2 Sj (yj – (w0 + w1xj)) xj = 0
§ Exact solutions for N examples:
§ w1 = [NSj xjyj – (Sj xj)(Sj yj)]/[NSj xj2 – (Sj xj)2] and w0 = 1/N [Sj yj – w1Sj xj]
§ For the general case where x is an n-dimensional vector
§ X is the data matrix (all the data, one example per row); y is the column of labels
§ w* = (XTX)-1XTy
Regression vs Classification
§ Linear regression when output is binary, ! ∈ −1, 1
§ ℎ! " = $" + $# "
y !! + !" #
1
x
§ Linear classification
§ Used with discrete output values
§ Threshold a linear function
§ ℎ! " = 1, if $" + $# " ≥ 0
§ ℎ! " = −1, if $" + $# " < 0 y
§ w: weight vector $ !! + !" #
1
§ Activation function g x
Threshold perceptron as linear classifier
Binary Decision Rule
§ A threshold perceptron is a single unit
that outputs
§ y = hw(x) = 1 when w.x ³ 0
= -1 when w.x < 0
§ In the input vector space
§ Examples are points x w0 : -3
money
2
wfree : 4
§ The equation w.x=0 defines a hyperplane y=1 (SPAM) wmoney : 2
§ One side corresponds to y=1
w.x
1
§ Other corresponds to y=-1
=0
0
y=-1 (HAM) 0 1 free
Example
Dear Stuart, I’m leaving Macrosoft
to return to academia. The money is
w0 : -3
money
2 is great here but I prefer to be free
wfree : 4 to do my own research; and I really
y=1 (SPAM) wmoney : 2 love teaching undergrads!
Do I need to finish
w.x
1
x0 : 1 my BA first before applying?
=0
xfree : 1 Best wishes
xmoney : 1 Bill
0
y=-1 (HAM) 0 1 free
w.x = -3x1 + 4x1 + 2x1 = 3
Weight Updates
Need a different solution than before given the characteristic of perceptron
Perceptron learning rule
§ If true y* ¹ hw(x) (an error), adjust the weights
§ If w.x < 0 but the output should be y*=1
§ This is called a false negative
§ Should increase weights on positive inputs
§ Should decrease weights on negative inputs
§ If w.x > 0 but the output should be y*=-1
§ This is called a false positive
§ Should decrease weights on positive inputs
§ Should increase weights on negative inputs
Perceptron Learning Rule
§ Start with weights = 0
y = hw(x) = 1 when w.x ³ 0
§ For each training instance:
= -1 when w.x < 0
§ If wrong: adjust the weight vector by
adding or subtracting the feature
vector. y* is true label.
*
* x
Example
Dear Stuart, I wanted to let you know
that I have decided to leave Macrosoft
w0 : -3
money
2 and return to academia. The money is
wfree : 4 is great here but I prefer to be free
y=1 (SPAM) wmoney : 2 to pursue more interesting research
and I really love teaching
w.x
1
x0 : 1 undergraduates! Do I need to finish
=0
xfree : 1 my BA first before applying?
xmoney : 1 Best wishes
0 Bill
y=0 (HAM) 0 1 free
w.x = -3x1 + 4x1 + 2x1 = 3
w ¬ w + a y* x w ¬ (-3,4,2) + 0.5 (0 – 1) (1,1,1)
a = 0.5 = (-3.5,3.5,1.5)
Perceptron convergence theorem
Separable
§ A learning problem is linearly separable iff there is some
hyperplane exactly separating positive from negative examples
§ Convergence: if the training data are linearly separable,
perceptron learning applied repeatedly to the training set will
eventually converge to a perfect separator
Non-Separable
Example: Earthquakes vs nuclear explosions
7.5 1
7
0.9
Proportion correct
6.5
6 0.8
5.5
5 0.7
x2
4.5 0.6
4
3.5 0.5
3
2.5 0.4
4.5 5 5.5 6 6.5 7 0 100 200 300 400 500 600 700
x1 Number of weight updates
63 examples, 657 updates required
Perceptron convergence theorem
Separable
§ A learning problem is linearly separable iff there is some
hyperplane exactly separating +ve from –ve examples
§ Convergence: if the training data are separable, perceptron
learning applied repeatedly to the training set will eventually
converge to a perfect separator
Non-Separable
§ Convergence: if the training data are non-separable,
perceptron learning will converge to a minimum-error
solution provided the learning rate a is decayed appropriately
(e.g., a=1/t)
Perceptron learning with fixed a
7.5
1
7
6.5 0.9
Proportion correct
6
5.5 0.8
5
x2
0.7
4.5
4 0.6
3.5 0.5
3
2.5 0.4
4.5 5 5.5 6 6.5 7 0 20000 40000 60000 80000 100000
x1 Number of weight updates
71 examples, 100,000 updates
fixed ! = 0.2, no convergence
Perceptron learning with decaying a
7.5
7 1
6.5 0.9
Proportion correct
6
5.5 0.8
5
x2
0.7
4.5
4 0.6
3.5 0.5
3
2.5 0.4
4.5 5 5.5 6 6.5 7 0 20000 40000 60000 80000 100000
x1 Number of weight updates
71 examples, 100,000 updates
decaying ! = 1000/(1000 + *), near-convergence
Non-Separable Case
Even the best linear boundary makes at least one mistake
Other Linear Classifiers
§ Perceptron is perfectly happy as long as it separates the training data
y
$+,-.#,'/( ! ⋅ #
1
§ Logistic Regression § Support Vector Machines (SVM)
1 § Maximize margin between boundary and nearest points
$#$%&'$( # =
1 + ' )*
y
$#$%&'$( !⋅#
1
x
Perceptrons hopeless for XOR function
x1 x1 x1
1 1 1
0 0 0
0 1 x2 0 1 x2 0 1 x2
(a) x1 and x2 (b) x1 or x2 (c) x1 xor x2
Basic questions
§ Which hypothesis space H to choose?
§ How to measure degree of fit?
§ How to trade off degree of fit vs. complexity?
§ “Ockham’s razor”
§ How do we find a good h?
§ How do we know if a good h will predict well?
Classical stats/ML: Minimize loss function
§ Which hypothesis space H to choose?
§ E.g., linear combinations of features: hw(x) = wTx
§ How to measure degree of fit?
§ Loss function, e.g., squared error Σj (yj – wTx)2
§ How to trade off degree of fit vs. complexity?
§ Regularization: complexity penalty, e.g., ||w||2
§ How do we find a good h?
§ Optimization (closed-form, numerical); discrete search
§ How do we know if a good h will predict well?
§ Try it and see (cross-validation, bootstrap, etc.)
57
Probabilistic: Max. likelihood, max. a priori
§ Which hypothesis space H to choose?
§ Probability model P(y | x,h) , e.g., Y ~ N(wTx,σ2)
§ How to measure degree of fit?
§ Data likelihood Πj P(yj | xj,h)
§ How to trade off degree of fit vs. complexity?
§ Regularization or prior: argmaxh P(h) Πj P(yj | xj,h) (Max a Priori)
§ How do we find a good h?
§ Optimization (closed-form, numerical); discrete search
§ How do we know if a good h will predict well?
§ Empirical process theory (generalizes Chebyshev, CLT, PAC…);
§ Key assumption is (i)id
58
Bayesian: Computing posterior over H
§ Which hypothesis space H to choose?
§ All hypotheses with nonzero a priori probability
§ How to measure degree of fit?
§ Data probability, as for MLE/MAP
§ How to trade off degree of fit vs. complexity?
§ Use prior, as for MAP
§ How do we find a good h?
§ Don’t! Bayes predictor P(y|x,D) = Σh P(y|x,h) P(D|h) P(h)
§ How do we know if a good h will predict well?
§ Silly question! Bayesian prediction is optimal!!
59
Acknowledgement
The lecture slides are based on the materials from ai.Berkey.edu
Questions