CHAPTER 2:
Supervised Learning
Learning a Class from Examples
Class C of a “family car”
Prediction: Is car x a family car?
Knowledge extraction: What do people expect from a
family car?
Output:
Positive (+) and negative (–) examples
Input representation:
x1: price, x2 : engine power
2
Training set X
X {xt ,r t }tN1
x1
x
x2
3
Class C
p1 price p2 AND e1 engine power e2
4
Hypothesis class H
1 if h says x is positive
h( x)
0 if h says x is negative
Error of h on H
E (h| X ) 1hxt r t
N
t 1
5
S, G, and the Version Space
most specific hypothesis, S
most general hypothesis, G
h H, between S and G is
consistent
and make up the
version space
(Mitchell, 1997)
6
Margin
Choose h with largest margin (the distance between
the boundary and closest instance)
7
Candidate Elimination Algorithm
Initialize the sets S and G, respectively, to the sets of maximally
specific and maximally general generalizations that are consistent with the
first observed positive training example;
for each subsequent example ei
if ei is a negative example
- retain in S only those generalizations which do not match ei;
- specialize the members of G that match ei, only to the extent
required so that they no longer match ei, and only in such ways
that each remains more general than some generalization in S;
- remove from G any element that is more specific than some in G
else if ei is a positive example
- retain in G only those generalizations that match ei;
- generalize members of S that do not match ei, only to the
extent required to allow them to match ei, and only in such ways
that each remains more specific than some generalization in G;
- remove from S any element that is more general than some in S
end
Example: Candidate elimination
Example
Sky AirTemp Humidity Wind Water Forecast EnjoySport
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes
Example: Candidate Elimination
Consider a concept described by three attributes predefined as follows:
sky temperature humidity
| | | | | |
Sunny Rainy Warm Coo Normal Low
and the set of positive and negative training examples:
1. ( S W N )+)
2. ( R C L )-)
3. ( S C N )+)
4. ( S W L )-)
Example: candidate Elimination
Size Color Shape Class
Big Red Circle -ve
Small Red Triangle -ve
Small Red Circle +v
Big Blue Circle +ve
Small Blue Circle +ve
VC (Vapnik-Chervonenkis) Dimension
N points can be labeled in 2N ways as +/–
H shatters N if there
exists h H consistent
for any of these:
VC(H ) = N
An axis-aligned rectangle
shatters 4 points only !
What about if the points are along one axis ?
What about 5 points ?
12
Probably Approximately Correct
(PAC) Learning
How many training examples N should we have, such that with probability
at least 1 ‒ δ, h has error (probablity) at most ε ?
(Blumer et al., 1989)
Each strip is at most ε/4
Pr that we miss a strip 1‒ ε/4
Pr that N instances miss a strip (1 ‒ ε/4)N
Pr that N instances miss 4 strips 4(1 ‒ ε/4)N
4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x)
4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)
13
Noise and Model Complexity
Select simpler model because
Simpler to use (lower computational complexity)
Easier to train (lower space complexity)
Easier to explain (more interpretable)
Generalizes better (lower variance - Occam’s
razor)
Occam’s razor: simpler explanations are more plausible
and unnecessary complexity should be shaved off
14
Learning Multiple Classes
Data: X {x ,r }t 1
t t N
1 if x t
Ci
ri
t
0 if x t
C j , j i
Model it as a K two-class
problem
Results into k hypothesis
Train hypotheses
hi(x), i =1,...,K:
t
Ci
hi x
t
1 if x
0 if x t
C j , j i 15
Regression
Data: X x , r
t
t N
gx w1x w0
t 1
r
t
gx w2 x 2 w1x w0
Actual fn r t f x t
Loss function
E g | X r gx
1 N t
N t 1
t 2
Ex: E w1 ,w0 | X r t w1 x t w0
N
1 2
N t 1
16
Solution to regression problem
Close form solution: W= (XT X)-1. XT y
Worst case time complexity: O(n3)
n: number of features
Gradient Descent: Tweak parameters interatively to minimize a cost
function
Batch Gradient Descent:
Stochastic Gradient Descent
Programming Note: SGDRegressor in sklearn.linear_model
Programming note:
Gradient Descent
Tweak parameters
To optimize a cost function
Learning rate too small
Learning rate too large
Pitfalls ?
Gradient Descent: Impact of scaling
With Scaling Without Scaling
Practice: Ensure all parameters have similar scale
Gradient Descent: Batch, Stochastic
and Mini batch
Batch: Update all by using matrix operation
𝜕
𝐸(𝑊)
𝜕𝑤0
𝜕
𝐸(𝑊) 2 𝑇
𝜕𝑤2
∇𝑊 (𝐸(𝑊)) = . = 𝑋 . 𝑋. 𝑊 − 𝑟
𝑁
.
𝜕
𝐸(𝑊)
𝜕𝑤𝑁
Gradient Descent step 𝑊 𝑛𝑒𝑥𝑡 = 𝑊 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 − 𝛻𝑊 (𝐸(𝑊))
: Learning rate
Stochastic: Pick a random instance and compute gradient on that single instance
Mini Batch: Pick a random set of instances and computer gradient on that set
Batch vs Gradient vs Mini
Polynomial regression
Add powers of each features and then train a
linear model
Programming Note: Use Scikitlearn’s PolynomialFeatures to add
powers of features, then use LinearRegression()
Model Selection & Generalization
Data is not sufficient to find a unique solution
The need for inductive bias
assumptions about H
Generalization
How well a model performs on new data
Overfitting: H more complex than C or f
Underfitting: H less complex than C or f
Triple Trade-Off
There is a trade-off between three factors
(Dietterich, 2003):
Complexity of H, c (H),
Training set size, N,
Generalization error, E, on new data
As N ,E
As c (H),first Eand then E
24
Cross-Validation: General Practice
To estimate generalization error, we need data
unseen during training. We split the data as
Training set (50%)
Validation set (25%)
Used to select the best model
Test (publication) set (25%)
Test the best model on this set
Fit the best hypothesis, then select the one that
fits most accurately (on the validation set).
Resampling when there is few data
25