0% found this document useful (0 votes)
13 views25 pages

CHAPTER 2 Supervised Learning Complete

Uploaded by

mailanh27052003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views25 pages

CHAPTER 2 Supervised Learning Complete

Uploaded by

mailanh27052003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CHAPTER 2:

Supervised Learning
Learning a Class from Examples
 Class C of a “family car”
 Prediction: Is car x a family car?
 Knowledge extraction: What do people expect from a
family car?
 Output:
Positive (+) and negative (–) examples
 Input representation:
x1: price, x2 : engine power

2
Training set X
X  {xt ,r t }tN1

 x1 
x 
x2 

3
Class C
p1  price  p2  AND e1  engine power  e2 

4
Hypothesis class H
 1 if h says x is positive
h( x)  
0 if h says x is negative

Error of h on H

E (h| X )  1hxt   r t 
N

t 1

5
S, G, and the Version Space

most specific hypothesis, S


most general hypothesis, G

h H, between S and G is


consistent
and make up the
version space
(Mitchell, 1997)

6
Margin
 Choose h with largest margin (the distance between
the boundary and closest instance)

7
Candidate Elimination Algorithm
Initialize the sets S and G, respectively, to the sets of maximally
specific and maximally general generalizations that are consistent with the
first observed positive training example;
for each subsequent example ei
if ei is a negative example
- retain in S only those generalizations which do not match ei;
- specialize the members of G that match ei, only to the extent
required so that they no longer match ei, and only in such ways
that each remains more general than some generalization in S;
- remove from G any element that is more specific than some in G
else if ei is a positive example
- retain in G only those generalizations that match ei;
- generalize members of S that do not match ei, only to the
extent required to allow them to match ei, and only in such ways
that each remains more specific than some generalization in G;
- remove from S any element that is more general than some in S
end
Example: Candidate elimination
 Example
Sky AirTemp Humidity Wind Water Forecast EnjoySport
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cool Change Yes
Example: Candidate Elimination

Consider a concept described by three attributes predefined as follows:


sky temperature humidity
| | | | | |
Sunny Rainy Warm Coo Normal Low
and the set of positive and negative training examples:
1. ( S W N )+)
2. ( R C L )-)
3. ( S C N )+)
4. ( S W L )-)
Example: candidate Elimination

Size Color Shape Class


Big Red Circle -ve
Small Red Triangle -ve
Small Red Circle +v
Big Blue Circle +ve
Small Blue Circle +ve
VC (Vapnik-Chervonenkis) Dimension

 N points can be labeled in 2N ways as +/–


 H shatters N if there
exists h  H consistent
for any of these:
VC(H ) = N
 An axis-aligned rectangle
shatters 4 points only !
 What about if the points are along one axis ?
 What about 5 points ?
12
Probably Approximately Correct
(PAC) Learning
 How many training examples N should we have, such that with probability
at least 1 ‒ δ, h has error (probablity) at most ε ?
(Blumer et al., 1989)

 Each strip is at most ε/4


 Pr that we miss a strip 1‒ ε/4
 Pr that N instances miss a strip (1 ‒ ε/4)N
 Pr that N instances miss 4 strips 4(1 ‒ ε/4)N
 4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x)
 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)

13
Noise and Model Complexity
Select simpler model because
 Simpler to use (lower computational complexity)

 Easier to train (lower space complexity)

 Easier to explain (more interpretable)

 Generalizes better (lower variance - Occam’s


razor)
 Occam’s razor: simpler explanations are more plausible
and unnecessary complexity should be shaved off

14
Learning Multiple Classes
Data: X  {x ,r }t 1
t t N

1 if x t
Ci
ri  
t

0 if x t
C j , j  i

 Model it as a K two-class
problem
 Results into k hypothesis
Train hypotheses
hi(x), i =1,...,K:

 t
Ci
hi x   
t
1 if x
0 if x t
C j , j  i 15
Regression
Data: X  x , r
t

t N

gx   w1x  w0
t 1

r 
t

gx   w2 x 2  w1x  w0
 Actual fn r t  f x t   

 Loss function

E g | X    r  gx 
1 N t
N t 1
t 2


Ex: E w1 ,w0 | X    r t  w1 x t  w0  
N

1 2

N t 1
16
Solution to regression problem
 Close form solution: W= (XT X)-1. XT y
 Worst case time complexity: O(n3)
 n: number of features
 Gradient Descent: Tweak parameters interatively to minimize a cost
function
 Batch Gradient Descent:
 Stochastic Gradient Descent
 Programming Note: SGDRegressor in sklearn.linear_model
 Programming note:
Gradient Descent
 Tweak parameters
 To optimize a cost function

 Learning rate too small

 Learning rate too large

Pitfalls ?
Gradient Descent: Impact of scaling

With Scaling Without Scaling

Practice: Ensure all parameters have similar scale


Gradient Descent: Batch, Stochastic
and Mini batch
 Batch: Update all by using matrix operation

𝜕
𝐸(𝑊)
𝜕𝑤0
𝜕
𝐸(𝑊) 2 𝑇
𝜕𝑤2
 ∇𝑊 (𝐸(𝑊)) = . = 𝑋 . 𝑋. 𝑊 − 𝑟
𝑁
.
𝜕
𝐸(𝑊)
𝜕𝑤𝑁

 Gradient Descent step 𝑊 𝑛𝑒𝑥𝑡 = 𝑊 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 − 𝛻𝑊 (𝐸(𝑊))


 : Learning rate
 Stochastic: Pick a random instance and compute gradient on that single instance

 Mini Batch: Pick a random set of instances and computer gradient on that set
Batch vs Gradient vs Mini
Polynomial regression
 Add powers of each features and then train a
linear model
 Programming Note: Use Scikitlearn’s PolynomialFeatures to add
powers of features, then use LinearRegression()
Model Selection & Generalization
 Data is not sufficient to find a unique solution
 The need for inductive bias
 assumptions about H
 Generalization
 How well a model performs on new data
 Overfitting: H more complex than C or f
 Underfitting: H less complex than C or f
Triple Trade-Off

 There is a trade-off between three factors


(Dietterich, 2003):
 Complexity of H, c (H),
 Training set size, N,
 Generalization error, E, on new data
 As N ,E
 As c (H),first Eand then E

24
Cross-Validation: General Practice
 To estimate generalization error, we need data
unseen during training. We split the data as
 Training set (50%)
 Validation set (25%)
 Used to select the best model
 Test (publication) set (25%)
 Test the best model on this set
 Fit the best hypothesis, then select the one that
fits most accurately (on the validation set).
 Resampling when there is few data
25

You might also like