0% found this document useful (0 votes)
18 views18 pages

Supervised Learning Algorithms

The document provides an overview of supervised learning algorithms, focusing on concepts such as probabilistic supervised learning, support vector machines (SVM), and kernel methods. It explains the mechanics of SVM, including the optimization problem and the use of the kernel trick to handle non-linear relationships. Additionally, it briefly discusses other supervised learning techniques like K-nearest neighbors and decision trees.

Uploaded by

khyatisingh1910
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views18 pages

Supervised Learning Algorithms

The document provides an overview of supervised learning algorithms, focusing on concepts such as probabilistic supervised learning, support vector machines (SVM), and kernel methods. It explains the mechanics of SVM, including the optimization problem and the use of the kernel trick to handle non-linear relationships. Additionally, it briefly discusses other supervised learning techniques like K-nearest neighbors and decision trees.

Uploaded by

khyatisingh1910
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Deep Learning Srihari

Machine Learning Basics:


Supervised Learning Algorithms
Sargur N. Srihari
[email protected]

This is part of lecture slides on Deep Learning:


http://www.cedar.buffalo.edu/~srihari/CSE676 1
Deep Learning Srihari

What is Supervised Learning?


• Refers to learning algorithms that learn to
associate some input with some output given a
training set of inputs x and outputs y
• Outputs may be collected automatically or
provided by a human supervisor

4
Deep Learning Srihari

Probabilistic Supervised Learning


• Most supervised learning algorithms are based
on estimating a probability distribution p(y|x)
• We can do this by using MLE to find the best
parameter vector θ for a parametric family of
distributions p(y|x;θ)
• Linear regression corresponds to the family
p(y|x;θ)=N(y|θTx,I)
• We can generalize linear regression to
classification by using a different family of
probability distributions 5
Deep Learning Srihari
Probabilistic Supervised Classification
• If we only have two classes we only need to
specify the distribution for one of these classes
– The probability of the other class is known
For linear regression, For classification, a distribution over
normal distribution a binary variable is more complex.
p(y|x;θ)=N(y|θTx,I) Its mean must be between 0 and 1.
was parameterized by Solved using logistic sigmoid
its mean θTx to squash the output of a linear
Any value we supply for function into interval (0,1):
this mean is valid p(y=1|x;θ)=σ(θTx)
Known as logistic regression
– Linear regression has a closed-form solution
– But Logistic regression has no closed-form solution
6
• Negative log-likelihood is minimized with gradient descent
Deep Learning Srihari

Strategy for Supervised Learning


• Same strategy can be applied to any
supervised learning problem
– Write down a parametric family of conditional
probability distributions over the right kind of
input and output variables

7
Deep Learning Srihari

Support Vector Machines


• An influential approach to supervised learning
• Model is similar to logistic regression in that it is
driven by a linear function wTx+b
– Unlike logistic regression, SVM does not provide
probabilities, but only outputs class identity
• SVM predicts positive class when wTx+b>0
• SVM predicts negative class when wTx+b<0

8
Deep Learning Srihari

Kernel trick
• A key innovation associated with SVM
• Many ML algorithms can be written as dot
products between examples
– Ex: linear function used by SVM can be rewritten as
m
f (x) = w x + b = b + ∑ α x x
T
i
T (i )

i =1

• where x(i) is a training example, α a vector of coefficients


– Replace x by a feature function ϕ(x) and the dot
product with a kernel function k(x,x(i))=ϕ(x)Ÿϕ(x(i))
• The Ÿ operator represents inner product ϕ(x)Tϕ(x(i))
• We may not literally use the inner product
9
Deep Learning Srihari

SVM Optimization Problem


Support Equation of plane: Distance from x to plane:
Vectors g(x) = w • x + w 0
t
r=g(x)/||w||
For each input vector, let zk = +1
depending on whether input k is in class C1 or C2

Thus if g(y)=0 is a separating hyperplane then zk g(yk) > 0, k = 1,.., n


Since distance of a point y to hyperplane g(y)=0 is g(y)/||a|| we could require that
hyperplane be such that all points are at least distant b from it, i.e. z g(y )
k k
≥b
The goal is to find the weight vector a that satisfies || a ||
( )
z k g yk while maximizing b.
≥ b, k = 1,.....n To ensure uniqueness we impose the constraint b ||a|| = 1
a or b = 1/||a||which implies that ||a||2 is to be minimized
Support vectors are training vectors which represent equality. A quadratic optimization
problem: minimize a quadratic subject to linear inequality constraints.
optimize
1
arg min || a ||2
a,b 2 Can be cast as unconstrained problem by introducing Lagrange multipliers
subject to constraints with one multiplier αk for each constraint. The Lagrange function is
zka tyk ≥ 1, k = 1,.....n
n
1 2 n 1 n n
( )
L a, α = a − ∑ α k ⎡⎣zka tyk − 1⎤⎦
2 k =1
Dual problem: L(α ) = ∑α k −
k =1
∑∑ α kα j z k z j k ( y j , yk )
2 k =1 j =1

n where kernel is defined as:


subject to constraint: ∑ α z k k =0 α k ≥ 0, k = 1,....., n
k =1 k(y j ,yk ) = y j t ⋅ yk = φ(x j )t ⋅ φ(x k )
Deep Learning Srihari

Prediction using SVM


• After replacing dot products with kernel
evaluations, we can make predictions using
f (x) = b + ∑ α k(x,x )
i
(i )

• Depending on whether f (x)>0


– The function is nonlinear in x but the relationship
between f (x) and ϕ(x) is linear
• Also the relationship between α and f (x) is linear
• The kernel-based function is exactly equivalent
to preprocessing the data by applying ϕ(x) to all
inputs, then learning a linear model in the new
transformed space 11
Deep Learning Srihari

Efficacy & Efficiency of kernel


1. Kernel trick allows us to learn models that are
nonlinear as a function of x using convex
optimization guaranteed to converge efficiently
– Possible because ϕ is fixed and optimize only α
2. Kernel k implemented more efficiently than
constructing ϕ vectors and taking dot product
– k(x,x’) tractable even when ϕ(x) is intractable

12
Deep Learning Srihari

Gaussian Kernel
• Most commonly used kernel is k(u,v)=N(u-v;0,σ2I)
– Called radial basis: decreases along lines radiating
from u
• To see that it is a valid kernel
– Consider k(u,v) = exp (-||u-v||2/2σ2)
• By expanding the square ||u-v||2 = uTu + vTv - 2uTv
• we get k(u,v)=exp(-uTu/2σ2)exp(-uTv/σ2)exp(-vTv/2σ2)
– Validity follows from kernel construction rules
• If k1(x,x’) is a valid kernel, so are
• k(x,x’)=f(x)k1(x,x’)f(x’) and k(x,x’)=exp(k1(x,x’))
• together with validity of linear kernel k(u,v)=uTv
Deep Learning Srihari

Intuition of Gaussian kernel


• It performs a kind of template matching
• When a test point x’ is near a template x its
response is high, putting a large weight on the
associated training label y
• Overall, the prediction combines many such
training labels weighted by the similarity of the
corresponding training examples
• SVM is not the only algorithm enhanced by the
kernel trick
– Methods employing the kernel trick are kernel
methods
Deep Learning Srihari

Disadvantages of Kernel Methods


• Cost of decision function evaluation: linear in m
– Because the ith example contributes a term
αik(x,x(i)) to the decision function
– Can mitigate this by learning an α with mostly zeros
• Classification requires evaluating the kernel function only
for training examples that have a nonzero αi
• These are known as support vectors
• Cost of training: high with large data sets
• Generic kernels struggle to generalize well
– Neural network outperformed RBF kernel SVM on
15
MNIST benchmark
Deep Learning Srihari

Other simple supervised learning


• K-nearest neighbor
• Decision trees

16
Deep Learning Srihari

K-Nearest Neighbors
• A family of techniques that can be used for
regression or classification
• As a nonparametric learning algorithm:
– it is not restricted to a fixed no of parameters
• A simple function of the training data
– When we want to produce output y for input x, we
find k-nearest neighbors to x in the training data X.
We then return the average of the corresponding y
values in the training set
• Works for any kind of supervised learning where we can
define average over y values 17
Deep Learning Srihari

K-Nearest Neighbor Classification


• We can average over one-hot-vectors c with
cy=1 and ci=0 for all other values of I
– Can interpret the average values as giving a
probability distribution over classes
• K-nn classification has high capacity
– High accuracy given a large training set
• 1-nearest neighbor error rate approaches twice Bayes
error rate, with training set size, assuming 0-1 loss
– Error in excess of Bayes rate because of choosing a neighbor by
breaking ties between equally distant neighbors randomly
– If we chooses all neighbors to vote, we get Bayes rate

• A weakness: cannot learn one feature more


18
discriminative than another
Deep Learning Srihari

Decision Tree
• A learning algorithm that breaks the input
space into regions and has separate
parameters for each region
• Each node is associated with with a region
of input space
• Internal nodes break that region into onr=e
subregion for each child

19
Deep Learning Srihari

How a decision tree works

Each node of the tree


chooses to send the input example to the
child node on the left (0) or the child node
on the right (1). Internal nodes are circles
Leaf nodes are squares.
The tree divides space into regions.
This 2-D plane shows how a decision
tree might divide R2.
The nodes are plotted in this plane
with each internal node drawn along
with the dividing line it uses to
categorize examples and leaf nodes
drawn in the center of the region of
examples they receive. The result is a
20
piecewise-constant function, with one
piece per leaf.

You might also like