0% found this document useful (0 votes)

123 views52 pages

Multi-Class Classification

The document discusses multi-class classification in artificial intelligence and machine learning, covering topics such as predictors, learning theory, and optimization techniques. It introduces methods for reducing multi-class problems to binary classification, including One-vs-One and One-vs-Rest approaches, and explores the linear multiclass hypothesis space. Additionally, it addresses margin-based loss functions and regularization schemes to enhance model performance in multi-class settings.

Uploaded by

9gt5rqjjnq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views52 pages

Multi-Class Classification

Uploaded by

9gt5rqjjnq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

ARIN7015/MATH6015: Topics in Artificial Intelligence

and Machine Learning

Multi-Class Classification and Its Optimization

Yunwen Lei

Department of Mathematics, The University of Hong Kong

March 20, 2025

Outline

1 Multiclass Predictors

2 Learning Theory for Multi-class Classification

3 Frank-Wolfe Algorithm

4 Optimization for Multi-class Classification

Multi-class Classification

( ,dog), ( ,car), ( ,airplane), . . .

i.i.d.
Formally z1 = (x1 , y1 ), . . . , zn = (xn , yn ) ∼ ρ
| {z }
∈X ×Y
▶ Y := {1, 2, . . . , c}
▶ c = number of classes
Reduction to Binary Classification

One-vs-One
Train c(c − 1)/2 binary classifiers, each classifier for a pair of different classes
The final prediction by the majority vote
Reduction to Binary Classification
One-vs-Rest
Train c binary classifiers, one for each class
Train j-th classifier to distinguish class j from the rest
Suppose h1 , . . . , hc : X 7→ R are our binary classifiers
The final prediction is
h(x) = arg max hj (x)
j∈[c]

Tie breaking: If two distinct i, j ∈ [c]

achieves the maximum (i.e., wi⊤ x = wj⊤ x),
we break the tie using some consistent
policy, e.g., predicting the label as the
smaller between i and j.
Recap: Linear Binary Classifier

Y = {−1, +1}, X = Rd
Linear classifier score function h(x) = w⊤ x
Final prediction: sign(h(x))

h(x) = w⊤ x = ∥w∥2 ∥x∥2 cos(θ)

h(x) > 0 ⇐⇒ cos θ > 0 =⇒ θ ∈ (−90◦ , 90◦ )
h(x) < 0 ⇐⇒ cos θ < 0 =⇒ θ ̸∈ (−90◦ , 90◦ )
Three Class Example

Base hypothesis space

H = x 7→ w⊤ x : w ∈ R2

Note: separating boundary always contains the origin

Three Class Example: One-vs-Rest

Class 1 vs Rest: h1 (x) = w1⊤ x

Three Class Example: One-vs-Rest

Class 2 vs Rest
▶ Predicts everything to be “Not 2”
▶ It misclassifies 6 examples (all examples in class “2”)
▶ If it predicted some “2”, then it would get many more “Not 2” incorrect
Score for class j is
hj (x) = wj⊤ x = ∥wj ∥2 ∥x∥ cos θj ,
where θj is the angle between x and wj .
One-vs-Rest: Class Boundaries

For simplicity, we assume ∥w1 ∥2 = ∥w2 ∥2 = ∥w3 ∥2

Then wj⊤ x = ∥wj ∥2 ∥x∥2 cos θj
▶ Only θj matters for comparison
▶ x is classified by whichever has largest cos θj (i.e., θj closest to 0)
One-vs-Rest: Class Boundaries

This approach doesn’t work well in this instance!

How can we fix this?

The Linear Multiclass Hypothesis Space

Base hypothesis space

e = x 7→ w⊤ x : w ∈ Rd

H
Linear multiclass hypothesis space
n o
H = x 7→ arg max hj (x) : h1 , . . . , hc ∈ H
e
j∈[c]

A learning algorithm chooses the hypothesis from

hypothesis space
Is this a failure of the hypothesis space or the
learning algorithm?
A Solution with Linear Functions

This works... so the problem is not with the hypothesis space

How can we get a solution like this?
Multiclass Predictors
Multiclass Hypothesis Space

Idea: use one function h(x, y ) to give a compatibility score between input x and output y
Final prediction is the y ∈ Y that is “most compatible” with x

x ← arg max h(x, y ).

y ∈Y

Given class-specific score functions h1 , . . . , hc , we could define compatibility

function as
h(x, j) = hj (x), j ∈ [c].

Base Hypothesis Space: H = h : X × Y 7→ R
Multiclass Hypothesis Space
n o
x 7→ arg max h(x, y ) : h ∈ H
y ∈Y

Want compatibility h(x, y ) to be large when x has label y , small otherwise!

Multi-class Classification: Example

Points x1 , x2 and x3 will be classified as label 1, 2 and 3, respectively.

What do the three rays stand for?

Linear Multiclass Prediction Function

A linear compatibility score function is given by

h(x, y ) = ⟨w, Ψ(x, y )⟩,

′
where Ψ : X × Y 7→ Rd is a compatibility feature map

Ψ(x, y ) extracts features relevant to how compatible y is with x

Final compatibility score is a linear function of Ψ(x, y )
Linear Multiclass Hypothesis Space
′
n o
x 7→ arg max⟨w, Ψ(x, y )⟩ : w ∈ Rd
y ∈Y
Example: X = R2 , Y = {1, 2, 3}

! !
− √12 √1

0 2
w1 = √1
, w2 = , w3 = √1
2
1 2

Prediction function
x1
x= ← arg max ⟨wj , x⟩
x2 j∈{1,2,3}

How can we get this into the form x ← arg maxy ∈Y ⟨w, Ψ(x, y )⟩?
The Multivector Construction
What if we stack wj ’s together
 
! !
− √12 √1

 0 2

w= √1
, , √1 

 2
1 2 

| {z } | {z } | {z }
w1 w2 w3

We define the feature map Ψ : R2 × {1, 2, 3} 7→ R2×3 as

x1 0 0
Ψ(x, 1) = , ,
x2 0 0

0 x1 0
Ψ(x, 2) = , ,
0 x2 0

0 0 x1
Ψ(x, 3) = , ,
0 0 x2

Then we get the wanted equation ⟨w, Ψ(x, y )⟩ = ⟨wy , x⟩

Linear Multiclass Prediction Function: Simplification

In our lecture, we always consider the following compatibility feature map

Ψ(x, j) = 0, . . . , 0, x
|{z} , 0, . . . , 0 ∈ Rd×c
j-th component

We stake all wj into a matrix

w = (w1 , . . . , wc ) ∈ Rd×c .

Then it is clear that

⟨w, Ψ(x, j)⟩ = wj⊤ x!
Linearly Separable Dataset

A training dataset S is linearly separable if there exist w1 , . . . , wc s.t.

correctly classify all the points in S
for every point x ∈ S with label y , wy⊤ x > wj⊤ x for all j ̸= y
Recap: Binary Classification
Margin of a hyperplane for a dataset: the width that the boundary could be increased
without hitting a data point

n
gi
ar
m

How to define the margin for multi-class classification?

Multi-class Margin

Recap: we choose the class label with the largest score function

ŷ := arg maxj∈Y h(x, j)

The prediction is correct if the true class label receives the largest score!

Multi-class Margin
Aim: Want h(xi , yi ) larger than all other h(xi , j) for j ̸= y i.e., a large margin

margin(h, z) = h(x, y ) − max h(x, j)
| {z } j:j̸=y | {z }
score of true label score of incorrect label
Multi-class Margin

Defined as the score difference between the score of the correct label and the highest
score for the incorrect label
Margin-based Loss

Ideal property of loss function

Loss should be a decreasing function of the margin
The model w predicts a vector in Rc : (w1⊤ x, . . . , wc⊤ x)
We need a map from Rc to R+

We consider the margin-based loss of the form

f (w; z) = ℓy w1⊤ x, w2⊤ x, . . . , wc⊤ x ,

where ℓy : Rc 7→ R.

hinge loss : ℓy (t1 , . . . , tc ) = max{0, 1 − margin} = max{0, 1 + max(tj − ty )}

j:j̸=y
n o
f (w; z) = max 0, 1 + max⟨wj − wy , x⟩ = ℓy w1⊤ x, . . . , wc⊤ x

(1)
j:j̸=y

margin: ty − maxj:j̸=y tj
Soft-Max Loss
Note that the margin of t is ty − maxj̸=y tj
Max is not differentiable. Replace it with the soft-max margin
X
max t1 −ty , . . . , tc −ty } ≤ log exp(tj −ty ) ≤ max t1 −ty , . . . , tc −ty }+log c
j:j̸=y
| {z } | {z }
−margin | {z } −margin
soft-max

Want to maximize margin: x 7→ log(1 + exp(−x)) is a decreasing function

X X
log(1+exp(−margin)) ≈ log 1+exp log exp(tj −ty ) = log 1+ exp(tj−ty )
j:j̸=y j:j̸=y
| {z }
soft-max
P
With ℓy (t1 , . . . , tc ) = log 1 + j:j̸=y exp(tj − ty ) we know
X
exp wj⊤ x − wy⊤ x = ℓy w1⊤ x, . . . , wc⊤ x .

f (w; z) = log 1 +
j:j̸=y
Regularization Schemes for Multi-class Classification

We train a model by minimizing the following objective w.r.t. w = (w1 , . . . , wc ), wj ∈ Rd

n
1X n o
ℓy w1⊤ x, . . . , wc⊤ x s.t. w ∈ W = w ∈ Rd×c : Ω(w) ≤ 1 ,

min
w∈Rd×c n i=1 | i {z }
=f (w;zi )

where Ω : Rd×c 7→ R+ is a regularizer.

Examples: Ω(w) = λ
2
∥w∥22,p , where
c
X 1
∥w∥2,p := ∥wj ∥p2 p
p≥1
j=1

Pc
p = 1: ∥w∥2,1 := j=1 ∥wj ∥2
Pc 1
p = 2: ∥w∥2,p := j=1 ∥wj ∥22 2

p = ∞: ∥w∥2,∞ := maxj∈[c] ∥wj ∥2

Learning Theory for Multi-class Classification
Rademacher Complexity for Multi-class Classification

The loss function class becomes

n o
F = z 7→ ℓy w1⊤ x, . . . , wc⊤ x : Ω(w) ≤ 1

How to estimate the Rademacher complexity

n
1 X
ℓyi w1⊤ xi , . . . , wc⊤ xi .

Eσ sup
n w∈W
i=1

The difficulty lies in the nonlinearity of ℓy , which takes a vector in Rc as the input!
Lipschitz Continuity of ℓy

We say ℓy is G -Lipschitz continuous w.r.t. a norm on Rc if

ℓ a1 , . . . , ac − ℓ a1′ , . . . , ac′ ≤ G (a1 , . . . , ac ) − (a1′ , . . . , ac′ ) .

It is an extension of the Lipschitz continuity of R 7→ R

We are interested in ∥ · ∥2 and ∥ · ∥∞

Lemma. Let a1 , . . . , ac and b1 , . . . , bc ∈ R. Then

max a1 , . . . , ac − max b1 , . . . , bc ≤ max |aj − bj |.
j∈[c]

Proof. Without loss of generality, we ak ≥ bk ′ , where

ak = max a1 , . . . , ac , bk ′ = max b1 , . . . , bc .

max a1 , . . . , ac − max b1 , . . . , bc ≤ ak − bk ′ ≤ ak − bk ≤ max |aj − bj |.
j∈[c]
Lipschitz Continuity of Hinge Loss

Example
The hinge loss ℓy (t1 , . . . , tc ) = max{0, 1 + maxj:j̸=y (tj − tc )} is 2-Lipschitz continuous
w.r.t. ∥ · ∥∞ .

Proof. We just showed

max a1 , . . . , ac − max b1 , . . . , bc ≤ max |aj − bj |.
j∈[c]

Let t = (t1 , . . . , tc ) and t′ = (t1′ , . . . , tc′ ). Then

|ℓy (t) − ℓy (t′ )| = max{0, 1 + max(tj − ty )} − max{0, 1 + max(tj′ − ty′ )}

j:j̸=y j:j̸=y

≤ 1 + max(tj − ty )} − 1 − max(tj′ − ty′ )}

j:j̸=y j:j̸=y

≤ max (tj − ty ) − (tj′ − ty′ ) ≤ 2∥t − t′ ∥∞ .

j:j̸=y
Lipschitz Continuity of the Soft-max Loss

Example
P
The soft-max loss ℓy (t1 , . . . , tc ) = log 1 + j:j̸=y exp(tj − ty ) is 2-Lipschitz continuous
w.r.t. ∥ · ∥∞ .
P
Proof. The function t 7→ g (t) := log j∈[c] exp(tj ) is 1-Lipschitz continuous w.r.t.
∥ · ∥∞ .
 
exp(t1 )
1 . 
∇g (t) = P  ..  =⇒ ∥∇g (t)∥1 ≤ 1.

j∈[c] exp(t j )
exp(tc )
g (t) − g (t′ ) = t − t′ , ∇g (αt + (1 − α)t′ )
| {z }
Taylor expansion

=⇒ |g (t) − g (t )| ≤ ∥t − t ∥∞ ∥∇g (αt + (1 − α)t′ )∥1 .

′ ′
Vector contraction of Rademacher complexity

Vector contraction of Rademacher complexity

Let F be a class of functions from Z to Rc . For each i ∈ [n], let σi : Rc 7→ R be
L-Lipschitz continuous w.r.t. ∥ · ∥2 . Then
X √ XX
E sup ϵi σi (f (zi )) ≤ 2LE sup ϵi,k fk (zi ),
f ∈F f ∈F
i∈[n] i∈[n] k∈[c]

where ϵi,k are i.i.d. Rademacher variables, and fk (zi ) is the k-th component of f (zi ).

It is an extension from Talagrand’s contraction lemma to vector-valued functions

It removes the nonlinear σ, which is the loss function in multi-class classification!

Maurer, A., 2016. A vector-contraction inequality for Rademacher complexities. In Algorithmic Learning
Theory.
Rademacher Complexity Bound (Optional)
Let ℓy be either the hinge loss or the soft-max loss
c
X 1
|ℓy (t) − ℓy (t′ )| ≤ 2 max |tj − tj′ | ≤ 2 (tj − tj′ )2 2 = 2∥t − t′ ∥2 .
j∈[c]
| {z } j=1
=∥t−t′ ∥∞

Then vector contraction of Rademacher complexity implies

X √ XX
ϵi ℓyi w1⊤ xi , . . . , wc⊤ xi ≤ 2 2E sup ϵij wj⊤ xi .

E sup
w∈W w∈W
i∈[n] i∈[n] j∈[c]

Let W = {w ∈ Rd×c : ∥w∥2,2 ≤ 1}. Then

XX X X X X
E sup ϵij wj⊤ xi = E sup wj⊤ ϵij xi ≤ E sup wj 2
ϵij xi
w∈W w∈W w∈W 2
i∈[n] j∈[c] j∈[c] i∈[n] j∈[c] i∈[n]
c
1 X
X
2 2
X 2 12
≤ E sup wj 2
ϵij xi
w∈W 2
j∈[c] j=1 i∈[n]
c
h X X 2 21 i
≤E ϵij xi .
2
j=1 i∈[n]
Rademacher Complexity Bound (Optional)
We just showed
c
X √ h X X 2 12 i
E sup ϵi ℓyi w1⊤ xi , . . . , wc⊤ xi ≤ 2LE ϵij xi .
w∈W 2
i∈[n] j=1 i∈[n]

√
By the concavity of x 7→ x, we know
c c
h X X 2 21 i h X X 2 21 i
E ϵij xi ≤ E ϵij xi .
2 2
j=1 i∈[n] j=1 i∈[n]

Furthermore, there holds

X 2 h X X i X X
E ϵij xi = E ⟨ ϵij xi , ϵi ′ j xi ′ ⟩ = Eϵ2ij x⊤
i xi + Eϵij ϵi ′ j x⊤
i xi ′ .
2
i∈[n] i∈[n] i ′ ∈[n] i∈[n] i̸=i ′
| {z }
=0

c
h X X 2 12 i X 1
2
=⇒ E ϵij xi ≤ c· E∥xi ∥22 .
2
j=1 i∈[n] i∈[n]
Rademacher Complexity Bound

The Rademacher complexity of F is bounded by

1 X 1
z 7→ ℓy (w1⊤ x, . . . , wc⊤ x) : ∥w∥2,2 ≤ 1} ≤
2
RS c· E∥xi ∥22 .
n
i∈[n]

Generalization bound
Let A be ERM. Then with probability at least 1 − δ
1 1 √ √
F (A(S)) − F (w∗ ) ≲ n− 2 log 2 (2/δ) + c/ n.

This shows a square-root dependency of generalization bounds w.r.t. the number of

classes!
Frank-Wolfe Algorithm
Training Multiclass Model

Consider the following minimization problem (p ≥ 1)

n
1X X
ℓyi (wj⊤ xi )j∈[c] ∥wj ∥p2 ≤ B p .

min s.t.
w n
i=1 j∈[c]

It is a constrained optimization problem

Can be solved by projected gradient descent, which however has a projection
step per iteration
The projection can be expensive
Can we develop a projection-free method?

Yes, the Frank-Wolfe Algorithm!

Frank-Wolfe Method

Objective: minw∈Ω FS (w), where Ω is convex.

Recap: projected gradient descent (PGD)

Recall PGD wt+1 = ProjW wt − η∇FS (wt ) uses a quadratic approximation
n 1 o
wt+1 = arg min FS (wt ) + ⟨w − wt , ∇FS (wt )⟩ + ∥w − wt ∥22 .
w∈W 2η

In some cases, the projection may be hard to compute (or even approximate)

For these problems, we can solve the simplified problem

n
arg min FS (wt ) + ⟨w − wt , ∇FS (wt )⟩},
w∈W

which optimizes a linear approximation to the function over the constraint set
This requires the set W to be bounded, otherwise there may be no solution
Frank-Wolfe Method

Frank-Wolfe Method
1: for t = 0, 1, . . . , T do
2: Set vt = arg minv∈W ⟨∇FS (wt ), v⟩
3: Set wt+1 = wt + γt (vt − wt ), where γt ∈ [0, 1]

wt+1 is feasible since it is a convex combination of two other feasible points

wt+1 = (1 − γt )wt + γt vt

A choice of γt is γt = 2/(t + 2)
Linear Minimization Oracle
Let W be a convex, closed and bounded set. The linear minimization oracle of Ω
(IMOW ) returns a vector ĝ such that

IMOW (g) = arg min g⊤ v.

v∈W

IMOW returns an extreme point of W

IMOW is (projection-free) arguably cheaper than projection
Example
Apply the FW algorithm to solve the following nonlinear problem with the initialization
point (x1 , y1 ) = (0, 0)

min f (x, y ) = (x − 1)2 + (y − 1)2

s.t. 2x + y ≤ 6
x + 2y ≤ 6
x, y ≥ 0.

Solution. The gradient of f is ∇f (x, y ) = (2x − 2, 2y − 2)⊤ .

We know ∇f (0, 0) = (−2, −2)⊤
We get v by solving the linear optimization problem

max −2x − 2y s.t. 2x + y ≤ 6, x + 2y ≤ 6, x, y ≥ 0.

(x,y )

By simplex method, one can show that v = (2, 2)⊤

If we choose γ = 1/2, then we get

⊤ ⊤ ⊤ 0 1 2 1
(x2 , y2 ) = (x1 , y1 ) + γ(v − (x1 , y1 ) ) = + =
0 2 2 1
Example

Lasso Regression
min FS (w) = ∥Aw − b∥22 s.t. ∥w∥1 ≤ 1.
w

∇FS (w) = g := A⊤ (Aw − b)

IMOW (g) = −sign(gj ∗ )ej ∗ with j ∗ := arg maxj∈[d] |gj |, which is simpler than
projection onto L1 -ball. Here ej is the j-th unit vector.
Proof. For any w ∈ W, we have
X X
w⊤ g = wj gj ≥ − |wj gj |
j∈[d] j∈[d]
X
≥ − max |gj | |wj | = −|gj ∗ | = −sign(gj ∗ )e⊤
j ∗ g.
j∈[d]
j∈[d]
Lasso comparison
Comparing projected and conditional gradient for constrained lasso problem, with
n = 100, d = 500
Convergence of Frank-Wolfe Method
Theorem. Let W be nonempty, convex and bounded. Let FS be convex and L-smooth.
Then FW with γt = 2/(t + 2) satisfies

2LD 2
FS (wt ) − FS (w∗ ) ≤ , D := max ∥w − w′ ∥2 .
t +1 ′
w,w ∈W

Proof. Since vt and w∗ are in W, optimality of vt shows

⟨vt − wt , ∇FS (wt )⟩ ≤ ⟨w∗ − wt , ∇FS (wt )⟩ ≤ FS (w∗ ) − FS (wt ).

By the L-smoothness and wt+1 − wt = γt (vt − wt )

L
FS (wt+1 ) ≤ FS (wt ) + ⟨wt+1 − wt , ∇FS (wt )⟩ + ∥wt+1 − wt ∥22
2
γ2L
= FS (wt ) + γt ⟨vt − wt , ∇FS (wt )⟩ + t ∥vt − wt ∥22
2
2
γ LD 2
≤ FS (wt ) + γt FS (w∗ ) − FS (wt ) + t

.
2
γ 2 LD 2
=⇒ FS (wt+1 ) − FS (w∗ ) ≤ (1 − γt ) FS (wt ) − FS (w∗ ) + t .
2
Convergence of Frank-Wolfe Method

γ 2 LD 2
We derived FS (wt+1 ) − FS (w∗ ) ≤ (1 − γt ) FS (wt ) − FS (w∗ ) + t .
2
t 4LD 2
=⇒ FS (wt+1 ) − FS (w∗ ) ≤ FS (wt ) − FS (w∗ ) +

t +2 2(t + 2)2

We multiply both sides by (t + 1)(t + 2) and get

(t + 1)(t + 2)∆t+1 ≤ t(t + 1)∆t + 2LD 2 .

Telescoping shows
t
X
(t + 1)(t + 2)∆t+1 = (k + 1)(k + 2)∆k+1 − k(k + 1)∆t
k=0
t
X
≤ 2LD 2 = 2LD 2 (t + 1).
k=0
Stochastic Frank-Wolfe Method
Stochastic Frank-Wolfe Method
1: for t = 0, 1, . . . , T do
2: ˆ t , v⟩, where ∇
Set vt = arg minv∈W ⟨∇ ˆ t is an unbiased estimator of ∇FS (wt )
3: Set wt+1 = wt + γt (vt − wt ), where γt ∈ [0, 1]

Convergence Rate
Let W be nonempty, convex and bounded. Let FS be convex and L-smooth. Assume the
following variance condition

ˆ t − ∇FS (wt ) ≤ LD
2 2
E ∇ .
2 t +1
Then FW with γt = 2/(t + 2) satisfies

4LD 2
E FS (wt ) − F (w∗ ) ≤

.
t +1

Stochastic FW requires decreasing variance!

Stochastic Frank-Wolfe Method
Note
n
1X
FS (w) = f (w; zi ).
n i=1

Assume St is a random sampling (with replacement) from [n] and

ˆt = 1
X
∇ ∇f (wt ; zi ).
|St |
i∈St

ˆ t is an unbiased estimator of ∇FS (wt )

Then ∇

ˆ t ] = 1 ESt
hX i
ESt [∇ ∇f (wt ; zi ) = ∇FS (wt ).
|St |
i∈St

Furthermore, the variance decreases by a factor of |St |

2 σ2 2
ˆ t − ∇FS (wt )
E ∇ = , σ 2 := E[ ∇f (wt ; zit ) − ∇FS (wt ) 2 ].
2 |St |

Choosing |St | = σ 2 (t + 1)2 /(L2 D 2 ) meets the variance condition for stochastic FW.
Optimization for Multi-class Classification
Training Multi-class Model by Frank-Wolfe Method
Recall the problem
n
1X X
ℓyi (wj⊤ xi )j∈[c] ∥wj ∥22 ≤ B 2 .

min s.t.
w n i=1
j∈[c]

Frank-Wolfe method for multi-class classification

1: for t = 0, 1, . . . , T do
2: vt = arg minw:Pj∈[c] ∥wj ∥2 ≤1 ⟨w, ∇FS (w(t) )
2
3: Set wt+1 = wt + γt (vt − wt ), where γt ∈ [0, 1]

Frank-Wolfe Algorithm requires solving a linear optimization problem

X
arg min⟨w, g⟩ s.t. ∥wj ∥22 ≤ 1,
w
j∈[c]

∗
which has a closed-form solution w = (w1∗ , . . . , wc∗ ) as follows
gj
wj∗ = − P 1 . (2)
c
˜
j=1 ∥gj˜∥22 2
Proof on Closed-form Solution (Optional)

gj
w∗ = arg min ⟨w, g⟩ ⇐⇒ wj∗ = − P 1 .
∥wj ∥22 ≤1 c
P
w: j∈[c] ˜
j=1 ∥gj˜∥22 2

It suffices to check ∥w∗ ∥2,p ≤ 1 and ⟨w∗ , g⟩ = −∥g∥2,p∗ .

It is clear that
c c
X 1 X 1
∥w∗ ∥2,2 = ∥gj˜∥22 2
/ ∥gj˜∥22 2
=1
˜
j=1 ˜
j=1

and
c c c c
X X − 1 X X 1
⟨w∗ , g⟩ = ⟨wj∗ , gj ⟩ = − ∥gj˜∥22 2 ∥gj˜∥22 = − ∥gj˜∥22 2 = −∥g∥2,2 .
j=1 ˜
j=1 ˜
j=1 ˜
j=1
Summary

Multiclass predictors: vector-valued functions

▶ margin-based formulation: we want to find model with large margin
Learning theory
▶ Rademacher complexity bound with a square-root dependency on c
Frank-Wolfe Algorithm
▶ projection-free algorithm for constrained optimization
▶ replace projection by a linear minimization problem
Optimization for multiclass optimization
▶ A closed-form solution for the linear minimization problem

Sol Multiclass 1
No ratings yet
Sol Multiclass 1
5 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
19 pages
CS373 Lecture18.1
No ratings yet
CS373 Lecture18.1
33 pages
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
No ratings yet
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
32 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Understanding Multiclass Classification Techniques
No ratings yet
Understanding Multiclass Classification Techniques
59 pages
Multiclass Classification Techniques
No ratings yet
Multiclass Classification Techniques
59 pages
Probabilities in Linear Classification
No ratings yet
Probabilities in Linear Classification
40 pages
DDA3020 L07 SVM Session1 Updated
No ratings yet
DDA3020 L07 SVM Session1 Updated
65 pages
07 Multiclass
No ratings yet
07 Multiclass
17 pages
ML-Unit 3 Classification
No ratings yet
ML-Unit 3 Classification
41 pages
Multiclass and Ranking Prediction Methods
No ratings yet
Multiclass and Ranking Prediction Methods
224 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Conf Stab
No ratings yet
Conf Stab
16 pages
Cse 445 ML - 1
No ratings yet
Cse 445 ML - 1
28 pages
Main
No ratings yet
Main
5 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
w04 LectureSlices MA4550
No ratings yet
w04 LectureSlices MA4550
32 pages
315 F19 14 SVM 1
No ratings yet
315 F19 14 SVM 1
33 pages
10 SVM
No ratings yet
10 SVM
77 pages
Multiclass SVM Loss & Optimization
No ratings yet
Multiclass SVM Loss & Optimization
22 pages
Supervised Machine Learning Guide
No ratings yet
Supervised Machine Learning Guide
74 pages
Machine Learning Basics Lecture 7: Multiclass Classification
No ratings yet
Machine Learning Basics Lecture 7: Multiclass Classification
28 pages
Understanding Linear Classifiers Basics
No ratings yet
Understanding Linear Classifiers Basics
9 pages
SVM PCA Kmeans
No ratings yet
SVM PCA Kmeans
121 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
44 pages
Discriminant Functions
No ratings yet
Discriminant Functions
33 pages
5 Multiclass
No ratings yet
5 Multiclass
46 pages
Lecture Notes - SVM
No ratings yet
Lecture Notes - SVM
13 pages
07 - Linear Models For Classification
No ratings yet
07 - Linear Models For Classification
76 pages
SVM 2
No ratings yet
SVM 2
65 pages
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
No ratings yet
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
65 pages
AI Linear Regression & Perceptron
No ratings yet
AI Linear Regression & Perceptron
8 pages
Chapter 8
No ratings yet
Chapter 8
52 pages
Linear Regression for Astronomers
No ratings yet
Linear Regression for Astronomers
24 pages
Multi-Class Classification Lecture
No ratings yet
Multi-Class Classification Lecture
19 pages
21 Support Vector Machines 03-10-2024
No ratings yet
21 Support Vector Machines 03-10-2024
72 pages
Problem 1 Report Trần Minh Long 2052154 Final
No ratings yet
Problem 1 Report Trần Minh Long 2052154 Final
31 pages
i2ML Cheatsheets
No ratings yet
i2ML Cheatsheets
7 pages
Midterm F02soln
No ratings yet
Midterm F02soln
14 pages
Linear Classification: Slides Credit: CMU AI, Zico Kolter, Pat Virtue
No ratings yet
Linear Classification: Slides Credit: CMU AI, Zico Kolter, Pat Virtue
527 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
ML Classification for CS Students
No ratings yet
ML Classification for CS Students
49 pages
AI Lec2.1 MLsupervised
No ratings yet
AI Lec2.1 MLsupervised
21 pages
Linear Classification Guide
No ratings yet
Linear Classification Guide
28 pages
Linear Classifiers Explained
No ratings yet
Linear Classifiers Explained
13 pages
Imp Document3
No ratings yet
Imp Document3
6 pages
Machine Learning Basics and Techniques
No ratings yet
Machine Learning Basics and Techniques
56 pages
Unit 3 - SVM, Bayesian Networks and ANN - 2 - 1724652402213
No ratings yet
Unit 3 - SVM, Bayesian Networks and ANN - 2 - 1724652402213
65 pages
Article
No ratings yet
Article
23 pages
Unit 4
No ratings yet
Unit 4
27 pages
Topic 2.8: Topics Comes Under The Topic of Discriminant Functions
No ratings yet
Topic 2.8: Topics Comes Under The Topic of Discriminant Functions
6 pages
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
No ratings yet
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
65 pages
cs221 Lecture11
No ratings yet
cs221 Lecture11
71 pages
Intro To ML RevisionNotes
No ratings yet
Intro To ML RevisionNotes
24 pages
SVM Minus Kernel 71
No ratings yet
SVM Minus Kernel 71
32 pages
EBOOK - Python Crash Course For Data Analysis
100% (12)
EBOOK - Python Crash Course For Data Analysis
168 pages
Inglese, Lucas - Python for Finance and Algorithmic trading (2nd edition)_ Machine Learning, Deep Learning, Time series Analysis, Risk and Portfolio Management for MetaTrader™5 Live Trading (2022).pdf
100% (6)
Inglese, Lucas - Python for Finance and Algorithmic trading (2nd edition)_ Machine Learning, Deep Learning, Time series Analysis, Risk and Portfolio Management for MetaTrader™5 Live Trading (2022).pdf
354 pages
The Python Bible
97% (33)
The Python Bible
506 pages
Optimization For Machine Learning
100% (5)
Optimization For Machine Learning
402 pages
AI Agents by Google
100% (11)
AI Agents by Google
42 pages
Object Oriented Python Tutorial
100% (21)
Object Oriented Python Tutorial
111 pages
Deep Learning With PyTorch Guide For Beginners and Intermediate
100% (7)
Deep Learning With PyTorch Guide For Beginners and Intermediate
120 pages
TensorFlow For Machine Intelligence
100% (27)
TensorFlow For Machine Intelligence
305 pages
Ian Goodfellow, Yoshua Bengio, Aaron Courville-Deep Learning (Pre-Pub Version) - MIT Press (2016) PDF
91% (11)
Ian Goodfellow, Yoshua Bengio, Aaron Courville-Deep Learning (Pre-Pub Version) - MIT Press (2016) PDF
802 pages
The Prompt Engineering Guide V3 PDF
100% (8)
The Prompt Engineering Guide V3 PDF
17 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (19)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
Python Pandas Tutorial
96% (28)
Python Pandas Tutorial
178 pages
Van Der Post H. Python For Finance. A Crash Course Modern Guide 2024
86% (7)
Van Der Post H. Python For Finance. A Crash Course Modern Guide 2024
304 pages
Machine Learning With Python
100% (15)
Machine Learning With Python
692 pages
Competitive Programming in Python 128 Algorithms To Develop Your Coding Skills
100% (10)
Competitive Programming in Python 128 Algorithms To Develop Your Coding Skills
267 pages
Machine Learning Projects in Python
100% (17)
Machine Learning Projects in Python
135 pages
Learn Python With Examples
100% (10)
Learn Python With Examples
92 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
94% (18)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
Practical Projects
100% (32)
Practical Projects
478 pages
Python Interview Questions
85% (20)
Python Interview Questions
77 pages
500 Real-World Machine Learning Projects
100% (1)
500 Real-World Machine Learning Projects
14 pages
Data Preprocessing With Python For Absolute Beginners. Step by Step. AI Publishing
100% (1)
Data Preprocessing With Python For Absolute Beginners. Step by Step. AI Publishing
252 pages
Machine Learning Notes
83% (12)
Machine Learning Notes
19 pages
Machine Learning For Factor Investing Python Version 9780367639747 9780367639723 9781003121596 2023002044 - Compress
100% (2)
Machine Learning For Factor Investing Python Version 9780367639747 9780367639723 9781003121596 2023002044 - Compress
358 pages
100 Skills To Better Python
100% (10)
100 Skills To Better Python
80 pages
Machine Learning in Finance: Matthew F. Dixon Igor Halperin Paul Bilokon
83% (12)
Machine Learning in Finance: Matthew F. Dixon Igor Halperin Paul Bilokon
565 pages
Machine Learning With Python.
100% (2)
Machine Learning With Python.
147 pages
Burkov's Guide to Machine Learning
100% (11)
Burkov's Guide to Machine Learning
135 pages
Machine Learning Projects Python
94% (18)
Machine Learning Projects Python
134 pages
E5. Efficient LM Methods
No ratings yet
E5. Efficient LM Methods
41 pages
1 Solution
No ratings yet
1 Solution
3 pages
E3. AI Agents
No ratings yet
E3. AI Agents
49 pages
LLM Scaling Laws & Emergent Capacities
No ratings yet
LLM Scaling Laws & Emergent Capacities
23 pages
Pre-Training & LLM 2
No ratings yet
Pre-Training & LLM 2
46 pages
LLM Prompting & In-Context Learning
No ratings yet
LLM Prompting & In-Context Learning
18 pages
Deep Learning Recap
No ratings yet
Deep Learning Recap
13 pages
Neural Language Models & Tokenization
No ratings yet
Neural Language Models & Tokenization
70 pages
Matrices and Linear Transformations
No ratings yet
Matrices and Linear Transformations
74 pages
Introduction
No ratings yet
Introduction
6 pages
0.1. Probability Review
No ratings yet
0.1. Probability Review
6 pages
Orthogonality
No ratings yet
Orthogonality
61 pages
Subspace and Basis
No ratings yet
Subspace and Basis
60 pages
Chapter 1 (CV & IP)
No ratings yet
Chapter 1 (CV & IP)
41 pages
003 Activation Functions in Machine Learning
No ratings yet
003 Activation Functions in Machine Learning
19 pages
ML-Lec-06-Supervised Learning-Decision Trees
No ratings yet
ML-Lec-06-Supervised Learning-Decision Trees
45 pages
Chapter-2 (Deep Learning)
No ratings yet
Chapter-2 (Deep Learning)
18 pages
Machine Learning: Oversampling vs Undersampling
No ratings yet
Machine Learning: Oversampling vs Undersampling
6 pages
Heart Disease Prediction with ML
No ratings yet
Heart Disease Prediction with ML
7 pages
Experiment 7 Support Vector Machine (SVM)
No ratings yet
Experiment 7 Support Vector Machine (SVM)
2 pages
Bayes Lecture Notes
No ratings yet
Bayes Lecture Notes
172 pages
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
No ratings yet
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
17 pages
StudMath 2017-Final Updated
No ratings yet
StudMath 2017-Final Updated
155 pages
Naïve Bayes for Data Scientists
No ratings yet
Naïve Bayes for Data Scientists
31 pages
B-56 Sanket Jambhulkar MLA-3
No ratings yet
B-56 Sanket Jambhulkar MLA-3
7 pages
Ggc14@columbia - Edu: Columbia University Syllabus COMS S4995-Topics in Computer Science: Machine Learning in Finance
No ratings yet
Ggc14@columbia - Edu: Columbia University Syllabus COMS S4995-Topics in Computer Science: Machine Learning in Finance
3 pages
Papers
No ratings yet
Papers
9 pages
Celebi, D., Bayraktar, D. & Aykac, D.S.O., “Multi Criteria Classification for Spare Parts Inventory”, 38th Computer and Industrial Engineering Conference, October 31-November 2, 2008, Beijing, China, pg. 1780-1787.
No ratings yet
Celebi, D., Bayraktar, D. & Aykac, D.S.O., “Multi Criteria Classification for Spare Parts Inventory”, 38th Computer and Industrial Engineering Conference, October 31-November 2, 2008, Beijing, China, pg. 1780-1787.
9 pages
Heart Disease Prediction Model by Shivansh
No ratings yet
Heart Disease Prediction Model by Shivansh
11 pages
23 July 2024 - Comprehensive Review of Depression Detection Techniques Based On Machine Learning Approach
No ratings yet
23 July 2024 - Comprehensive Review of Depression Detection Techniques Based On Machine Learning Approach
25 pages
Detecting Fake E-commerce Reviews
No ratings yet
Detecting Fake E-commerce Reviews
12 pages
FML Course Content
No ratings yet
FML Course Content
2 pages
DL Unit - 5
No ratings yet
DL Unit - 5
14 pages
EdgeActNet Compressed
No ratings yet
EdgeActNet Compressed
15 pages
Prediction of Mobile Phone Price Class Using Supervised Machine Learning Techniques
No ratings yet
Prediction of Mobile Phone Price Class Using Supervised Machine Learning Techniques
4 pages
Technologies 12 00015
No ratings yet
Technologies 12 00015
40 pages
Neural Networks & Fuzzy Logic Course Guide
No ratings yet
Neural Networks & Fuzzy Logic Course Guide
2 pages
Ai YasmeenAlhajYousef 0197638 Mohammad Almajali 2191370 End
No ratings yet
Ai YasmeenAlhajYousef 0197638 Mohammad Almajali 2191370 End
2 pages
Dsai Report
No ratings yet
Dsai Report
12 pages
Modern Statistical Learning Course Materials
No ratings yet
Modern Statistical Learning Course Materials
2 pages
Lecture 1
No ratings yet
Lecture 1
24 pages
Machine Learning For Earth Sciences Using Python To Solve Geological Problems Maurizio Petrelli Instant Download
100% (2)
Machine Learning For Earth Sciences Using Python To Solve Geological Problems Maurizio Petrelli Instant Download
88 pages
An Application For Predicting Phishing Attacks A Case of Implementing
No ratings yet
An Application For Predicting Phishing Attacks A Case of Implementing
12 pages