ARIN7015/MATH6015: Topics in Artificial Intelligence
and Machine Learning
Multi-Class Classification and Its Optimization
Yunwen Lei
Department of Mathematics, The University of Hong Kong
March 20, 2025
Outline
1 Multiclass Predictors
2 Learning Theory for Multi-class Classification
3 Frank-Wolfe Algorithm
4 Optimization for Multi-class Classification
Multi-class Classification
( ,dog), ( ,car), ( ,airplane), . . .
i.i.d.
Formally z1 = (x1 , y1 ), . . . , zn = (xn , yn ) ∼ ρ
| {z }
∈X ×Y
▶ Y := {1, 2, . . . , c}
▶ c = number of classes
Reduction to Binary Classification
One-vs-One
Train c(c − 1)/2 binary classifiers, each classifier for a pair of different classes
The final prediction by the majority vote
Reduction to Binary Classification
One-vs-Rest
Train c binary classifiers, one for each class
Train j-th classifier to distinguish class j from the rest
Suppose h1 , . . . , hc : X 7→ R are our binary classifiers
The final prediction is
h(x) = arg max hj (x)
j∈[c]
Tie breaking: If two distinct i, j ∈ [c]
achieves the maximum (i.e., wi⊤ x = wj⊤ x),
we break the tie using some consistent
policy, e.g., predicting the label as the
smaller between i and j.
Recap: Linear Binary Classifier
Y = {−1, +1}, X = Rd
Linear classifier score function h(x) = w⊤ x
Final prediction: sign(h(x))
h(x) = w⊤ x = ∥w∥2 ∥x∥2 cos(θ)
h(x) > 0 ⇐⇒ cos θ > 0 =⇒ θ ∈ (−90◦ , 90◦ )
h(x) < 0 ⇐⇒ cos θ < 0 =⇒ θ ̸∈ (−90◦ , 90◦ )
Three Class Example
Base hypothesis space
H = x 7→ w⊤ x : w ∈ R2
Note: separating boundary always contains the origin
Three Class Example: One-vs-Rest
Class 1 vs Rest: h1 (x) = w1⊤ x
Three Class Example: One-vs-Rest
Class 2 vs Rest
▶ Predicts everything to be “Not 2”
▶ It misclassifies 6 examples (all examples in class “2”)
▶ If it predicted some “2”, then it would get many more “Not 2” incorrect
Score for class j is
hj (x) = wj⊤ x = ∥wj ∥2 ∥x∥ cos θj ,
where θj is the angle between x and wj .
One-vs-Rest: Class Boundaries
For simplicity, we assume ∥w1 ∥2 = ∥w2 ∥2 = ∥w3 ∥2
Then wj⊤ x = ∥wj ∥2 ∥x∥2 cos θj
▶ Only θj matters for comparison
▶ x is classified by whichever has largest cos θj (i.e., θj closest to 0)
One-vs-Rest: Class Boundaries
This approach doesn’t work well in this instance!
How can we fix this?
The Linear Multiclass Hypothesis Space
Base hypothesis space
e = x 7→ w⊤ x : w ∈ Rd
H
Linear multiclass hypothesis space
n o
H = x 7→ arg max hj (x) : h1 , . . . , hc ∈ H
e
j∈[c]
A learning algorithm chooses the hypothesis from
hypothesis space
Is this a failure of the hypothesis space or the
learning algorithm?
A Solution with Linear Functions
This works... so the problem is not with the hypothesis space
How can we get a solution like this?
Multiclass Predictors
Multiclass Hypothesis Space
Idea: use one function h(x, y ) to give a compatibility score between input x and output y
Final prediction is the y ∈ Y that is “most compatible” with x
x ← arg max h(x, y ).
y ∈Y
Given class-specific score functions h1 , . . . , hc , we could define compatibility
function as
h(x, j) = hj (x), j ∈ [c].
Base Hypothesis Space: H = h : X × Y 7→ R
Multiclass Hypothesis Space
n o
x 7→ arg max h(x, y ) : h ∈ H
y ∈Y
Want compatibility h(x, y ) to be large when x has label y , small otherwise!
Multi-class Classification: Example
Points x1 , x2 and x3 will be classified as label 1, 2 and 3, respectively.
What do the three rays stand for?
Linear Multiclass Prediction Function
A linear compatibility score function is given by
h(x, y ) = ⟨w, Ψ(x, y )⟩,
′
where Ψ : X × Y 7→ Rd is a compatibility feature map
Ψ(x, y ) extracts features relevant to how compatible y is with x
Final compatibility score is a linear function of Ψ(x, y )
Linear Multiclass Hypothesis Space
′
n o
x 7→ arg max⟨w, Ψ(x, y )⟩ : w ∈ Rd
y ∈Y
Example: X = R2 , Y = {1, 2, 3}
! !
− √12 √1
0 2
w1 = √1
, w2 = , w3 = √1
2
1 2
Prediction function
x1
x= ← arg max ⟨wj , x⟩
x2 j∈{1,2,3}
How can we get this into the form x ← arg maxy ∈Y ⟨w, Ψ(x, y )⟩?
The Multivector Construction
What if we stack wj ’s together
! !
− √12 √1
0 2
w= √1
, , √1
2
1 2
| {z } | {z } | {z }
w1 w2 w3
We define the feature map Ψ : R2 × {1, 2, 3} 7→ R2×3 as
x1 0 0
Ψ(x, 1) = , ,
x2 0 0
0 x1 0
Ψ(x, 2) = , ,
0 x2 0
0 0 x1
Ψ(x, 3) = , ,
0 0 x2
Then we get the wanted equation ⟨w, Ψ(x, y )⟩ = ⟨wy , x⟩
Linear Multiclass Prediction Function: Simplification
In our lecture, we always consider the following compatibility feature map
Ψ(x, j) = 0, . . . , 0, x
|{z} , 0, . . . , 0 ∈ Rd×c
j-th component
We stake all wj into a matrix
w = (w1 , . . . , wc ) ∈ Rd×c .
Then it is clear that
⟨w, Ψ(x, j)⟩ = wj⊤ x!
Linearly Separable Dataset
A training dataset S is linearly separable if there exist w1 , . . . , wc s.t.
correctly classify all the points in S
for every point x ∈ S with label y , wy⊤ x > wj⊤ x for all j ̸= y
Recap: Binary Classification
Margin of a hyperplane for a dataset: the width that the boundary could be increased
without hitting a data point
n
gi
ar
m
How to define the margin for multi-class classification?
Multi-class Margin
Recap: we choose the class label with the largest score function
ŷ := arg maxj∈Y h(x, j)
The prediction is correct if the true class label receives the largest score!
Multi-class Margin
Aim: Want h(xi , yi ) larger than all other h(xi , j) for j ̸= y i.e., a large margin
margin(h, z) = h(x, y ) − max h(x, j)
| {z } j:j̸=y | {z }
score of true label score of incorrect label
Multi-class Margin
Defined as the score difference between the score of the correct label and the highest
score for the incorrect label
Margin-based Loss
Ideal property of loss function
Loss should be a decreasing function of the margin
The model w predicts a vector in Rc : (w1⊤ x, . . . , wc⊤ x)
We need a map from Rc to R+
We consider the margin-based loss of the form
f (w; z) = ℓy w1⊤ x, w2⊤ x, . . . , wc⊤ x ,
where ℓy : Rc 7→ R.
hinge loss : ℓy (t1 , . . . , tc ) = max{0, 1 − margin} = max{0, 1 + max(tj − ty )}
j:j̸=y
n o
f (w; z) = max 0, 1 + max⟨wj − wy , x⟩ = ℓy w1⊤ x, . . . , wc⊤ x
(1)
j:j̸=y
margin: ty − maxj:j̸=y tj
Soft-Max Loss
Note that the margin of t is ty − maxj̸=y tj
Max is not differentiable. Replace it with the soft-max margin
X
max t1 −ty , . . . , tc −ty } ≤ log exp(tj −ty ) ≤ max t1 −ty , . . . , tc −ty }+log c
j:j̸=y
| {z } | {z }
−margin | {z } −margin
soft-max
Want to maximize margin: x 7→ log(1 + exp(−x)) is a decreasing function
X X
log(1+exp(−margin)) ≈ log 1+exp log exp(tj −ty ) = log 1+ exp(tj−ty )
j:j̸=y j:j̸=y
| {z }
soft-max
P
With ℓy (t1 , . . . , tc ) = log 1 + j:j̸=y exp(tj − ty ) we know
X
exp wj⊤ x − wy⊤ x = ℓy w1⊤ x, . . . , wc⊤ x .
f (w; z) = log 1 +
j:j̸=y
Regularization Schemes for Multi-class Classification
We train a model by minimizing the following objective w.r.t. w = (w1 , . . . , wc ), wj ∈ Rd
n
1X n o
ℓy w1⊤ x, . . . , wc⊤ x s.t. w ∈ W = w ∈ Rd×c : Ω(w) ≤ 1 ,
min
w∈Rd×c n i=1 | i {z }
=f (w;zi )
where Ω : Rd×c 7→ R+ is a regularizer.
Examples: Ω(w) = λ
2
∥w∥22,p , where
c
X 1
∥w∥2,p := ∥wj ∥p2 p
p≥1
j=1
Pc
p = 1: ∥w∥2,1 := j=1 ∥wj ∥2
Pc 1
p = 2: ∥w∥2,p := j=1 ∥wj ∥22 2
p = ∞: ∥w∥2,∞ := maxj∈[c] ∥wj ∥2
Learning Theory for Multi-class Classification
Rademacher Complexity for Multi-class Classification
The loss function class becomes
n o
F = z 7→ ℓy w1⊤ x, . . . , wc⊤ x : Ω(w) ≤ 1
How to estimate the Rademacher complexity
n
1 X
ℓyi w1⊤ xi , . . . , wc⊤ xi .
Eσ sup
n w∈W
i=1
The difficulty lies in the nonlinearity of ℓy , which takes a vector in Rc as the input!
Lipschitz Continuity of ℓy
We say ℓy is G -Lipschitz continuous w.r.t. a norm on Rc if
ℓ a1 , . . . , ac − ℓ a1′ , . . . , ac′ ≤ G (a1 , . . . , ac ) − (a1′ , . . . , ac′ ) .
It is an extension of the Lipschitz continuity of R 7→ R
We are interested in ∥ · ∥2 and ∥ · ∥∞
Lemma. Let a1 , . . . , ac and b1 , . . . , bc ∈ R. Then
max a1 , . . . , ac − max b1 , . . . , bc ≤ max |aj − bj |.
j∈[c]
Proof. Without loss of generality, we ak ≥ bk ′ , where
ak = max a1 , . . . , ac , bk ′ = max b1 , . . . , bc .
max a1 , . . . , ac − max b1 , . . . , bc ≤ ak − bk ′ ≤ ak − bk ≤ max |aj − bj |.
j∈[c]
Lipschitz Continuity of Hinge Loss
Example
The hinge loss ℓy (t1 , . . . , tc ) = max{0, 1 + maxj:j̸=y (tj − tc )} is 2-Lipschitz continuous
w.r.t. ∥ · ∥∞ .
Proof. We just showed
max a1 , . . . , ac − max b1 , . . . , bc ≤ max |aj − bj |.
j∈[c]
Let t = (t1 , . . . , tc ) and t′ = (t1′ , . . . , tc′ ). Then
|ℓy (t) − ℓy (t′ )| = max{0, 1 + max(tj − ty )} − max{0, 1 + max(tj′ − ty′ )}
j:j̸=y j:j̸=y
≤ 1 + max(tj − ty )} − 1 − max(tj′ − ty′ )}
j:j̸=y j:j̸=y
≤ max (tj − ty ) − (tj′ − ty′ ) ≤ 2∥t − t′ ∥∞ .
j:j̸=y
Lipschitz Continuity of the Soft-max Loss
Example
P
The soft-max loss ℓy (t1 , . . . , tc ) = log 1 + j:j̸=y exp(tj − ty ) is 2-Lipschitz continuous
w.r.t. ∥ · ∥∞ .
P
Proof. The function t 7→ g (t) := log j∈[c] exp(tj ) is 1-Lipschitz continuous w.r.t.
∥ · ∥∞ .
exp(t1 )
1 .
∇g (t) = P .. =⇒ ∥∇g (t)∥1 ≤ 1.
j∈[c] exp(t j )
exp(tc )
g (t) − g (t′ ) = t − t′ , ∇g (αt + (1 − α)t′ )
| {z }
Taylor expansion
=⇒ |g (t) − g (t )| ≤ ∥t − t ∥∞ ∥∇g (αt + (1 − α)t′ )∥1 .
′ ′
Vector contraction of Rademacher complexity
Vector contraction of Rademacher complexity
Let F be a class of functions from Z to Rc . For each i ∈ [n], let σi : Rc 7→ R be
L-Lipschitz continuous w.r.t. ∥ · ∥2 . Then
X √ XX
E sup ϵi σi (f (zi )) ≤ 2LE sup ϵi,k fk (zi ),
f ∈F f ∈F
i∈[n] i∈[n] k∈[c]
where ϵi,k are i.i.d. Rademacher variables, and fk (zi ) is the k-th component of f (zi ).
It is an extension from Talagrand’s contraction lemma to vector-valued functions
It removes the nonlinear σ, which is the loss function in multi-class classification!
Maurer, A., 2016. A vector-contraction inequality for Rademacher complexities. In Algorithmic Learning
Theory.
Rademacher Complexity Bound (Optional)
Let ℓy be either the hinge loss or the soft-max loss
c
X 1
|ℓy (t) − ℓy (t′ )| ≤ 2 max |tj − tj′ | ≤ 2 (tj − tj′ )2 2 = 2∥t − t′ ∥2 .
j∈[c]
| {z } j=1
=∥t−t′ ∥∞
Then vector contraction of Rademacher complexity implies
X √ XX
ϵi ℓyi w1⊤ xi , . . . , wc⊤ xi ≤ 2 2E sup ϵij wj⊤ xi .
E sup
w∈W w∈W
i∈[n] i∈[n] j∈[c]
Let W = {w ∈ Rd×c : ∥w∥2,2 ≤ 1}. Then
XX X X X X
E sup ϵij wj⊤ xi = E sup wj⊤ ϵij xi ≤ E sup wj 2
ϵij xi
w∈W w∈W w∈W 2
i∈[n] j∈[c] j∈[c] i∈[n] j∈[c] i∈[n]
c
1 X
X
2 2
X 2 12
≤ E sup wj 2
ϵij xi
w∈W 2
j∈[c] j=1 i∈[n]
c
h X X 2 21 i
≤E ϵij xi .
2
j=1 i∈[n]
Rademacher Complexity Bound (Optional)
We just showed
c
X √ h X X 2 12 i
E sup ϵi ℓyi w1⊤ xi , . . . , wc⊤ xi ≤ 2LE ϵij xi .
w∈W 2
i∈[n] j=1 i∈[n]
√
By the concavity of x 7→ x, we know
c c
h X X 2 21 i h X X 2 21 i
E ϵij xi ≤ E ϵij xi .
2 2
j=1 i∈[n] j=1 i∈[n]
Furthermore, there holds
X 2 h X X i X X
E ϵij xi = E ⟨ ϵij xi , ϵi ′ j xi ′ ⟩ = Eϵ2ij x⊤
i xi + Eϵij ϵi ′ j x⊤
i xi ′ .
2
i∈[n] i∈[n] i ′ ∈[n] i∈[n] i̸=i ′
| {z }
=0
c
h X X 2 12 i X 1
2
=⇒ E ϵij xi ≤ c· E∥xi ∥22 .
2
j=1 i∈[n] i∈[n]
Rademacher Complexity Bound
The Rademacher complexity of F is bounded by
1 X 1
z 7→ ℓy (w1⊤ x, . . . , wc⊤ x) : ∥w∥2,2 ≤ 1} ≤
2
RS c· E∥xi ∥22 .
n
i∈[n]
Generalization bound
Let A be ERM. Then with probability at least 1 − δ
1 1 √ √
F (A(S)) − F (w∗ ) ≲ n− 2 log 2 (2/δ) + c/ n.
This shows a square-root dependency of generalization bounds w.r.t. the number of
classes!
Frank-Wolfe Algorithm
Training Multiclass Model
Consider the following minimization problem (p ≥ 1)
n
1X X
ℓyi (wj⊤ xi )j∈[c] ∥wj ∥p2 ≤ B p .
min s.t.
w n
i=1 j∈[c]
It is a constrained optimization problem
Can be solved by projected gradient descent, which however has a projection
step per iteration
The projection can be expensive
Can we develop a projection-free method?
Yes, the Frank-Wolfe Algorithm!
Frank-Wolfe Method
Objective: minw∈Ω FS (w), where Ω is convex.
Recap: projected gradient descent (PGD)
Recall PGD wt+1 = ProjW wt − η∇FS (wt ) uses a quadratic approximation
n 1 o
wt+1 = arg min FS (wt ) + ⟨w − wt , ∇FS (wt )⟩ + ∥w − wt ∥22 .
w∈W 2η
In some cases, the projection may be hard to compute (or even approximate)
For these problems, we can solve the simplified problem
n
arg min FS (wt ) + ⟨w − wt , ∇FS (wt )⟩},
w∈W
which optimizes a linear approximation to the function over the constraint set
This requires the set W to be bounded, otherwise there may be no solution
Frank-Wolfe Method
Frank-Wolfe Method
1: for t = 0, 1, . . . , T do
2: Set vt = arg minv∈W ⟨∇FS (wt ), v⟩
3: Set wt+1 = wt + γt (vt − wt ), where γt ∈ [0, 1]
wt+1 is feasible since it is a convex combination of two other feasible points
wt+1 = (1 − γt )wt + γt vt
A choice of γt is γt = 2/(t + 2)
Linear Minimization Oracle
Let W be a convex, closed and bounded set. The linear minimization oracle of Ω
(IMOW ) returns a vector ĝ such that
IMOW (g) = arg min g⊤ v.
v∈W
IMOW returns an extreme point of W
IMOW is (projection-free) arguably cheaper than projection
Example
Apply the FW algorithm to solve the following nonlinear problem with the initialization
point (x1 , y1 ) = (0, 0)
min f (x, y ) = (x − 1)2 + (y − 1)2
s.t. 2x + y ≤ 6
x + 2y ≤ 6
x, y ≥ 0.
Solution. The gradient of f is ∇f (x, y ) = (2x − 2, 2y − 2)⊤ .
We know ∇f (0, 0) = (−2, −2)⊤
We get v by solving the linear optimization problem
max −2x − 2y s.t. 2x + y ≤ 6, x + 2y ≤ 6, x, y ≥ 0.
(x,y )
By simplex method, one can show that v = (2, 2)⊤
If we choose γ = 1/2, then we get
⊤ ⊤ ⊤ 0 1 2 1
(x2 , y2 ) = (x1 , y1 ) + γ(v − (x1 , y1 ) ) = + =
0 2 2 1
Example
Lasso Regression
min FS (w) = ∥Aw − b∥22 s.t. ∥w∥1 ≤ 1.
w
∇FS (w) = g := A⊤ (Aw − b)
IMOW (g) = −sign(gj ∗ )ej ∗ with j ∗ := arg maxj∈[d] |gj |, which is simpler than
projection onto L1 -ball. Here ej is the j-th unit vector.
Proof. For any w ∈ W, we have
X X
w⊤ g = wj gj ≥ − |wj gj |
j∈[d] j∈[d]
X
≥ − max |gj | |wj | = −|gj ∗ | = −sign(gj ∗ )e⊤
j ∗ g.
j∈[d]
j∈[d]
Lasso comparison
Comparing projected and conditional gradient for constrained lasso problem, with
n = 100, d = 500
Convergence of Frank-Wolfe Method
Theorem. Let W be nonempty, convex and bounded. Let FS be convex and L-smooth.
Then FW with γt = 2/(t + 2) satisfies
2LD 2
FS (wt ) − FS (w∗ ) ≤ , D := max ∥w − w′ ∥2 .
t +1 ′
w,w ∈W
Proof. Since vt and w∗ are in W, optimality of vt shows
⟨vt − wt , ∇FS (wt )⟩ ≤ ⟨w∗ − wt , ∇FS (wt )⟩ ≤ FS (w∗ ) − FS (wt ).
By the L-smoothness and wt+1 − wt = γt (vt − wt )
L
FS (wt+1 ) ≤ FS (wt ) + ⟨wt+1 − wt , ∇FS (wt )⟩ + ∥wt+1 − wt ∥22
2
γ2L
= FS (wt ) + γt ⟨vt − wt , ∇FS (wt )⟩ + t ∥vt − wt ∥22
2
2
γ LD 2
≤ FS (wt ) + γt FS (w∗ ) − FS (wt ) + t
.
2
γ 2 LD 2
=⇒ FS (wt+1 ) − FS (w∗ ) ≤ (1 − γt ) FS (wt ) − FS (w∗ ) + t .
2
Convergence of Frank-Wolfe Method
γ 2 LD 2
We derived FS (wt+1 ) − FS (w∗ ) ≤ (1 − γt ) FS (wt ) − FS (w∗ ) + t .
2
t 4LD 2
=⇒ FS (wt+1 ) − FS (w∗ ) ≤ FS (wt ) − FS (w∗ ) +
t +2 2(t + 2)2
We multiply both sides by (t + 1)(t + 2) and get
(t + 1)(t + 2)∆t+1 ≤ t(t + 1)∆t + 2LD 2 .
Telescoping shows
t
X
(t + 1)(t + 2)∆t+1 = (k + 1)(k + 2)∆k+1 − k(k + 1)∆t
k=0
t
X
≤ 2LD 2 = 2LD 2 (t + 1).
k=0
Stochastic Frank-Wolfe Method
Stochastic Frank-Wolfe Method
1: for t = 0, 1, . . . , T do
2: ˆ t , v⟩, where ∇
Set vt = arg minv∈W ⟨∇ ˆ t is an unbiased estimator of ∇FS (wt )
3: Set wt+1 = wt + γt (vt − wt ), where γt ∈ [0, 1]
Convergence Rate
Let W be nonempty, convex and bounded. Let FS be convex and L-smooth. Assume the
following variance condition
ˆ t − ∇FS (wt ) ≤ LD
2 2
E ∇ .
2 t +1
Then FW with γt = 2/(t + 2) satisfies
4LD 2
E FS (wt ) − F (w∗ ) ≤
.
t +1
Stochastic FW requires decreasing variance!
Stochastic Frank-Wolfe Method
Note
n
1X
FS (w) = f (w; zi ).
n i=1
Assume St is a random sampling (with replacement) from [n] and
ˆt = 1
X
∇ ∇f (wt ; zi ).
|St |
i∈St
ˆ t is an unbiased estimator of ∇FS (wt )
Then ∇
ˆ t ] = 1 ESt
hX i
ESt [∇ ∇f (wt ; zi ) = ∇FS (wt ).
|St |
i∈St
Furthermore, the variance decreases by a factor of |St |
2 σ2 2
ˆ t − ∇FS (wt )
E ∇ = , σ 2 := E[ ∇f (wt ; zit ) − ∇FS (wt ) 2 ].
2 |St |
Choosing |St | = σ 2 (t + 1)2 /(L2 D 2 ) meets the variance condition for stochastic FW.
Optimization for Multi-class Classification
Training Multi-class Model by Frank-Wolfe Method
Recall the problem
n
1X X
ℓyi (wj⊤ xi )j∈[c] ∥wj ∥22 ≤ B 2 .
min s.t.
w n i=1
j∈[c]
Frank-Wolfe method for multi-class classification
1: for t = 0, 1, . . . , T do
2: vt = arg minw:Pj∈[c] ∥wj ∥2 ≤1 ⟨w, ∇FS (w(t) )
2
3: Set wt+1 = wt + γt (vt − wt ), where γt ∈ [0, 1]
Frank-Wolfe Algorithm requires solving a linear optimization problem
X
arg min⟨w, g⟩ s.t. ∥wj ∥22 ≤ 1,
w
j∈[c]
∗
which has a closed-form solution w = (w1∗ , . . . , wc∗ ) as follows
gj
wj∗ = − P 1 . (2)
c
˜
j=1 ∥gj˜∥22 2
Proof on Closed-form Solution (Optional)
gj
w∗ = arg min ⟨w, g⟩ ⇐⇒ wj∗ = − P 1 .
∥wj ∥22 ≤1 c
P
w: j∈[c] ˜
j=1 ∥gj˜∥22 2
It suffices to check ∥w∗ ∥2,p ≤ 1 and ⟨w∗ , g⟩ = −∥g∥2,p∗ .
It is clear that
c c
X 1 X 1
∥w∗ ∥2,2 = ∥gj˜∥22 2
/ ∥gj˜∥22 2
=1
˜
j=1 ˜
j=1
and
c c c c
X X − 1 X X 1
⟨w∗ , g⟩ = ⟨wj∗ , gj ⟩ = − ∥gj˜∥22 2 ∥gj˜∥22 = − ∥gj˜∥22 2 = −∥g∥2,2 .
j=1 ˜
j=1 ˜
j=1 ˜
j=1
Summary
Multiclass predictors: vector-valued functions
▶ margin-based formulation: we want to find model with large margin
Learning theory
▶ Rademacher complexity bound with a square-root dependency on c
Frank-Wolfe Algorithm
▶ projection-free algorithm for constrained optimization
▶ replace projection by a linear minimization problem
Optimization for multiclass optimization
▶ A closed-form solution for the linear minimization problem