MA3K1 Mathematics of Machine Learning April 10, 2021
Solution (23) Given a vector x 2 Rd and a sequence g = (g d+1 , . . . , gd 1 )
T
, the
convolution is defined as the vector y = (y1 , . . . , yd )T , where
d
X
yk = xi · g k i , k 2 {1, . . . , d}
i=1
In terms of matrices,
0 1
0 1 g0 g 1 ··· g d+1 0 1
y1 B g1 C x1
B ..C B g0 ··· g d+2 C B .C
@y2 .A = B .. .. .. .. C @x2 ..A
@ . . . . A
yd xd
gd 1 gd 2 ··· g0
The sequence g is called a filter. Convolutional Neural Networks (CNNs) operate on
image data, and in the context of colour image data, a vector x that enters a layer of
the CNN will not be seen as a single vector in some Rd , but as a tensor with format
N ⇥ M ⇥ 3. Here, the image is considered to be a N ⇥ M matrix, with each entry
corresponding to a pixel, and each pixel is represented by a 3-tuple consisting of the red,
blue and green components. A convolution filter will therefore operate on a subset of
this tensor, typically a n ⇥ m ⇥ 3 window. Typical parameters are N = M = 32 and
n = m = 5.
2 3
⇤ ⇤ ⇤ ⇤ ⇤
2 3
⇤ ⇤66⇤⇤ 3⇤⇤ ⇤⇤ ⇤ ⇤7
7
2 6 7 ···
⇤ ⇤66⇤⇤
⇤ ⇤
⇤⇤6 ⇤⇤ ⇤ ⇤7 ⇤ 7 ⇤ ⇤ 7
6⇤ ⇤66⇤⇤ ⇤⇤ ⇤⇤7 ⇤ ⇤7 · · · 5
4 ⇤ ⇤ ⇤ 7 ⇤ ⇤ [⇤] · · ·
6 7
6⇤ ⇤ 4⇤⇤ ⇤⇤ ⇤⇤⇤7⇤⇤· · ⇤·⇤5 ⇤ ⇤ ..
6 7 ..
4⇤ ⇤ ⇤⇤ ⇤⇤ ⇤⇤5 ⇤ .⇤ .. . .
.. .
⇤ ⇤ ⇤ ⇤ .⇤ .
.. ..
.. ..
. .
Each 5 ⇥ 5 ⇥ 3 block is mapped linearly to an entry of a 28 ⇥ 28 matrix in the
same way (that is, applying the same filter). The filter applied is called a kernel and the
resulting map is interpreted as representing a feature. Specifically, instead of writing
it as a vector g, we can write it as a matrix K, and then have, if we denote the input
image by X, X
Y = X ⇤ K, Yij = Xk` · Ki k,j ` .
k,`
In a convolutional layer, various different kernels or convolutions are applied to the
image. This can be seen as extracting several features from an image. CNNs have the
advantage of being scalable: they work on images of arbitrary size.
19
MA3K1 Mathematics of Machine Learning April 10, 2021
Solution (24) Let h : Rd ! Y be any classifier. Define the smallest perturbation that
moves a data point into a different class as
(x) = inf {krk : h(x + r) 6= h(x)}. (3.5)
r
Given a probability distribution on the input spaces, define the robustness as
h (X)
⇢(h) = E .
kXk
For a linear classifier with only two classes, where h(x) = wT x + b, we get the
vector that moves point x to the boundary by solving the optimization problem (as for
SVMs)
1
r ⇤ = argmin krk2 subject to wT (x + r) + b = 0.
2
The solution is
|wT x + b|
r⇤ = .
kwk
Assume that we now have k linear functions f1 , . . . , fk , with fi (x) = wiT x + bi , and
a classifier h : Rd ! {1, . . . , k} that assigns to each x the index j of the largest value
fj (x) (this corresponds to the one-to-many setting for multiple classification). Let x be
such that maxj fj (x) = fk (x) and define the linear functions
gi (x) := fi (x) fk (x) = (wi wk )T x + (bi bk ), i 2 {1, . . . , k 1}.
Then \
x2 {y : gi (y) < 0}.
1ik 1
The intersection of half-spaces is a polyhedron, and x is in the interior of the polyhedron
P delineated by the hyperplanes Hi = {y : gi (x) = 0} (see Figure 6).
H4
H3
H2
H5
x
H1
Figure 6: The distance to misclassification is the radius of the largest enclosed ball in
a polyhedron P .
20
MA3K1 Mathematics of Machine Learning April 10, 2021
A perturbation x + r ceases to be in class k as soon as gj (x + r) > 0 for some j,
and the smallest length of such an r equals the radius of the largest enclosed ball in the
polyhedron. Formally, noting that wj wk = rgj (x), we get
|gj (x)| gĵ (x)
ĵ := arg min , r⇤ = · rgĵ (x). (3.6)
j krgj (w)k krgĵ (x)k2
Solution (25) The discriminator D would like to achieve, on average, a large value
on G0 (Z0 ), and a small value on G1 (Z1 ). Using the logarithm, this can be expressed as
the problem of maximizing
EZ0 [log(D(G0 (Z0 )))] + EZ1 [log(1 D(G1 (Z1 )))].
As an integral, using the push-forward measures, we get the expression
Z
[log(D(x))⇢X0 (x) + log(1 D(x))⇢X1 (x)] dx.
X
We choose D(x) so that the integrand becomes maximal. So considering the function
f (y) = log(y)⇢X0 (x) + log(1 y)⇢X1 (x)
and computing the first and second derivatives,
⇢X0 (x) ⇢X1 (x) ⇢X0 (x) ⇢X1 (x)
f 0 (y) = , f 00 (y) < 0,
y 1 y y2 (1 y)2
we see that we get a maximum at
⇢X0 (x)
y= ,
⇢X0 (x) + ⇢X1 (x)
which is the value of D⇤ (x) that we want.
21