MA3K1 Mathematics of Machine Learning April 10, 2021
Solutions
Solution (1) We assume that there is an input space X , and output space Y, and a
probability distribution on X ⇥ Y. Given a map h : X ! Y and a loss function L
defined on Y ⇥ Y, the generalization risk of h is the expected value
R(h) = E[L(h(X), Y )],
where (X, Y ) is distributed on X ⇥ Y according to the given distribution. The general-
ization risk gives an indication on how h “performs”, on average, on unseen data.
If we observe n samples (x1 , y1 ), . . . , (xn , yn ) drawn from this distribution, we can
construct a classifier ĥ by minimizing the empirical risk
n
1X
R̂(h) = L(h(xi ), yi )
n i=1
over a class of functions H. When the class H is very large (for example, if it consists
of all possible functions from X to Y), then we can find a function ĥ for which the
empirical risk, R̂(ĥ), is very small or even zero (if we simply “memorise” the observed
data). Such a function ĥ can have a large generalization risk: it has been adapted too
closely to the observed data, at the expense of not generalizing well to unobserved data.
This is called overfitting.
Solution (2) The Markov inequality states that, given t > 0 and a non-negative
random variable X, we have
E[X]
P(X t) .
t
Given any t 2 R and 0, e t
> 0, and hence
X E[e X
]
P(X t) = P(e e t) t
,
e
where the last inequality follows by applying Markov’s inequality.
Solution (3) (a) The generalization risk is the expected value
R(h) = E[1{h(X) 6= Y }] = P(h(X) 6= Y ).
The empirical risk is
n
1X
R̂(h) = 1{h(Xi ) 6= Yi }.
n i=1
(b) Assume that we have n random samples (X1 , Y1 ), . . . , (Xn , Yn ) that are inde-
pendent and identically distributed. For each fixed h 2 H, we have the bound
|R̂(h) R(h)| C(n, ),
7
MA3K1 Mathematics of Machine Learning April 10, 2021
with probability at least 1 , by assumption. Now if ĥ is the minimizer of R̂(h), that
is,
n
1X
ĥ = arg min 1{h(Xi ) 6= Yi },
h2H n i=1
then ĥ is also a random variable: it depends on the random data (Xi , Yi ). Hence, in the
expression |R̂(ĥ) R(ĥ)|, the given ĥ is not fixed and will vary with the data (Xi , Yi ).
The expected value may larger.
Solution (4) (a) The regression function for binary classification is
f (x) = E[Y | X = x] = P(Y = 1 | X = x)
The Bayes classifier can be defined as
(
1 if f (x) > 1/2
h⇤ (x) =
0 else
(b) To compute the Bayes classifier, we use Bayes’ rule:
⇢1 (x)P(Y = 1) p · ⇢1 (x)
P(Y = 1 | X = x) = = .
⇢(x) (1 p)⇢0 (x) + p⇢1 (x)
We can rearrange:
p · ⇢1 (x) 1 ⇢1 (x) 1 p
> , > .
(1 p)⇢0 (x) + p⇢1 (x) 2 ⇢0 (x) p
Solution (5) A hypothesis class H is called PAC-learnable if there exists a classifier
ĥ 2 H depending on n random samples (Xi , Yi ), i 2 {1, . . . , n}, and a polynomial
function p(x, y, z, w), such that for any ✏ > 0 and 2 (0, 1), for all distributions on
X ⇥ Y, ✓ ◆
P R(ĥ) inf R(h) + ✏ 1
h2H
holds whenever n p(1/✏, 1/ , size(X ), size(H)).
Solution (6) (a) A function h 2 H that implements XOR would correspond to a line
in R2 that separates the points {(0, 0), (1, 1)} from the points {(0, 1), (1, 0)}. Any line
h(x) = wT x+b that has (0, 0) on one side, say, h(0, 0) < 0, has to have b < 0. Assume
that h separates (0, 0) from (0, 1) and (1, 0). Then h(1, 0) > 0 implies w1 + b > 0, and
h(0, 1) > 0 implies that w2 + b > 0. In particular, if we sum these two expression, we
get w1 + w2 + 2b > 0. But since b < 0, this implies that h(1, 1) = w1 + w2 + b > 0.
This shows that any line line separating (0, 0) from (0, 1) and (1, 0) also separates (0, 0)
from (1, 1),and hence XOR is not possible via linear classifiers.
(b) The problem amounts to finding the ways in which the corners of a square can
be separated by lines. There are 16 = 24 possible binary functions.
8
MA3K1 Mathematics of Machine Learning April 10, 2021
x1 x2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
The task is to figure out which of these can be implemented by a linear classifier.
Clearly, for every such function f that can be implemented by linear separation, the
inverted 1 f can also be implemented (if f (x) = 1{h(x) > 0}, then 1 f (x) =
1{ h(x) > 0}). Clearly, cases 1 and 16 can be implemented (take any line that contains
all the points on one side). We also get all the cases in the above table with exactly one
1 or exactly one 0 by a line that has one point on one side and all the others on the other
side (like x2 x1 + 1/2, for example). This makes another 8 cases. By considering
conditions of the form
xi 1/2 > 0, xi 1/2 < 0, i 2 {1, 2}
we get four additional cases, taking the total number of functions that can be implemen-
ted using H to 14. The remaining two cases are XOR and 1 XOR, which cannot be
implement using H as seen in Part (a).
Solution (7) Let H be a set of classifiers h : X ! Y. The growth function ⇧H is
defined as
⇧H (n) = maxn |{h(x) : h 2 H}| .
x2X
where h(x) = (h(x1 ), . . . , h(xn )) if x = (x1 , . . . , xn ). The VC dimension of H is
VC(H) = max{n 2 N : ⇧H (n) = 2n }.
We can take the set x = (e1 , . . . , ed ), where ei is the i-th unit vector. Then
|{h(x) : h 2 H}| = 2d ,
since for any 0 1 vector of length d, taking the function hI where I consists of
those indices i for which i = 1 gives the sign pattern hI (x) = . This shows that the
VC dimension is at least d.
Solution (8) Set K = |H|. For z = (z1 , . . . , zn ), with zi = (xi , yi ), we can define
the empirical Rademacher complexity as
" n
#
1X
R̂z (L H) = E sup i g(zi ) ,
g2L H n i=1
where g(zi ) = L(h(xi ), yi ), and the Rademacher complexity as
h i
Rn (L H) = EZ R̂Z (L H .