MA3K1 Mathematics of Machine Learning April 10, 2021
Test Problems
These problems are indicative of the type of problem you may encounter in the
exam. The questions are a mixture of knowledge-based questions, questions that require
calculation, and some that may be a bit more challenging. This is, however, not the
exam format! The exam itself will follow the format of having one compulsory question
worth 40 points, and three additional parts worth 30 points each, of which two should be
chosen. Even though the problems in this set are indicative of exam questions, they may
not cover all the material for the exam, nor will the exam cover all the topics touched
upon in this set. It should therefore be used in conjunction with the problem sets from
class and the lecture notes.
1 Statistical Learning Theory
(1) Explain the concept of generalization risk and how it relates to the problem of
overfitting.
(2) State the Markov inequality, and use it to show that for any t 2 R, 0 and a
random variable X,
P(X t) e t E[e X ].
(3) Let H be a set of binary classifiers on a space X .
(a) Define the empirical risk R̂(h) with respect to the unit loss function based on n
random samples, and the generalization risk R(h) with respect to a probability
distribution on X ⇥ Y.
(b) Find the error in the following argument:
Let 2 [0, 1) and assume that for every h 2 H we have a bound
|R̂(h) R(h)| C(n, ) (1.1)
with probability at least 1 . Let ĥ 2 H be an empirical risk minimizer, i.e.,
R̂(ĥ) = minh2H R̂(h). Since (1.1) holds for every h with probability at least
1 , it holds in particular for ĥ with probability at least 1 , and therefore
|R̂(ĥ) R(ĥ)| C(n, )
holds with probability at least 1 .
(4) Given an input space X and output space Y = {0, 1}, consider a random variable
(X, Y ) distributed on X ⇥ Y. Assume that X has a continuous distribution with density
⇢, that the conditional density of X given Y = 1 is ⇢1 (x) := ⇢Y =1 (x), and that the
conditional density of X given Y = 0 is ⇢0 (x) := ⇢Y =0 (x).
1
MA3K1 Mathematics of Machine Learning April 10, 2021
(a) Define the regression function and the notion of a Bayes classifier h⇤ : X ! Y
for binary classification.
(b) Assuming that P(Y = 1) = p, show that the Bayes classifier for the problem
above is given by (
⇤ 1 if ⇢⇢10 (x) 1 p
(x) > p
h (x) =
0 else
(5) Explain the concept of Probably Approximately Correct (PAC) learning.
(6) Consider the following table, describing the function XOR : {0, 1}2 ! {0, 1}
(this is the “exclusive or” function).
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0
Let H be the set of linear classifiers on R2 , where each h 2 H is of the form
(
1 if wT x + b > 0
h(x) =
0 if wT x + b 0.
for some w 2 R2 and b 2 R.
(a) Show that there is no h 2 H such that h(x) = XOR(x) for x 2 {0, 1}2 .
(b) Determine the number of functions f : {0, 1}2 ! {0, 1} that can be represented
by elements of H.
(7) Let X = {0, 1}d ⇢ Rd . For every I ⇢ [d] = {1, . . . , d}, consider a function
hI : X ! {0, 1}, where
(
1 if {i : xi = 1} ⇢ I
hI (x) =
0 else
Let H denote the set of all such function. Define the notion of VC dimension, and show
that the VC dimension of H is at least d.
(8) Define the empirical Rademacher complexity and the Rademacher complexity
of a finite set of classifiers H with values in { 1, 1} and with respect to an arbitrary
loss function L taking values in [0, 1]. State Massart’s Lemma and use it to derive the
Rademacher complexity bound
r
2 log(|H|)
Rn (L H) .
n
2
MA3K1 Mathematics of Machine Learning April 10, 2021
(9) [Not relevant] Consider the unit `1 -ball
B1d := B1d (0, 1) = {x 2 Rd : kxk1 1},
Pd
where kxk1 = i=1 |xi | is the `1 -norm. Let d1 be the metric d1 (x, y) = kx yk1 .
Define the notion of an ✏-net and the covering number, and show that for ✏ 2 (0, 1), the
covering number N (B1d , d1 , ✏) is bounded by
⇣ c ⌘d
N (B1d , d1 , ✏)
✏
for some constant c independent of d and ✏. You may use that the volume of the `1 -ball
of radius r scales as vol B1d (0, r) = rd B1d .
2 Optimization
(10) For each of the functions below, state (with justification) whether it is convex or
not.
(a) f (x) = log(1 e x
) for x 2 R;
(b) f (x) = |x1 | + x22 for x 2 R2 ;
(c) f (x) = x21 sin(x) for x 2 R2 ;
(d) f (x) = min{max{x1 , x2 }, 0} for x 2 R3 ;
(e) f (x) = 1{x 1} for x 2 R.
(11) For each of the sets below, determine whether they are convex or not.
(a) C = {x 2 R2 : kxk1 = 1};
(b) C = {x 2 R2 : x21 x22 = 0};
(c) C = {x 2 R2 : x2 ex1 };
(d) C = {A 2 Rn⇥n : det(A) 6= 0}.
(12) Let f 2 C 1 (Rd ) be a convex function and C ⇢ Rd a convex set. Show that x⇤
is an optimal point of the optimization problem
minimize f (x) subject to x 2 C
if and only if for all y 2 C, hrf (x⇤ ), y x⇤ i 0. Show, by counterexample, that the
statement is not true if C is not required to be convex.
(13) Let f (x) = (x21 + x2 )2 by a function on R2 . Show that the direction p =
( 1, 1)> is a descent direction at x0 = (1, 0)> , and determine a step length ↵ that
minimizes f (x0 + ↵p).