Exam CS4400: Deep Reinforcement Learning
19-04-2023 | 13:30–16:30
Student name:
Student number:
• This is a closed-book individual examination with 9 questions and a total of 100 points.
• Do not open the exam before the official start of the examination.
• If you feel sick or otherwise unable to take the examination, please indicate this before
the exam starts.
• The examination lasts 180 minutes after the official start.
• This gives you a bit under 2 minutes per point. Use your time carefully!
• You can hand in your exam solution any time until 15 minutes before the end of the
exam and leave the examination room quietly. In the last 15 minutes, no one can leave
the examination room to help other students concentrate on finishing their exam.
• Only one student can visit the bathroom at the same time. In the last 15 minutes, no
bathroom visits are possible.
• Use of course books, readers, notes, and slides is not permitted
• Use of (graphical) calculators or mobile computing devices (including mobile phones) is
not permitted.
• Write down your name and student number above.
• Write your student number on each sheet of the exam after the exam started.
• You can write your answer on the free space under each question.
• If you need more space, use the back of another exam-sheet and write where to find the
answer under the question. Ask for additional empty pages if you need them.
• Use pens with black or blue ink. Pencils and red ink are not allowed!
• Clearly cross out invalid answers. If two answers are given, we consider the one with
less points!
• Write clearly, use correct English, and avoid verbose explanations. Giving irrelevant
information may lead to a reduction in your score.
• This exam covers all information on the slides of the course, the tutorials and everything
discussed in lectures.
• This exam assumes a familiarity with the stated background knowledge of the course.
• The total number of pages of this exam is 11 (excluding this front page).
• Exam prepared by Wendelin Böhmer. ©2023 TU Delft.
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023
Question 1 (multiple choice): (40 points)
Please mark only the correct answers with a cross like this: . If you wish to unmark a marked
answer, fill the entire square and draw an empty one next to it like this:
Only one answer per question is correct. You will receive 2 points per correct answer, except when
multiple squares are marked. Wrong answers yield no points, but are also not punished. Good luck!
1.1: Why is the following loss computation considered efficient for deep learning?
loss = ((f(x) - y) ** 2).mean()
because the mean() function automatically masks out non-existing values.
because the entire batch is computed in one forward pass.
because the mean-squared error minimizes the difference between f(x) and y.
because all operations are differentiable and can therefore be used in gradient descent.
1.2: Which of the following is a parameter of the [Link] class?
hidden_size
output_size
num_heads
transposed
1.3: What is the input to a Multi-headed Attention Layer (MHA)?
a tensor one vector per sample
a tensor with an oredered set of vectors per sample
a tensor with an unordered set of vectors per sample
tensors representing one graph, with nodes annotated by vectors, per sample
1.4: Which of the following losses has not been discussed in the lecture?
−E[Q(s, a) | (s, a) ∼ D]
−E[ ln π(a|s) | (s, a) ∼ D]
−E[Q(s, a) ln π(a|s) | (s, a) ∼ D]
−E[Q(s, π(s, a)) | s ∼ D, a ∼ N (0, 1)]
1.5: Which of the following approaches update the target network the slowest?
Semi-gradient TD-learning
Soft target updates
Deep Q-networks
Neural Fitted Q-iteration
1/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023
1.6: Which property of RL does not violate classical ML assumptions?
regression targets are non-stationary
states have to fulfill the Markov assumption
neural networks forget values of unsampled states
transitions within episodes are not sampled i.i.d.
1.7: Changing which parameter makes the DQN algorithm not more sample efficient?
decrease the number of gradient updates per sampled tajectory
decrease the number of environmental steps between gradient updates
decrease the time between target network updates
decrease the time during which the exploration is decayed
1.8: Which class is part of the standard RL software architecture from the lectures?
Trainer
Executor
Explorer
Controller
1.9: Which of the following definitions is a TD(λ) value target?
∞
λk rt+k
P
(1 − λ)
k=0
n−1
λk rt+k + λn V π (st+n )
P
(1 − λ)
k=0
n−1 n
(λγ)k rt+k + γ (1 − λ) V π (st+n )
P
k=0
∞
(λγ)k rt+k + γ (1 − λ) V π (st+k+1 )
P
k=0
1.10: Which term from the on-policy-gradient loss Lπ did we not ignore or approximate to derive the
off-policy actor-critic loss Lµ in the lecture?
n−1
ZZ
ξtπ (st ) γ t Qπ (st , at ) − V π (st ) ∇θ πθ (at |st ) dst dat
P
Lπ [θ] := −
t=0
ξtπ (st )
ξtµ (st )
πθ (at |st )
µ(at |st )
V π (st )
Qπ (st , at )
2/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023
1.11: Which algorithm can not work with discrete actions without the reparametrization trick?
DRQN
Reinforce
PPO
DDPG
1.12: How does SAC differ from TD3?
SAC can use an experience replay buffer, TD3 can not
SAC uses stochastic policies via the reparametrization trick, TD3 does not
SAC trains two Q-value functions to be pessimistic, TD3 does not
SAC regularizes actions with clipped noise, TD3 does not
1.13: Which of the following uncertainties is irreducible?
Aleatoric
Epistemic
Posterior variance
Upper confidence bounds
1.14: What is the value of the initial state of a Markov chain with 10 transitions in a row, where the
last state is terminal and each transition yields a reward of 1?
1
1−γ
1−γ 9
1−γ
1−γ 10
1−γ
1−γ 11
1−γ
Pn t 1−γ n+1
Remark: The finite geometrical series is t=0 γ = 1−γ . There are 10 transitions with a
1−γ 10
reward of 1 each, summed up 9t=0 γ t =
P
1−γ .
1.15: Which of the following is not a main challenge of online deep RL?
non-stationarity
catastrophic forgetting
no exploration
learning stability
Remark: Many students gave the answer “learning stability” amd “catastrophic forgetting”,
which is less true, but one can interpret it as correct.
3/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023
1.16: In offline RL, restricting available actions performs well when ...
a suitable distance measure can be found
a suitable epistemic uncertainty measure can be found
the sample distribution is close to random behavior
the sample distribution is close to optimal behavior
1.17: How many outputs does a LSTM network with h memory cells require to represent a stochastic
Gaussian policy in a partially observable environment with continuous actions a ∈ Rk ?
k+h
k + 2h
2k + h
2k + 2h
1.18: Which games cannot be formaluted as POSG?
games with sequential moves
games with continuous actions
games with non-stationary rules
games with stochastic moves
1.19: Which of the following MARL algorithms can overcome relative overgeneralization?
VDN
MADDPG
DCG
QMIX
1.20: In which of the following multi-task techniques does the policy have access to the task-id?
DQN
HER
DR
Sim2Real
Question 2: (6 points)
In 6 sentences or less, explain active domain randomization and name one approach to do it.
Solution
In domain randomization, random environmental parameters are drawn every episode. Active domain
randomization tries to choose the parameters that are most informative for the learning process. An
4/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023
example Bayesian optimization, which learns the Bayesian posterior over the task space and samples
parameters with high probability. Another example is adversarial interference, where another agent is
trained, like in robust RL, to choose the task/parameters that are hardest for the current agent.
Rubrik:
• 2 points for mentioning randomly drawn environments/tasks/parameters. Only 1 point if just
noise is added to the state.
• 2 point for mentioning that one wants to choose the most informative. Only 1 point if standard
domain randomization is described.
• 2 points for either Bayesian optimization or adversarial interference, even if they are called
slightly differently.
Question 3: (6 points)
(a) [3 points] Design a cooperative two-payer normal-form game, that exhibits on average relative
overgeneralization, by defining the collaborative reward:
\ a21 a22 a23
a11
r(a1 , a2 ) :=
a12
a13
(b) [3 points] Name a training method that can solve games exhibiting relative overgeneralization and
explain why it works at the example of your game. You do not need to solve your game, just explain
how it could be done.
Solution
Many games are possible, but for simplicity, here is predator-prey:
\ a21 a22 a23
a11 +1 −2 −2
r(a1 , a2 ) :=
a12 −2 0 0
a13 −2 0 0
The same game with a punishment below -1 is an example where both players almost exhibit RO.
Once can solve relative overgeneralization (RO) with optimistic return methods. Here agents remember
the best past reward for each action or ignore rewards that are lower than the current Q-value. As a
result, each agent always remembers the best action of the other agent and RO cannot occur any more.
Another methods are higher order factorization like Deep Coordination Graphs. Here the joint q-
value function can be learned by a pairwise function, which does not exhibit RO. The agents can agree
decentralized on the best joint act by using the max-sum algorithm with message passing.
Rubrik:
• 2 point for a game where at least one agent exhibits RO, or where both agents almost exhibit RO
5/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023
• 1 point for a game where both agents exhibit RO, including on the boundary (-1 above)
• 1 point for mentioning either optimistic return or higher-order factorization, or any method that
works against RO in the example.
• 2 point for a good explanation of the mentioned concept
Question 4: (6 points)
Which deep reinforcement learning algorithm and neural network architecture from the lectures would
you use to train two robots, with cameras as eyes, playing ping-pong against each other? Give at least
one argument to justify each of your choices.
Hint: it is sufficient to name and justify the required module-types of the neural network, you do not
need to draw how they are connected. Linear layers are always present, so they do not have to be named.
Solution
The environment is partially observable (camera eyes), the observations are images (require CNNs), and
the actions are continuous (motors). The two robots play aginst each other (zero-sum MARL game)
and can/will therefore not communicate.
Rubrik:
• 3 points for a MARL algorithms for continuous actions: MADDPG or independent learning
with any single-agent algorithm for continuous actions (TRPO, PPO, DDPG, TD3, SAC). Only
2 point for MARL algorithms for discrete actions (IQL, COMA, VDN, QMIX, QTRAN) or for
not mentioning MARL. 1 point for DCG (no communication) or single-agent DQN.
• 3 points for CNN (for the images), 2 points for LSTM (against partial observability). No points
for Linear, GNN, MHA or others, unless justified (e.g. with slot attention).
• 1 point less if the choice is suboptimal or no justification (minimally 0 points per category).
• No points if the choice is totally wrong or the justification unconnected.
Question 5: (6 points)
The figure to the right plots the training return for On-
line Q-learning with a replay buffer of the last 100k
transitions (blue, dark) and the same algorithm with
soft-updated target networks (red, bright), learning the
CartPole-v1 task.
(a) [2 points] Explain why the blue agent becomes in-
stable in the second half of training.
(b) [2 points] Explain how the target network stabilizes
training.
(c) [2 points] Explain the difference between soft and
hard target updates.
6/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023
Solution
There are of course multiple possible explanations, but the ones we expect here are:
(a) In the second half of training, the initial exploration transitions leave the replay buffer and catas-
trophic forgetting leads to uncontrollable drifting of unexectured actions. Once the wrong action is
selected greedily, it takes a while to unlearn it again. Random deviations, for example due to the
-greedy policy, can also lead to states that have been affected by catastrophic forgetting.
(b) Soft-updated target networks slow the change in the network that is used for bootstrapping. The
above problems are therefore kept local and propagations are delayed long enough for the sampling
policy is not affected (that much).
(c) Soft target updates change the target parameters every gradient update slightly towards the cur-
rent network parameters, whereas hard software updates do not change the target network most
of the time but every now and then replace the target parameters completely with the the current
parameters.
Rubrik:
• 2 points for a coherent explanation of (a).
• no points for explanations that would also apply to target networks.
• 2 points for a coherent explanation of (b).
• no points for explanantions that would also apply without target networks.
• 2 points for a coherent explanation of (c).
• only 1 point in either cathegory if the explanation is not coherent or the answer otherwise unclear.
Question 6: (6 points)
Consider a two-player zero-sum normal-form game G with the reward function of player A, rA (aA , aB ):
H B X
H
Y
θi if a = X
A HHH and the policies of agents i: πθi i (a) = , θi ∈ [0, 1] .
X 4 -4 1 − θi if a = Y
Y 0 2
(a) [2 points] Give the joint actions of all Nash-equilibria of G or indicate that none exist. You do not
have to justify your answer.
(b) [4 points] Derive the mixed Nash equilibrium for G.
Hint: it suffices to derive the exteme point, you do not have to prove that it is a saddle-point.
Solution
(a) None exist. Rubrik:
• 2 points for the correct answer.
(b) The mixed Nash equilibrium of a zero-sum game is the tuple of stochastic policies (π A , π B ) that
maximize the expected return Q w.r.t. π A and minimizes Q w.r.t. π B . Let in the following π 0 := πθAA
7/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023
and π := πθBB :
Q := E[r(a, b)|a ∼ π 0 , b ∼ π]
= π 0 (X)π(X)r(X, X) + π 0 (X)π(Y )r(X, Y ) + π 0 (Y )π(X)r(Y, X) + π 0 (Y )π(Y )r(Y, Y )
= 4θA θB − 4θA (1 − θB ) + 2(1 − θA )(1 − θB )
= 4θA θB + 4θA θB − 4θA + 2θA θB − 2θA − 2θB + 2
= 10 θA θB − 6θA − 2θB + 2 .
Setting the derivatives to zero reveals the saddle point of Q in parameter space (θA , θB ):
∂Q ! 2
∂θB = 10 θA − 2 = 0 ⇒ θA = 10 ,
∂Q ! 6
∂θA = 10 θB − 6 = 0 ⇒ θB = 10 .
2 6
The mixed Nash equilibrium is therefore θ = [ 10 , 10 ], or a pair of stochastic policies in which
agent A plays X 20% of the time and agent B plays X 60% of the time.
Rubrik:
• 1 point for the correct ansatz of Q
• 1 point for the correct Q
• 1 point for setting the derivatives to zero
• 1 point for the correct policy parameters
Question 7: (6 points)
To implement pseudo-counts, you are given a generative model that would produce a given state s with
probability p(s) before and p0 (s) after updating the generative model with an observed s.
(a) [2 points] Explain how p(s) and p0 (s) are related to the count N (s) of past observations of s.
p(s) (1−p0 (s))
(b) [2 points] Derive the pseudo-count N (s) = p0 (s)−p(s) .
(c) [2 points] Define a sensible intrinsic reward ri (s, a, s0 ) for deep exploration based on pseudo-count
N (s).
Solution
(a) For countable states, p(s) = Nn(s) and p0 (s) = Nn+1
(s)+1 P
, where n := s N (s) denotes the total
number of observations before observing s again.
Rubrik:
• 1 point per correct definition of p(s) and p(s0 ), or an equivalent description (without math)
• 1 point if p(s) and p0 (s) are not defined but the basic idea of the relationship is communicated
N (s) N (s)+1
(b) From above we can deduce n = p(s) and n = p0 (s) − 1. Putting those equalities together yields
N (s) N (s) 1−p0 (s) 0 1−p0 (s)
p(s) = p0 (s) + 1
p0 (s) −1 ⇔ N (s) 1
p(s) − 1
p0 (s) ) = p0 (s) ⇔ N (s) pp(s)
(s)−p(s)
p0 (s) = p0 (s)
8/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023
p(s) p0 (s) (1−p0 (s)) p(s) (1−p0 (s))
⇔ N (s) = p0 (s)(p0 (s)−p(s)) = p0 (s)−p(s) .
Rubrik:
• 1 point for an ansatz that removes the unknown n
• 1 point for the correct derivation of pseudo counts N (s)
(c) The reward should create an upper confidence bound on the value,
p i.e., should be an analogue to the
standard deviation, which for averages is proportionally to 1/ N (s). As we only measure N (s),
and not N (s, a), we can only see how often we visited the next state s’. The intrinsic reward for
the transition s, a → s0 is therefore (for some factor c):
c
ri (s, a, s0 ) := p .
N (s0 )
Rubrik:
√ √
• 1 point for using 1/ N , even if is missing, but no point if fraction is missing
• 1 point for using N (s0 ) or redefining the generative model to deduce N (s, a).
Question 8: (10 points)
Prove that the variance V[Qπ (s, a)] of the Q-value Qπ (s, a) of stochastic policy π, in a MDP with
deterministic rewards and stochastic transitions, is bounded from above by
0
h i
V[Qπ (s, a)] ≤ γ 2 E V[Qπ (s0 , a0 )] sa∼P (·|s,a)
0 ∼π(·|s0 ) , ∀s ∈ S, ∀a ∈ A .
s0 ∼P (·|s,a)
h i
Hint: you can use Jensen’s inequality E[x]2 ≤ E[x2 ]. You may also abbreviate E0 [·] := E · a0 ∼π(·|s0 )
.
Solution
Note that the reward function r(s, a) is deterministic, and thus E r(s, a) = r(s, a).
Qπ (s,a) E[Qπ (s,a)]
h z }| { z }| { 2 i
V[Qπ (s, a)] = E r(s, a) + γE0 [Qπ (s0 , a0 )] − E r(s, a) + γE0 [Qπ (s0 , a0 )]
h 2 i
(det. reward) = E r(s, a) − E r(s, a) +γE0 [Qπ (s0 , a0 )] − γE E0 [Qπ (s0 , a0 )]
| {z }
0
h 2 i
(swap expectations) = E γE0 Qπ (s0 , a0 ) − E[Qπ (s0 , a0 )]
h 2 i
(Jensen’s inequality) ≤ γ 2 E E0 Qπ (s0 a0 ) − E[Qπ (s0 , a0 )]
h 2 i
(swap expectations) = γ 2 E0 E Qπ (s0 a0 ) − E[Qπ (s0 , a0 )] γ 2 E0 V[Qπ (s0 , a0 )] .
=
| {z }
V[Qπ (s0 ,a0 )]
Rubrik:
• 1 point for the correct definition of the Q-value
9/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023
• 2 point for the correct definition of the variance
• 2 point for canceling out the reward functions
• 2 points for using Jensen’s inequality correctly
• 2 point for resubstituting the variance of the next state-action
• 1 point for putting it all correctly together
Question 9: (programming) (14 points)
You only have to insert the missing code segment at the line(s) marked with #YOUR CODE HERE.
Please use correct Python/PyTorch code. Singleton dimensions of tensors can be ignored, i.e., you do
not need to (un)squeeze tensors. If you forget a specific command, you can define it first, both the
signature (input/output parameters) and a short description what it does. Using your own definitions of
existing PyTorch functions will not yield point deductions. If no similar PyTorch function exists, your
definition will be considered as wrong code and you will not receive the corresponding points.
Implement the following loss for DQN with intrinsic reward in the given MyLearner class efficiently:
h Pk n 2 i
0 0 0
min E k1 n1
P
rt + σ(st , at ) + γ max
0
µ(s t , a ) − Qθi
(st , at ) hst , at , rt , st i ∼ D ,
{θi }ki=1 i=1 t=1 a ∈A
s
k 2 k
1P 1P
where σ(s, a) := k Qθj0 (s, a) − µ(s, a) , and µ(s, a) := k Qθj0 (s, a) .
j=1 j=1
The loss trains an ensemble of k DQN Q-value models m = make_model(env), with indepen-
dently initialized parameters θi , which each take a batch of n states of dimensionality d (as n × d
tensor) and return a batch of vectors of length |A| ∈ N with the predicted Q-values for all actions (as
n×|A| tensor). The target parameters θi0 shall be identical to those of the current models, i.e. θi0 = θi , ∀i,
but shall not be changed by gradient descent. Ensure that terminal next states s0t are handled properly.
1 import torch
2 from torch import stack # to stack a list of tensors in one dim
3 from [Link] import mse_loss # MSE loss
4
5 class MyLearner:
6 def __init__(self, env, k=5, gamma=0.99):
7 [Link] = gamma
8 [Link] = [self.make_model(env) for _ in range(k)]
9 [Link] = [Link]([p for m in [Link]
10 for p in [Link]()])
11 def train(self, batch):
12 """ Performs one gradient update step on the above loss.
13 "batch" is a dictionary of equally sized tensors
14 (except for last dimension):
15 - batch[’states’][t, :] = st ∈ Rd
16 - batch[’actions’][t] = at ∈ N
17 - batch[’rewards’][t] = rt ∈ R
18 - batch[’next_states’][t, :] = s0t ∈ Rd
19 - batch[’terminals’][t] = true, iff s0t is terminal """
20 loss = 0
10/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023
21 values = stack([m(batch[’states’]) for m in [Link]], dim=0)
22 # YOUR CODE HERE
23 [Link].zero_grad()
24 [Link]()
25 [Link]()
26 return [Link]()
Hint: you can use the functions [Link](dim) to compute µ and [Link](dim) for
σ. Be careful about the order in which .mean() and .max() are applied!
Solution
Here are two variants how to implement this loss:
1 # Variant A: list comprehensions
2 values = [Link](dim=-1, index=batch[’actions’].unsqueeze(dim=0))
3 futures = stack([m(batch[’next_states’]) for m in [Link]], dim=0)
4 futures = [Link](dim=0).max(dim=-1)[0]
5 targets = batch[’rewards’] + [Link](dim=0) \
6 + [Link] * ~batch[’terminals’] * futures
7 loss = mse_loss(values, [Link](dim=0).detach())
8
9 # Variant B: for loops over models
10 futures = []
11 for m in [Link]:
12 [Link](m(batch[’next_states’]))
13 futures = stack(futures, dim=0).mean(dim=0).max(dim=-1)
14 targets = batch[’rewards’] + [Link](dim=0) \
15 + [Link] * ~batch[’terminals’] * futures
16 for i in range([Link][0]):
17 v = values[i].gather(dim=-1,index=batch[’actions’])
18 loss = loss + mse_loss(v, [Link]()) / [Link][90]
Both variants are fine. The max − mean order is important, due to Jensen’s inequality.
Rubrik:
• 2 points for gathering the current values of the current actions (only 1 point for a for loop)
• 2 points for computing the future values for all models (none for treating [Link] as one model)
• 2 points for taking the maximum over actions of future values (only 1 point for forgotten [0])
• 2 points for taking the mean over future values (only 1 point if the mean-max order is incorrect)
• 2 points for the correct use of terminals (only 1 point for forgotten not: ~)
• 2 points for detaching the targets
• 2 points for the correct TD loss, including value (only 1 point for the mean current future values)
• no point deduction for double-Q-learning or for missing loss normalization
• single point deduction for significant syntax errors that would change the behavior, but no deduction for
repetitions of that error
End of exam. Total 100 points.
11/11