0% found this document useful (0 votes)

15 views12 pages

2022 Resit Solution

The document outlines the guidelines and structure for the CS4400 Deep Reinforcement Learning exam, which consists of 9 questions worth a total of 100 points, to be completed in 180 minutes. It specifies rules regarding exam conduct, such as the prohibition of external materials and devices, and provides instructions for answering questions. Additionally, it includes sample questions covering topics related to deep reinforcement learning, including multiple-choice questions and short-answer questions on domain randomization and cooperative game design.

Uploaded by

Meme Necromancer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views12 pages

2022 Resit Solution

Uploaded by

Meme Necromancer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Exam CS4400: Deep Reinforcement Learning

19-04-2023 | 13:30–16:30

Student name:

Student number:

• This is a closed-book individual examination with 9 questions and a total of 100 points.
• Do not open the exam before the official start of the examination.
• If you feel sick or otherwise unable to take the examination, please indicate this before
the exam starts.
• The examination lasts 180 minutes after the official start.
• This gives you a bit under 2 minutes per point. Use your time carefully!
• You can hand in your exam solution any time until 15 minutes before the end of the
exam and leave the examination room quietly. In the last 15 minutes, no one can leave
the examination room to help other students concentrate on finishing their exam.
• Only one student can visit the bathroom at the same time. In the last 15 minutes, no
bathroom visits are possible.
• Use of course books, readers, notes, and slides is not permitted
• Use of (graphical) calculators or mobile computing devices (including mobile phones) is
not permitted.
• Write down your name and student number above.
• Write your student number on each sheet of the exam after the exam started.
• You can write your answer on the free space under each question.
• If you need more space, use the back of another exam-sheet and write where to find the
answer under the question. Ask for additional empty pages if you need them.
• Use pens with black or blue ink. Pencils and red ink are not allowed!
• Clearly cross out invalid answers. If two answers are given, we consider the one with
less points!
• Write clearly, use correct English, and avoid verbose explanations. Giving irrelevant
information may lead to a reduction in your score.
• This exam covers all information on the slides of the course, the tutorials and everything
discussed in lectures.
• This exam assumes a familiarity with the stated background knowledge of the course.
• The total number of pages of this exam is 11 (excluding this front page).
• Exam prepared by Wendelin Böhmer. ©2023 TU Delft.
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023

Question 1 (multiple choice): (40 points)

Please mark only the correct answers with a cross like this: . If you wish to unmark a marked
answer, fill the entire square and draw an empty one next to it like this:
Only one answer per question is correct. You will receive 2 points per correct answer, except when
multiple squares are marked. Wrong answers yield no points, but are also not punished. Good luck!

1.1: Why is the following loss computation considered efficient for deep learning?
loss = ((f(x) - y) ** 2).mean()

because the mean() function automatically masks out non-existing values.

because the entire batch is computed in one forward pass.
because the mean-squared error minimizes the difference between f(x) and y.
because all operations are differentiable and can therefore be used in gradient descent.

1.2: Which of the following is a parameter of the [Link] class?

hidden_size
output_size
num_heads
transposed

1.3: What is the input to a Multi-headed Attention Layer (MHA)?

a tensor one vector per sample

a tensor with an oredered set of vectors per sample
a tensor with an unordered set of vectors per sample
tensors representing one graph, with nodes annotated by vectors, per sample

1.4: Which of the following losses has not been discussed in the lecture?

1.5: Which of the following approaches update the target network the slowest?

Semi-gradient TD-learning
Soft target updates
Deep Q-networks
Neural Fitted Q-iteration

1/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023

1.6: Which property of RL does not violate classical ML assumptions?

regression targets are non-stationary

states have to fulfill the Markov assumption
neural networks forget values of unsampled states
transitions within episodes are not sampled i.i.d.

1.7: Changing which parameter makes the DQN algorithm not more sample efficient?

decrease the number of gradient updates per sampled tajectory

decrease the number of environmental steps between gradient updates
decrease the time between target network updates
decrease the time during which the exploration is decayed

1.8: Which class is part of the standard RL software architecture from the lectures?

Trainer
Executor
Explorer
Controller

1.9: Which of the following definitions is a TD(λ) value target?

∞
λk rt+k
P
(1 − λ)
k=0
n−1
λk rt+k + λn V π (st+n )
P
(1 − λ)
k=0
n−1 n
(λγ)k rt+k + γ (1 − λ) V π (st+n )
P

k=0
∞
(λγ)k rt+k + γ (1 − λ) V π (st+k+1 )
P

k=0

1.10: Which term from the on-policy-gradient loss Lπ did we not ignore or approximate to derive the
off-policy actor-critic loss Lµ in the lecture?
n−1
ZZ
ξtπ (st ) γ t Qπ (st , at ) − V π (st ) ∇θ πθ (at |st ) dst dat
P
Lπ [θ] := −
t=0

ξtπ (st )
ξtµ (st )
πθ (at |st )
µ(at |st )
V π (st )
Qπ (st , at )

2/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023

1.11: Which algorithm can not work with discrete actions without the reparametrization trick?

DRQN
Reinforce
PPO
DDPG

1.12: How does SAC differ from TD3?

SAC can use an experience replay buffer, TD3 can not

SAC uses stochastic policies via the reparametrization trick, TD3 does not
SAC trains two Q-value functions to be pessimistic, TD3 does not
SAC regularizes actions with clipped noise, TD3 does not

1.13: Which of the following uncertainties is irreducible?

Aleatoric
Epistemic
Posterior variance
Upper confidence bounds

1.14: What is the value of the initial state of a Markov chain with 10 transitions in a row, where the
last state is terminal and each transition yields a reward of 1?
1
1−γ
1−γ 9
1−γ
1−γ 10
1−γ
1−γ 11
1−γ
Pn t 1−γ n+1
Remark: The finite geometrical series is t=0 γ = 1−γ . There are 10 transitions with a
1−γ 10
reward of 1 each, summed up 9t=0 γ t =
P
1−γ .

1.15: Which of the following is not a main challenge of online deep RL?

non-stationarity
catastrophic forgetting
no exploration
learning stability
Remark: Many students gave the answer “learning stability” amd “catastrophic forgetting”,
which is less true, but one can interpret it as correct.

3/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023

1.16: In offline RL, restricting available actions performs well when ...

a suitable distance measure can be found

a suitable epistemic uncertainty measure can be found
the sample distribution is close to random behavior
the sample distribution is close to optimal behavior

1.17: How many outputs does a LSTM network with h memory cells require to represent a stochastic
Gaussian policy in a partially observable environment with continuous actions a ∈ Rk ?

k+h
k + 2h
2k + h
2k + 2h

1.18: Which games cannot be formaluted as POSG?

games with sequential moves

games with continuous actions
games with non-stationary rules
games with stochastic moves

1.19: Which of the following MARL algorithms can overcome relative overgeneralization?

VDN
MADDPG
DCG
QMIX

1.20: In which of the following multi-task techniques does the policy have access to the task-id?

DQN
HER
DR
Sim2Real

Question 2: (6 points)

In 6 sentences or less, explain active domain randomization and name one approach to do it.

Solution
In domain randomization, random environmental parameters are drawn every episode. Active domain
randomization tries to choose the parameters that are most informative for the learning process. An

4/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023

example Bayesian optimization, which learns the Bayesian posterior over the task space and samples
parameters with high probability. Another example is adversarial interference, where another agent is
trained, like in robust RL, to choose the task/parameters that are hardest for the current agent.
Rubrik:
• 2 points for mentioning randomly drawn environments/tasks/parameters. Only 1 point if just
noise is added to the state.
• 2 point for mentioning that one wants to choose the most informative. Only 1 point if standard
domain randomization is described.
• 2 points for either Bayesian optimization or adversarial interference, even if they are called
slightly differently.

Question 3: (6 points)

(a) [3 points] Design a cooperative two-payer normal-form game, that exhibits on average relative
overgeneralization, by defining the collaborative reward:

\ a21 a22 a23

a11
r(a1 , a2 ) :=
a12
a13
(b) [3 points] Name a training method that can solve games exhibiting relative overgeneralization and
explain why it works at the example of your game. You do not need to solve your game, just explain
how it could be done.

Solution
Many games are possible, but for simplicity, here is predator-prey:

\ a21 a22 a23

a11 +1 −2 −2
r(a1 , a2 ) :=
a12 −2 0 0
a13 −2 0 0

The same game with a punishment below -1 is an example where both players almost exhibit RO.
Once can solve relative overgeneralization (RO) with optimistic return methods. Here agents remember
the best past reward for each action or ignore rewards that are lower than the current Q-value. As a
result, each agent always remembers the best action of the other agent and RO cannot occur any more.
Another methods are higher order factorization like Deep Coordination Graphs. Here the joint q-
value function can be learned by a pairwise function, which does not exhibit RO. The agents can agree
decentralized on the best joint act by using the max-sum algorithm with message passing.
Rubrik:
• 2 point for a game where at least one agent exhibits RO, or where both agents almost exhibit RO

5/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023

• 1 point for a game where both agents exhibit RO, including on the boundary (-1 above)
• 1 point for mentioning either optimistic return or higher-order factorization, or any method that
works against RO in the example.
• 2 point for a good explanation of the mentioned concept

Question 4: (6 points)

Which deep reinforcement learning algorithm and neural network architecture from the lectures would
you use to train two robots, with cameras as eyes, playing ping-pong against each other? Give at least
one argument to justify each of your choices.
Hint: it is sufficient to name and justify the required module-types of the neural network, you do not
need to draw how they are connected. Linear layers are always present, so they do not have to be named.

Solution
The environment is partially observable (camera eyes), the observations are images (require CNNs), and
the actions are continuous (motors). The two robots play aginst each other (zero-sum MARL game)
and can/will therefore not communicate.
Rubrik:
• 3 points for a MARL algorithms for continuous actions: MADDPG or independent learning
with any single-agent algorithm for continuous actions (TRPO, PPO, DDPG, TD3, SAC). Only
2 point for MARL algorithms for discrete actions (IQL, COMA, VDN, QMIX, QTRAN) or for
not mentioning MARL. 1 point for DCG (no communication) or single-agent DQN.
• 3 points for CNN (for the images), 2 points for LSTM (against partial observability). No points
for Linear, GNN, MHA or others, unless justified (e.g. with slot attention).
• 1 point less if the choice is suboptimal or no justification (minimally 0 points per category).
• No points if the choice is totally wrong or the justification unconnected.

Question 5: (6 points)

The figure to the right plots the training return for On-
line Q-learning with a replay buffer of the last 100k
transitions (blue, dark) and the same algorithm with
soft-updated target networks (red, bright), learning the
CartPole-v1 task.

(a) [2 points] Explain why the blue agent becomes in-

stable in the second half of training.

(b) [2 points] Explain how the target network stabilizes

training.

(c) [2 points] Explain the difference between soft and

hard target updates.

6/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023

Solution
There are of course multiple possible explanations, but the ones we expect here are:
(a) In the second half of training, the initial exploration transitions leave the replay buffer and catas-
trophic forgetting leads to uncontrollable drifting of unexectured actions. Once the wrong action is
selected greedily, it takes a while to unlearn it again. Random deviations, for example due to the
-greedy policy, can also lead to states that have been affected by catastrophic forgetting.
(b) Soft-updated target networks slow the change in the network that is used for bootstrapping. The
above problems are therefore kept local and propagations are delayed long enough for the sampling
policy is not affected (that much).
(c) Soft target updates change the target parameters every gradient update slightly towards the cur-
rent network parameters, whereas hard software updates do not change the target network most
of the time but every now and then replace the target parameters completely with the the current
parameters.
Rubrik:
• 2 points for a coherent explanation of (a).
• no points for explanations that would also apply to target networks.
• 2 points for a coherent explanation of (b).
• no points for explanantions that would also apply without target networks.
• 2 points for a coherent explanation of (c).
• only 1 point in either cathegory if the explanation is not coherent or the answer otherwise unclear.

Question 6: (6 points)

Consider a two-player zero-sum normal-form game G with the reward function of player A, rA (aA , aB ):

H B X
H
Y
θi if a = X
A HHH and the policies of agents i: πθi i (a) = , θi ∈ [0, 1] .
X 4 -4 1 − θi if a = Y
Y 0 2

(a) [2 points] Give the joint actions of all Nash-equilibria of G or indicate that none exist. You do not
have to justify your answer.
(b) [4 points] Derive the mixed Nash equilibrium for G.
Hint: it suffices to derive the exteme point, you do not have to prove that it is a saddle-point.

Solution
(a) None exist. Rubrik:
• 2 points for the correct answer.
(b) The mixed Nash equilibrium of a zero-sum game is the tuple of stochastic policies (π A , π B ) that
maximize the expected return Q w.r.t. π A and minimizes Q w.r.t. π B . Let in the following π 0 := πθAA

7/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023

and π := πθBB :

Q := E[r(a, b)|a ∼ π 0 , b ∼ π]
= π 0 (X)π(X)r(X, X) + π 0 (X)π(Y )r(X, Y ) + π 0 (Y )π(X)r(Y, X) + π 0 (Y )π(Y )r(Y, Y )
= 4θA θB − 4θA (1 − θB ) + 2(1 − θA )(1 − θB )
= 4θA θB + 4θA θB − 4θA + 2θA θB − 2θA − 2θB + 2
= 10 θA θB − 6θA − 2θB + 2 .

Setting the derivatives to zero reveals the saddle point of Q in parameter space (θA , θB ):

∂Q ! 2
∂θB = 10 θA − 2 = 0 ⇒ θA = 10 ,
∂Q ! 6
∂θA = 10 θB − 6 = 0 ⇒ θB = 10 .

2 6
The mixed Nash equilibrium is therefore θ = [ 10 , 10 ], or a pair of stochastic policies in which
agent A plays X 20% of the time and agent B plays X 60% of the time.
Rubrik:
• 1 point for the correct ansatz of Q
• 1 point for the correct Q
• 1 point for setting the derivatives to zero
• 1 point for the correct policy parameters

Question 7: (6 points)

To implement pseudo-counts, you are given a generative model that would produce a given state s with
probability p(s) before and p0 (s) after updating the generative model with an observed s.
(a) [2 points] Explain how p(s) and p0 (s) are related to the count N (s) of past observations of s.
p(s) (1−p0 (s))
(b) [2 points] Derive the pseudo-count N (s) = p0 (s)−p(s) .

Solution

(a) For countable states, p(s) = Nn(s) and p0 (s) = Nn+1

(s)+1 P
, where n := s N (s) denotes the total
number of observations before observing s again.
Rubrik:
• 1 point per correct definition of p(s) and p(s0 ), or an equivalent description (without math)
• 1 point if p(s) and p0 (s) are not defined but the basic idea of the relationship is communicated
N (s) N (s)+1
(b) From above we can deduce n = p(s) and n = p0 (s) − 1. Putting those equalities together yields

N (s) N (s) 1−p0 (s) 0 1−p0 (s)

p(s) = p0 (s) + 1
p0 (s) −1 ⇔ N (s) 1
p(s) − 1
p0 (s) ) = p0 (s) ⇔ N (s) pp(s)
(s)−p(s)
p0 (s) = p0 (s)

8/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023

p(s) p0 (s) (1−p0 (s)) p(s) (1−p0 (s))

⇔ N (s) = p0 (s)(p0 (s)−p(s)) = p0 (s)−p(s) .
Rubrik:
• 1 point for an ansatz that removes the unknown n
• 1 point for the correct derivation of pseudo counts N (s)
(c) The reward should create an upper confidence bound on the value,
p i.e., should be an analogue to the
standard deviation, which for averages is proportionally to 1/ N (s). As we only measure N (s),
and not N (s, a), we can only see how often we visited the next state s’. The intrinsic reward for
the transition s, a → s0 is therefore (for some factor c):
c
ri (s, a, s0 ) := p .
N (s0 )

Rubrik:
√ √
• 1 point for using 1/ N , even if is missing, but no point if fraction is missing
• 1 point for using N (s0 ) or redefining the generative model to deduce N (s, a).

Question 8: (10 points)

Prove that the variance V[Qπ (s, a)] of the Q-value Qπ (s, a) of stochastic policy π, in a MDP with
deterministic rewards and stochastic transitions, is bounded from above by
0
h i
V[Qπ (s, a)] ≤ γ 2 E V[Qπ (s0 , a0 )] sa∼P (·|s,a)
0 ∼π(·|s0 ) , ∀s ∈ S, ∀a ∈ A .

s0 ∼P (·|s,a)
h i
Hint: you can use Jensen’s inequality E[x]2 ≤ E[x2 ]. You may also abbreviate E0 [·] := E · a0 ∼π(·|s0 )
.

Solution

Note that the reward function r(s, a) is deterministic, and thus E r(s, a) = r(s, a).

Qπ (s,a) E[Qπ (s,a)]

h z }| { z }| { 2 i
V[Qπ (s, a)] = E r(s, a) + γE0 [Qπ (s0 , a0 )] − E r(s, a) + γE0 [Qπ (s0 , a0 )]
h 2 i
(det. reward) = E r(s, a) − E r(s, a) +γE0 [Qπ (s0 , a0 )] − γE E0 [Qπ (s0 , a0 )]

| {z }
0
h 2 i
(swap expectations) = E γE0 Qπ (s0 , a0 ) − E[Qπ (s0 , a0 )]
h 2 i
(Jensen’s inequality) ≤ γ 2 E E0 Qπ (s0 a0 ) − E[Qπ (s0 , a0 )]
h 2 i
(swap expectations) = γ 2 E0 E Qπ (s0 a0 ) − E[Qπ (s0 , a0 )] γ 2 E0 V[Qπ (s0 , a0 )] .

=
| {z }
V[Qπ (s0 ,a0 )]

Rubrik:
• 1 point for the correct definition of the Q-value

9/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023

• 2 point for the correct definition of the variance

• 2 point for canceling out the reward functions
• 2 points for using Jensen’s inequality correctly
• 2 point for resubstituting the variance of the next state-action
• 1 point for putting it all correctly together

Question 9: (programming) (14 points)

You only have to insert the missing code segment at the line(s) marked with #YOUR CODE HERE.
Please use correct Python/PyTorch code. Singleton dimensions of tensors can be ignored, i.e., you do
not need to (un)squeeze tensors. If you forget a specific command, you can define it first, both the
signature (input/output parameters) and a short description what it does. Using your own definitions of
existing PyTorch functions will not yield point deductions. If no similar PyTorch function exists, your
definition will be considered as wrong code and you will not receive the corresponding points.
Implement the following loss for DQN with intrinsic reward in the given MyLearner class efficiently:
h Pk n 2 i
0 0 0
min E k1 n1
P
rt + σ(st , at ) + γ max
0
µ(s t , a ) − Qθi
(st , at ) hst , at , rt , st i ∼ D ,
{θi }ki=1 i=1 t=1 a ∈A

s
k 2 k
1P 1P
where σ(s, a) := k Qθj0 (s, a) − µ(s, a) , and µ(s, a) := k Qθj0 (s, a) .
j=1 j=1

The loss trains an ensemble of k DQN Q-value models m = make_model(env), with indepen-
dently initialized parameters θi , which each take a batch of n states of dimensionality d (as n × d
tensor) and return a batch of vectors of length |A| ∈ N with the predicted Q-values for all actions (as
n×|A| tensor). The target parameters θi0 shall be identical to those of the current models, i.e. θi0 = θi , ∀i,
but shall not be changed by gradient descent. Ensure that terminal next states s0t are handled properly.
1 import torch
2 from torch import stack # to stack a list of tensors in one dim
3 from [Link] import mse_loss # MSE loss
4
5 class MyLearner:
6 def __init__(self, env, k=5, gamma=0.99):
7 [Link] = gamma
8 [Link] = [self.make_model(env) for _ in range(k)]
9 [Link] = [Link]([p for m in [Link]
10 for p in [Link]()])
11 def train(self, batch):
12 """ Performs one gradient update step on the above loss.
13 "batch" is a dictionary of equally sized tensors
14 (except for last dimension):
15 - batch[’states’][t, :] = st ∈ Rd
16 - batch[’actions’][t] = at ∈ N
17 - batch[’rewards’][t] = rt ∈ R
18 - batch[’next_states’][t, :] = s0t ∈ Rd
19 - batch[’terminals’][t] = true, iff s0t is terminal """
20 loss = 0

10/11
CS4400 Deep Reinforcement Learning Exam Q3 2023
Student number: date: 19-04-2023

21 values = stack([m(batch[’states’]) for m in [Link]], dim=0)

22 # YOUR CODE HERE
23 [Link].zero_grad()
24 [Link]()
25 [Link]()
26 return [Link]()

Hint: you can use the functions [Link](dim) to compute µ and [Link](dim) for
σ. Be careful about the order in which .mean() and .max() are applied!

Solution
Here are two variants how to implement this loss:
1 # Variant A: list comprehensions
2 values = [Link](dim=-1, index=batch[’actions’].unsqueeze(dim=0))
3 futures = stack([m(batch[’next_states’]) for m in [Link]], dim=0)
4 futures = [Link](dim=0).max(dim=-1)[0]
5 targets = batch[’rewards’] + [Link](dim=0) \
6 + [Link] * ~batch[’terminals’] * futures
7 loss = mse_loss(values, [Link](dim=0).detach())
8
9 # Variant B: for loops over models
10 futures = []
11 for m in [Link]:
12 [Link](m(batch[’next_states’]))
13 futures = stack(futures, dim=0).mean(dim=0).max(dim=-1)
14 targets = batch[’rewards’] + [Link](dim=0) \
15 + [Link] * ~batch[’terminals’] * futures
16 for i in range([Link][0]):
17 v = values[i].gather(dim=-1,index=batch[’actions’])
18 loss = loss + mse_loss(v, [Link]()) / [Link][90]

Both variants are fine. The max − mean order is important, due to Jensen’s inequality.
Rubrik:
• 2 points for gathering the current values of the current actions (only 1 point for a for loop)
• 2 points for computing the future values for all models (none for treating [Link] as one model)
• 2 points for taking the maximum over actions of future values (only 1 point for forgotten [0])
• 2 points for taking the mean over future values (only 1 point if the mean-max order is incorrect)
• 2 points for the correct use of terminals (only 1 point for forgotten not: ~)
• 2 points for detaching the targets
• 2 points for the correct TD loss, including value (only 1 point for the mean current future values)
• no point deduction for double-Q-learning or for missing loss normalization
• single point deduction for significant syntax errors that would change the behavior, but no deduction for
repetitions of that error

End of exam. Total 100 points.

11/11

Second Exam 2021-22 Solution
No ratings yet
Second Exam 2021-22 Solution
9 pages
ADL Midterm Mock Exam 2021
No ratings yet
ADL Midterm Mock Exam 2021
5 pages
Solution: Introduction To Deep Learning
No ratings yet
Solution: Introduction To Deep Learning
20 pages
ML Endsem 2022
No ratings yet
ML Endsem 2022
7 pages
Fazle Rabby Ksa-02 (Assignment) .
No ratings yet
Fazle Rabby Ksa-02 (Assignment) .
3 pages
7COM1033test 0000
No ratings yet
7COM1033test 0000
4 pages
Exam Long Questions
No ratings yet
Exam Long Questions
8 pages
Second Exam 2021-22
No ratings yet
Second Exam 2021-22
14 pages
1718sem2-Ee5904 Me5404
No ratings yet
1718sem2-Ee5904 Me5404
4 pages
MT1SP19
No ratings yet
MT1SP19
13 pages
MT1 SP19 Solutions
No ratings yet
MT1 SP19 Solutions
14 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
SS 2020
No ratings yet
SS 2020
21 pages
Ad3501 Deep-Learning Model Question
No ratings yet
Ad3501 Deep-Learning Model Question
16 pages
QP
No ratings yet
QP
3 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
DL Midterm Rubrics
No ratings yet
DL Midterm Rubrics
5 pages
DL Exam 2023-2
No ratings yet
DL Exam 2023-2
5 pages
C6 Sample
No ratings yet
C6 Sample
4 pages
DL - Midterm - Fall23
No ratings yet
DL - Midterm - Fall23
2 pages
RL Algorithms in Gymnasium
No ratings yet
RL Algorithms in Gymnasium
59 pages
Deep Reinforcement Learning Quiz 04
No ratings yet
Deep Reinforcement Learning Quiz 04
2 pages
Solution ML KOE - 073 PUT (7th Sem 2024-25) Neeru
No ratings yet
Solution ML KOE - 073 PUT (7th Sem 2024-25) Neeru
14 pages
Practice Final sp22
No ratings yet
Practice Final sp22
10 pages
Ad3501-Deep-Learningquestion Bak
No ratings yet
Ad3501-Deep-Learningquestion Bak
15 pages
Advanced Value-Based Methods in RL
No ratings yet
Advanced Value-Based Methods in RL
6 pages
DL Quiz1
No ratings yet
DL Quiz1
5 pages
Midpaper
No ratings yet
Midpaper
16 pages
Exam - Deep Learning - From Theory To Practice (201800177) - Jan 22 2019
No ratings yet
Exam - Deep Learning - From Theory To Practice (201800177) - Jan 22 2019
3 pages
First Exam 24 25 Solution
No ratings yet
First Exam 24 25 Solution
13 pages
WS 2021 Solutions
No ratings yet
WS 2021 Solutions
16 pages
Multiplication Operations in Deep Learning
No ratings yet
Multiplication Operations in Deep Learning
7 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Deep Learning
No ratings yet
Deep Learning
9 pages
Final Solution
No ratings yet
Final Solution
12 pages
DL
No ratings yet
DL
1 page
Cs224n Midterm 2018 Solution
No ratings yet
Cs224n Midterm 2018 Solution
17 pages
Artificial Intelligence, NLP
No ratings yet
Artificial Intelligence, NLP
6 pages
HW4 Questions
No ratings yet
HW4 Questions
11 pages
19CSE456 - VI Sem May 2022
No ratings yet
19CSE456 - VI Sem May 2022
6 pages
Comprehensive Exam - Answer Key - DNN - EC3M - October 2024
No ratings yet
Comprehensive Exam - Answer Key - DNN - EC3M - October 2024
7 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
Assignment on Reinforcement Learning
No ratings yet
Assignment on Reinforcement Learning
2 pages
Exam RL 2022 Sample
No ratings yet
Exam RL 2022 Sample
8 pages
Machine Learning for Natural Language Understanding Exercises
No ratings yet
Machine Learning for Natural Language Understanding Exercises
11 pages
Final 2018
No ratings yet
Final 2018
15 pages
2022-23 Second Sem - DRL Mid Sem Regular
No ratings yet
2022-23 Second Sem - DRL Mid Sem Regular
2 pages
CS6700 RL 2024 Wa1
No ratings yet
CS6700 RL 2024 Wa1
7 pages
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
No ratings yet
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
9 pages
Solution and Result
No ratings yet
Solution and Result
9 pages
2021 Exam2 Solution
No ratings yet
2021 Exam2 Solution
11 pages
HW 1
No ratings yet
HW 1
4 pages
2022 Exam2 Solution
No ratings yet
2022 Exam2 Solution
10 pages
Mid Sem
No ratings yet
Mid Sem
2 pages
Sample QP Template
No ratings yet
Sample QP Template
2 pages
Deep Learning Question Bank
No ratings yet
Deep Learning Question Bank
8 pages
Assignment 5 Solution
No ratings yet
Assignment 5 Solution
4 pages
Aiml Solved Answers For QP
No ratings yet
Aiml Solved Answers For QP
39 pages
Shortest Path Problem
No ratings yet
Shortest Path Problem
16 pages
VLR, Vla, Vlax Functions
No ratings yet
VLR, Vla, Vlax Functions
42 pages
DSP Exam Pattern and Topics Overview
100% (1)
DSP Exam Pattern and Topics Overview
2 pages
1001 T7 Sol
No ratings yet
1001 T7 Sol
5 pages
Actuarial Science: State Transitions
No ratings yet
Actuarial Science: State Transitions
13 pages
PL/SQL Composite Data Types Guide
No ratings yet
PL/SQL Composite Data Types Guide
0 pages
Engineering Beam Analysis
No ratings yet
Engineering Beam Analysis
10 pages
Attendance List - For
No ratings yet
Attendance List - For
16 pages
Compliant Parallel Pan-Tilt Mechanism
No ratings yet
Compliant Parallel Pan-Tilt Mechanism
12 pages
Amp Impact Aggregation and Calculation Guide 1 1
No ratings yet
Amp Impact Aggregation and Calculation Guide 1 1
20 pages
Lecture 1 Course Overview
No ratings yet
Lecture 1 Course Overview
41 pages
Discrete Mathematics: Thomas Goller January 2013
No ratings yet
Discrete Mathematics: Thomas Goller January 2013
87 pages
ITB3105 CASE Tools Lab Record
100% (1)
ITB3105 CASE Tools Lab Record
40 pages
Singly Reinforced Beam Analysis
No ratings yet
Singly Reinforced Beam Analysis
8 pages
Chi Square AP Stats 2025
No ratings yet
Chi Square AP Stats 2025
21 pages
Orthogonal Categories For A Finitely Banach, Hyper-Newton, Contra-Algebraically Invertible Category
No ratings yet
Orthogonal Categories For A Finitely Banach, Hyper-Newton, Contra-Algebraically Invertible Category
10 pages
VIBRANT - Class - 10 - DPPs 1
No ratings yet
VIBRANT - Class - 10 - DPPs 1
2 pages
G10 Maths - FAKE TEST of Emg Math Education Here
No ratings yet
G10 Maths - FAKE TEST of Emg Math Education Here
10 pages
DSA Lab1
No ratings yet
DSA Lab1
22 pages
Numerical Problems On SOM
50% (2)
Numerical Problems On SOM
20 pages
Analysis and Design of Algorithms 1st Edition by Amrinder Arora ISBN 1634870212 9781634870214
100% (17)
Analysis and Design of Algorithms 1st Edition by Amrinder Arora ISBN 1634870212 9781634870214
90 pages
Elmasri 6e - ISM 15
No ratings yet
Elmasri 6e - ISM 15
11 pages
Lab Report
100% (1)
Lab Report
44 pages
Deblurring Motion in Barcode Images
No ratings yet
Deblurring Motion in Barcode Images
7 pages
Ce76 - Cadd Lab Manual - Kcet
No ratings yet
Ce76 - Cadd Lab Manual - Kcet
28 pages
Compare: A Journal of Comparative and International Education
No ratings yet
Compare: A Journal of Comparative and International Education
29 pages
Pile Eccentricity
No ratings yet
Pile Eccentricity
8 pages
Settlement of Piled Foundations Using Equivalent Raft Approach
No ratings yet
Settlement of Piled Foundations Using Equivalent Raft Approach
17 pages
Concise Formulation of Huygens' Principle The Electromagnetic Field
No ratings yet
Concise Formulation of Huygens' Principle The Electromagnetic Field
1 page
FoldingPaper OrigamiResourceList PDF
100% (2)
FoldingPaper OrigamiResourceList PDF
7 pages

2022 Resit Solution

Uploaded by

2022 Resit Solution

Uploaded by

Exam CS4400: Deep Reinforcement Learning

Question 1 (multiple choice): (40 points)

because the mean() function automatically masks out non-existing values.

1.2: Which of the following is a parameter of the [Link] class?

1.3: What is the input to a Multi-headed Attention Layer (MHA)?

a tensor one vector per sample

1.6: Which property of RL does not violate classical ML assumptions?

regression targets are non-stationary

decrease the number of gradient updates per sampled tajectory

1.9: Which of the following definitions is a TD(λ) value target?

1.12: How does SAC differ from TD3?

SAC can use an experience replay buffer, TD3 can not

1.13: Which of the following uncertainties is irreducible?

a suitable distance measure can be found

1.18: Which games cannot be formaluted as POSG?

games with sequential moves

\ a21 a22 a23

\ a21 a22 a23

(a) [2 points] Explain why the blue agent becomes in-

(b) [2 points] Explain how the target network stabilizes

(c) [2 points] Explain the difference between soft and

(a) For countable states, p(s) = Nn(s) and p0 (s) = Nn+1

N (s) N (s) 1−p0 (s) 0 1−p0 (s)

p(s) p0 (s) (1−p0 (s)) p(s) (1−p0 (s))

Question 8: (10 points)

Qπ (s,a) E[Qπ (s,a)]

• 2 point for the correct definition of the variance

Question 9: (programming) (14 points)

21 values = stack([m(batch[’states’]) for m in [Link]], dim=0)

End of exam. Total 100 points.

You might also like