0% found this document useful (0 votes)
52 views15 pages

Count-Based Exploration With Neural Density Models

The document discusses the use of neural density models, specifically PixelCNN, for count-based exploration in reinforcement learning. It addresses the importance of the density model's quality and the role of mixed Monte Carlo updates in enhancing exploration performance, particularly in challenging environments like Atari games. The authors present empirical results demonstrating significant improvements in exploration strategies using these methods, ultimately achieving state-of-the-art performance on several games.

Uploaded by

mathsmasterdp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views15 pages

Count-Based Exploration With Neural Density Models

The document discusses the use of neural density models, specifically PixelCNN, for count-based exploration in reinforcement learning. It addresses the importance of the density model's quality and the role of mixed Monte Carlo updates in enhancing exploration performance, particularly in challenging environments like Atari games. The authors present empirical results demonstrating significant improvements in exploration strategies using these methods, ultimately achieving state-of-the-art performance on several games.

Uploaded by

mathsmasterdp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Count-Based Exploration with Neural Density Models

Georg Ostrovski 1 Marc G. Bellemare 1 Aäron van den Oord 1 Rémi Munos 1

Abstract been successfully demonstrated in a number of settings


(Deisenroth & Rasmussen, 2011; Guez et al., 2012). On the
Bellemare et al. (2016) introduced the notion of other hand, practical algorithms for the general case remain
a pseudo-count, derived from a density model, scarce; fully Bayesian approaches are usually intractable in
arXiv:1703.01310v2 [cs.AI] 14 Jun 2017

to generalize count-based exploration to non- large state spaces, and the count-based method typical of
tabular reinforcement learning. This pseudo- theoretical results is not applicable in the presence of value
count was used to generate an exploration bonus function approximation.
for a DQN agent and combined with a mixed
Monte Carlo update was sufficient to achieve Recently, Bellemare et al. (2016) proposed the notion of
state of the art on the Atari 2600 game Mon- pseudo-count as a reasonable generalization of the tabu-
tezuma’s Revenge. We consider two questions lar setting considered in the theory literature. The pseudo-
left open by their work: First, how important is count is defined in terms of a density model ρ trained on
the quality of the density model for exploration? the sequence of states experienced by an agent:
Second, what role does the Monte Carlo update
play in exploration? We answer the first question N̂(x) = ρ(x)n̂(x),
by demonstrating the use of PixelCNN, an ad-
vanced neural density model for images, to sup- where n̂(x) can be thought of as a total pseudo-count com-
ply a pseudo-count. In particular, we examine the puted from the model’s recoding probability ρ0 (x), the
intrinsic difficulties in adapting Bellemare et al.’s probability of x computed immediately after training on
approach when assumptions about the model are x. As a practical application the authors used the pseudo-
violated. The result is a more practical and gen- counts derived from the simple CTS density model (Belle-
eral algorithm requiring no special apparatus. We mare et al., 2014) to incentivize exploration in Atari 2600
combine PixelCNN pseudo-counts with different agents. One of the main outcomes of their work was sub-
agent architectures to dramatically improve the stantial empirical progress on the infamously hard game
state of the art on several hard Atari games. One M ONTEZUMA’ S R EVENGE.
surprising finding is that the mixed Monte Carlo Their method critically hinged on several assumptions
update is a powerful facilitator of exploration in regarding the density model: 1) the model should be
the sparsest of settings, including Montezuma’s learning-positive, i.e. the probability assigned to a state x
Revenge. should increase with training; 2) it should be trained on-
line, using each sample exactly once; and 3) the effective
model step-size should decay at a rate of n−1 . Part of their
1. Introduction empirical success also relied on a mixed Monte Carlo/Q-
Learning update rule, which permitted fast propagation of
Exploration is the process by which an agent learns about the exploration bonuses.
its environment. In the reinforcement learning framework,
this involves reducing the agent’s uncertainty about the In this paper, we set out to answer several research ques-
environment’s transition dynamics and attainable rewards. tions related to these modelling choices and assumptions:
From a theoretical perspective, exploration is now well-
understood (e.g. Strehl & Littman, 2008; Jaksch et al., 1. To what extent does a better density model give rise to
2010; Osband et al., 2016), and Bayesian methods have better exploration?
1
DeepMind, London, UK. Correspondence to: Georg Ostro- 2. Can the above modelling assumptions be relaxed
vski <[email protected]>.
without sacrificing exploration performance?
Proceedings of the 34 th International Conference on Machine
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 3. What role does the mixed Monte Carlo update play in
by the author(s). successfully incentivizing exploration?
Count-Based Exploration with Neural Density Models

In particular, we explore the use of PixelCNN (van den should lead to a unit increase in pseudo-count:
Oord et al., 2016b;a), a state-of-the-art neural density
model. We examine the challenges posed by this approach: N̂n (x) N̂n (x) + 1
ρn (x) = , ρ0n (x) = ,
n̂ n̂ + 1
Model choice. Performing two evaluations and one model
update at each agent step (to compute ρ(x) and ρ0 (x)) can where n̂ is the pseudo-count total. The pseudo-count gen-
be prohibitively expensive. This requires the design of a eralizes the usual state visitation count function Nn (x).
simplified – yet sufficiently expressive and accurate – Pix- Under certain assumptions on ρn , pseudo-counts grow
elCNN architecture. approximately linearly with real counts. Crucially, the
pseudo-count can be approximated using the prediction
Model training. A CTS model can naturally be trained
gain of the density model:
from sequentially presented, correlated data samples.
Training a neural model in this online fashion requires  −1
more careful attention to the optimization procedure to pre- N̂n (x) ≈ ePGn (x) − 1 .
vent overfitting and catastrophic forgetting (French, 1999).
Its main use is to define an exploration bonus. We consider
Model use. The theory of pseudo-counts requires the den- a reinforcement learning (RL) agent interacting with an en-
sity model’s rate of learning to decay over time. Optimiza- vironment that provides observations and extrinsic rewards
tion of a neural model, however, imposes constraints on the (see Sutton & Barto, 1998, for a thorough exposition of the
step-size regime which cannot be violated without deterio- RL framework). To the reward at step n we add the bonus
rating effectiveness and stability of training.
r+ (x) := (N̂n (x))−1/2 ,
The concept of intrinsic motivation has made a recent resur-
gence in reinforcement learning research, in great part which incentivizes the agent to try to re-experience sur-
due to a dissatisfaction with -greedy and Boltzmann poli- prising situations. Quantities related to prediction gain
cies. Of note, Tang et al. (2016) maintain an approximate have been used for similar purposes in the intrinsic moti-
count by means of hash tables over features, which in the vation literature (Lopes et al., 2012), where they measure
pseudo-count framework corresponds to a hash-based den- an agent’s learning progress (Oudeyer et al., 2007). Al-
sity model. Houthooft et al. (2016) used a second-order though the pseudo-count bonus is close to the prediction
Taylor approximation of the prediction gain to drive explo- gain, it is asymptotically more conservative and supported
ration in continuous control. As research moves towards by stronger theoretical guarantees.
ever more complex environments, we expect the trend to-
wards more intrinsically motivated solutions to continue. 2.2. Density Models for Images
The CTS density model (Bellemare et al., 2014) is based
2. Background on the namesake algorithm, Context Tree Switching (Ve-
2.1. Pseudo-Count and Prediction Gain ness et al., 2012), a Bayesian variable-order Markov model.
In its simplest form, the model takes as input a 2D image
Here we briefly introduce notation and results, referring the and assigns to it a probability according to the product of
reader to (Bellemare et al., 2016) for technical details. location-dependent L-shaped filters, where the prediction
Let ρ be a density model on a finite space X , and ρn (x) the of each filter is given by a CTS algorithm trained on past
probability assigned by the model to x after being trained images. In Bellemare et al. (2016), this model was ap-
on a sequence of states x1 , . . . , xn . Assume ρn (x) > 0 plied to 3-bit greyscale, 42 × 42 downsampled Atari 2600
for all x, n. The recoding probability ρ0n (x) is then the frames (Fig. 1). The CTS model presents advantages in
probability the model would assign to x if it were trained on terms of simplicity and performance but is limited in ex-
that same x one more time. We call ρ learning-positive if pressiveness, scalability, and data efficiency.
ρ0n (x) ≥ ρn (x) for all x1 , . . . , xn , x ∈ X . The prediction In recent years, neural generative models for images have
gain (PG) of ρ is achieved impressive successes in their ability to generate
PGn (x) = log ρ0n (x) − log ρn (x). (1) diverse images in various domains (Kingma & Welling,
2013; Rezende et al., 2014; Gregor et al., 2015; Good-
A learning-positive ρ implies PGn (x) ≥ 0 for all x ∈ X . fellow et al., 2014). In particular, van den Oord et al.
For learning-positive ρ, we define the pseudo-count as (2016b;a) introduced PixelCNN, a fully convolutional neu-
ρn (x)(1 − ρ0n (x)) ral network composed of residual blocks with multiplica-
N̂n (x) = , tive gating units, which models pixel probabilities condi-
ρ0n (x) − ρn (x)
tional on previous pixels (in the usual top-left to bottom-
derived from postulating that a single observation of x ∈ X right raster-scan order) by using masked convolution fil-
Count-Based Exploration with Neural Density Models

pling ratios c1 , c2 , . . . to replace δ with the error term


∞ t
!
X Y
δR ETRACE (x, a) := γt cs δ(xt , at ),
t=0 s=1

effectively mixing in TD-errors from all future time steps.


Munos et al. showed that Retrace(λ) is safe (does not di-
Original Frame (160x210) 3-bit Greyscale (42x42)
verge when trained on data from an arbitrary behaviour pol-
Figure 1. Atari frame preprocessing (Bellemare et al., 2016). icy), and efficient (makes the most of multi-step returns).

ters. This model achieved state-of-the-art modelling perfor-


3. Using PixelCNN for Exploration
mance on standard datasets, paired with the computational As mentioned in the Introduction, the theory of using den-
efficiency of a convolutional feed-forward network. sity models for exploration makes several assumptions that
translate into concrete requirements for an implementation:
2.3. Multi-Step RL Methods
(a) The density model should be trained completely on-
A distinguishing feature of reinforcement learning is that
line, i.e. exactly once on each state experienced by the
the agent “learns on the basis of interim estimates” (Sutton,
agent, in the given sequential order.
1996). For example, the Q-Learning update rule is
(b) The prediction gain (PG) should decay at a rate n−1
Q(x, a) ← Q(x, a) to ensure that pseudo-counts grow approximately lin-
+ α [r(x, a) + γ maxa0 Q(x0 , a0 ) − Q(x, a)], early with real counts.
| {z }
δ(x,a) (c) The density model should be learning-positive.

linking the reward r and next-state value function Q(x0 , a0 ) Simultaneously, a partly competing set of requirements
to the current state value function Q(x, a). This particular are posed by the practicalities of training a neural density
form is the stochastic update rule with step-size α and in- model and using it as part of an RL agent:
volves the TD-error δ. In the approximate reinforcement
learning setting, such as when Q(x, a) is represented by a (d) For stability, efficiency, and to avoid catastrophic for-
neural network, this update is converted into a loss to be getting in the context of a drifting data distribution,
minimized, most commonly the squared loss δ 2 (x, a). it is advantageous to train a neural model in mini-
It is well known that better performance, both in terms of batches, drawn randomly from a diverse dataset.
learning efficiency and approximation error, is attained by (e) For effective training, a certain optimization regime
multi-step methods (Sutton, 1996; Tsitsiklis & van Roy, (e.g. a fixed learning rate schedule) has to be followed.
1997). These methods interpolate between one-step meth-
ods (Q-Learning) and the Monte-Carlo update (f) The density model must be computationally
"∞ # lightweight, to allow computing the PG (two
X model evaluations and one update) as part of every
t
Q(x, a) ← Q(x, a) + α γ r(xt , at ) − Q(x, a) , training step of an RL agent.
t=0
| {z }
δMC (x,a) We investigate how to best resolve these tensions in the
context of the Arcade Learning Environment (Bellemare
where x0 , a0 , x1 , a1 , . . . is a sample path through the en- et al., 2013), a suite of benchmark Atari 2600 games.
vironment beginning in (x, a). To achieve their success on
the hardest Atari 2600 games, Bellemare et al. (2016) used 3.1. Designing a Suitable Density Model
the mixed Monte-Carlo update (MMC)
Driven by (f) and aiming for an agent with computational
Q(x, a) ← Q(x, a) + α [(1 − β)δ(x, a) + βδMC (x, a)] , performance comparable to DQN, we design a slim vari-
ant of the PixelCNN network. Its core is a stack of 2 gated
with β ∈ [0, 1]. This choice was made for “computa- residual blocks with 16 feature maps (compared to 15 resid-
tional and implementational simplicity”, and is a particu- ual blocks with 128 feature maps in vanilla PixelCNN). As
larly coarse multi-step method. A better multi-step method was done with the CTS model, images are downsampled to
is the recent Retrace(λ) algorithm (Munos et al., 2016). 42 × 42 and quantized to 3-bit greyscale. See Appendix A
Retrace(λ) uses a product of truncated importance sam- for technical details.
Count-Based Exploration with Neural Density Models

300 Freeway
online
random permutation 10 4
Pong 3500 Montezuma's Revenge
250 random samples LR: 0. 001, PG: 0. 1 · n −1/2 · PGn LR: 0. 1 · n −1/2 , PG: 0. 1 · n −1/2 · PGn
3000 LR: 0. 001, PG: 0. 01 · PGn LR: 0. 1 · n −1/2 , PG: 0. 01 · PGn
PixelCNN Log Loss

PixelCNN Log Loss

Average Score
10 3 2500
200
2000

150 10 2 1500

1000
100
10 1 500

0
0 2000 4000 6000 8000 10000 12000 0 1000 2000 3000 4000 5000 6000 0 20 40 60 80
Model Updates Frames Million Frames
Figure 2. Left: PixelCNN log loss on F REEWAY, when trained online, on a random permutation (single use of each frame) or on
randomly drawn samples (with replacement, potentially using same frame multiple times) from the state sequence. To simulate the
effect of non-stationarity, the agent’s policy changes every 4K updates. All training methods show qualitatively similar learning progress
and stability. Middle: PixelCNN log loss over first 6K training frames on P ONG. Vertical dashed lines indicate episode ends. The
coinciding loss spikes are the density model’s ‘surprise’ upon observing the distinctive green frame that sometimes occurs at the episode
start. Right: DQN-PixelCNN training performance on M ONTEZUMA’ S R EVENGE as we vary learning rate and PG decay schedules.

Freeway Montezuma's Revenge acteristics (e.g. different gradient magnitudes), invalidating


Average Log Loss

10 4 10 4 the assumptions underlying the optimization algorithm and


c · n −1 leading to slower or unstable training.
10 3 10 3
c · n −1/2
10 2 10 2 c To determine a suitable online learning rate schedule, we
train the model on a sequence of 1M frames of experience
10 1 -5 -4 -3 -2 -1 0 10 1 -5 -4 -3 -2 -1 0
10 10 10 10 10 10 10 10 10 10 10 10 of a random-policy agent. We compare the loss achieved by
Initial Learning Rate c Initial Learning Rate c
training procedures following constant or decaying learn-
Figure 3. Model loss averaged over 10K frames, after 1M training ing rate schedules, see Fig. 3. The lowest final training loss
frames, for constant, n−1 , and n−1/2 learning rate schedules. The is achieved by a constant learning rate of 0.001 or a de-
smallest loss is achieved by a constant learning rate of 10−3 . caying learning rate of 0.1 · n−1/2 . We settled our choice
on the constant learning rate schedule as it showed greater
robustness with respect to the choice of initial learning rate.
3.2. Training the Density Model PixelCNN rapidly learns a sensible distribution over state
space. Fig. 2(left) shows the model’s loss decaying as
Instead of using randomized mini-batches, we train the
it learns to exploit image regularities. Spikes in its loss
density model completely online on the sequence of experi-
function quickly start to correspond to visually meaning-
enced states. Empirically we found that with minor tuning
ful events, such as the starts of episodes (Fig. 2(middle)).
of optimization hyper-parameters we could train the model
A video of early density model training is provided in
as robustly on a temporally correlated sequence of states as
http://youtu.be/T6iaa8Z4eyE.
on a sequence with randomized order (Fig. 2(left)). Temp 1.0 Temp 1.0
0 0

Besides satisfying the theoretical requirement (a), com-


pletely online training of the density model has the advan- 10 10

tage that ρ0n = ρn+1 , so that the model update performed


for computing the PG need not be reverted1 . 20 20

Another more subtle reason for avoiding mini-batch up- 30 30

dates of the density model (despite (d)) is a practical op-


timization issue. The (necessarily online) computation of 40 40
0 10 20 30 40 0 10 20 30 40
the PG involves a model update and hence the use of an Figure 4. Samples after 25K steps. Left: CTS, right: PixelCNN.
optimizer. Advanced optimizers used with deep neural net-
works, like the RMSProp optimizer (Tieleman & Hinton,
2012) used in this work, are stateful, tracking running av- 3.3. Computing the Pseudo-Count
erages of e.g. mean and variance of the model parameters.
From the previous section we obtain a particular learning
If the model is additionally trained from mini-batches, the
rate schedule that cannot be arbitrarily modified without
two streams of updates may show different statistical char-
deteriorating the model’s training performance or stability.
1
The CTS model allows querying the PG cheaply, without in- To achieve the required PG decay (b), we instead replace
curring an actual update of model parameters. PGn by cn · PGn with a suitably decaying sequence cn .
Count-Based Exploration with Neural Density Models

In experiments comparing actual agent performance we 3500 Montezuma's Revenge 16000 Private Eye
3000 DQN 14000
empirically determined that in fact the constant learning DQN-CTS 12000

Average Score
2500 DQN-PixelCNN 10000
DQN
rate 0.001, paired with a PG decay cn = c · n−1/2 , obtains 2000
1500
8000
6000
DQN-CTS
DQN-PixelCNN
the best exploration results on hard exploration games like 1000
4000
2000
500 0
M ONTEZUMA’ S R EVENGE, see Fig. 2(right). We find the 0
0 20 40 60 80 100
2000
0 20 40 60 80 100
model to be robust across 1-2 orders of magnitude for the Asteroids Berzerk
1300 500
value of c, and informally determine c = 0.1 to be a sen- 1200 DQN 450

Average Score
1100 DQN-CTS 400
sible configuration for achieving good results on a broad 1000 DQN-PixelCNN 350
900
300
800
range of Atari 2600 games (see also Section 7). 700
250 DQN
600 200 DQN-CTS
500 150 DQN-PixelCNN
Regarding (c), it is hard to ensure learning-positiveness for 400
0 20 40 60 80 100
100
0 20 40 60 80 100
Million Frames Million Frames
a deep neural model, and a negative PG can occur when-
ever the optimizer ‘overshoots’ a local loss minimum. As Figure 5. DQN, DQN-CTS and DQN-PixelCNN on hard explo-
a workaround, we threshold the PG value at 0. To summa- ration games (top) and easier ones (bottom).
rize, the computed pseudo-count is
   −1
N̂n (x) = exp c · n−1/2 · (PGn (x))+ − 1 .

4. Exploration in Atari 2600 Games


6
Having described our pseudo-count friendly adaptation of
PixelCNN, we now study its performance on Atari games. 11
To this end we augment the environment reward with a
pseudo-count exploration bonus, yielding the combined re-
ward r(x, a) + (N̂n (x))−1/2 . As usual for neural network-
based agents, we ensure the total reward lies in [−1, 1] by
clipping larger values.

3
4.1. DQN with PixelCNN Exploration Bonus
Our first set of experiments provides the PixelCNN explo- 14
ration bonus to a DQN agent (Mnih et al., 2015)2 . At each
agent step, the density model receives a single frame, with
which it simultaneously updates its parameters and outputs Figure 6. Improvements (in % of AUC) of DQN-PixelCNN and
the PG. We refer to this agent as DQN-PixelCNN. DQN-CTS over DQN in 57 Atari games. Annotations indicate
the number of hard exploration games with positive (right) and
The DQN-CTS agent we compare against is derived from negative (left) improvement, respectively.
the one in (Bellemare et al., 2016). For better compara-
bility, it is trained in the same online fashion as DQN-
PixelCNN, i.e. the PG is computed whenever we train the CTS and DQN-PixelCNN. On the famous M ONTEZUMA’ S
density model. By contrast, the original DQN-CTS queried R EVENGE, both intrinsically motivated agents vastly out-
the PG at the end of each episode. perform the baseline DQN. On other hard exploration
games (P RIVATE E YE; or V ENTURE, appendix Fig. 15),
Unless stated otherwise, we always use the mixed Monte DQN-PixelCNN achieves state of the art results, substan-
Carlo update (MMC) for the intrinsically motivated tially outperforming DQN and DQN-CTS. The other two
agents3 , but regular Q-Learning for the baseline DQN. games shown (A STEROIDS, B ERZERK) pose easier explo-
Fig. 5 shows training curves of DQN compared to DQN- ration problems, where the reward bonus should not pro-
vide large improvements and may have a negative effect by
2
Unlike Bellemare et al. we use regular Q-Learning instead of skewing the reward landscape. Here, DQN-PixelCNN be-
Double Q-Learning (van Hasselt et al., 2016), as our early exper-
haves more gracefully and still outperforms DQN-CTS. We
iments showed no significant advantage of DoubleDQN with the
PixelCNN-based exploration reward. hypothesize this is due to a qualitative difference between
3
The use of MMC in a replay-based agent poses a minor com- the models, see Section 5.
plication, as the MC return is not available for replay until the end
of an episode. For simplicity, in our implementation we disregard Overall PixelCNN provides the DQN agent with a larger
this detail and set the MC return to 0 for transitions from the most advantage than CTS, and often accelerates or stabilizes
recent episode. training even when not affecting peak performance. Out of
Count-Based Exploration with Neural Density Models

57 Atari games, DQN-PixelCNN outperforms DQN-CTS lar, Reactor-PixelCNN enjoys better sample efficiency (in
in 52 games by maximum achieved score, and 51 by AUC terms of area under the curve, AUC) than vanilla Reactor.
(methodology in Appendix B). See Fig. 6 for a high level We hypothesize that, like other policy gradient algorithms,
comparison (appendix Fig. 15 for full training graphs). The Reactor generally suffers from weaker exploration than its
greatest gains from using either exploration bonus are ob- value-based counterpart DQN. This aspect is much helped
served in games categorized as hard exploration games in by the exploration bonus, boosting the agent’s sample effi-
the ‘taxonomy of exploration’ in (Bellemare et al., 2016, ciency in many environments.
reproduced in Appendix D), specifically in the most chal-
lenging sparse reward games (e.g. M ONTEZUMA’ S R E - % Improvement Reactor-PixelCNN over Reactor (by AUC)
VENGE, P RIVATE E YE, V ENTURE ). Easy exploration (40) 250%
Hard exploration, dense reward (10)
Hard exploration, sparse reward (7) 200%
4.2. A Multi-Step RL Agent with PixelCNN 150%
Empirical practitioners know that techniques beneficial for 100%
one agent architecture often can be detrimental for a dif- 50%
ferent algorithm. To demonstrate the wide applicability 0%
of the PixelCNN exploration bonus, we also evaluate it
with the more recent Reactor agent4 (Gruslys et al., 2017). Figure 8. Improvements (in % of AUC) of Reactor-PixelCNN
This replay-based actor-critic agent represents its policy over Reactor in 57 Atari games.
and value function by a recurrent neural network and, cru-
cially, uses the multi-step Retrace(λ) algorithm for policy However, on hard exploration games with sparse rewards,
evaluation, replacing the MMC we use in DQN-PixelCNN. Reactor seems unable to make full use of the exploration
To reduce impact on computational efficiency of this agent, bonus. We believe this is because, in very sparse settings,
we sub-sample intrinsic rewards: we perform updates of the propagation of reward information across long hori-
the PixelCNN model and compute the reward bonus on zons becomes crucial. The MMC takes one extreme of
(randomly chosen) 25% of all steps, leaving the agent’s re- this view, directly learning from the observed returns. The
ward unchanged on other steps. We use the same PG decay Retrace(λ) algorithm, on the other hand, has an effective
schedule of 0.1n−1/2 , with n the number of model updates. horizon which depends on λ and, critically, the truncated
importance sampling ratio. This ratio results in the discard-
2000 Gravitar 35000 H.E.R.O. ing of trajectories which are off-policy, i.e. unlikely under
DQN DQN
DQN-PixelCNN
30000
DQN-PixelCNN the current policy. We hypothesize that the very goal of the
Average Score

1500
Reactor 25000 Reactor
1000
Reactor-PixelCNN 20000 Reactor-PixelCNN Retrace(λ) algorithm to learn cautiously is what prevents it
15000

500 10000
from taking full advantage of the exploration bonus!
5000
0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140

60000 Road Runner 25000 Time Pilot 5. Quality of the Density Model
50000
DQN DQN
Average Score

DQN-PixelCNN 20000
DQN-PixelCNN
40000 Reactor 15000 Reactor PixelCNN can be expected to be more expressive and accu-
30000 Reactor-PixelCNN Reactor-PixelCNN
20000
10000 rate than the less advanced CTS model, and indeed, sam-
5000
10000
ples generated after training are somewhat higher quality
0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Million Frames Million Frames (Fig. 4). However, we are not using the generative function
of the models when computing an exploration bonus, and
Figure 7. Reactor/Reactor-PixelCNN and DQN/DQN-PixelCNN
a better generative model does not necessarily give rise to
training performance (averaged over 3 seeds).
better probability estimates (Theis et al., 2016).
Training curves for the Reactor/Reactor-PixelCNN agent
compared to DQN/DQN-PixelCNN are shown in Fig. 7.
The baseline Reactor agent is superior to the DQN agent, CTS
CTS
obtaining higher scores and learning faster in about 50 out
of 57 games. It is further improved on a large fraction of
games by the PixelCNN exploration reward, see Fig. 8 (full
training graphs in appendix Fig. 16).
The effect of the exploration bonus is rather uniform, yield- PixelCNN
PixelCNN
ing improvements on a broad range of games. In particu-
4
The exact agent variant is referred to as ‘β-LOO’ with β = 1. Figure 9. PG on M ONTEZUMA’ S R EVENGE (log scale).
Count-Based Exploration with Neural Density Models

In Fig. 9 we compare the PG produced by the two mod- 500 Bank Heist 2500 Ms. Pacman
els throughout 5K training steps. PixelCNN consistently 400 2000

Average Score
300 1500
produces PGs lower than CTS. More importantly, its PGs
200 1000
are smoother, exhibiting less variance between succes- 100
DQN (w/o MC)
DQN (with MC) 500
DQN (w/o MC)
DQN (with MC)
DQN-PixelCNN (w/o MC) DQN-PixelCNN (w/o MC)
sive states, while showing more pronounced peaks at cer- 0
0 20 40 60
DQN-PixelCNN (with MC)
80 100
0
0 20 40 60
DQN-PixelCNN (with MC)
80 100
tain infrequent events. This yields a reward bonus that is H.E.R.O Private Eye
25000 16000
less harmful in easy exploration games, while providing a 14000

Average Score
20000 12000

strong signal in the case of novel or rare events. 15000


10000
8000
DQN (w/o MC)
DQN (with MC)
6000 DQN-PixelCNN (w/o MC)
10000 DQN-PixelCNN (with MC)
DQN (w/o MC) 4000
Another distinguishing feature of PixelCNN is its non- 5000
DQN (with MC)
DQN-PixelCNN (w/o MC)
2000
0
DQN-PixelCNN (with MC)
decaying step-size. The per-step PG never completely van- 0
0 20 40 60 80 100
2000
0 20 40 60 80 100
Million Frames Million Frames
ishes, as the model tracks the most recent data. This pro-
vides an unexpected benefit: the agent remains mildly sur- Figure 11. Top: games where MMC completely explains the im-
prised by significant state changes, e.g. switching rooms in proved/decreased performance of DQN-PixelCNN compared to
M ONTEZUMA’ S R EVENGE. These persistent rewards act DQN. Bottom-left: MMC and PixelCNN show additive benefits.
Bottom-right: hard exploration, sparse reward game – only com-
as milestones that the agent learns to return to. This is il-
bining MMC and PixelCNN bonus achieves training progress.
lustrated in Fig. 10, depicting the intrinsic reward over the
course of an episode. The agent routinely revisits the right-
hand side of the torch room, not because it leads to reward important to note that the Monte Carlo return’s on-policy
but just to “take in the sights”. A video of the episode is nature increases variance in the learning algorithm, and can
provided at http://youtu.be/232tOUPKPoQ.5 prevent the algorithm’s convergence to the optimal policy
when training off-policy. It can therefore be expected to
adversely affect training performance in some games.
To distill the effect of the MMC on performance, we com-
pare all four combinations of DQN with/without PixelCNN
exploration bonus and with/without MMC. Fig. 11 shows
Intrinsic Reward

the performance of these four agent variants (graphs for all


games are shown in Fig. 17). These games were picked to
illustrate several commonly occurring cases:

• MMC speeds up training and improves final perfor-


mance significantly (examples: BANK H EIST, T IME
P ILOT). In these games, MMC alone explains most or
Time Steps
all of the improvement of DQN-PixelCNN over DQN.
Figure 10. Intrinsic reward in M ONTEZUMA’ S R EVENGE.
• MMC hurts performance (examples: M S . PAC -M AN,
B REAKOUT). Here too, MMC alone explains most of
Lastly, PixelCNN’s convolutional nature is expected to be
the difference between DQN-PixelCNN and DQN.
beneficial for its sample efficiency. In Appendix C we com-
pare to a convolutional CTS and confirm that this explains • MMC and PixelCNN reward bonus have a compound-
part, but not all of PixelCNN’s advantage over vanilla CTS. ing effect (example: H.E.R.O.).

Most importantly, the situation is rather different when we


6. Importance of the Monte Carlo Return
restrict our attention to the hardest exploration games with
Like for DQN-CTS, the success of DQN-PixelCNN hinges sparse rewards. Here the baseline DQN agent fails to make
on the use of the mixed Monte Carlo update. The transient any training progress, and neither Monte Carlo return nor
and vanishing nature of the exploration rewards requires the exploration bonus alone provide any significant benefit.
the learning algorithm to latch on to these rapidly. The Their combination however grants the agent rapid training
MMC serves this end as a simple multi-step method, help- progress and allows it to achieve high performance.
ing to propagate reward information faster. An additional
One effect of the exploration bonus in these games is to
benefit lies in the fact that the Monte Carlo return helps
provide a denser reward landscape, enabling the agent to
bridging long horizons in environments where rewards are
learn meaningful policies. Due to the transient nature of
far apart and encountered rarely. On the other hand, it is
the exploration bonus, the agent needs to be able to learn
5
Another agent video on the game P RIVATE E YE can be found from this reward signal faster than regular one-step meth-
at http://youtu.be/kNyFygeUa2E. ods allow, and MMC proves to be an effective solution.
Count-Based Exploration with Neural Density Models

7000 Montezuma's Revenge 35000 Private Eye 1400 Venture


6000
c = 0. 1 30000
c = 0. 1 1200
c = 0. 1
c = 1. 0 c = 1. 0 c = 1. 0
Average Score

Average Score

Average Score
5000 c = 10. 0 25000 c = 10. 0 1000 c = 10. 0
4000 20000 800

3000 15000 600

2000 10000 400

1000 5000 200

0 0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Million Frames Million Frames Million Frames

Figure 12. DQN-PixelCNN, hard exploration games, different PG scales c · n−1/2 · PGn (c = 0.1, 1, 10) (5 seeds each).

7. Pushing the Limits of Intrinsic Motivation 2500 Montezuma's Revenge


c = 1. 0
In this section we explore the idea of a ‘maximally curious’ 2000 c = 5. 0

Average Score
agent, whose reward function is dominated by the explo- c = 10. 0
1500 c = 50. 0
ration bonus. For that we increase the PG scale, previously
1000
chosen conservatively to avoid adverse effects on easy ex-
ploration games. 500

Fig. 12 shows DQN-PixelCNN performance on the hardest 0


0 20 40 60 80 100 120 140 160
exploration games when the PG scale is increased by 1- Million Frames
2 orders of magnitude. The algorithm seems fairly robust
20000 Private Eye
across a wide range of scales: the main effect of increasing c = 1. 0
this parameter is to trade off exploration (seeking maximal Average Score
15000
c = 5. 0
c = 10. 0
reward) with exploitation (optimizing the current policy). c = 50. 0
10000
As expected, a higher PG scale translates to stronger ex-
ploration: several runs obtain record peak scores (900 in 5000
G RAVITAR, 6,600 in M ONTEZUMA’ S R EVENGE, 39,000
in P RIVATE E YE, 1,500 in V ENTURE) surpassing the state 0
0 20 40 60 80 100 120 140 160
of the art by a substantial margin (for previously published Million Frames
results, see Appendix D). Aggressive scaling speeds up the Figure 13. DQN-PixelCNN trained from intrinsic reward only
agent’s exploration and achieves peak performance rapidly, (3 seeds for each configuration).
but can also deteriorate its stability and long-term perfor-
mance. Note that in practice, because of the non-decaying
step-size the PG does not vanish. After reward clipping,
an overly inflated exploration bonus can therefore become counts puts stringent requirements on the density model,
essentially constant, no longer providing a useful intrinsic we have shown that PixelCNN can be used in a simpler and
motivation signal to the agent. more general setup, and can be trained completely online.
Another way of creating an entirely curiosity-driven agent It also proves to be widely compatible with both value-
is to ignore the environment reward altogether and train function and policy-based RL algorithms.
based on the exploration reward only, see Fig. 13. Remark- In addition to pushing the state of the art on the hardest
ably, the curiosity signal alone is sufficient to train a high- exploration problems among the Atari 2600 games, Pixel-
performing agent (measured by environment reward!). CNN improves speed of learning and stability of baseline
It is worth noting that agents with exploration bonus seem RL agents across a wide range of games. The quality of its
to ‘never stop exploring’: for different seeds, the agents reward bonus is evidenced by the fact that on sparse reward
make learning progress at very different times during train- games, this signal alone suffices to learn to achieve signifi-
ing, a qualitative difference to vanilla DQN. cant scores, creating a truly intrinsically motivated agent.
Our analysis also reveals the importance of the Monte
8. Conclusion Carlo return for effective exploration. The comparison with
more sophisticated but fixed-horizon multi-step methods
We demonstrated the use of PixelCNN for exploration and shows that its significance lies both in faster learning in the
showed that its greater accuracy and expressiveness trans- context of a useful but transient reward function, as well as
late into a more useful exploration bonus than that obtained bridging reward gaps in environments where extrinsic and
from previous models. While the current theory of pseudo- intrinsic rewards are, or quickly become, extremely sparse.
Count-Based Exploration with Neural Density Models

Acknowledgements Jaksch, Thomas, Ortner, Ronald, and Auer, Peter. Near-


optimal regret bounds for reinforcement learning. Jour-
The authors thank Tom Schaul, Olivier Pietquin, Ian Os- nal of Machine Learning Research, 11:1563–1600,
band, Sriram Srinivasan, Tejas Kulkarni, Alex Graves, 2010.
Charles Blundell, and Shimon Whiteson for invaluable
feedback on the ideas presented here, and Audrunas Kingma, Diederik P. and Welling, Max. Auto-encoding
Gruslys especially for providing the Reactor agent. variational bayes. In Proceedings of the International
Conference on Learning Representations, 2013.
References Lopes, Manuel, Lang, Tobias, Toussaint, Marc, and
Bellemare, Marc, Veness, Joel, and Talvitie, Erik. Skip Oudeyer, Pierre-Yves. Exploration in model-based re-
context tree switching. Proceedings of the International inforcement learning by empirically estimating learning
Conference on Machine Learning, 2014. progress. In Advances in Neural Information Processing
Systems, 2012.
Bellemare, Marc G., Naddaf, Yavar, Veness, Joel, and
Bowling, Michael. The arcade learning environment: Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David,
An evaluation platform for general agents. Journal of Rusu, Andrei A, Veness, Joel, Bellemare, Marc G.,
Artificial Intelligence Research, 47:253–279, 2013. Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K.,
Ostrovski, Georg, et al. Human-level control through
Bellemare, Marc G., Srinivasan, Sriram, Ostrovski, Georg,
deep reinforcement learning. Nature, 518(7540):529–
Schaul, Tom, Saxton, David, and Munos, Rémi. Uni-
533, 2015.
fying count-based exploration and intrinsic motivation.
Advances in Neural Information Processing Systems, Munos, Rémi, Stepleton, Tom, Harutyunyan, Anna, and
2016. Bellemare, Marc G. Safe and efficient off-policy rein-
forcement learning. In Advances in Neural Information
Deisenroth, Marc P. and Rasmussen, Carl E. PILCO:
Processing Systems, 2016.
A model-based and data-efficient approach to policy
search. In Proceedings of the International Conference Osband, Ian, van Roy, Benjamin, and Wen, Zheng. Gen-
on Machine Learning, 2011. eralization and exploration via randomized value func-
French, Robert M. Catastrophic forgetting in connectionist tions. In Proceedings of the International Conference on
networks. Trends in cognitive sciences, 3(4):128–135, Machine Learning, 2016.
1999. Oudeyer, Pierre-Yves, Kaplan, Frédéric, and Hafner, Ver-
Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, ena V. Intrinsic motivation systems for autonomous
Bing, Warde-Farley, David, Ozair, Sherjil, Courville, mental development. IEEE Transactions on Evolution-
Aaron, and Bengio, Yoshua. Generative adversarial nets. ary Computation, 11(2):265–286, 2007.
In Advances in Neural Information Processing Systems,
Rezende, Danilo Jimenez, Mohamed, Shakir, and Wier-
2014.
stra, Daan. Stochastic backpropagation and approximate
Gregor, Karol, Danihelka, Ivo, Graves, Alex, Rezende, inference in deep generative models. In Proceedings
Danilo, and Wierstra, Daan. Draw: A recurrent neural of The International Conference on Machine Learning,
network for image generation. In Proceedings of the In- 2014.
ternational Conference on Machine Learning, 2015.
Strehl, Alexander L. and Littman, Michael L. An analysis
Gruslys, Audrunas, Azar, Mohammad Gheshlaghi, Belle- of model-based interval estimation for Markov decision
mare, Marc G., and Munos, Rémi. The Reactor: A processes. Journal of Computer and System Sciences, 74
sample-efficient actor-critic architecture. arXiv preprint (8):1309 – 1331, 2008.
arXiv:1704.04651, 2017.
Sutton, Richard S. Generalization in reinforcement learn-
Guez, Arthur, Silver, David, and Dayan, Peter. Efficient ing: Successful examples using sparse coarse coding.
bayes-adaptive reinforcement learning using sample- In Advances in Neural Information Processing Systems,
based search. In Advances in Neural Information Pro- 1996.
cessing Systems, 2012.
Sutton, Richard S. and Barto, Andrew G. Reinforcement
Houthooft, Rein, Chen, Xi, Duan, Yan, Schulman, John, learning: An introduction. MIT Press, 1998.
De Turck, Filip, and Abbeel, Pieter. Variational infor-
mation maximizing exploration. In Advances in Neural Tang, Haoran, Houthooft, Rein, Foote, Davis, Stooke,
Information Processing Systems (NIPS), 2016. Adam, Chen, Xi, Duan, Yan, Schulman, John, De Turck,
Count-Based Exploration with Neural Density Models

Filip, and Abbeel, Pieter. #Exploration: A study of


count-based exploration for deep reinforcement learn-
ing. arXiv preprint arXiv:1611.04717, 2016.
Theis, Lucas, van den Oord, Aäron, and Bethge, Matthias.
A note on the evaluation of generative models. In Pro-
ceedings of the International Conference on Learning
Representations, 2016.
Tieleman, Tijmen and Hinton, Geoffrey. RMSProp: divide
the gradient by a running average of its recent magni-
tude. COURSERA. Lecture 6.5 of Neural Networks for
Machine Learning, 2012.

Tsitsiklis, John N. and van Roy, Benjamin. An analysis of


temporal-difference learning with function approxima-
tion. IEEE Transactions on Automatic Control, 42(5):
674–690, 1997.
van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse,
Vinyals, Oriol, Graves, Alex, et al. Conditional image
generation with PixelCNN decoders. In Advances in
Neural Information Processing Systems, 2016a.
van den Oord, Aaron, Kalchbrenner, Nal, and
Kavukcuoglu, Koray. Pixel recurrent neural net-
works. In Proceedings of the International Conference
on Machine Learning, 2016b.
van Hasselt, Hado, Guez, Arthur, and Silver, David. Deep
reinforcement learning with Double Q-learning. In Pro-
ceedings of the AAAI Conference on Artificial Intelli-
gence, 2016.
Veness, Joel, Ng, Kee Siong, Hutter, Marcus, and Bowling,
Michael H. Context tree switching. In Proceedings of
the Data Compression Conference, 2012.
Wang, Ziyu, Schaul, Tom, Hessel, Matteo, van Hasselt,
Hado, Lanctot, Marc, and de Freitas, Nando. Dueling
network architectures for deep reinforcement learning.
In Proceedings of The 33rd International Conference on
Machine Learning, pp. 1995–2003, 2016.
Count-Based Exploration with Neural Density Models

A. PixelCNN Hyper-parameters Z AXXON, where most improvement comes from the use
of the MMC, and the exploration bonus hurts performance.
The PixelCNN model used in this paper is a lightweight We find that in fact convolutional CTS behaves fairly sim-
variant of the Gated PixelCNN introduced in (van den Oord ilarly to PixelCNN, leaving agent performance unaffected,
et al., 2016a). It consists of a 7 × 7 masked convolution, whereas regular CTS causes the agent to train more slowly
followed by two residual blocks with 1×1 masked convolu- or reach an earlier performance plateau. On the sparse re-
tions with 16 feature planes, and another 1×1 masked con- ward games (G RAVITAR, P RIVATE E YE, V ENTURE) how-
volution producing 64 features planes, which are mapped ever, convolutional CTS shows to be as inferior to Pixel-
by a final masked convolution to the output logits. Inputs CNN as the vanilla CTS variant, failing to achieve the sig-
are 42 × 42 greyscale images, with pixel values quantized nificant improvements over the baseline agents presented
to 8 bins. in this paper.
The model is trained completely online, from the stream of We conclude that while the convolutional aspect plays a
Atari frames experienced by an agent. Optimization is per- role in the ’softer’ nature of the PixelCNN model com-
formed with the (uncentered) RMSProp optimizer (Tiele- pared to its CTS counterpart, it alone is insufficient to
man & Hinton, 2012) with momentum 0.9, decay 0.95 and explain the massive exploration boost that the PixelCNN-
epsilon 10−4 . derived reward provides to the DQN agent. The more ad-
vanced model’s accuracy advantage translates into a more
B. Methodology targeted and useful curiosity signal for the agent, which
distinguishes novel from well-explored states more clearly
Unless otherwise stated, all agent performance graphs in and allows for more effective exploration.
this paper show the agent’s training performance, measured
as the undiscounted per-episode return, averaged over 1M
environment frames per data point. D. The Hardest Exploration Games
The algorithm-comparison graphs Fig. 6 and Fig. 8 show Table 1 reproduces Bellemare et al. (2016)’s taxonomy of
the relative improvement of one algorithm over another games available through the ALE according to their explo-
in terms of area-under-the-curve (AUC). A comparison by ration difficulty. “Human-Optimal” refers to games where
maximum achieved score would yield similar overall re- DQN-like agents achieve human-level or higher perfor-
sults, but underestimate the advantage in terms of learning mance; “Score Exploit” refers to games where agents find
speed (sample efficiency) and stability that the intrinsically ways to achieve superhuman scores, without necessarily
motivated and MMC-based agents show over the baselines. playing the game as a human would. “Sparse” and “Dense”
rewards are qualitative descriptors of the game’s reward
structure. See the original source for additional details.
C. Convolutional CTS
Table 2 compares previously published results on the 7 hard
In Section 4 we have seen that DQN-PixelCNN outper- exploration, sparse reward Atari 2600 games with results
forms DQN-CTS in most of the 57 Atari games, by provid- obtained by DQN-CTS and DQN-PixelCNN.
ing a more impactful exploration bonus in hard exploration
games, as well as a more graceful (less harmful) one in
games where the learning algorithm does not benefit from
the additional curiosity signal. One may wonder whether
this improvement is due to the generally more expressive
and accurate density model PixelCNN, or simply its con-
volutional nature, which gives it an advantage in general-
ization and sample efficiency over a model that represents
pixel probabilities in a completely location-dependent way.
To answer this question, we developed a convolutional vari-
ant of the CTS model. This model has a single set of pa-
rameters conditioning a pixel’s value on its predecessors
shared across all pixel locations, instead of the location-
dependent parameters in the regular CTS. In Fig. 14 we
contrast the performance of DQN, DQN-MC, DQN-CTS,
DQN-ConvCTS and DQN-PixelCNN on 6 example games.
We first consider dense reward games like Q*B ERT and
Count-Based Exploration with Neural Density Models

600 Gravitar 4000 Montezuma's Revenge 16000 Private Eye


DQN 3500 DQN 14000
500
DQN-MC 3000
DQN-MC 12000
DQN-CTS DQN-CTS DQN
Average Score

Average Score

Average Score
400 10000
DQN-ConvCTS 2500 DQN-ConvCTS 8000
DQN-MC
300 DQN-PixelCNN 2000 DQN-PixelCNN DQN-CTS
6000
1500 DQN-ConvCTS
200
1000
4000
DQN-PixelCNN
2000
100
500 0
0 0 2000
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Million Frames Million Frames Million Frames
8000 Q*Bert 1200 Venture 10000 Zaxxon
7000 DQN DQN
6000
1000
DQN-MC 8000 DQN-MC
DQN-CTS DQN-CTS
Average Score

Average Score

Average Score
800
5000 DQN-ConvCTS 6000 DQN-ConvCTS
4000 DQN 600 DQN-PixelCNN DQN-PixelCNN
3000 DQN-MC 400
4000

2000
DQN-CTS
DQN-ConvCTS 200
2000
1000
DQN-PixelCNN
0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Million Frames Million Frames Million Frames

Figure 14. Comparison of DQN, DQN-CTS, DQN-ConvCTS and DQN-PixelCNN training performance.

Easy Exploration Hard Exploration


Human-Optimal Score Exploit Dense Reward Sparse Reward
A SSAULT A STERIX B EAM R IDER A LIEN F REEWAY
A STEROIDS ATLANTIS K ANGAROO A MIDAR G RAVITAR
BATTLE Z ONE B ERZERK K RULL BANK H EIST M ONTEZUMA’ S R EVENGE
B OWLING B OXING K UNG - FU M ASTER F ROSTBITE P ITFALL !
B REAKOUT C ENTIPEDE ROAD RUNNER H.E.R.O. P RIVATE E YE
C HOPPER C MD C RAZY C LIMBER S EAQUEST M S . PAC -M AN S OLARIS
D EFENDER D EMON ATTACK U P N D OWN Q*B ERT V ENTURE
D OUBLE D UNK E NDURO T UTANKHAM S URROUND
F ISHING D ERBY G OPHER W IZARD OF W OR
I CE H OCKEY JAMES B OND Z AXXON
NAME THIS G AME P HOENIX
P ONG R IVER R AID
ROBOTANK S KIING
S PACE I NVADERS S TARGUNNER

Table 1. A rough taxonomy of Atari 2600 games according to their exploration difficulty.

DQN A3C-CTS Prior. Duel DQN-CTS DQN-PixelCNN


F REEWAY 30.8 30.48 33.0 31.7 31.7
G RAVITAR 473.0 238.68 238.0 498.3 859.1
M ONTEZUMA’ S R EVENGE 0.0 273.70 0.0 3705.5 2514.3
P ITFALL ! -286.1 -259.09 0.0 0.0 0.0
P RIVATE E YE 146.7 99.32 206.0 8358.7 15806.5
S OLARIS 3,482.8 2270.15 133.4 2863.6 5501.5
V ENTURE 163.0 0.00 48.0 82.2 1356.25

Table 2. Comparison with previously published results on hard exploration, sparse reward games. The compared agents are DQN (Mnih
et al., 2015), A3C-CTS (“A3C+” in (Bellemare et al., 2016)), Prioritized Dueling DQN (Wang et al., 2016), and the basic versions of
DQN-CTS and DQN-PixelCNN from Section 4. For our agents we report the maximum scores achieved over 150M frames of training,
averaged over 3 seeds.
Count-Based Exploration with Neural Density Models

1800 Alien 1000 Amidar 1400 Assault 4000 Asterix 1300 Asteroids 70000 Atlantis

1600 3500 1200


1200 60000
800 1100
1400 3000
50000
1000 1000
1200 600 2500
900 40000
1000 800 2000
800 30000
800 400 1500
600 700
20000
600 1000 600
200
400 10000
400 500 500
200 0 200 0 400 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
500 Bank Heist 18000 Battle Zone 7000 Beam Rider 500 Berzerk 45 Bowling 80 Boxing
16000 450 70
6000 40
400 14000 60
400
5000
12000 35 50
300 350
10000 4000 40
300 30
8000 3000 30
200 250
6000 25 20
2000
4000 200 10
100
1000 20
2000 150 0
0 0 0 100 15 10
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
250 Breakout 3500 Centipede 3000 Chopper Command 90000 Crazy Climber 10000 Defender 8000 Demon Attack
80000 9000 7000
3000 2500
200 70000 8000 6000
2500 2000 60000
150 7000 5000
50000
2000 1500 6000 4000
40000
100 5000 3000
1500 1000 30000
20000 4000 2000
50
1000 500
10000 3000 1000
0 500 0 0 2000 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
8 Double Dunk 500 Enduro 10 Fishing Derby 30 Freeway 2500 Frostbite 6000 Gopher
20
10 25 5000
400 30 2000
12
40 20 4000
14 300 1500
50
15 3000
16 60
200 1000
70 10 2000
18
100 80 500
20 5 1000
90
22 0 100 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
600 Gravitar 25000 H.E.R.O. 4 Ice Hockey 600 Jamesbond 9000 Kangaroo 8000 Krull
8000
500 6 500 7000
20000 7000
400 8 400 6000 6000
15000
5000
300 10 300 5000
4000
10000
200 12 200 3000 4000

5000 2000
100 14 100 3000
1000
0 0 16 0 0 2000
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
16000 Kung Fu Master 4000 Montezuma's Revenge 2500 Ms. Pacman 6000 Name this Game 7000 Phoenix 0 Pitfall!

14000 3500 5500 100


6000
2000 5000 200
12000 3000
5000
4500 300
10000 2500 1500
4000 4000 400
8000 2000
3500 3000 500
6000 1500 1000
3000 600
2000
4000 1000 2500 700
500
2000 500 1000
2000 800
0 0 0 1500 0 900
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
20 Pong 16000 Private Eye 8000 Q*Bert 9000 Riverraid 25000 Road Runner 50 Robotank
15 14000 7000 8000
10 12000 20000 40
6000 7000
5 10000
5000 6000 15000 30
0 8000
4000 5000
5 6000
3000 4000 10000 20
10 4000
15 2000 2000 3000
5000 10
20 0 1000 2000
25 2000 0 1000 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
3000 Seaquest 14000 Skiing 4000 Solaris 1000 Space Invaders 25000 Star Gunner 8.8 Surround

15000 900
2500 3500 9.0
800 20000
16000
3000
2000 700 9.2
17000 15000
2500 600
1500 18000 9.4
2000 500
19000 10000
1000 400 9.6
1500
20000 300 5000
500 1000 9.8
21000 200
0 22000 500 100 0 10.0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
5 Tennis 7000 Time Pilot 200 Tutankham 6000 Up'n'down 1200 Venture 80000 Video Pinball

70000
0 6000 5000 1000
150 60000
5 5000 4000 800
50000
10 4000 100 3000 600 40000
30000
15 3000 2000 400
50 20000
20 2000 1000 200
10000
25 1000 0 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
3500 Wizard of Wor 16000 Yar's Revenge 9000 Zaxxon
8000
3000 14000
7000
2500 12000
6000
2000 10000 5000
DQN
4000
DQN-CTS
1500 8000
3000
DQN-PixelCNN
1000 6000
2000
500 4000
1000
0 2000 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140

Figure 15. Training curves of DQN, DQN-CTS and DQN-PixelCNN across all 57 Atari games.
Count-Based Exploration with Neural Density Models

4500 Alien 1400 Amidar 4000 Assault 9000 Asterix 3000 Asteroids 1.0 1e7 Atlantis
4000 3500 8000
1200 2500
3500 7000 0.8
3000
1000
3000 6000 2000
2500 0.6
2500 800 5000
2000 1500
2000 600 4000
1500 0.4
1500 3000 1000
400
1000 1000 2000 0.2
200 500
500 500 1000
0 0 0 0 0 0.0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
1400 Bank Heist 70000 Battle Zone 7000 Beam Rider 700 Berzerk 80 Bowling 100 Boxing

1200 60000 6000 600 70 80


1000 50000 5000 60
500 60
800 40000 4000 50
400 40
600 30000 3000 40
300 20
400 20000 2000 30

200 10000 1000 200 20 0

0 0 0 100 10 20
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
300 Breakout 5000 Centipede 12000 Chopper Command 140000 Crazy Climber 50000 Defender 9000 Demon Attack
4500 8000
250 10000 120000
4000 40000 7000
100000
200 3500 8000 6000
80000 30000
3000 5000
150 6000
2500 60000 4000
20000
100 2000 4000 3000
40000
1500 10000 2000
50 2000 20000
1000 1000
0 500 0 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
15 Double Dunk 3500 Enduro 40 Fishing Derby 35 Freeway 7000 Frostbite 25000 Gopher

10 3000 20 30 6000
20000
5
2500 0 25 5000
0 15000
2000 20 20 4000
5
1500 40 15 3000 10000
10
1000 60 10 2000
15
5000
20 500 80 5 1000

25 0 100 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
2000 Gravitar 35000 H.E.R.O. 10 Ice Hockey 800 Jamesbond 14000 Kangaroo 12000 Krull

30000 700 12000


5 10000
1500 600
25000 10000
0 8000
500
20000 8000
1000 5 400 6000
15000 6000
300
10 4000
10000 4000
500 200
5000 15 2000 2000
100
0 0 20 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
40000 Kung Fu Master 3000 Montezuma's Revenge 3500 MS. Pacman 10000 Name this Game 7000 Phoenix 0 Pitfall!

35000 9000
2500 3000 6000 200
30000 8000
2500 5000 400
2000 7000
25000
2000 6000 4000 600
20000 1500
1500 5000 3000 800
15000
1000 4000
1000 2000 1000
10000 3000
500 500 1000 1200
5000 2000
0 0 0 1000 0 1400
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
30 Pong 20000 Private Eye 14000 Q*Bert 12000 Riverraid 60000 Road Runner 70 Robotank

20 12000 10000 50000 60


15000
10000 50
10 8000 40000
10000 8000 40
0 6000 30000
5000 6000 30
10 4000 20000
4000 20
0
20 2000 2000 10000 10

30 5000 0 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
6000 Seaquest 10000 Skiing 5000 Solaris 1800 Space Invaders 30000 Star Gunner 0 Surround
1600
5000 25000
15000 4000 1400 2

4000 1200 20000


20000 3000 4
1000
3000 15000
800
25000 2000 6
2000 600 10000

30000 1000 400 8


1000 5000
200
0 35000 0 0 0 10
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
30 Tennis 25000 Time Pilot 250 Tutankham 250000 Up'n'down 1600 Venture 400000 Video Pinball

1400 350000
20
20000 200 200000
1200 300000
10
15000 150 150000 1000 250000
0 800 200000
10000 100 100000 600 150000
10
400 100000
5000 50 50000
20
200 50000
30 0 0 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
10000 Wizard of Wor 80000 Yar's Revenge 18000 Zaxxon

70000 16000
8000 14000
60000

6000 50000
12000 DQN
40000
10000 DQN-PixelCNN
4000
8000 Reactor
30000
6000 Reactor-PixelCNN
20000 4000
2000
10000 2000
0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140

Figure 16. Training curves of DQN, DQN-PixelCNN, Reactor and Reactor-PixelCNN across all 57 Atari games.
Count-Based Exploration with Neural Density Models

2500 Alien 1000 Amidar 1400 Assault 4500 Asterix 1300 Asteroids 70000 Atlantis
4000 1200
1200 60000
2000 800 3500 1100
50000
1000 3000 1000
1500 600 40000
2500 900
800
2000 800 30000
1000 400
600 1500 700
20000
500 200 1000 600
400 10000
500 500
0 0 200 0 400 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
500 Bank Heist 25000 Battle Zone 7000 Beam Rider 500 Berzerk 45 Bowling 80 Boxing

6000 450
40
400 20000 60
400
5000
35
300 15000 350 40
4000
300 30
200 10000 3000 20
250
25
2000
200
100 5000 0
1000 20
150
0 0 0 100 15 20
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
250 Breakout 3500 Centipede 3000 Chopper Command 100000 Crazy Climber 10000 Defender 8000 Demon Attack

9000 7000
3000 2500
200 80000
8000 6000
2500 2000
150 60000 7000 5000
2000 1500 6000 4000
100 40000 5000 3000
1500 1000
4000 2000
50 20000
1000 500
3000 1000
0 500 0 0 2000 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
4 Double Dunk 500 Enduro 10 Fishing Derby 30 Freeway 2500 Frostbite 6000 Gopher
6 20
25 5000
8 400 30 2000

10 40 20 4000
300 1500
12 50
15 3000
14 60
200 1000
16 70 10 2000
18 100 80 500
5 1000
20 90
22 0 100 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
600 Gravitar 25000 H.E.R.O. 4 Ice Hockey 600 Jamesbond 9000 Kangaroo 8000 Krull
8000
500 6 500 7000
20000 7000
400 8 400 6000 6000
15000
5000
300 10 300 5000
4000
10000
200 12 200 3000 4000

5000 2000
100 14 100 3000
1000
0 0 16 0 0 2000
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
16000 Kung Fu Master 1800 Montezuma's Revenge 2500 MS. Pacman 7000 Name this Game 7000 Phoenix 0 Pitfall!

14000 1600 50
6000 6000
1400 2000 100
12000
5000
1200 5000 150
10000 1500
1000 4000 200
8000 4000
800 3000 250
6000 1000
600 3000 300
2000
4000 400 350
500
2000 1000
2000 200 400
0 0 0 1000 0 450
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
20 Pong 16000 Private Eye 8000 Q*Bert 9000 Riverraid 35000 Road Runner 50 Robotank
15 14000 7000 8000 30000
10 12000 40
6000 7000
25000
5 10000
5000 6000 30
0 8000 20000
4000 5000
5 6000 15000
3000 4000 20
10 4000
10000
15 2000 2000 3000
10
1000 2000 5000
20 0
25 2000 0 1000 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
3000 Seaquest 14000 Skiing 4000 Solaris 1200 Space Invaders 25000 Star Gunner 8.6 Surround

2500 16000 3500 1000 8.8


20000
18000 9.0
2000 3000 800
20000 15000 9.2
1500 2500 600
22000 10000 9.4
1000 2000 400
24000 9.6
5000
500 26000 1500 200 9.8

0 28000 1000 0 0 10.0


0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
0 Tennis 7000 Time Pilot 200 Tutankham 7000 Up'n'down 1200 Venture 200000 Video Pinball

6000 6000 1000


5
150 150000
5000
5000 800
10 4000
4000 100 600 100000
15 3000
3000 400
2000
50 50000
20
2000 1000 200

25 1000 0 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
3500 Wizard of Wor 16000 Yar's Revenge 10000 Zaxxon

3000 14000
8000
2500 12000
6000
DQN (w/o MC)
2000 10000 DQN (with MC)
1500 8000 4000
DQN-PixelCNN (w/o MC)
1000 6000
DQN-PixelCNN (with MC)
2000
500 4000

0 2000 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140

Figure 17. Training curves of DQN and DQN-PixelCNN, each with and without MMC, across all 57 Atari games.

You might also like