Count-Based Exploration With Neural Density Models
Count-Based Exploration With Neural Density Models
Georg Ostrovski 1 Marc G. Bellemare 1 Aäron van den Oord 1 Rémi Munos 1
to generalize count-based exploration to non- large state spaces, and the count-based method typical of
tabular reinforcement learning. This pseudo- theoretical results is not applicable in the presence of value
count was used to generate an exploration bonus function approximation.
for a DQN agent and combined with a mixed
Monte Carlo update was sufficient to achieve Recently, Bellemare et al. (2016) proposed the notion of
state of the art on the Atari 2600 game Mon- pseudo-count as a reasonable generalization of the tabu-
tezuma’s Revenge. We consider two questions lar setting considered in the theory literature. The pseudo-
left open by their work: First, how important is count is defined in terms of a density model ρ trained on
the quality of the density model for exploration? the sequence of states experienced by an agent:
Second, what role does the Monte Carlo update
play in exploration? We answer the first question N̂(x) = ρ(x)n̂(x),
by demonstrating the use of PixelCNN, an ad-
vanced neural density model for images, to sup- where n̂(x) can be thought of as a total pseudo-count com-
ply a pseudo-count. In particular, we examine the puted from the model’s recoding probability ρ0 (x), the
intrinsic difficulties in adapting Bellemare et al.’s probability of x computed immediately after training on
approach when assumptions about the model are x. As a practical application the authors used the pseudo-
violated. The result is a more practical and gen- counts derived from the simple CTS density model (Belle-
eral algorithm requiring no special apparatus. We mare et al., 2014) to incentivize exploration in Atari 2600
combine PixelCNN pseudo-counts with different agents. One of the main outcomes of their work was sub-
agent architectures to dramatically improve the stantial empirical progress on the infamously hard game
state of the art on several hard Atari games. One M ONTEZUMA’ S R EVENGE.
surprising finding is that the mixed Monte Carlo Their method critically hinged on several assumptions
update is a powerful facilitator of exploration in regarding the density model: 1) the model should be
the sparsest of settings, including Montezuma’s learning-positive, i.e. the probability assigned to a state x
Revenge. should increase with training; 2) it should be trained on-
line, using each sample exactly once; and 3) the effective
model step-size should decay at a rate of n−1 . Part of their
1. Introduction empirical success also relied on a mixed Monte Carlo/Q-
Learning update rule, which permitted fast propagation of
Exploration is the process by which an agent learns about the exploration bonuses.
its environment. In the reinforcement learning framework,
this involves reducing the agent’s uncertainty about the In this paper, we set out to answer several research ques-
environment’s transition dynamics and attainable rewards. tions related to these modelling choices and assumptions:
From a theoretical perspective, exploration is now well-
understood (e.g. Strehl & Littman, 2008; Jaksch et al., 1. To what extent does a better density model give rise to
2010; Osband et al., 2016), and Bayesian methods have better exploration?
1
DeepMind, London, UK. Correspondence to: Georg Ostro- 2. Can the above modelling assumptions be relaxed
vski <[email protected]>.
without sacrificing exploration performance?
Proceedings of the 34 th International Conference on Machine
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 3. What role does the mixed Monte Carlo update play in
by the author(s). successfully incentivizing exploration?
Count-Based Exploration with Neural Density Models
In particular, we explore the use of PixelCNN (van den should lead to a unit increase in pseudo-count:
Oord et al., 2016b;a), a state-of-the-art neural density
model. We examine the challenges posed by this approach: N̂n (x) N̂n (x) + 1
ρn (x) = , ρ0n (x) = ,
n̂ n̂ + 1
Model choice. Performing two evaluations and one model
update at each agent step (to compute ρ(x) and ρ0 (x)) can where n̂ is the pseudo-count total. The pseudo-count gen-
be prohibitively expensive. This requires the design of a eralizes the usual state visitation count function Nn (x).
simplified – yet sufficiently expressive and accurate – Pix- Under certain assumptions on ρn , pseudo-counts grow
elCNN architecture. approximately linearly with real counts. Crucially, the
pseudo-count can be approximated using the prediction
Model training. A CTS model can naturally be trained
gain of the density model:
from sequentially presented, correlated data samples.
Training a neural model in this online fashion requires −1
more careful attention to the optimization procedure to pre- N̂n (x) ≈ ePGn (x) − 1 .
vent overfitting and catastrophic forgetting (French, 1999).
Its main use is to define an exploration bonus. We consider
Model use. The theory of pseudo-counts requires the den- a reinforcement learning (RL) agent interacting with an en-
sity model’s rate of learning to decay over time. Optimiza- vironment that provides observations and extrinsic rewards
tion of a neural model, however, imposes constraints on the (see Sutton & Barto, 1998, for a thorough exposition of the
step-size regime which cannot be violated without deterio- RL framework). To the reward at step n we add the bonus
rating effectiveness and stability of training.
r+ (x) := (N̂n (x))−1/2 ,
The concept of intrinsic motivation has made a recent resur-
gence in reinforcement learning research, in great part which incentivizes the agent to try to re-experience sur-
due to a dissatisfaction with -greedy and Boltzmann poli- prising situations. Quantities related to prediction gain
cies. Of note, Tang et al. (2016) maintain an approximate have been used for similar purposes in the intrinsic moti-
count by means of hash tables over features, which in the vation literature (Lopes et al., 2012), where they measure
pseudo-count framework corresponds to a hash-based den- an agent’s learning progress (Oudeyer et al., 2007). Al-
sity model. Houthooft et al. (2016) used a second-order though the pseudo-count bonus is close to the prediction
Taylor approximation of the prediction gain to drive explo- gain, it is asymptotically more conservative and supported
ration in continuous control. As research moves towards by stronger theoretical guarantees.
ever more complex environments, we expect the trend to-
wards more intrinsically motivated solutions to continue. 2.2. Density Models for Images
The CTS density model (Bellemare et al., 2014) is based
2. Background on the namesake algorithm, Context Tree Switching (Ve-
2.1. Pseudo-Count and Prediction Gain ness et al., 2012), a Bayesian variable-order Markov model.
In its simplest form, the model takes as input a 2D image
Here we briefly introduce notation and results, referring the and assigns to it a probability according to the product of
reader to (Bellemare et al., 2016) for technical details. location-dependent L-shaped filters, where the prediction
Let ρ be a density model on a finite space X , and ρn (x) the of each filter is given by a CTS algorithm trained on past
probability assigned by the model to x after being trained images. In Bellemare et al. (2016), this model was ap-
on a sequence of states x1 , . . . , xn . Assume ρn (x) > 0 plied to 3-bit greyscale, 42 × 42 downsampled Atari 2600
for all x, n. The recoding probability ρ0n (x) is then the frames (Fig. 1). The CTS model presents advantages in
probability the model would assign to x if it were trained on terms of simplicity and performance but is limited in ex-
that same x one more time. We call ρ learning-positive if pressiveness, scalability, and data efficiency.
ρ0n (x) ≥ ρn (x) for all x1 , . . . , xn , x ∈ X . The prediction In recent years, neural generative models for images have
gain (PG) of ρ is achieved impressive successes in their ability to generate
PGn (x) = log ρ0n (x) − log ρn (x). (1) diverse images in various domains (Kingma & Welling,
2013; Rezende et al., 2014; Gregor et al., 2015; Good-
A learning-positive ρ implies PGn (x) ≥ 0 for all x ∈ X . fellow et al., 2014). In particular, van den Oord et al.
For learning-positive ρ, we define the pseudo-count as (2016b;a) introduced PixelCNN, a fully convolutional neu-
ρn (x)(1 − ρ0n (x)) ral network composed of residual blocks with multiplica-
N̂n (x) = , tive gating units, which models pixel probabilities condi-
ρ0n (x) − ρn (x)
tional on previous pixels (in the usual top-left to bottom-
derived from postulating that a single observation of x ∈ X right raster-scan order) by using masked convolution fil-
Count-Based Exploration with Neural Density Models
linking the reward r and next-state value function Q(x0 , a0 ) Simultaneously, a partly competing set of requirements
to the current state value function Q(x, a). This particular are posed by the practicalities of training a neural density
form is the stochastic update rule with step-size α and in- model and using it as part of an RL agent:
volves the TD-error δ. In the approximate reinforcement
learning setting, such as when Q(x, a) is represented by a (d) For stability, efficiency, and to avoid catastrophic for-
neural network, this update is converted into a loss to be getting in the context of a drifting data distribution,
minimized, most commonly the squared loss δ 2 (x, a). it is advantageous to train a neural model in mini-
It is well known that better performance, both in terms of batches, drawn randomly from a diverse dataset.
learning efficiency and approximation error, is attained by (e) For effective training, a certain optimization regime
multi-step methods (Sutton, 1996; Tsitsiklis & van Roy, (e.g. a fixed learning rate schedule) has to be followed.
1997). These methods interpolate between one-step meth-
ods (Q-Learning) and the Monte-Carlo update (f) The density model must be computationally
"∞ # lightweight, to allow computing the PG (two
X model evaluations and one update) as part of every
t
Q(x, a) ← Q(x, a) + α γ r(xt , at ) − Q(x, a) , training step of an RL agent.
t=0
| {z }
δMC (x,a) We investigate how to best resolve these tensions in the
context of the Arcade Learning Environment (Bellemare
where x0 , a0 , x1 , a1 , . . . is a sample path through the en- et al., 2013), a suite of benchmark Atari 2600 games.
vironment beginning in (x, a). To achieve their success on
the hardest Atari 2600 games, Bellemare et al. (2016) used 3.1. Designing a Suitable Density Model
the mixed Monte-Carlo update (MMC)
Driven by (f) and aiming for an agent with computational
Q(x, a) ← Q(x, a) + α [(1 − β)δ(x, a) + βδMC (x, a)] , performance comparable to DQN, we design a slim vari-
ant of the PixelCNN network. Its core is a stack of 2 gated
with β ∈ [0, 1]. This choice was made for “computa- residual blocks with 16 feature maps (compared to 15 resid-
tional and implementational simplicity”, and is a particu- ual blocks with 128 feature maps in vanilla PixelCNN). As
larly coarse multi-step method. A better multi-step method was done with the CTS model, images are downsampled to
is the recent Retrace(λ) algorithm (Munos et al., 2016). 42 × 42 and quantized to 3-bit greyscale. See Appendix A
Retrace(λ) uses a product of truncated importance sam- for technical details.
Count-Based Exploration with Neural Density Models
300 Freeway
online
random permutation 10 4
Pong 3500 Montezuma's Revenge
250 random samples LR: 0. 001, PG: 0. 1 · n −1/2 · PGn LR: 0. 1 · n −1/2 , PG: 0. 1 · n −1/2 · PGn
3000 LR: 0. 001, PG: 0. 01 · PGn LR: 0. 1 · n −1/2 , PG: 0. 01 · PGn
PixelCNN Log Loss
Average Score
10 3 2500
200
2000
150 10 2 1500
1000
100
10 1 500
0
0 2000 4000 6000 8000 10000 12000 0 1000 2000 3000 4000 5000 6000 0 20 40 60 80
Model Updates Frames Million Frames
Figure 2. Left: PixelCNN log loss on F REEWAY, when trained online, on a random permutation (single use of each frame) or on
randomly drawn samples (with replacement, potentially using same frame multiple times) from the state sequence. To simulate the
effect of non-stationarity, the agent’s policy changes every 4K updates. All training methods show qualitatively similar learning progress
and stability. Middle: PixelCNN log loss over first 6K training frames on P ONG. Vertical dashed lines indicate episode ends. The
coinciding loss spikes are the density model’s ‘surprise’ upon observing the distinctive green frame that sometimes occurs at the episode
start. Right: DQN-PixelCNN training performance on M ONTEZUMA’ S R EVENGE as we vary learning rate and PG decay schedules.
In experiments comparing actual agent performance we 3500 Montezuma's Revenge 16000 Private Eye
3000 DQN 14000
empirically determined that in fact the constant learning DQN-CTS 12000
Average Score
2500 DQN-PixelCNN 10000
DQN
rate 0.001, paired with a PG decay cn = c · n−1/2 , obtains 2000
1500
8000
6000
DQN-CTS
DQN-PixelCNN
the best exploration results on hard exploration games like 1000
4000
2000
500 0
M ONTEZUMA’ S R EVENGE, see Fig. 2(right). We find the 0
0 20 40 60 80 100
2000
0 20 40 60 80 100
model to be robust across 1-2 orders of magnitude for the Asteroids Berzerk
1300 500
value of c, and informally determine c = 0.1 to be a sen- 1200 DQN 450
Average Score
1100 DQN-CTS 400
sible configuration for achieving good results on a broad 1000 DQN-PixelCNN 350
900
300
800
range of Atari 2600 games (see also Section 7). 700
250 DQN
600 200 DQN-CTS
500 150 DQN-PixelCNN
Regarding (c), it is hard to ensure learning-positiveness for 400
0 20 40 60 80 100
100
0 20 40 60 80 100
Million Frames Million Frames
a deep neural model, and a negative PG can occur when-
ever the optimizer ‘overshoots’ a local loss minimum. As Figure 5. DQN, DQN-CTS and DQN-PixelCNN on hard explo-
a workaround, we threshold the PG value at 0. To summa- ration games (top) and easier ones (bottom).
rize, the computed pseudo-count is
−1
N̂n (x) = exp c · n−1/2 · (PGn (x))+ − 1 .
3
4.1. DQN with PixelCNN Exploration Bonus
Our first set of experiments provides the PixelCNN explo- 14
ration bonus to a DQN agent (Mnih et al., 2015)2 . At each
agent step, the density model receives a single frame, with
which it simultaneously updates its parameters and outputs Figure 6. Improvements (in % of AUC) of DQN-PixelCNN and
the PG. We refer to this agent as DQN-PixelCNN. DQN-CTS over DQN in 57 Atari games. Annotations indicate
the number of hard exploration games with positive (right) and
The DQN-CTS agent we compare against is derived from negative (left) improvement, respectively.
the one in (Bellemare et al., 2016). For better compara-
bility, it is trained in the same online fashion as DQN-
PixelCNN, i.e. the PG is computed whenever we train the CTS and DQN-PixelCNN. On the famous M ONTEZUMA’ S
density model. By contrast, the original DQN-CTS queried R EVENGE, both intrinsically motivated agents vastly out-
the PG at the end of each episode. perform the baseline DQN. On other hard exploration
games (P RIVATE E YE; or V ENTURE, appendix Fig. 15),
Unless stated otherwise, we always use the mixed Monte DQN-PixelCNN achieves state of the art results, substan-
Carlo update (MMC) for the intrinsically motivated tially outperforming DQN and DQN-CTS. The other two
agents3 , but regular Q-Learning for the baseline DQN. games shown (A STEROIDS, B ERZERK) pose easier explo-
Fig. 5 shows training curves of DQN compared to DQN- ration problems, where the reward bonus should not pro-
vide large improvements and may have a negative effect by
2
Unlike Bellemare et al. we use regular Q-Learning instead of skewing the reward landscape. Here, DQN-PixelCNN be-
Double Q-Learning (van Hasselt et al., 2016), as our early exper-
haves more gracefully and still outperforms DQN-CTS. We
iments showed no significant advantage of DoubleDQN with the
PixelCNN-based exploration reward. hypothesize this is due to a qualitative difference between
3
The use of MMC in a replay-based agent poses a minor com- the models, see Section 5.
plication, as the MC return is not available for replay until the end
of an episode. For simplicity, in our implementation we disregard Overall PixelCNN provides the DQN agent with a larger
this detail and set the MC return to 0 for transitions from the most advantage than CTS, and often accelerates or stabilizes
recent episode. training even when not affecting peak performance. Out of
Count-Based Exploration with Neural Density Models
57 Atari games, DQN-PixelCNN outperforms DQN-CTS lar, Reactor-PixelCNN enjoys better sample efficiency (in
in 52 games by maximum achieved score, and 51 by AUC terms of area under the curve, AUC) than vanilla Reactor.
(methodology in Appendix B). See Fig. 6 for a high level We hypothesize that, like other policy gradient algorithms,
comparison (appendix Fig. 15 for full training graphs). The Reactor generally suffers from weaker exploration than its
greatest gains from using either exploration bonus are ob- value-based counterpart DQN. This aspect is much helped
served in games categorized as hard exploration games in by the exploration bonus, boosting the agent’s sample effi-
the ‘taxonomy of exploration’ in (Bellemare et al., 2016, ciency in many environments.
reproduced in Appendix D), specifically in the most chal-
lenging sparse reward games (e.g. M ONTEZUMA’ S R E - % Improvement Reactor-PixelCNN over Reactor (by AUC)
VENGE, P RIVATE E YE, V ENTURE ). Easy exploration (40) 250%
Hard exploration, dense reward (10)
Hard exploration, sparse reward (7) 200%
4.2. A Multi-Step RL Agent with PixelCNN 150%
Empirical practitioners know that techniques beneficial for 100%
one agent architecture often can be detrimental for a dif- 50%
ferent algorithm. To demonstrate the wide applicability 0%
of the PixelCNN exploration bonus, we also evaluate it
with the more recent Reactor agent4 (Gruslys et al., 2017). Figure 8. Improvements (in % of AUC) of Reactor-PixelCNN
This replay-based actor-critic agent represents its policy over Reactor in 57 Atari games.
and value function by a recurrent neural network and, cru-
cially, uses the multi-step Retrace(λ) algorithm for policy However, on hard exploration games with sparse rewards,
evaluation, replacing the MMC we use in DQN-PixelCNN. Reactor seems unable to make full use of the exploration
To reduce impact on computational efficiency of this agent, bonus. We believe this is because, in very sparse settings,
we sub-sample intrinsic rewards: we perform updates of the propagation of reward information across long hori-
the PixelCNN model and compute the reward bonus on zons becomes crucial. The MMC takes one extreme of
(randomly chosen) 25% of all steps, leaving the agent’s re- this view, directly learning from the observed returns. The
ward unchanged on other steps. We use the same PG decay Retrace(λ) algorithm, on the other hand, has an effective
schedule of 0.1n−1/2 , with n the number of model updates. horizon which depends on λ and, critically, the truncated
importance sampling ratio. This ratio results in the discard-
2000 Gravitar 35000 H.E.R.O. ing of trajectories which are off-policy, i.e. unlikely under
DQN DQN
DQN-PixelCNN
30000
DQN-PixelCNN the current policy. We hypothesize that the very goal of the
Average Score
1500
Reactor 25000 Reactor
1000
Reactor-PixelCNN 20000 Reactor-PixelCNN Retrace(λ) algorithm to learn cautiously is what prevents it
15000
500 10000
from taking full advantage of the exploration bonus!
5000
0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
60000 Road Runner 25000 Time Pilot 5. Quality of the Density Model
50000
DQN DQN
Average Score
DQN-PixelCNN 20000
DQN-PixelCNN
40000 Reactor 15000 Reactor PixelCNN can be expected to be more expressive and accu-
30000 Reactor-PixelCNN Reactor-PixelCNN
20000
10000 rate than the less advanced CTS model, and indeed, sam-
5000
10000
ples generated after training are somewhat higher quality
0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Million Frames Million Frames (Fig. 4). However, we are not using the generative function
of the models when computing an exploration bonus, and
Figure 7. Reactor/Reactor-PixelCNN and DQN/DQN-PixelCNN
a better generative model does not necessarily give rise to
training performance (averaged over 3 seeds).
better probability estimates (Theis et al., 2016).
Training curves for the Reactor/Reactor-PixelCNN agent
compared to DQN/DQN-PixelCNN are shown in Fig. 7.
The baseline Reactor agent is superior to the DQN agent, CTS
CTS
obtaining higher scores and learning faster in about 50 out
of 57 games. It is further improved on a large fraction of
games by the PixelCNN exploration reward, see Fig. 8 (full
training graphs in appendix Fig. 16).
The effect of the exploration bonus is rather uniform, yield- PixelCNN
PixelCNN
ing improvements on a broad range of games. In particu-
4
The exact agent variant is referred to as ‘β-LOO’ with β = 1. Figure 9. PG on M ONTEZUMA’ S R EVENGE (log scale).
Count-Based Exploration with Neural Density Models
In Fig. 9 we compare the PG produced by the two mod- 500 Bank Heist 2500 Ms. Pacman
els throughout 5K training steps. PixelCNN consistently 400 2000
Average Score
300 1500
produces PGs lower than CTS. More importantly, its PGs
200 1000
are smoother, exhibiting less variance between succes- 100
DQN (w/o MC)
DQN (with MC) 500
DQN (w/o MC)
DQN (with MC)
DQN-PixelCNN (w/o MC) DQN-PixelCNN (w/o MC)
sive states, while showing more pronounced peaks at cer- 0
0 20 40 60
DQN-PixelCNN (with MC)
80 100
0
0 20 40 60
DQN-PixelCNN (with MC)
80 100
tain infrequent events. This yields a reward bonus that is H.E.R.O Private Eye
25000 16000
less harmful in easy exploration games, while providing a 14000
Average Score
20000 12000
Average Score
Average Score
5000 c = 10. 0 25000 c = 10. 0 1000 c = 10. 0
4000 20000 800
0 0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Million Frames Million Frames Million Frames
Figure 12. DQN-PixelCNN, hard exploration games, different PG scales c · n−1/2 · PGn (c = 0.1, 1, 10) (5 seeds each).
Average Score
agent, whose reward function is dominated by the explo- c = 10. 0
1500 c = 50. 0
ration bonus. For that we increase the PG scale, previously
1000
chosen conservatively to avoid adverse effects on easy ex-
ploration games. 500
A. PixelCNN Hyper-parameters Z AXXON, where most improvement comes from the use
of the MMC, and the exploration bonus hurts performance.
The PixelCNN model used in this paper is a lightweight We find that in fact convolutional CTS behaves fairly sim-
variant of the Gated PixelCNN introduced in (van den Oord ilarly to PixelCNN, leaving agent performance unaffected,
et al., 2016a). It consists of a 7 × 7 masked convolution, whereas regular CTS causes the agent to train more slowly
followed by two residual blocks with 1×1 masked convolu- or reach an earlier performance plateau. On the sparse re-
tions with 16 feature planes, and another 1×1 masked con- ward games (G RAVITAR, P RIVATE E YE, V ENTURE) how-
volution producing 64 features planes, which are mapped ever, convolutional CTS shows to be as inferior to Pixel-
by a final masked convolution to the output logits. Inputs CNN as the vanilla CTS variant, failing to achieve the sig-
are 42 × 42 greyscale images, with pixel values quantized nificant improvements over the baseline agents presented
to 8 bins. in this paper.
The model is trained completely online, from the stream of We conclude that while the convolutional aspect plays a
Atari frames experienced by an agent. Optimization is per- role in the ’softer’ nature of the PixelCNN model com-
formed with the (uncentered) RMSProp optimizer (Tiele- pared to its CTS counterpart, it alone is insufficient to
man & Hinton, 2012) with momentum 0.9, decay 0.95 and explain the massive exploration boost that the PixelCNN-
epsilon 10−4 . derived reward provides to the DQN agent. The more ad-
vanced model’s accuracy advantage translates into a more
B. Methodology targeted and useful curiosity signal for the agent, which
distinguishes novel from well-explored states more clearly
Unless otherwise stated, all agent performance graphs in and allows for more effective exploration.
this paper show the agent’s training performance, measured
as the undiscounted per-episode return, averaged over 1M
environment frames per data point. D. The Hardest Exploration Games
The algorithm-comparison graphs Fig. 6 and Fig. 8 show Table 1 reproduces Bellemare et al. (2016)’s taxonomy of
the relative improvement of one algorithm over another games available through the ALE according to their explo-
in terms of area-under-the-curve (AUC). A comparison by ration difficulty. “Human-Optimal” refers to games where
maximum achieved score would yield similar overall re- DQN-like agents achieve human-level or higher perfor-
sults, but underestimate the advantage in terms of learning mance; “Score Exploit” refers to games where agents find
speed (sample efficiency) and stability that the intrinsically ways to achieve superhuman scores, without necessarily
motivated and MMC-based agents show over the baselines. playing the game as a human would. “Sparse” and “Dense”
rewards are qualitative descriptors of the game’s reward
structure. See the original source for additional details.
C. Convolutional CTS
Table 2 compares previously published results on the 7 hard
In Section 4 we have seen that DQN-PixelCNN outper- exploration, sparse reward Atari 2600 games with results
forms DQN-CTS in most of the 57 Atari games, by provid- obtained by DQN-CTS and DQN-PixelCNN.
ing a more impactful exploration bonus in hard exploration
games, as well as a more graceful (less harmful) one in
games where the learning algorithm does not benefit from
the additional curiosity signal. One may wonder whether
this improvement is due to the generally more expressive
and accurate density model PixelCNN, or simply its con-
volutional nature, which gives it an advantage in general-
ization and sample efficiency over a model that represents
pixel probabilities in a completely location-dependent way.
To answer this question, we developed a convolutional vari-
ant of the CTS model. This model has a single set of pa-
rameters conditioning a pixel’s value on its predecessors
shared across all pixel locations, instead of the location-
dependent parameters in the regular CTS. In Fig. 14 we
contrast the performance of DQN, DQN-MC, DQN-CTS,
DQN-ConvCTS and DQN-PixelCNN on 6 example games.
We first consider dense reward games like Q*B ERT and
Count-Based Exploration with Neural Density Models
Average Score
Average Score
400 10000
DQN-ConvCTS 2500 DQN-ConvCTS 8000
DQN-MC
300 DQN-PixelCNN 2000 DQN-PixelCNN DQN-CTS
6000
1500 DQN-ConvCTS
200
1000
4000
DQN-PixelCNN
2000
100
500 0
0 0 2000
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Million Frames Million Frames Million Frames
8000 Q*Bert 1200 Venture 10000 Zaxxon
7000 DQN DQN
6000
1000
DQN-MC 8000 DQN-MC
DQN-CTS DQN-CTS
Average Score
Average Score
Average Score
800
5000 DQN-ConvCTS 6000 DQN-ConvCTS
4000 DQN 600 DQN-PixelCNN DQN-PixelCNN
3000 DQN-MC 400
4000
2000
DQN-CTS
DQN-ConvCTS 200
2000
1000
DQN-PixelCNN
0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Million Frames Million Frames Million Frames
Figure 14. Comparison of DQN, DQN-CTS, DQN-ConvCTS and DQN-PixelCNN training performance.
Table 1. A rough taxonomy of Atari 2600 games according to their exploration difficulty.
Table 2. Comparison with previously published results on hard exploration, sparse reward games. The compared agents are DQN (Mnih
et al., 2015), A3C-CTS (“A3C+” in (Bellemare et al., 2016)), Prioritized Dueling DQN (Wang et al., 2016), and the basic versions of
DQN-CTS and DQN-PixelCNN from Section 4. For our agents we report the maximum scores achieved over 150M frames of training,
averaged over 3 seeds.
Count-Based Exploration with Neural Density Models
1800 Alien 1000 Amidar 1400 Assault 4000 Asterix 1300 Asteroids 70000 Atlantis
5000 2000
100 14 100 3000
1000
0 0 16 0 0 2000
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
16000 Kung Fu Master 4000 Montezuma's Revenge 2500 Ms. Pacman 6000 Name this Game 7000 Phoenix 0 Pitfall!
15000 900
2500 3500 9.0
800 20000
16000
3000
2000 700 9.2
17000 15000
2500 600
1500 18000 9.4
2000 500
19000 10000
1000 400 9.6
1500
20000 300 5000
500 1000 9.8
21000 200
0 22000 500 100 0 10.0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
5 Tennis 7000 Time Pilot 200 Tutankham 6000 Up'n'down 1200 Venture 80000 Video Pinball
70000
0 6000 5000 1000
150 60000
5 5000 4000 800
50000
10 4000 100 3000 600 40000
30000
15 3000 2000 400
50 20000
20 2000 1000 200
10000
25 1000 0 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
3500 Wizard of Wor 16000 Yar's Revenge 9000 Zaxxon
8000
3000 14000
7000
2500 12000
6000
2000 10000 5000
DQN
4000
DQN-CTS
1500 8000
3000
DQN-PixelCNN
1000 6000
2000
500 4000
1000
0 2000 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Figure 15. Training curves of DQN, DQN-CTS and DQN-PixelCNN across all 57 Atari games.
Count-Based Exploration with Neural Density Models
4500 Alien 1400 Amidar 4000 Assault 9000 Asterix 3000 Asteroids 1.0 1e7 Atlantis
4000 3500 8000
1200 2500
3500 7000 0.8
3000
1000
3000 6000 2000
2500 0.6
2500 800 5000
2000 1500
2000 600 4000
1500 0.4
1500 3000 1000
400
1000 1000 2000 0.2
200 500
500 500 1000
0 0 0 0 0 0.0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
1400 Bank Heist 70000 Battle Zone 7000 Beam Rider 700 Berzerk 80 Bowling 100 Boxing
0 0 0 100 10 20
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
300 Breakout 5000 Centipede 12000 Chopper Command 140000 Crazy Climber 50000 Defender 9000 Demon Attack
4500 8000
250 10000 120000
4000 40000 7000
100000
200 3500 8000 6000
80000 30000
3000 5000
150 6000
2500 60000 4000
20000
100 2000 4000 3000
40000
1500 10000 2000
50 2000 20000
1000 1000
0 500 0 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
15 Double Dunk 3500 Enduro 40 Fishing Derby 35 Freeway 7000 Frostbite 25000 Gopher
10 3000 20 30 6000
20000
5
2500 0 25 5000
0 15000
2000 20 20 4000
5
1500 40 15 3000 10000
10
1000 60 10 2000
15
5000
20 500 80 5 1000
25 0 100 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
2000 Gravitar 35000 H.E.R.O. 10 Ice Hockey 800 Jamesbond 14000 Kangaroo 12000 Krull
35000 9000
2500 3000 6000 200
30000 8000
2500 5000 400
2000 7000
25000
2000 6000 4000 600
20000 1500
1500 5000 3000 800
15000
1000 4000
1000 2000 1000
10000 3000
500 500 1000 1200
5000 2000
0 0 0 1000 0 1400
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
30 Pong 20000 Private Eye 14000 Q*Bert 12000 Riverraid 60000 Road Runner 70 Robotank
30 5000 0 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
6000 Seaquest 10000 Skiing 5000 Solaris 1800 Space Invaders 30000 Star Gunner 0 Surround
1600
5000 25000
15000 4000 1400 2
1400 350000
20
20000 200 200000
1200 300000
10
15000 150 150000 1000 250000
0 800 200000
10000 100 100000 600 150000
10
400 100000
5000 50 50000
20
200 50000
30 0 0 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
10000 Wizard of Wor 80000 Yar's Revenge 18000 Zaxxon
70000 16000
8000 14000
60000
6000 50000
12000 DQN
40000
10000 DQN-PixelCNN
4000
8000 Reactor
30000
6000 Reactor-PixelCNN
20000 4000
2000
10000 2000
0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Figure 16. Training curves of DQN, DQN-PixelCNN, Reactor and Reactor-PixelCNN across all 57 Atari games.
Count-Based Exploration with Neural Density Models
2500 Alien 1000 Amidar 1400 Assault 4500 Asterix 1300 Asteroids 70000 Atlantis
4000 1200
1200 60000
2000 800 3500 1100
50000
1000 3000 1000
1500 600 40000
2500 900
800
2000 800 30000
1000 400
600 1500 700
20000
500 200 1000 600
400 10000
500 500
0 0 200 0 400 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
500 Bank Heist 25000 Battle Zone 7000 Beam Rider 500 Berzerk 45 Bowling 80 Boxing
6000 450
40
400 20000 60
400
5000
35
300 15000 350 40
4000
300 30
200 10000 3000 20
250
25
2000
200
100 5000 0
1000 20
150
0 0 0 100 15 20
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
250 Breakout 3500 Centipede 3000 Chopper Command 100000 Crazy Climber 10000 Defender 8000 Demon Attack
9000 7000
3000 2500
200 80000
8000 6000
2500 2000
150 60000 7000 5000
2000 1500 6000 4000
100 40000 5000 3000
1500 1000
4000 2000
50 20000
1000 500
3000 1000
0 500 0 0 2000 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
4 Double Dunk 500 Enduro 10 Fishing Derby 30 Freeway 2500 Frostbite 6000 Gopher
6 20
25 5000
8 400 30 2000
10 40 20 4000
300 1500
12 50
15 3000
14 60
200 1000
16 70 10 2000
18 100 80 500
5 1000
20 90
22 0 100 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
600 Gravitar 25000 H.E.R.O. 4 Ice Hockey 600 Jamesbond 9000 Kangaroo 8000 Krull
8000
500 6 500 7000
20000 7000
400 8 400 6000 6000
15000
5000
300 10 300 5000
4000
10000
200 12 200 3000 4000
5000 2000
100 14 100 3000
1000
0 0 16 0 0 2000
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
16000 Kung Fu Master 1800 Montezuma's Revenge 2500 MS. Pacman 7000 Name this Game 7000 Phoenix 0 Pitfall!
14000 1600 50
6000 6000
1400 2000 100
12000
5000
1200 5000 150
10000 1500
1000 4000 200
8000 4000
800 3000 250
6000 1000
600 3000 300
2000
4000 400 350
500
2000 1000
2000 200 400
0 0 0 1000 0 450
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
20 Pong 16000 Private Eye 8000 Q*Bert 9000 Riverraid 35000 Road Runner 50 Robotank
15 14000 7000 8000 30000
10 12000 40
6000 7000
25000
5 10000
5000 6000 30
0 8000 20000
4000 5000
5 6000 15000
3000 4000 20
10 4000
10000
15 2000 2000 3000
10
1000 2000 5000
20 0
25 2000 0 1000 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
3000 Seaquest 14000 Skiing 4000 Solaris 1200 Space Invaders 25000 Star Gunner 8.6 Surround
25 1000 0 0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
3500 Wizard of Wor 16000 Yar's Revenge 10000 Zaxxon
3000 14000
8000
2500 12000
6000
DQN (w/o MC)
2000 10000 DQN (with MC)
1500 8000 4000
DQN-PixelCNN (w/o MC)
1000 6000
DQN-PixelCNN (with MC)
2000
500 4000
0 2000 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Figure 17. Training curves of DQN and DQN-PixelCNN, each with and without MMC, across all 57 Atari games.