0% found this document useful (0 votes)

189 views406 pages

Overview of Deep Reinforcement Learning

Uploaded by

khawla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

189 views406 pages

Overview of Deep Reinforcement Learning

Uploaded by

khawla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 406

Aske Plaat

Deep Reinforcement Learning

arXiv:2201.02135v1 [cs.AI] 4 Jan 2022

January 7, 2022

Springer Nature
v

This is a preprint of the following work:

Aske Plaat,
Deep Reinforcement Learning,
2022,
Springer Nature,
reproduced with permission of Springer Nature Singapore Pte Ltd.
The final authenticated version is available online at: http://dx.doi.org/
Preface

Deep reinforcement learning has gathered much attention recently. Impressive

results were achieved in activities as diverse as autonomous driving, game playing,
molecular recombination, and robotics. In all these fields, computer programs have
taught themselves to solve difficult problems. They have learned to fly model
helicopters and perform aerobatic manoeuvers such as loops and rolls. In some
applications they have even become better than the best humans, such as in Atari,
Go, poker and StarCraft.
The way in which deep reinforcement learning explores complex environments
reminds us of how children learn, by playfully trying out things, getting feedback,
and trying again. The computer seems to truly possess aspects of human learning;
this goes to the heart of the dream of artificial intelligence.
The successes in research have not gone unnoticed by educators, and universities
have started to offer courses on the subject. The aim of this book is to provide a
comprehensive overview of the field of deep reinforcement learning. The book is
written for graduate students of artificial intelligence, and for researchers and prac-
titioners who wish to better understand deep reinforcement learning methods and
their challenges. We assume an undergraduate-level of understanding of computer
science and artificial intelligence; the programming language of this book is Python.
We describe the foundations, the algorithms and the applications of deep rein-
forcement learning. We cover the established model-free and model-based methods
that form the basis of the field. Developments go quickly, and we also cover advanced
topics: deep multi-agent reinforcement learning, deep hierarchical reinforcement
learning, and deep meta learning.
We hope that learning about deep reinforcement learning will give you as much
joy as the many researchers experienced when they developed their algorithms,
finally got them to work, and saw them learn!

vii
viii

Acknowledgments

This book benefited from the help of many friends. First of all, I thank everyone at
the Leiden Institute of Advanced Computer Science, for creating such a fun and
vibrant environment to work in.
Many people contributed to this book. Some material is based on the book
that we used in our previous reinforcement learning course and on lecture notes
on policy-based methods written by Thomas Moerland. Thomas also provided
invaluable critique on an earlier draft of the book. Furthermore, as this book was
being prepared, we worked on survey articles on deep model-based reinforcement
learning, deep meta-learning, and deep multi-agent reinforcement learning. I thank
Mike Preuss, Walter Kosters, Mike Huisman, Jan van Rijn, Annie Wong, Anna
Kononova, and Thomas Bäck, the co-authors on these articles.
I thank all members of the Leiden reinforcement learning community for their
input and enthusiasm. I thank especially Thomas Moerland, Mike Preuss, Matthias
Müller-Brockhausen, Mike Huisman, Hui Wang, and Zhao Yang, for their help
with the course for which this book is written. I thank Wojtek Kowalczyk for
insightful discussions on deep supervised learning, and Walter Kosters for his views
on combinatorial search, as well as for his neverending sense of humor.
A very special thank you goes to Thomas Bäck, for our many discussions on
science, the universe, and everything (including, especially, evolution). Without
you, this effort would not have been possible.
This book is a result of the graduate course on reinforcement learning that we
teach in Leiden. I thank all students of this course, past, present, and future, for
their wonderful enthusiasm, sharp questions, and many suggestions. This book was
written for you and by you!
Finally, I thank Saskia, Isabel, Rosalin, Lily, and Dahlia, for being who they are,
for giving feedback and letting me learn, and for their boundless love.

Leiden,
December 2021 Aske Plaat
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What is Deep Reinforcement Learning? . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Three Machine Learning Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Overview of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Tabular Value-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . 23

2.1 Sequential Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Tabular Value-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Classic Gym Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3 Deep Value-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 63

3.1 Large, High-Dimensional, Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2 Deep Value-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Atari 2600 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4 Policy-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.1 Continuous Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 Policy-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3 Locomotion and Visuo-Motor Environments . . . . . . . . . . . . . . . . . . . . 111
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5 Model-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.1 Dynamics Models of High-Dimensional Problems . . . . . . . . . . . . . . . 122
5.2 Learning and Planning Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.3 High-Dimensional Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

ix
x CONTENTS

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6 Two-Agent Self-Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.1 Two-Agent Zero-Sum Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.2 Tabula Rasa Self-Play Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3 Self-Play Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

7 Multi-Agent Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

7.1 Multi-Agent Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7.2 Multi-Agent Reinforcement Learning Agents . . . . . . . . . . . . . . . . . . . . 202
7.3 Multi-Agent Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

8 Hierarchical Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

8.1 Granularity of the Structure of Problems . . . . . . . . . . . . . . . . . . . . . . . 227
8.2 Divide and Conquer for Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
8.3 Hierarchical Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

10 Further Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

10.1 Development of Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . 271
10.2 Main Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
10.3 The Future of Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

A Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

A.1 Sets and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
A.2 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
A.3 Derivative of an Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
A.4 Bellman Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

B Deep Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

B.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
B.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
B.3 Datasets and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
CONTENTS xi

C Deep Reinforcement Learning Suites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331

C.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
C.2 Agent Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
C.3 Deep Learning Suites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
xii CONTENTS
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What is Deep Reinforcement Learning? . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.5 Four Related Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.5.1 Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.5.2 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.5.3 Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.5.4 Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Three Machine Learning Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Overview of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.1 Prerequisite Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.2 Structure of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Tabular Value-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . 23

2.1 Sequential Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Tabular Value-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.1 Agent and Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.2 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2.1 State 𝑆 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.2.2 Action 𝐴 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2.3 Transition 𝑇𝑎 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.2.4 Reward 𝑅 𝑎 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.2.5 Discount Factor 𝛾 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.2.6 Policy 𝜋 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.3 MDP Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

xiii
xiv Contents

2.2.3.1 Trace 𝜏 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.3.2 State Value 𝑉 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.3.3 State-Action Value 𝑄 . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.3.4 Reinforcement Learning Objective . . . . . . . . . . . . . . 38
2.2.3.5 Bellman Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.4 MDP Solution Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.4.1 Hands On: Value Iteration in Gym . . . . . . . . . . . . . . . 41
2.2.4.2 Model-Free Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2.4.3 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2.4.4 Off-Policy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2.4.5 Hands On: Q-learning on Taxi . . . . . . . . . . . . . . . . . . 52
2.3 Classic Gym Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.3.1 Mountain Car and Cartpole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.3.2 Path Planning and Board Games . . . . . . . . . . . . . . . . . . . . . . . . 56
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3 Deep Value-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 63

3.1 Large, High-Dimensional, Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.1.1 Atari Arcade Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.1.2 Real-Time Strategy and Video Games . . . . . . . . . . . . . . . . . . . . 68
3.2 Deep Value-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2.1 Generalization of Large Problems with Deep Learning . . . . 69
3.2.1.1 Minimizing Supervised Target Loss . . . . . . . . . . . . . 69
3.2.1.2 Bootstrapping Q-Values . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.1.3 Deep Reinforcement Learning Target-Error . . . . . 71
3.2.2 Three Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2.2.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2.2.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2.2.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2.3 Stable Deep Value-Based Learning . . . . . . . . . . . . . . . . . . . . . . 74
3.2.3.1 Decorrelating States . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.2.3.2 Infrequent Updates of Target Weights . . . . . . . . . . . 76
3.2.3.3 Hands On: DQN and Breakout Gym Example . . . . . 76
3.2.4 Improving Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.2.4.1 Overestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.2.4.2 Distributional Methods . . . . . . . . . . . . . . . . . . . . . . . . 83
3.3 Atari 2600 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.3.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3.2 Benchmarking Atari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Contents xv

4 Policy-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.1 Continuous Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.1 Continuous Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.2 Stochastic Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.3 Environments: Gym and MuJoCo . . . . . . . . . . . . . . . . . . . . . . . 92
4.1.3.1 Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.1.3.2 Physics Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.1.3.3 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2 Policy-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2.1 Policy-Based Algorithm: REINFORCE . . . . . . . . . . . . . . . . . . . 95
4.2.2 Bias-Variance trade-off in Policy-Based Methods . . . . . . . . . 98
4.2.3 Actor Critic Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.2.4 Baseline Subtraction with Advantage Function . . . . . . . . . . . 101
4.2.5 Trust Region Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2.6 Entropy and Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.2.7 Deterministic Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.2.8 Hands On: PPO and DDPG MuJoCo Examples . . . . . . . . . . . . . 110
4.3 Locomotion and Visuo-Motor Environments . . . . . . . . . . . . . . . . . . . . 111
4.3.1 Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.3.2 Visuo-Motor Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.3.3 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5 Model-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.1 Dynamics Models of High-Dimensional Problems . . . . . . . . . . . . . . . 122
5.2 Learning and Planning Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.2.1 Learning the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.2.1.1 Modeling Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.2.1.2 Latent Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.2.2 Planning with the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.2.2.1 Trajectory Rollouts and Model-Predictive Control 132
5.2.2.2 End-to-end Learning and Planning-by-Network . 133
5.3 High-Dimensional Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.3.1 Overview of Model-Based Experiments . . . . . . . . . . . . . . . . . . 137
5.3.2 Small Navigation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.3.3 Robotic Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.3.4 Games Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.3.5 Hands On: PlaNet Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
xvi Contents

6 Two-Agent Self-Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.1 Two-Agent Zero-Sum Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.1.1 The Difficulty of Playing Go . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.1.2 AlphaGo Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.2 Tabula Rasa Self-Play Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.2.1 Move-Level Self-Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.2.1.1 Minimax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.2.1.2 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . 164
6.2.2 Example-Level Self-Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.2.2.1 Policy and Value Network . . . . . . . . . . . . . . . . . . . . . 172
6.2.2.2 Stability and Exploration . . . . . . . . . . . . . . . . . . . . . . 173
6.2.3 Tournament-Level Self-Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.2.3.1 Self-Play Curriculum Learning . . . . . . . . . . . . . . . . . 175
6.2.3.2 Supervised Curriculum Learning . . . . . . . . . . . . . . . 176
6.3 Self-Play Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.3.1 How to Design a World Class Go Program? . . . . . . . . . . . . . . 179
6.3.2 AlphaGo Zero Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.3.3 AlphaZero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.3.4 Open Self-Play Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.3.5 Hands On: Hex in Polygames Example . . . . . . . . . . . . . . . . . . . . 184
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

7 Multi-Agent Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

7.1 Multi-Agent Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7.1.1 Competitive Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.1.2 Cooperative Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.1.3 Mixed Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.1.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.1.4.1 Partial Observability . . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.1.4.2 Nonstationary Environments . . . . . . . . . . . . . . . . . . 201
7.1.4.3 Large State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.2 Multi-Agent Reinforcement Learning Agents . . . . . . . . . . . . . . . . . . . . 202
7.2.1 Competitive Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
7.2.1.1 Counterfactual Regret Minimization . . . . . . . . . . . . 203
7.2.1.2 Deep Counterfactual Regret Minimization . . . . . . . 204
7.2.2 Cooperative Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
7.2.2.1 Centralized Training/Decentralized Execution . . . 206
7.2.2.2 Opponent Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.2.2.3 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.2.2.4 Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
7.2.3 Mixed Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
7.2.3.1 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . 209
7.2.3.2 Swarm Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
7.2.3.3 Population-Based Training . . . . . . . . . . . . . . . . . . . . . 212
Contents xvii

7.2.3.4 Self-Play Leagues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

7.3 Multi-Agent Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
7.3.1 Competitive Behavior: Poker . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
7.3.2 Cooperative Behavior: Hide and Seek . . . . . . . . . . . . . . . . . . . . 216
7.3.3 Mixed Behavior: Capture the Flag and StarCraft . . . . . . . . . . 218
7.3.4 Hands On: Hide and Seek in the Gym Example . . . . . . . . . . . . 220
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

8 Hierarchical Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

8.1 Granularity of the Structure of Problems . . . . . . . . . . . . . . . . . . . . . . . 227
8.1.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
8.1.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
8.2 Divide and Conquer for Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
8.2.1 The Options Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
8.2.2 Finding Subgoals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
8.2.3 Overview of Hierarchical Algorithms . . . . . . . . . . . . . . . . . . . . 231
8.2.3.1 Tabular Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
8.2.3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
8.3 Hierarchical Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
8.3.1 Four Rooms and Robot Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
8.3.2 Montezuma’s Revenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
8.3.3 Multi-Agent Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
8.3.4 Hands On: Hierarchical Actor Citic Example . . . . . . . . . . . . . . 238
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

9 Meta-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
9.1 Learning to Learn Related Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
9.2 Transfer Learning and Meta-Learning Agents . . . . . . . . . . . . . . . . . . . 247
9.2.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
9.2.1.1 Task Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
9.2.1.2 Pretraining and Finetuning . . . . . . . . . . . . . . . . . . . . 249
9.2.1.3 Hands-on: Pretraining Example . . . . . . . . . . . . . . . . . 249
9.2.1.4 Multi-task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
9.2.1.5 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
9.2.2 Meta-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
9.2.2.1 Evaluating Few-Shot Learning Problems . . . . . . . . 253
9.2.2.2 Deep Meta-Learning Algorithms . . . . . . . . . . . . . . . 254
9.2.2.3 Recurrent Meta-Learning . . . . . . . . . . . . . . . . . . . . . . 256
9.2.2.4 Model-Agnostic Meta-Learning . . . . . . . . . . . . . . . . 257
9.2.2.5 Hyperparameter Optimization . . . . . . . . . . . . . . . . . 259
9.2.2.6 Meta-Learning and Curriculum Learning . . . . . . . . 260
9.2.2.7 From Few-Shot to Zero-Shot Learning . . . . . . . . . . 260
9.3 Meta-Learning Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
xviii Contents

9.3.1 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

9.3.2 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
9.3.3 Meta-Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
9.3.4 Meta-World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
9.3.5 Alchemy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
9.3.6 Hands-on: Meta-World Example . . . . . . . . . . . . . . . . . . . . . . . . . 266
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

10 Further Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

10.1 Development of Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . 271
10.1.1 Tabular Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
10.1.2 Model-free Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
10.1.3 Multi-Agent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
10.1.4 Evolution of Reinforcement Learning . . . . . . . . . . . . . . . . . . . . 273
10.2 Main Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
10.2.1 Latent Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
10.2.2 Self-Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
10.2.3 Hierarchical Reinforcement Learning . . . . . . . . . . . . . . . . . . . . 275
10.2.4 Transfer Learning and Meta-Learning . . . . . . . . . . . . . . . . . . . 276
10.2.5 Population-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
10.2.6 Exploration and Intrinsic Motivation . . . . . . . . . . . . . . . . . . . . 277
10.2.7 Explainable AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
10.2.8 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
10.3 The Future of Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

A Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

A.1 Sets and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
A.1.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
A.1.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
A.2 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
A.2.1 Discrete Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 286
A.2.2 Continuous Probability Distributions . . . . . . . . . . . . . . . . . . . . 287
A.2.3 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
A.2.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
A.2.4.1 Expectation of a Random Variable . . . . . . . . . . . . . . 290
A.2.4.2 Expectation of a Function of a Random Variable . 291
A.2.5 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
A.2.5.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
A.2.5.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
A.2.5.3 Cross-entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
A.2.5.4 Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . 293
A.3 Derivative of an Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
A.4 Bellman Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Contents xix

B Deep Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

B.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
B.1.1 Training Set and Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
B.1.2 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
B.1.3 Overfitting and the Bias-Variance Trade-Off . . . . . . . . . . . . . . 300
B.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
B.2.1 Weights, Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
B.2.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
B.2.3 End-to-end Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
B.2.4 Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
B.2.5 Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
B.2.6 More Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
B.2.7 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
B.3 Datasets and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
B.3.1 Keras, TensorFlow, PyTorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
B.3.2 MNIST and ImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
B.3.3 GPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
B.3.4 Hands On: Classification Example . . . . . . . . . . . . . . . . . . . . . . . . 325
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

C Deep Reinforcement Learning Suites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
Chapter 1
Introduction

Deep reinforcement learning studies how we learn to solve complex problems,

problems that require us to find a solution to a sequence of decisions in high
dimensional states. To make bread, we must use the right flour, add some salt, yeast
and sugar, prepare the dough (not too dry and not too wet), pre-heat the oven to
the right temperature, and bake the bread (but not too long); to win a ballroom
dancing contest we must find the right partner, learn to dance, practice, and beat
the competition; to win in chess we must study, practice, and make all the right
moves.

1.1 What is Deep Reinforcement Learning?

Deep reinforcement learning is the combination of deep learning and reinforcement

learning.
The goal of deep reinforcement learning is to learn optimal actions that maximize
our reward for all states that our environment can be in (the bakery, the dance
hall, the chess board). We do this by interacting with complex, high-dimensional
environments, trying out actions, and learning from the feedback.
The field of deep learning is about approximating functions in high-dimensional
problems; problems that are so complex that tabular methods cannot find exact
solutions anymore. Deep learning uses deep neural networks to find approximations
for large, complex, high-dimensional environments, such as in image and speech
recognition. The field has made impressive progress; computers can now recog-
nize pedestrians in a sequence of images (to avoid running over them), and can
understand sentences such as: “What is the weather going to be like tomorrow?”
The field of reinforcement learning is about learning from feedback; it learns by
trial and error. Reinforcement learning does not need a pre-existing dataset to train
on; it chooses its own actions, and learns from the feedback that an environment
provides. It stands to reason that in this process of trial and error, our agent will
make mistakes (the fire extinguisher is essential to survive the process of learning

1
2 1 Introduction

Low-Dimensional States High-Dimensional States

Static Dataset classic supervised learning deep supervised learning
Agent/Environment Interaction tabular reinforcement learning deep reinforcement learning
Table 1.1 The Constituents of Deep Reinforcement Learning

to bake bread). The field of reinforcement learning is all about learning from success
as well as from mistakes.
In recent years the two fields of deep and reinforcement learning have come
together, and have yielded new algorithms, that are able to approximate high-
dimensional problems by feedback on their actions. Deep learning has brought new
methods and new successes, with advances in policy-based methods, in model-
based approaches, in transfer learning, in hierarchical reinforcement learning, and
in multi-agent learning.
The fields also exist separately, as deep supervised learning and as tabular re-
inforcement learning (see Table 1.1). The aim of deep supervised learning is to
generalize and approximate complex, high-dimensional, functions from pre-existing
datasets, without interaction; Appendix B discusses deep supervised learning. The
aim of tabular reinforcement learning is to learn by interaction in simpler, low-
dimensional, environments such as Grid worlds; Chap. 2 discusses tabular reinforce-
ment learning.
Let us have a closer look at the two fields.

1.1.1 Deep Learning

Classic machine learning algorithms learn a predictive model on data, using methods
such as linear regression, decision trees, random forests, support vector machines,
and artificial neural networks. The models aim to generalize, to make predictions.
Mathematically speaking, machine learning aims to approximate a function from
data.
Traditionally, when computers were slow, the neural networks that were used
consisted of a few layers of fully connected neurons, and did not perform excep-
tionally well. This changed with the advent of deep learning and faster computers.
Deep neural networks now consist of many layers of neurons and use different
types of connections.1 Deep networks and deep learning have taken the accuracy of
certain important machine learning tasks to a new level, and have allowed machine
learning to be applied to complex, high-dimensional, problems, such as recognizing
cats and dogs in high-resolution (mega-pixel) images.
Deep learning has allowed machine learning to be applied to day-to-day tasks
such as the face-recognition and speech-recognition that we use in our smartphones.
Deep learning allows high-dimensional problems to be solved in real-time.
1 Where many means an input layer, an output layer, and more than one hidden layer in between.
1.1 What is Deep Reinforcement Learning? 3

1.1.2 Reinforcement Learning

Let us look more deeply at reinforcement learning, to see what it means to learn
from our own actions.
Reinforcement learning is a field in which an agent learns by interacting with
an environment. In supervised learning we need pre-existing datasets of labeled
examples to approximate a function; reinforcement learning only needs an environ-
ment that provides feedback signals for actions that the agent is trying out. This
requirement is easier to fulfill, allowing reinforcement learning to be applicable to
more situations than supervised learning.
Reinforcement learning agents generate, by their actions, their own on-the-fly
data, through the environment’s rewards. Agents can choose which actions to
learn from; reinforcement learning is a form of active learning. In this sense, our
agents are like children, that, through playing and exploring, teach themselves a
certain task. This level of autonomy is one of the aspects that attracts researchers
to the field. The reinforcement learning agent chooses which action to perform—
which hypothesis to test, and adjusts its knowledge of what works, building up a
policy of actions that are to be performed in the different states of the world that it
has encountered. (This freedom is also what makes reinforcement learning hard,
because when you are allowed to choose your own examples, it is all too easy to
stay in your comfort zone, stuck in a positive reinforcement bubble, believing you
are doing great, but learning very little of the world in reality.)

1.1.3 Deep Reinforcement Learning

Deep reinforcement learning combines methods for learning high-dimensional prob-

lems with reinforcement learning, allowing high-dimensional, interactive learning.
A major reason for the interest in deep reinforcement learning is that it works well
on current computers, and does so in seemingly different applications. For example,
in Chap. 3 we will see how deep reinforcement learning can learn eye-hand coordi-
nation tasks to play 1980s video games, in Chap. 4 how it teaches a simulated robot
cheetah to jump, and in Chap. 6 how it can teach itself to play complex games of
strategy to the extent that world champions are beaten.
Let us have a closer look at the kinds of applications on which deep reinforcement
learning does so well.

1.1.4 Applications

In its most basic form, reinforcement learning is a way to teach an agent to operate in
the world. As a child learns to walk from actions and feedback, so do reinforcement
learning agents learn from actions and feedback. Deep reinforcement learning can
4 1 Introduction

learn to solve large and complex decision problems—problems whose solution is

not yet known, but for which an approximating trial-and-error mechanism exists
that can learn a solution out of repeated interactions with the problem. This may
sound a bit cryptical and convoluted, but approximation and trial and error are
something that we do in real life all the time. Generalization and approximation
allow us to infer patterns or rules from examples. Trial and error is a method by
which humans learn how to deal with things that are unfamiliar to them (“What
happens if I press this button? Oh. Oops.” Or: “What happens if I do not put my leg
before my other leg while moving forward? Oh. Ouch.”).

Sequential Decision Problems

Learning to operate in the world is a high level goal; we can be more specific.
Reinforcement learning is about the agent’s behavior. Reinforcement learning can
find solutions for sequential decision problems, or optimal control problems, as
they are known in engineering. There are many situations in the real world where,
in order to reach a goal, a sequence of decisions must be made. Whether it is baking
a cake, building a house, or playing a card game; a sequence of decisions has to
be made. Reinforcement learning is an efficient way to learn to solve sequential
decision problems.
Many real world problems can be modeled as a sequence of decisions [543]. For
example, in autonomous driving, an agent is faced with questions of speed control,
finding drivable areas, and, most importantly, avoiding collisions. In healthcare,
treatment plans contain many sequential decisions, and factoring the effects of
delayed treatment can be studied. In customer centers, natural language process-
ing can help improve chatbot dialogue, question answering, and even machine
translation. In marketing and communication, recommender systems recommend
news, personalize suggestions, deliver notifications to user, or otherwise optimize
the product experience. In trading and finance, systems decide to hold, buy or
sell financial titles, in order to optimize future reward. In politics and governance,
the effects of policies can be simulated as a sequence of decisions before they are
implemented. In mathematics and entertainment, playing board games, card games,
and strategy games consists of a sequence of decisions. In computational creativity,
making a painting requires a sequence of esthetic decisions. In industrial robotics
and engineering, the grasping of items and the manipulation of materials consists of
a sequence of decisions. In chemical manufacturing, the optimization of production
processes consists of many decision steps, that influence the yield and quality of
the product. Finally, in energy grids, the efficient and safe distribution of energy
can be modeled as a sequential decision problem.
In all these situations, we must make a sequence of decisions. In all these situa-
tions, taking the wrong decision can be very costly.
The algorithmic research on sequential decision making has focused on two
types of applications: (1) robotic problems and (2) games. Let us have a closer look
at these two domains, starting with robotics.
1.1 What is Deep Reinforcement Learning? 5

Fig. 1.1 Robot Flipping Pancakes [422]

Fig. 1.2 Aerobatic Model Helicopter [3]

Robotics

In principle, all actions that a robot should take can be pre-programmed step-by-step
by a programmer in meticulous detail. In highly controlled environments, such as
a welding robot in a car factory, this can conceivably work, although any small
change or any new task requires reprogramming the robot.
It is surprisingly hard to manually program a robot to perform a complex task.
Humans are not aware of their own operational knowledge, such as what “voltages”
we put on which muscles when we pick up a cup. It is much easier to define a desired
goal state, and let the system find the complicated solution by itself. Furthermore,
in environments that are only slightly challenging, when the robot must be able to
respond more flexibly to different conditions, an adaptive program is needed.
It will be no surprise that the application area of robotics is an important driver
for machine learning research, and robotics researchers turned early on to finding
methods by which the robots could teach themselves certain behavior.
The literature on robotics experiments is varied and rich. A robot can teach itself
how to navigate a maze, how to perform manipulation tasks, and how to learn
6 1 Introduction

Fig. 1.3 Chess

Fig. 1.4 Go

locomotion tasks. Research into adaptive robotics has made quite some progress.
For example, one of the recent achievements involves flipping pancakes [422] and
flying an aerobatic model helicopter [2, 3]; see Figs. 1.1 and 1.2. Frequently, learning
tasks are combined with computer vision, where a robot has to learn by visually
interpreting the consequences of its own actions.

Games

Let us now turn to games. Puzzles and games have been used from the earliest days
to study aspects of intelligent behavior. Indeed, before computers were powerful
enough to execute chess programs, in the days of Shannon and Turing, paper
designs were made, in the hope that understanding chess would teach us something
about the nature of intelligence [693, 787].
Games allow researchers to limit the scope of their studies, to focus on intelli-
gent decision making in a limited environment, without having to master the full
1.1 What is Deep Reinforcement Learning? 7

Fig. 1.5 Pac-Man [71]

Fig. 1.6 StarCraft [812]

complexity of the real world. In addition to board games such as chess and Go,
video games are being used extensively to test intelligent methods in computers.
Examples are Arcade-style games such as Pac-Man [522] and multi-player strategy
games such as StarCraft [812]. See Figs. 1.3–1.6.

1.1.5 Four Related Fields

Deep reinforcement learning is a rich field, that has existed long before the artificial
intelligence endeavour had started, as a part of biology, psychology, and educa-
tion [86, 388, 742]. In artificial intelligence it has become one of the three main
categories of machine learning, the other two being supervised and unsupervised
learning [93]. This book is a book of algorithms that are inspired by topics from
the social sciences. Although the rest of the book will be about these algorithms, it
is interesting to briefly discuss the links of deep reinforcement learning to human
8 1 Introduction

Fig. 1.7 Classical Conditioning

and animal learning. We will introduce the four scientific disciplines that have a
profound influence on deep reinforcement learning.

1.1.5.1 Psychology

In psychology, reinforcement learning is also known as learning by conditioning or

as operant conditioning. Figure 1.7 illustrates the folk psychological idea of how a
dog can be conditioned. A natural reaction to food is that a dog salivates. By ringing
a bell whenever the dog is given food, the dog learns to associate the sound with
food, and after enough trials, the dog starts salivating as soon as it hears the bell,
presumably in anticipation of the food, whether it is there or not.
The behavioral scientists Pavlov (1849–1936) and Skinner (1904–1990) are well-
known for their work on conditioning. Phrases such as Pavlov-reaction have entered
our everyday language, and various jokes about conditioning exist (see, for example,
Fig. 1.8). Psychological research into learning is one of the main influences on
reinforcement learning as we know it in artificial intelligence.

1.1.5.2 Mathematics

Mathematical logic is another foundation of deep reinforcement learning. Discrete

optimization and graph theory are of great importance for the formalization of
reinforcement learning, as we will see in Sect. 2.2.2 on Markov decision processes.
1.1 What is Deep Reinforcement Learning? 9

Fig. 1.8 Who is Conditioning Whom?

Mathematical formalizations have enabled the development of efficient planning

and optimization algorithms, that are at the core of current progress.
Planning and optimization are an important part of deep reinforcement learning.
They are also related to the field of operations research, although there the emphasis
is on (non-sequential) combinatorial optimization problems. In AI, planning and
optimization are used as building blocks for creating learning systems for sequential,
high-dimensional, problems that can include visual, textual or auditory input.
The field of symbolic reasoning is based on logic, it is one of the earliest success
stories in artificial intelligence. Out of work in symbolic reasoning came heuristic
search [592], expert systems, and theorem proving systems. Well-known systems
are the STRIPS planner [241], the Mathematica computer algebra system [119], the
logic programming language PROLOG [157], and also systems such as SPARQL for
semantic (web) reasoning [24, 81].
Symbolic AI focuses on reasoning in discrete domains, such as decision trees,
planning, and games of strategy, such as chess and checkers. Symbolic AI has driven
success in methods to search the web, to power online social networks, and to power
online commerce. These highly successful technologies are the basis of much of
our modern society and economy. In 2011 the highest recognition in computer
science, the Turing award, was awarded to Judea Pearl for work in causal reasoning
(Fig. 1.9).2 Pearl later published an influential book to popularize the field [593].
Another area of mathematics that has played a large role in deep reinforcement
learning is the field of continuous (numerical) optimization. Continuous methods are
important, for example, in efficent gradient descent and backpropagation methods
that are at the heart of current deep learning algorithms.
2Joining a long list of AI researchers that have been honored earlier with a Turing award: Minsky,
McCarthy, Newell, Simon, Feigenbaum and Reddy.
10 1 Introduction

Fig. 1.9 Turing-award winner Judea Pearl

Fig. 1.10 Optimal Control of Dynamical Systems at Work

1.1.5.3 Engineering

In engineering, the field of reinforcement learning is better known as optimal control.

The theory of optimal control of dynamical systems was developed by Richard
Bellman and Lev Pontryagin [85]. Optimal control theory originally focused on
dynamical systems, and the technology and methods relate to continuous optimiza-
tion methods such as used in robotics (see Fig. 1.10 for an illustration of optimal
control at work in docking two space vehicles). Optimal control theory is of central
importance to many problems in engineering.
To this day reinforcement learning and optimal control use a different termi-
nology and notation. States and actions are denoted as 𝑠 and 𝑎 in state-oriented
reinforcement learning, where the engineering world of optimal control uses 𝑥 and
𝑢. In this book the former notation is used.
1.1 What is Deep Reinforcement Learning? 11

Fig. 1.11 Turing-award winners Geoffrey Hinton, Yann LeCun, and Yoshua Bengio

1.1.5.4 Biology

Biology has a profound influence on computer science. Many nature-inspired opti-

mization algorithms have been developed in artificial intelligence. An important
nature-inspired school of thought is connectionist AI.
Mathematical logic and engineering approach intelligence as a top-down de-
ductive process; observable effects in the real world follow from the application of
theories and the laws of nature, and intelligence follows deductively from theory. In
contrast, connectionism approaches intelligence in a bottom-up fashion. Intelligence
emerges out of many low level interactions. Intelligence follows inductively from
practice. Intelligence is embodied: the bees in bee hives, the ants in ant colonies, and
the neurons in the brain all interact, and out of the connections and interactions
arises behavior that we recognize as intelligent [97].
Examples of the connectionist approach to intelligence are nature-inspired al-
gorithms such as Ant colony optimization [209], swarm intelligence [405, 97],
evolutionary algorithms [43, 251, 346], robotic intelligence [109], and, last but not
least, artificial neural networks and deep learning [317, 458, 279].
It should be noted that both the symbolic and the connectionist school of AI have
been very successful. After the enormous economic impact of search and symbolic
AI (Google, Facebook, Amazon, Netflix), much of the interest in AI in the last two
decades has been inspired by the success of the connectionist approach in computer
speech and vision. In 2018 the Turing award was awarded to three key researchers
in deep learning: Bengio, Hinton, and LeCun (Fig. 1.11). Their most famous paper
on deep learning may well be [458].
12 1 Introduction

1.2 Three Machine Learning Paradigms

Now that we have introduced the general context and origins of deep reinforcement
learning, let us switch gears, and talk about machine learning. Let us see how deep
reinforcement learning fits in the general picture of the field. At the same time, we
will take the opportunity to introduce some notation and basic concepts.
In the next section we will then provide an outline of the book. But first it is
time for machine learning. We start at the beginning, with function approximation.

Representing a Function

Functions are a central part in artificial intelligence. A function 𝑓 transforms input 𝑥

to output 𝑦 according to some method, and we write 𝑓 (𝑥) → 𝑦. In order to perform
calculations with function 𝑓 it must be represented as a computer program in some
form in memory. We also write function

𝑓 : 𝑋 → 𝑌,

where the domain 𝑋 and range 𝑌 can be discrete or continuous, and the dimension-
ality (number of attributes in 𝑋) can be arbitrary.
Often, in the real world, the same input may yield a range of different outputs,
and we would like our function to provide a conditional probability distribution, a
function that maps
𝑓 : 𝑋 → 𝑝(𝑌 ).
Here the function maps the domain to a probability distribution 𝑝 over the range.
Representing a conditional probability allows us to model functions for which the
input does not always give the same output.

Given versus Learned Function

Sometimes the function that we are interested in is given, and we can represent
the function by a specific algorithm that computes an analytical expression that is
known exactly. This is, for example, the case for the laws of physics, or when we
make explicit assumptions for a particular system.

Example: Newton’s second Law of Motion states that for objects with
constant mass
𝐹 = 𝑚 · 𝑎,
where 𝐹 denotes the net force on the object, 𝑚 denotes its mass, and 𝑎
denotes its acceleration. In this case, the analytical expression defines the
entire function, for every possible combination of the inputs.
1.2 Three Machine Learning Paradigms 13

However, for many functions in the real world, we do not have an analytical
expression. Here, we enter the realm of machine learning, in particular of supervised
learning. When we do not know an analytical expression for a function, our best
approach is to collect data—examples of (𝑥, 𝑦) pairs—and reverse engineer or learn
the function from this data. See Fig. 1.12.

Fig. 1.12 Example of learning a function; data points are in blue, a possible learned linear function
is the red line, which allows us to make predictions 𝑦ˆ for any new input 𝑥

Example: A company wants to predict the chance that you buy a shampoo
to color your hair, based on your age. They collect many data points of
𝑥 ∈ N, your age (a natural number), that map to 𝑦 ∈ {0, 1}, a binary indicator
whether you bought their shampoo. They then want to learn the mapping

𝑦ˆ = 𝑓 (𝑥)

where 𝑓 is the desired function that tells the company who will buy the
product and 𝑦ˆ is the predicted 𝑦 (admittedly overly simplistic in this exam-
ple).

Let us see which methods exist in machine learning to find function approxima-
tions.

Three Approaches

There are three main approaches for how the observations can be provided in
machine learning: (1) supervised learning, (2) reinforcement learning, and (3) unsu-
pervised learning.
14 1 Introduction

“Cat” “Cat” “Dog” “Cat” “Dog” “Dog”

Table 1.2 (Input/output)-Pairs for a Supervised Classification Problem

1.2.1 Supervised Learning

The first and most basic method for machine learning is supervised learning. In
supervised learning, the data to learn the function 𝑓 (𝑥) is provided to the learning
algorithm in (𝑥, 𝑦) example-pairs. Here 𝑥 is the input, and 𝑦 the observed output
value to be learned for that particular input value 𝑥. The 𝑦 values can be thought
of as supervising the learning process, they teach the learning process the right
answers for each input value 𝑥, hence the name supervised learning.
The data pairs to be learned from are organized in a dataset, which must be
present in its entirety before the algorithm can start. During the learning process,
an estimate of the real function that generated the data is created, 𝑓ˆ. The 𝑥 values
of the pair are also called the input, and the 𝑦 values are the label to be learned.
Two well-known problems in supervised learning are regression and classifi-
cation. Regression predicts a continuous number, classification a dicrete category.
The best known regression relation is the linear relation: the familiar straight line
through a cloud of observation points that we all know from our introductory
statistics course. Figure 1.12 shows such a linear relationship 𝑦ˆ = 𝑎 · 𝑥 + 𝑏. The
linear function can be characterized with two parameters 𝑎 and 𝑏. Of course, more
complex functions are possible, such as quadratic regression, non-linear regression,
or regression with higher-order polynomials [210].
The supervisory signal is computed for each data item 𝑖 as the difference between
the current estimate and the given label, for example by ( 𝑓ˆ(𝑥𝑖 ) − 𝑦 𝑖 ) 2 . Such an error
function ( 𝑓ˆ(𝑥) − 𝑦) 2 is also known as a loss function; it measures the quality of our
prediction. The closer our prediction is to the true label, the lower the loss. There
are many Í ways to compute this closeness, such as the mean squared error loss
L = 𝑁1 1𝑁 ( 𝑓ˆ(𝑥𝑖 ) − 𝑦 𝑖 ) 2 , which is used often for regression over 𝑁 observations.
This loss function can be used by any supervised learning algorithm to adjust model
parameters 𝑎 and 𝑏 to fit the function 𝑓ˆ to the data. Some of the many possible
learning algorithms are linear regression and support vector machines [93, 646].
In classification, a relation between an input value and a class label is fitted. A
well-studied classification problem is image recognition, where two-dimensional
images are to be categorized. Table 1.2 shows a tiny dataset of labeled images
of the proverbial cats and Í dogs. A popular loss function for classification is the
cross-entropy loss L = − 1𝑁 𝑦 𝑖 log( 𝑓ˆ(𝑥𝑖 )), see also Sect. A.2.5.3. Again, such a
loss function can be used to adjust the model parameters to fit the function to the
data. The model can be small and linear, or it can be large, such as a neural network,
which is often used for image classification.
1.2 Three Machine Learning Paradigms 15

In supervised learning a large dataset exists where all input items have an
associated training label. Reinforcement learning is different, it does not assume
the pre-existence of a large labeled training set. Unsupervised learning does require
a large dataset, but no user-supplied output labels; all it needs are the inputs.
Deep learning function approximation was first developed in a supervised set-
ting. Although this book is about deep reinforcement learning, we will encounter
supervised learning concepts frequently, whenever we discuss the deep learning
aspect of deep reinforcement learning.

1.2.2 Unsupervised Learning

When there are no labels in the dataset, different learning algorithms must be
used. Learning without labels is called unsupervised learning. In unsupervised
learning an inherent metric of the data items is used, such as distance. A typical
problem in unsupervised learning is to find patterns in the data, such as clusters or
subgroups [819, 800].
Popular unsupervised learning algorithms are 𝑘-means algorithms, and prin-
cipal component analysis [676, 379]. Other popular unsupervised methods are
dimensionality reduction techniques from visualization, such as t-SNE [492], mini-
mum description length [293] and data compression [55]. A popular application of
unsupervised learning are autencoders, see Sect. B.2.6 [410, 411].
The relation between supervised and unsupervised learning is sometimes char-
acterized as follows: supervised learning aims to learn the conditional probability
distribution 𝑝(𝑥|𝑦) of input data conditioned on a label 𝑦, whereas unsupervised
learning aims to learn the a priori probability distribution 𝑝(𝑥) [342].
We will encounter unsupervised methods in this book in a few places, specif-
ically, when autoencoders and dimension reduction are discussed, for example,
in Chap. 5. At the end of this book explainable artificial intelligence is discussed,
where interpretable models play an important role, in Chap. 10.

1.2.3 Reinforcement Learning

The third machine learning paradigm is, indeed, reinforcement learning. In contrast
to supervised and unsupervised learning, in reinforcement learning data items
come one by one. The dataset is produced dynamically, as it were. The objective in
reinforcement learning is to find the policy: a function that gives us the best action
in each state that the world can be in.
The approach of reinforcement learning is to learn the policy for the world by
interacting with it. In reinforcement learning we recognize an agent, that does the
learning of the policy, and an environment, that provides feedback to the agent’s
actions (and that performs state changes, see Fig. 1.13). In reinforcement learning,
16 1 Introduction

Fig. 1.13 Agent and Environment

Concept Supervised Learning Reinforcement Learning

Inputs 𝑥 Full dataset of states Partial (One state at a time)
Labels 𝑦 Full (correct action) Partial (Numeric action reward)
Table 1.3 Supervised vs. Reinforcement Learning

the agent stands for the human, and the environment for the world. The goal of
reinforcement learning is to find the actions for each state that maximize the long
term accumulated expected reward. This optimal function of states to actions is
called the optimal policy.
In reinforcement learning there is no teacher or supervisor, and there is no static
dataset. There is, however, the environment, that will tell us how good the state is
in which we find ourselves. Reinforcement learning gives us partial information:
a number indicating the quality of the action that brought us to our state, where
supervised learning gives full information: the correct answer or action in that state
(Table 1.3). In this sense, reinforcement learning is in between supervised learning,
in which all data items have a label, and unsupervised learning, where no data has
a label.
Reinforcement learning provides the data to the learning algorithm step by
step, action by action; whereas in supervised learning the data is provided all at
once in one large dataset. The step-by-step approach is well suited to sequential
decision problems. On the other hand, many deep learning methods were developed
for supervised learning and may work differently when data items are generated
one-by-one. Furthermore, since actions are selected using the policy function, and
action rewards are used to update this same policy function, there is a possibility
of circular feedback and local minima. Care must be taken to ensure convergence
to global optima in our methods. Human learning also suffers from this problem,
when a stubborn child refuses to explore outside of its comfort zone. This topic is is
discussed in Sect. 2.2.4.3.
Another difference is that in supervised learning the pupil learns from a finite-
sized teacher (the dataset), and at some point may have learned all there is to learn.
The reinforcement learning paradigm allows a learning setup where the agent
1.3 Overview of the Book 17

can continue to sample the environment indefinitely, and will continue to become
smarter as long as the environment remains challenging (which can be a long time,
for example in games such as chess and Go).3
For these reasons there is great interest in reinforcement learning, although
getting the methods to work is often harder than for supervised learning.
Most classical reinforcement learning use tabular methods that work for low-
dimensional problems with small state spaces. Many real world problems are com-
plex and high-dimensional, with large state spaces. Due to steady improvements
in learning algorithms, datasets, and compute power, deep learning methods have
become quite powerful. Deep reinforcement learning methods have emerged that
successfully combine step-by-step sampling in high-dimensional problems with
large state spaces. We will discuss these methods in the subsequent chapters of this
book.

1.3 Overview of the Book

The aim of this book is to present the latest insights in deep reinforcement learning in
a single comprehensive volume, suitable for teaching a graduate level one-semester
course.
In addition to covering state of the art algorithms, we cover necessary background
in classic reinforcement learning and in deep learning. We also cover advanced,
forward looking developments in self-play, and in multi-agent, hierarchical, and
meta-learning.

1.3.1 Prerequisite Knowledge

In an effort to be comprehensive, we make modest assumptions about previous

knowledge. We assume a bachelor level of computer science or artificial intelligence,
and an interest in artificial intelligence and machine learning. A good introductory
textbook on artificial intelligence is Russel and Norvig: Artificial Intelligence, A
Modern Approach [646].
Figure 1.14 shows an overview of the structure of the book. Deep reinforcement
learning combines deep supervised learning and classical (tabular) reinforcement
learning. The figure shows how the chapters are built on this dual foundation.
For deep reinforcement learning, the field of deep supervised learning is of great
importance. It is a large field; deep, and rich. Many students may have followed
a course on deep learning, if not, Appendix B provides you with the necessary
background (dashed). Tabular reinforcement learning, on the other hand, may be
new to you, and we start our story with this topic in Chap. 2.
3In fact, some argue that reward is enough for artificial general intelligence, see Silver, Singh,
Precup, and Sutton [706].
18 1 Introduction

10. Further Developments

9. Meta-Learning

8. Hierarchical Reinforcement Learning

7. Multi-Agent Reinforcement Learning

6. Two-Agent Self-Play

5. Model-Based Reinforcement Learning

4. Policy-Based Reinforcement Learning

3. Deep Value-Based Reinforcement Learning

2. Tabular Value-Based Reinforcement Learning B. Deep Supervised Learning

Fig. 1.14 Deep Reinforcement Learning is built on Deep Supervised Learning and Tabular Rein-
forcement Learning

The figure shows how deep reinforcement learning is built on tabular reinforce-
ment learning and on deep (supervised) learning. Please check the material in
Appendix B to make sure you have enough background knowledge to follow the
rest of this book.
We also assume undergraduate level familiarity with the Python programming
language. Python that has become the programming language of choice for machine
learning research, and the host-language of most machine learning packages. All
example code in this book is in Python, and major machine learning environments
such as scikit-learn, TensorFlow, Keras and PyTorch work best from Python. See
https://www.python.org for pointers on how to get started in Python. Use the
latest stable version, unless the text mentions otherwise.
We assume an undergraduate level of familiarity with mathematics—a basic
understanding of set theory, graph theory, probability theory and information
theory is necessary, although this is not a book of mathematics. Appendix A contains
a summary to refresh your mathematical knowledge, and to provide an introduction
to the notation that is used in the book.
1.3 Overview of the Book 19

Course

There is a lot of material in the chapters, both basic and advanced, with many
pointers to the literature. One option is to teach a single course about all topics in
the book. Another option is to go slower and deeper, to spend sufficient time on the
basics, and create a course about Chaps. 2–5 to cover the basic topics (value-based,
policy-based, and model-based learning), and to create a separate course about
Chaps. 6–9 to cover the more advanced topics of multi-agent, hierarchical, and
meta-learning.

Blogs and GitHub

The field of deep reinforcement learning is a highly active field, in which theory and
practice go hand in hand. The culture of the field is open, and you will easily find
many blog posts about interesting topics, some quite good. Theory drives experi-
mentation, and experimental results drive theoretical insights. Many researchers
publish their papers on arXiv and their algorithms, hyperparameter settings and
environments on GitHub.
In this book we aim for the same atmosphere. In the chapters we first discuss
problems that are to be solved, then the algorithmic approaches to solve them, and
finally the environments to test these approaches.
Throughout the text we provide links to code, and we challenge you with hands-
on sections to get your hands dirty to perform your own experiments. All links to
web pages that we use have been stable for some time.

Website: The companion website for this book https://deep-

reinforcement-learning.net contains updates, slides, and other course
material that you are welcome to explore and use.

1.3.2 Structure of the Book

The field of deep reinforcement learning consists of two main areas: model-free
reinforcement learning and model-based reinforcement learning. Both areas have
two subareas. The chapters of this book are organized according to this structure.
• Model-free methods
– Value-based methods: Chap. 2 (tabular) and 3 (deep)
– Policy-based methods: Chap. 4
• Model-based methods
– Learned model: Chap. 5
20 1 Introduction

– Given model: Chap. 6

Then, we have three chapters on more specialized topics.
• Multi-agent reinforcement learning: Chap. 7
• Hierarchical reinforcement learning: Chap. 8
• Transfer and Meta-learning: Chap. 9
Appendix B provides a necessary review of deep supervised learning.
The style of each chapter is to first provide the main idea of the chapter in
an intuitive example, to then explain the kind of problem to be solved, and then
to discuss algorithmic concepts that agents use, and the environments that have
been solved in practice with these algorithms. The sections of the chapters are
named accordingly: their names end in problem-agent-environment. At the end
of each chapter we provide questions for quizzes to check your understanding of
the concepts, and we provide exercises for larger programming assignments (some
doable, some quite challenging). We also end each chapter with a summary and
references to further reading.
Let us now look in more detail at what topics the chapters cover.

Chapters

After this introductory chapter, we continue with Chap. 2, in which we discuss in

detail the basic concepts of tabular (non-deep) reinforcement learning. We start with
Markov decision processes and discuss them at length. We will introduce tabular
planning and learning, and important concepts such as state, action, reward, value,
and policy. We will encounter the first, tabular, value-based model-free learning
algorithms (for an overview, see Table 2.1). Chapter 2 is the only non-deep chapter
of the book. All other chapters cover deep methods.
Chapter 3 explains deep value-based reinforcement learning. The chapter covers
the first deep algorithms that have been devised to find the optimal policy. We will
still be working in the value-based, model-free, paradigm. At the end of the chapter
we will analyze a player that teaches itself how to play 1980s Atari video games.
Table 3.1 lists some of the many stable deep value-based model-free algorithms.
Value-based reinforcement learning works well with applications such as games,
with discrete action spaces. The next chapter, Chap. 4, discusses a different approach:
deep policy-based reinforcement learning (Table 4.1). In addition to discrete spaces,
this approach is also suited for continuous actions spaces, such as robot arm move-
ment, and simulated articulated locomotion. We see how a simulated half-cheetah
teaches itself to run.
The next chapter, Chap. 5, introduces deep model-based reinforcement learning
with a learned model, a method that first builds up a transition model of the envi-
ronment before it builds the policy. Model-based reinforcement learning holds the
promise of higher sample efficiency, and thus faster learning. New developments,
such as latent models, are discussed. Applications are both in robotics and in games
(Table 5.2).
1.3 Overview of the Book 21

The next chapter, Chap. 6, studies how a self-play system can be created for
applications where the transition model is given by the problem description. This is
the case in two-agent games, where the rules for moving in the game determine the
transition function. We study how TD-Gammon and AlphaZero achieve tabula rasa
learning: teaching themselves from zero knowledge to world champion level play
through playing against a copy of itself (Table 6.2). In this chapter deep residual
networks and Monte Carlo Tree Search result in curriculum learning.
Chapter 7 introduces recent developments in deep multi-agent and team learning.
The chapter covers competition and collaboration, population-based methods, and
playing in teams. Applications of these methods are found in games such as StarCraft
and Capture the Flag (Table 7.2).
Chapter 8 covers deep hierarchical reinforcement learning. Many tasks exhibit
an inherent hierarchical structure, in which clear subgoals can be identified. The
options framework is discussed, and methods that can identify subgoals, subpolicies,
and meta policies. Different approaches for tabular and deep hierarchical methods
are discussed (Table 8.1).
The final technical chapter, Chap. 9, covers deep meta-learning, or learning to
learn. One of the major hurdles in machine learning is the long time it takes to learn
to solve a new task. Meta-learning and transfer learning aim to speed up learning of
new tasks by using information that has been learned previously for related tasks;
algorithms are listed in Table 9.2. At the end of the chapter we will experiment with
few-shot learning, where a task has to be learned without having seen more than a
few training examples.
Chapter 10 concludes the book by reviewing what we have learned, and by
looking ahead into what the future may bring.
Appendix A provides mathematical background information and notation. Ap-
pendix B provides a chapter-length overview of machine learning and deep super-
vised learning. If you wish to refresh your knowledge of deep learning, please go to
this appendix before you read Chap. 3.
Chapter 2
Tabular Value-Based Reinforcement Learning

This chapter will introduce the classic, tabular, field of reinforcement learning, to
build a foundation for the next chapters. First, we will introduce the concepts of
agent and environment. Next come Markov decision processes, the formalism that
is used to reason mathematically about reinforcement learning. We discuss at some
length the elements of reinforcement learning: states, actions, values, policies.
We learn about transition functions, and solution methods that are based on
dynamic programming using the transition model. There are many situations where
agents do not have access to the transition model, and state and reward information
must be acquired from the environment. Fortunately, methods exist to find the
optimal policy without a model, by querying the environment. These methods,
appropriately named model-free methods, will be introduced in this chapter. Value-
based model-free methods are the most basic learning approach of reinforcement
learning. They work well in problems with deterministic environments and discrete
action spaces, such as mazes and games. Model-free learning makes few demands
on the environment, building up the policy function 𝜋(𝑠) → 𝑎 by sampling the
environment.
After we have discussed these concepts, it is time to apply them, and to under-
stand the kinds of sequential decision problems that we can solve. We will look
at Gym, a collection of reinforcement learning environments. We will also look at
simple Grid world puzzles, and see how to navigate those.
This is a non-deep chapter: in this chapter functions are exact, states are stored in
tables, an approach that works as long as problems fit in memory. The next chapter
shows how function approximation with neural networks allows us to approximate
problems with more states than fit in memory.
The chapter is concluded with exercises, a summary, and pointers to further
reading.

23
24 2 Tabular Value-Based Reinforcement Learning

Core Concepts

• Agent, environment
• MDP: state, action, reward, value, policy
• Planning and learning
• Exploration and exploitation
• Gym, baselines

Core Problem

• Learn a policy from interaction with the environment

Core Algorithms

• Value iteration (Listing 2.1)

• Temporal difference learning (Sect. 2.2.4.2)
• Q-learning (Listing 2.5)

Finding a Supermarket

Imagine that you have just moved to a new city, you are hungry, and you want to
buy some groceries. There is a somewhat unrealistic catch: you do not have a map
of the city and you forgot to charge your smartphone. It is a sunny day, you put
on your hiking shoes, and after some random exploration you have found a way
to a supermarket and have bought your groceries. You have carefully noted your
route in a notebook, and you retrace your steps, finding your way back to your new
home.
What will you do the next time that you need groceries? One option is to
follow exactly the same route, exploiting your current knowledge. This option is
guaranteed to bring you to the store, at no additional cost for exploring possible
alternative routes. Or you could be adventurous, and explore, trying to find a new
route that may actually be quicker than the old route. Clearly, there is a trade-off:
you should not spend so much time exploring that you can not recoup the gains of
a potential shorter route before you move elsewhere.
Reinforcement learning is a natural way of learning the optimal route as we go,
by trial and error, from the effects of the actions that we take in our environment.
This little story contained many of the elements of a reinforcement learning
problem, and how to solve it. There is an agent (you), an environment (the city), there
2.1 Sequential Decision Problems 25

Fig. 2.1 Grid World with a goal, an “un-goal,” and a wall

are states (your location at different points in time), actions (assuming a Manhattan-
style grid, moving a block left, right, forward, or back), there are trajectories (the
routes to the supermarket that you tried), there is a policy (which action you will
take at a particular location), there is a concept of cost/reward (the length of your
current path), we see exploration of new routes, exploitation of old routes, a trade-off
between them, and your notebook in which you have been sketching a map of the
city (a local transition model).
By the end of this chapter you will have learned which role all these topics play
in reinforcement learning.

2.1 Sequential Decision Problems

Reinforcement learning is used to solve sequential decision problems [27, 254].

Before we dive into the algorithms, let us have a closer look at these problems, to
better understand the challenges that the agents must solve.
In a sequential decision problem the agent has to make a sequence of decisions
in order to solve a problem. In reinforcement learning, solving implies to find the
sequence with the highest (expected cumulative future) reward. In reinforcement
learning terminology, the solver is called the agent, and the problem is called
environment (or sometimes the world).
We will now discuss basic examples of sequential decision problems.

Grid Worlds

Some of the first environments that we encounter in reinforcement learning are

Grid worlds (Fig. 2.1). These environments consist of a rectangular grid of squares,
with a start square, and a goal square. The aim is for the agent to find the sequence
of actions that it must take (up, down, left, right) to arrive at the goal square. In
fancy versions a “loss” square is added, that scores minus points, or a “wall” square,
that is impenetrable for the agent.
26 2 Tabular Value-Based Reinforcement Learning

Fig. 2.2 The Longleat Hedge Maze in Wiltshire, England

Fig. 2.3 Sokoban Puzzle [136]

By exploring the grid, taking different actions, and recording the reward (whether
it reached the goal square), the agent can find a route—and when it has a route, it
can try to improve that route, to find an optimal, shortest, route to the goal.
Grid world is a simple environment that is well-suited for manually playing
around with reinforcement learning algorithms, to build up intuition of what the
algorithms do to learn by reinforcement. In this chapter we will model reinforcement
learning problems formally, and encounter algorithms that find optimal routes in
Grid world.
2.2 Tabular Value-Based Agents 27

Agent

State 𝑠𝑡+1 Reward 𝑟𝑡+1 Action 𝑎𝑡

Environment

Fig. 2.4 Agent and environment [742]

Mazes and Box Puzzles

After Grid world problems, there are more complicated problems, with extensive
wall structures to make navigation more difficult (see Fig. 2.2). Trajectory planning
algorithms play a central role in robotics [455, 264]; there is a long tradition of
using 2D and 3D mazes for path-finding problems in reinforcement learning. The
Taxi domain was introduced by Dietterich [196], and box-pushing problems such
as Sokoban have also been used frequently [385, 204, 542, 877], see Fig. 2.3. The
challenge in Sokoban is that boxes can only be pushed, not pulled. Actions can have
the effect of creating an inadvertent dead-end for into the future, making Sokoban
a difficult game to play. The action space of these puzzles and mazes is discrete.
Small versions of the mazes can be solved exactly by planning, larger instances are
only suitable for approximate planning or learning methods. Solving these planning
problems exactly is NP-hard or PSPACE-hard [169, 321], as a consequence the
computational time required to solve problem instances exactly grows exponentially
with the problem size, and becomes quickly infeasible for all but the smallest
problems.
Let us see how we can model agents to act in these types of environments.

2.2 Tabular Value-Based Agents

Reinforcement learning finds the best policy to operate in the environment by

interacting with it. The reinforcement learning paradigm consists of an agent (you,
the learner) and an environment (the world, which is in a certain state, and gives
you feedback on your actions).

2.2.1 Agent and Environment

In Fig. 2.4 the agent and environment are shown, together with action 𝑎 𝑡 , next state
𝑠𝑡+1 , and its reward 𝑟 𝑡+1 . Let us have a closer look at the figure. The environment
28 2 Tabular Value-Based Reinforcement Learning

is in a certain state 𝑠𝑡 at time 𝑡. Then, the agent performs action 𝑎 𝑡 , resulting in

a transition in the environment from state 𝑠𝑡 to 𝑠𝑡+1 at the next time step, also
denoted as 𝑠 → 𝑠 0. Along with this new state comes a reward value 𝑟 𝑡+1 (which
may be a positive or negative number). The goal of reinforcement learning is to
find the sequence of actions that gives the best reward. Or, more formally, to find
the optimal policy function 𝜋★ that gives in each state the best action to take in
that state. By trying different actions, and accumulating the rewards, the agent can
find the best action for each state. In this way, with the reinforcing reward values,
the optimal policy is learned from repeated interaction with the environment, and
the problem is “solved.”
Where in supervised learning all data is provided in one large labeled dataset,
reinforcement learning works step by step. Furthermore, where in supervised
learning we are given the ground truth for each example 𝑥 with a label 𝑦, in
reinforcement learning the environment gives us only a number as an indication
of the quality of an action that we performed, and we are left to derive the correct
action policy from that, as we can see in Fig. 2.4. On the other hand, reinforcement
learning allows us to generate as many action-reward pairs as we need, without a
large hand-labeled dataset, and we can choose ourselves which actions to try.

2.2.2 Markov Decision Process

Sequential decision problems can be modelled as Markov decision processes

(MDPs) [482]. Markov decision problems are so called because they have the Markov
property: the next state depends only on the current state and the actions avail-
able in it (no historical memory of previous states or information from elsewhere
influences the next state) [351]. The no-memory property is important because it
makes reasoning about future states possible using only the information present in
the current state. If previous histories would influence the current state, and these
would all have to be taken into account, then reasoning about the current state
would be much harder or even infeasible.
Markov processes are named after Russian mathematician Andrey Markov (1856–
1922) who is best known for his work on these stochastic processes. See [27, 254]
for an introduction into MDPs. The MDP formalism is the mathematical basis under
reinforcement learning, and we will introduce the relevant elements in this chapter.
We follow Moerland [523] and François-Lavet et al. [254] for some of the notation
and examples in this section.

Formalism

We define a Markov decision process for reinforcement learning as a 5-tuple

(𝑆, 𝐴, 𝑇𝑎 , 𝑅 𝑎 , 𝛾):
• 𝑆 is a finite set of legal states of the environment; the initial state is denoted as 𝑠0
2.2 Tabular Value-Based Agents 29

• 𝐴 is a finite set of actions (if the set of actions differs per state, then 𝐴𝑠 is the
finite set of actions in state 𝑠)
• 𝑇𝑎 (𝑠, 𝑠 0) = Pr(𝑠𝑡+1 = 𝑠 0 |𝑠𝑡 = 𝑠, 𝑎 𝑡 = 𝑎) is the probability that action 𝑎 in state
𝑠 at time 𝑡 will transition to state 𝑠 0 at time 𝑡 + 1 in the environment (do not
confuse the 𝑇 in transition function 𝑇𝑎 (·) with time index 𝑡)
• 𝑅 𝑎 (𝑠, 𝑠 0) is the reward received after action 𝑎 transitions state 𝑠 to state 𝑠 0
• 𝛾 ∈ [0, 1] is the discount factor representing the difference between future and
present rewards.

2.2.2.1 State 𝑺

Let us have a deeper look at the Markov-tuple 𝑆, 𝐴, 𝑇𝑎 , 𝑅 𝑎 , 𝛾, to see their role in the
reinforcement learning paradigm, and how, together, they can model and describe
reward-based learning processes.
At the basis of every Markov decision process is a description of the state 𝑠𝑡 of
the system at a certain time 𝑡.

State Representation

The 𝑆 𝑠 contains the information to uniquely represent the configuration of the

environment.
Often there is a straightforward way to uniquely represent the state in a computer
memory. For the supermarket example, each identifying location is a state (such
as: I am at the corner of 8th Av and 27nd St). For chess, this can be the location of
all pieces on the board (plus information for the 50 move repetition rule, castling
rights, and en-passant state). For robotics this can be the orientation of all joints of
the robot, and the location of the limbs of the robot. For Atari, the state comprises
the values of all screen pixels, the orientation of the joystick and the on/off state of
the fire-button.
Using its current behavior policy, the agent chooses an action 𝑎, which is per-
formed in the environment. How the environment reacts to the action is defined by
the transition model 𝑇𝑎 (𝑠, 𝑠 0) that is internal to the environment, which the agent
does not know. The environment returns the new state 𝑠 0, as well as a reward value
𝑟 0 for the new state, to the agent.

Deterministic and Stochastic Environment

In a deterministic environment actions from a certain state always lead predictably

to one new state; a stochastic environment may return different new states at
different occasions.
In discrete deterministic environments the transition function defines a one-step
transition, as each action (from a certain old state) deterministically leads to a single
30 2 Tabular Value-Based Reinforcement Learning

new state. This is the case in Grid worlds, Sokoban, and in games such as chess and
checkers, where a move action deterministically leads to one new board position.
An example of a non-deterministic situation is a robot movement in an envi-
ronment. In a certain state, a robot arm is holding a bottle. An agent-action can be
turning the bottle in a certain orientation (presumably to pour a drink in a cup). The
next state may be a full cup, or it may be a mess, if the bottle was not poured in the
correct orientation, or location, or if something happened in the environment such
as someone bumping the table. The outcome of the action is unknown beforehand
by the agent, and depends on elements in the environment, that are not known to
the agent.

2.2.2.2 Action 𝑨

Now that we have looked at the state, it is time to look at the second item that
defines an MDP, the action.

Irreversible Environment Action

When the agent is in state 𝑠, it chooses an 𝐴 𝑎 to perform, based on its current

behavior policy 𝜋(𝑎|𝑠) (policies are explained soon). The agent communicates the
selected action 𝑎 to the environment (Fig. 2.4). For the supermarket example, an
example of an action could be walking along a block in a certain direction (such
as: East). For Sokoban, an action can be pushing a box to a new location in the
warehouse. Note that in different states the possible actions may differ. For the
supermarket example, walking East may not be possible at each street corner, and
in Sokoban pushing a box in a certain direction will only be possible in states where
this direction is not blocked by a wall.
An action changes the state of the environment irreversibly. In the reinforcement
learning paradigm, there is no undo operator for the environment (nor is there in
the real world). When the environment has performed a state transition, it is final.
The new state is communicated to the agent, together with a reward value. The
actions that the agent perfoms in the environment are also known as its behavior,
just as the actions of a human in the world constitute the human’s behavior.

Discrete or Continuous Action Space

The actions are discrete in some applications, continuous in others. For example,
the actions in board games, and choosing a direction in a navigation task in a grid,
are discrete.
In contrast, arm and joint movements of robots, and bet sizes in certain games, are
continuous (or span a very large range of values). Applying algorithms to continuous
or very large action spaces either requires discretization of the continuous space
2.2 Tabular Value-Based Agents 31

𝑠 𝑠

𝜋 𝜋 𝑠
𝑎 𝑎

𝑡𝑎 , 𝑟𝑎 𝑡𝑎 , 𝑟𝑎 𝜋 𝑡𝑎 , 𝑟𝑎
𝑠0 𝑠0 𝑎, 𝑠0

Fig. 2.5 Backup Diagrams for MDP Transitions [742]

(into buckets) or the development of a different kind of algorithm. As we will see in

Chaps. 3 and 4, value-based methods (Chap. 3) work well for discrete action spaces,
and policy-based methods (Chap. 4) work well for both action spaces.
For the supermarket example we can actually choose between modeling our
actions discreet or continuous. From every state, we can move any number of steps,
small or large, integer or fractional, in any direction. We can even walk a curvy
path. So, strictly speaking, the action space is continuous. However, if, as in some
cities, the streets are organized in a rectangular Manhattan-pattern, then it makes
sense to discretize the continuous space, and to only consider discrete actions that
take us to the next street corner. Then, our action space has become discrete, by
using extra knowledge of the problem structure.1

2.2.2.3 Transition 𝑻𝒂

After having discussed state and action, it is time to look at the transition function
𝑇𝑎 (𝑠, 𝑠 0). The 𝑇 function 𝑇𝑎 determines how the state changes after an action
has been selected. In model-free reinforcement learning the transition function is
implicit to the solution algorithm: the environment has access to the transition
function, and uses it to compute the next state 𝑠 0, but the agent has not. (In Chap. 5
we will discuss model-based reinforcement learning. There the agent has its own
transition function, an approximation of the environment’s transition function,
which is learned from the environment feedback.)

Graph View of the State Space

We have discussed states, actions and transitions. The dynamics of the MDP are
modelled by transition function 𝑇𝑎 (·) and reward function 𝑅 𝑎 (·). The imaginary
1 If we assume that supermarkets are large, block-sized, items that typically can be found on street
corners, then we can discretize the action space. Note that we may miss small sub-block-sized
supermarkets, because of this simplification. Another, better, simplification, would be to discretize
the action space into walking distances of the size of the smallest supermarket that we expect to
ever encounter.
32 2 Tabular Value-Based Reinforcement Learning

space of all possible states is called the state space. The state space is typically
large. The two functions define a two-step transition from state 𝑠 to 𝑠 0, via action 𝑎:
𝑠 → 𝑎 → 𝑠 0. To help our understanding of the transitions between states we can
use a graphical depiction, as in Fig. 2.5.
In the figure, states and actions are depicted as nodes (vertices), and transitions
are links (edges) between the nodes. States are drawn as open circles, and actions
as smaller black circles. In a certain state 𝑠, the agent can choose which action 𝑎 to
perform, that is then acted out in the environment. The environment returns the
new state 𝑠 0 and the reward 𝑟 0.
Figure 2.5 shows a transition graph of the elements of the MDP tuple 𝑠, 𝑎, 𝑡 𝑎 , 𝑟 𝑎
as well as 𝑠 0, and policy 𝜋, and how the value can be calculated. The root node at
the top is state 𝑠, where policy 𝜋 allows the agent to choose between three actions
𝑎, that, following distribution Pr, each can transition to two possible states 𝑠 0, with
their reward 𝑟 0. In the figure, a single transition is shown. Use your imagination to
picture the other transitions as the graph extends down.
In the left panel of the figure the environment can choose which new state it
returns in response to the action (stochastic environment), in the middle panel there
is only one state for each action (deterministic environment).
To calculate the value of the root of the tree a backup procedure can be followed.
A backup procedure calculates the value of a parent from the values of the children,
recursively, in a bottom-up fashion, summing or maxing their values from the
leaves to the root of the tree. Such a calculation uses discrete time steps, indicated
by subscripts to the state and action, as in 𝑠𝑡 , 𝑠𝑡+1 , 𝑠𝑡+2 , . . .. For brevity, 𝑠𝑡+1 is
sometimes written as 𝑠 0. The figure shows a single transition step; an episode in
reinforcement learning typically consists of a sequence of many time steps.

Trial and Error, Down and Up

A graph such as the one in the center and right panel of Fig. 2.5, where child nodes
have only one parent node and without cycles, is known as a tree. In computer
science the root of the tree is at the top, and trees grow downward to the leaves.
As actions are performed and states and rewards are returned, a learning process
is taking place in the agent. We can use Fig. 2.5 to better understand the learning
process that is unfolding.
The rewards of actions are learned by the agent by interacting with the envi-
ronment, performing the actions. In the tree of Fig. 2.5 an action selection moves
downward, towards the leaves. At the leaves, we find the rewards, which we propa-
gate to the parent states upwards. Reward learning is learning by backpropagation:
in Fig. 2.5 the reward information flows upward in the diagram from the leaves to
the root. Action selection moves down, reward learning flows upward.
Reinforcement learning is learning by trial and error. Trial is selecting an action
down (using the behavior policy) to perform in the environment. Error is moving
up the tree, receiving a feedback reward from the environment, and reporting that
back up the tree to the state to update the current behavior policy. The downward
2.2 Tabular Value-Based Agents 33

selection policy chooses which actions to explore, and the upward propagation of
the error signal performs the learning of the policy.
Figures such as the one in Fig. 2.5 are useful for seeing how values are calculated.
The basic notions are trial, and error, or down, and up.

2.2.2.4 Reward 𝑹𝒂

The 𝑅 function is of central importance in reinforcement learning. It indicates the

measure of quality of that state, such solved, or distance. Rewards are associated
with single states, indicating their quality. However, we are most often interested
in the quality of a full decision making sequence from root to leaves. The reward
of a full sequence is called the return, sometimes denoted confusingly as 𝑅. The
expected cumulative discounted future reward of a state is called the value function
𝑉 𝜋 (𝑠). The value function 𝑉 𝜋 (𝑠) is the expected cumulative reward of 𝑠 where
actions are chosen according to policy 𝜋. The value function plays a central role
in reinforcement learning algorithms, such as Q-learning and DQN, that we will
discuss in detail later.

2.2.2.5 Discount Factor 𝜸

We distinguish between two types of tasks: (1) continuous time and long running
tasks, and (2) episodic tasks—tasks that end. In continuous and long running tasks it
makes sense to discount rewards from far into the future in order to more strongly
value current information. To achieve this a discount factor 𝛾 is used that reduces
the impact of far away rewards. Many continuous tasks use discounting, 𝛾 ≠ 1.
In this book we will mostly discuss episodic problems, where 𝛾 is irrelevant.
Both the supermarket example and the game of chess are episodic, and discounting
does not make sense in these problems.

2.2.2.6 Policy 𝝅

Of central importance in reinforcement learning is the 𝜋 function. The policy

function 𝜋 answers the question how the different actions 𝑎 at state 𝑠 should be
chosen.
Actions are anchored in states. The central question of MDP optimization is how
to choose our actions. The policy 𝜋 is a conditional probability distribution that for
each possible state specifies the probability of each possible action. The function 𝜋
is a mapping from the state space to a probability distribution over the action space:

𝜋 : 𝑆 → 𝑝( 𝐴)
34 2 Tabular Value-Based Reinforcement Learning

where 𝑝( 𝐴) can be a discrete or continuous probability distribution. For a particular

probability (density) from this distribution we write

𝜋(𝑎|𝑠)

Example: For a discrete state space and discrete action space, we may store
an explicit policy as a table, e.g.:
𝑠 𝜋(𝑎=up|𝑠) 𝜋(𝑎=down|𝑠) 𝜋(𝑎=left|𝑠) 𝜋(𝑎=right|𝑠)
1 0.2 0.8 0.0 0.0
2 0.0 0.0 0.0 1.0
3 0.7 0.0 0.3 0.0
etc. . . . .

A special case of a policy is a deterministic policy, denoted by

𝜋(𝑠)

where

𝜋:𝑆→𝐴
A deterministic policy selects a single action in every state. Of course the deter-
ministic action may differ between states, as in the example below:

Example: An example of a deterministic discrete policy is

𝑠 𝜋(𝑎=up|𝑠) 𝜋(𝑎=down|𝑠) 𝜋(𝑎=left|𝑠) 𝜋(𝑎=right|𝑠)
1 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 1.0
3 1.0 0.0 0.0 0.0
etc. . . . .
We would write 𝜋(𝑠 = 1) = down, 𝜋(𝑠 = 2) = right, etc.

2.2.3 MDP Objective

Finding the optimal policy function is the goal of the reinforcement learning prob-
lem, and the remainder of this book will discuss many different algorithms to
achieve this goal under different circumstances. Let us have a closer look at the
objective of reinforcement learning.
Before we can do so, we will look at traces, their return, and value functions.
2.2 Tabular Value-Based Agents 35

𝑎
𝑠

𝑇 𝑇

Fig. 2.6 Single Transition Step versus Full 3-Step Trace/Episode/Trajectory

2.2.3.1 Trace 𝝉

As we start interacting with the MDP, at each timestep 𝑡, we observe 𝑠𝑡 , take an

action 𝑎 𝑡 and then observe the next state 𝑠𝑡+1 ∼ 𝑇𝑎𝑡 (𝑠) and reward 𝑟 𝑡 = 𝑅 𝑎𝑡 (𝑠𝑡 , 𝑠𝑡+1 ).
Repeating this process leads to a sequence or trace in the environment, which we
denote by 𝜏𝑡𝑛 :
𝜏𝑡𝑛 = {𝑠𝑡 , 𝑎 𝑡 , 𝑟 𝑡 , 𝑠𝑡+1 , .., 𝑎 𝑡+𝑛 , 𝑟 𝑡+𝑛 , 𝑠𝑡+𝑛+1 }
Here, 𝑛 denotes the length of the 𝜏. In practice, we often assume 𝑛 = ∞, which means
that we run the trace until the domain terminates. In those cases, we will simply
write 𝜏𝑡 = 𝜏𝑡∞ . Traces are one of the basic building blocks of reinforcement learning
algorithms. They are a single full rollout of a sequence from the sequential decision
problem. They are also called trajectory, episode, or simply sequence (Fig. 2.6).

Example: A short trace with three actions could look like:

𝜏02 = {𝑠0 =1, 𝑎 0 =up, 𝑟 0 =−1, 𝑠1 =2, 𝑎 1 =up, 𝑟 1 =−1, 𝑠2 =3, 𝑎 2 =left, 𝑟 2 =20, 𝑠3 =5}

Since both the policy and the transition dynamics can be stochastic, we will not
always get the same trace from the start state. Instead, we will get a distribution
over traces. The distribution of traces from the start state (distribution) is denoted
by 𝑝(𝜏0 ). The probability of each possible trace from the start is actually given by
the product of the probability of each specific transition in the trace:

𝑝(𝜏0 ) = 𝑝 0 (𝑠0 ) · 𝜋(𝑎 0 |𝑠0 ) · 𝑇𝑎0 (𝑠0 , 𝑠1 ) · 𝜋(𝑎 1 |𝑠1 )...

Ö∞
= 𝑝 0 (𝑠0 ) · 𝜋(𝑎 𝑡 |𝑠𝑡 ) · 𝑇𝑎𝑡 (𝑠𝑡 , 𝑠𝑡+1 ) (2.1)
𝑡=0

Policy-based reinforcement learning depends heavily on traces, and we will

discuss traces more deeply in Chap. 4. Value-based reinforcement learning uses
single transition steps.
36 2 Tabular Value-Based Reinforcement Learning

Return 𝑹

We have not yet formally defined what we actually want to achieve in the sequential
decision-making task—which is, informally, the best policy. The sum of the future
reward of a trace is known as the return. The return of trace 𝜏𝑡 is:

𝑅(𝜏𝑡 ) = 𝑟 𝑡 + 𝛾 · 𝑟 𝑡+1 + 𝛾 2 · 𝑟 𝑡+2 + ...

∞
∑︁
= 𝑟𝑡 + 𝛾 𝑖 𝑟 𝑡+𝑖 (2.2)
𝑖=1

where 𝛾 ∈ [0, 1] is the discount factor. Two extreme cases are:

• 𝛾 = 0: A myopic agent, which only considers the immediate reward, 𝑅(𝜏𝑡 ) = 𝑟 𝑡
• 𝛾 = 1: A far-sighted agent, which treats all future rewards as equal, 𝑅(𝜏𝑡 ) =
𝑟 𝑡 + 𝑟 𝑡+1 + 𝑟 𝑡+2 + . . .
Note that if we would use an infinite-horizon return (Eq. 2.2) and 𝛾 = 1.0, then the
cumulative reward may become unbounded. Therefore, in continuous problems,
we use a discount factor close to 1.0, such as 𝛾 = 0.99.

Example: For the previous trace example we assume 𝛾 = 0.9. The return
(cumulative reward) is equal to:

𝑅(𝜏02 ) = −1 + 0.9 · −1 + 0.92 · 20 = 16.2 − 1.9 = 14.3

2.2.3.2 State Value 𝑽

The real measure of optimality that we are interested in is not the return of just one
trace. The environment can be stochastic, and so can our policy, and for a given
policy we do not always get the same trace. Therefore, we are actually interested
in the expected cumulative reward that a certain policy achieves. The expected
cumulative discounted future reward of a state is better known as the value of that
state.
We define the state 𝑉 𝑉 𝜋 (𝑠) as the return we expect to achieve when an agent
starts in state 𝑠 and then follows policy 𝜋, as:
∞
∑︁
𝑉 𝜋 (𝑠) = E 𝜏𝑡 ∼ 𝑝 ( 𝜏𝑡 ) 𝛾 𝑖 · 𝑟 𝑡+𝑖 |𝑠𝑡 = 𝑠 (2.3)
𝑖=0
2.2 Tabular Value-Based Agents 37

Example: Imagine that we have a policy 𝜋, which from state 𝑠 can result
in two traces. The first trace has a cumulative reward of 20, and occurs in
60% of the times. The other trace has a cumulative reward of 10, and occurs
40% of the times. What is the value of state 𝑠?

𝑉 𝜋 (𝑠) = 0.6 · 20 + 0.4 · 10 = 16.

The average return (cumulative reward) that we expect to get from state 𝑠
under this policy is 16.

Every policy 𝜋 has one unique associated value function 𝑉 𝜋 (𝑠). We often omit
𝜋 to simplify notation, simply writing 𝑉 (𝑠), knowing a state value is always condi-
tioned on a certain policy.
The state value is defined for every possible state 𝑠 ∈ 𝑆. 𝑉 (𝑠) maps every state
to a real number (the expected return):

𝑉 :𝑆→R

Example: In a discrete state space, the value function can be represented

as a table of size |𝑆|.
𝑠 𝑉 𝜋 (𝑠)
1 2.0
2 4.0
3 1.0
etc. .

Finally, the state value of a terminal state is by definition zero:

𝑠 = terminal ⇒ 𝑉 (𝑠) := 0.

2.2.3.3 State-Action Value 𝑸

In addition to state values 𝑉 𝜋 (𝑠), we also define 𝑄 values 𝑄 𝜋 (𝑠, 𝑎).2 The only
difference is that we now condition on a state and action. We estimate the average
return we expect to achieve when taking action 𝑎 in state 𝑠, and then following
policy 𝜋 afterwards:
∞
𝜋
∑︁
𝑄 (𝑠, 𝑎) = E 𝜏𝑡 ∼ 𝑝 ( 𝜏𝑡 ) 𝛾 𝑖 · 𝑟 𝑡+𝑖 |𝑠𝑡 = 𝑠, 𝑎 𝑡 = 𝑎 (2.4)
𝑖=0

2The reason for the choice for letter Q is lost in the mists of time. Perhaps it is meant to indicate
quality.
38 2 Tabular Value-Based Reinforcement Learning

Every policy 𝜋 has only one unique associated state-action value function 𝑄 𝜋 (𝑠, 𝑎).
We often omit 𝜋 to simplify notation. Again, the state-action value is a function

𝑄:𝑆×𝐴→R

which maps every state-action pair to a real number.

Example: For a discrete state and action space, 𝑄(𝑠, 𝑎) can be represented
as a table of size |𝑆| × | 𝐴|. Each table entry stores a 𝑄(𝑠, 𝑎) estimate for the
specific 𝑠, 𝑎 combination:
𝑎=up 𝑎=down 𝑎=left 𝑎=right
𝑠=1 4.0 3.0 7.0 1.0
𝑠=2 2.0 -4.0 0.3 1.0
𝑠=3 3.5 0.8 3.6 6.2
etc. . . . .

The state-action value of a terminal state is by definition zero:

𝑠 = terminal ⇒ 𝑄(𝑠, 𝑎) := 0, ∀𝑎

2.2.3.4 Reinforcement Learning Objective

We now have the ingredients to state the objective 𝐽 (·) of reinforcement learning.
The objective is to achieve the highest possible average return from the start state:
h i
𝐽 (𝜋) = 𝑉 𝜋 (𝑠0 ) = E 𝜏0 ∼ 𝑝 ( 𝜏0 | 𝜋) 𝑅(𝜏0 ) . (2.5)

for 𝑝(𝜏0 ) given in Eq. 2.1. There is one optimal value function, which achieves
higher or equal value than all other value functions. We search for a policy that
achieves this optimal value function, which we call the optimal policy 𝜋★:

𝜋★ (𝑎|𝑠) = arg max 𝑉 𝜋 (𝑠0 ) (2.6)

𝜋

This function 𝜋★ is the optimal policy, it uses the arg max function to select the
policy with the optimal value. The goal in reinforcement learning is to find this
optimal policy for start state 𝑠0 .
A potential benefit of state-action values 𝑄 over state values 𝑉 is that state-action
values directly tell what every action is worth. This is may be useful for action
selection, since, for discrete action spaces,

𝑎★ = arg max 𝑄★ (𝑠, 𝑎)

𝑎∈𝐴

the Q function directly identifies the best action. Equivalently, the optimal policy
can be obtained directly from the optimal Q function:
2.2 Tabular Value-Based Agents 39

𝜋★ (𝑠) = arg max 𝑄★ (𝑠, 𝑎).

𝑎∈𝐴

We will now turn to construct algorithms to compute the value function and the
policy function.

2.2.3.5 Bellman Equation

To calculate the value function, let us look again at the tree in Fig. 2.5 on page 31,
and imagine that it is many times larger, with subtrees that extend to fully cover
the state space. Our task is to compute the value of the root, based on the reward
values at the real leaves, using the transition function 𝑇𝑎 . One way to calculate the
value 𝑉 (𝑠) is to traverse this full state space tree, computing the value of a parent
node by taking the reward value and the sum of the children, discounting this value
by 𝛾.
This intuitive approach was first formalized by Richard Bellman in 1957. He
showed that discrete optimization problems can be described as a recursive back-
ward induction problem [72]. He introduced the term dynamic programming to
recursively traverse the states and actions. The so-called Bellman equation shows
the relationship between the value function in state 𝑠 and the future child state 𝑠 0,
when we follow the transition function.
The discrete Bellman equation of the value of state 𝑠 after following policy 𝜋 is:3
∑︁ h ∑︁ i
𝑇𝑎 (𝑠, 𝑠 0) 𝑅 𝑎 (𝑠, 𝑠 0) + 𝛾 · 𝑉 𝜋 (𝑠 0)

𝑉 𝜋 (𝑠) = 𝜋(𝑎|𝑠) (2.7)
𝑎∈𝐴 𝑠0 ∈𝑆

where 𝜋 is the probability of action 𝑎 in state 𝑠, 𝑇 is the stochastic transition function,

𝑅 is the reward function and 𝛾 is the discount rate. Note the recursion on the value
function, and that for the Bellman equation the transition and reward functions
must be known for all states by the agent.
Together, the transition and reward model are referred to as the dynamics model
of the environment. The dynamics model is often not known by the agent, and
model-free methods have been developed to compute the value function and policy
function without them.
The recursive Bellman equation is the basis of algorithms to compute the value
function, and other relevant functions to solve reinforcement learning problems. In
the next section we will study these solution methods.

2.2.4 MDP Solution Methods

The Bellman equation is a recursive equation: it shows how to calculate the value
of a state, out of the values of applying the function specification again on the
3 State-action value and continuous Bellman equations can be found in Appendix A.4.
40 2 Tabular Value-Based Reinforcement Learning

Fig. 2.7 Recursion: Droste effect

successor states. Figure 2.7 shows a recursive picture, of a picture in a picture, in

a picture, etc. In algorithmic form, dynamic programming calls its own code on
states that are closer and closer to the leaves, until the leaves are reached, and the
recursion can not go further.
Dynamic programming uses the principle of divide and conquer: it begins with
a start state whose value is to be determined by searching a large subtree, which
it does by going down into the recursion, finding the value of sub-states that are
closer to terminals. At terminals the reward values are known, and these are then
used in the construction of the parent values, as it goes up, back out of the recursion,
and ultimately arrives at the root value itself.
A simple dynamic programming method to iteratively traverse the state space
to calculate Bellman’s equation is value iteration. Pseudocode for a basic version of
VI is shown in Listing 2.1, based on [15]. Value iteration converges to the optimal
value function by iteratively improving the estimate of 𝑉 (𝑠). The value function
𝑉 (𝑠) is first initialized to random values. Value iteration repeatedly updates 𝑄(𝑠, 𝑎)
and 𝑉 (𝑠) values, looping over the states and their actions, until convergence occurs
(when the values of 𝑉 (𝑠) stop changing much).
Value iteration works with a finite set of actions. Value iteration has been proven
to converge to the optimal values, but, as we can see in the pseudocode in Listing 2.1,
it does so quite inefficiently by essentially repeatedly enumerating the entire state
space in a triply nested loop, traversing the state space many times. Soon we will
see more efficient methods.
2.2 Tabular Value-Based Agents 41

def v al ue _ it er a ti o n () :
initialize ( V )
while not convergence ( V ) :
for s in range ( S ) :
for a in range ( A ) :
Í 0 0 0
Q [s , a ] = 𝑠0 ∈𝑆 𝑇𝑎 (𝑠, 𝑠 ) (𝑅 𝑎 (𝑠, 𝑠 ) + 𝛾𝑉 [𝑠 ])
V [ s ] = max_a ( Q [s , a ])
return V

Listing 2.1 Value iteration pseudocode

2.2.4.1 Hands On: Value Iteration in Gym

We have discussed in detail how to model a reinforcement learning problem with

an MDP. We have talked in depth and at length about states, actions, and policies. It
is now time for some hands-on work, to experiment with the theoretical concepts.
Let us get some first hands on experience with reinforcement learning agent and
environment algorithms. We will start with the environment.

OpenAI Gym

OpenAI has created the Gym suite of environments for Python, which has become
the de facto standard in the field of research [108]. The Gym suite can be found at
OpenAI4 and on GitHub.5 Gym works on Linux, macOS and Windows. An active
community exists and new environments are created continuously and uploaded to
the Gym website. Many interesting environments are available for experimentation,
to create your own agent algorithm for, and test it.
If you browse Gym on GitHub, you will see different sets of environments,
from easy to advanced. There are the classics, such as Cartpole and Mountain car.
There are also small text environments. Taxi is there, and the Arcade Learning
Environment [71], which was used in the paper that introduced DQN [521], as we
will discuss at length in the next chapter. MuJoCo6 is also available, an environment
for experimentation with simulated robotics [779] (student licences are free, or you
can use pybullet).7
We can now install Gym. Go to the Gym page on https://gym.openai.com
and read the documentation. Make sure Python is installed on your system (does
typing python at the command prompt work?), and that your Python version is up
to date (version 3.10 at the time of this writing). Then type
4 https://gym.openai.com
5 https://github.com/openai/gym
6 http://www.mujoco.org
7 https://pybullet.org/wordpress/
42 2 Tabular Value-Based Reinforcement Learning

import gym

env = gym . make ( ’ CartPole - v0 ’)

env . reset ()
for _ in range (1000) :
env . render ()
env . step ( env . action_space . sample () ) # take a random action
env . close ()

Listing 2.2 Running the Gym CartPole Environment from Gym

pip install gym

to install Gym with the Python package manager. Soon, you will also be needing
deep learning suites, such as TensorFlow or PyTorch. It is recommended to install
Gym in the same virtual environment as your upcoming PyTorch and TensorFlow
installation, so that you can use both at the same time (see Sect. B.3.4). You may
have to install or update other packages, such as numpy, scipy and pyglet, to get
Gym to work, depending on your system installation.
You can check if the installation works by seeing if the CartPole environment
works, see Listing 2.2. A window should appear on your screen in which a Cartpole
is making random movements (your window system should support OpenGL, and
you may need a version of pyglet newer than 1.5.11 on some operating systems).

Taxi Example with Value Iteration

The Taxi example (Fig. 2.8) is an environment where taxis move up, down, left,
and right, and pickup and drop off passengers. Let us see how we can use value
iteration to solve the Taxi problem. The Gym documentation describes the Taxi
world as follows. There are four designated locations in the Grid world indicated by
R(ed), B(lue), G(reen), and Y(ellow). When the episode starts, the taxi starts off at a
random square and the passenger is at a random location. The taxi drives to the
passenger’s location, picks up the passenger, drives to the passenger’s destination
(another one of the four specified locations), and then drops off the passenger. Once
the passenger is dropped off, the episode ends.
The Taxi problem has 500 discrete states: there are 25 taxi positions, five possible
locations of the passenger (including the case when the passenger is in the taxi),
and 4 destination locations (25 × 5 × 4).
The environment returns a new result tuple at each step. There are six discrete
deterministic actions for the Taxi driver:
0: Move south
1: Move north
2: Move east
2.2 Tabular Value-Based Agents 43

import gym
import numpy as np

def i t e r a t e _ v a l u e _ f u n c t i o n ( v_inp , gamma , env ) :

ret = np . zeros ( env . nS )
for sid in range ( env . nS ) :
temp_v = np . zeros ( env . nA )
for action in range ( env . nA ) :
for ( prob , dst_state , reward , is_final ) in env . P [ sid
][ action ]:
temp_v [ action ] += prob *( reward + gamma * v_inp [
dst_state ]*( not is_final ) )
ret [ sid ] = max ( temp_v )
return ret

def b u i l d _ g r e e d y _ p o l i c y ( v_inp , gamma , env ) :

new_policy = np . zeros ( env . nS )
for state_id in range ( env . nS ) :
profits = np . zeros ( env . nA )
for action in range ( env . nA ) :
for ( prob , dst_state , reward , is_final ) in env . P [
state_id ][ action ]:
profits [ action ] += prob *( reward + gamma * v [
dst_state ])
new_policy [ state_id ] = np . argmax ( profits )
return new_policy

env = gym . make ( ’ Taxi - v3 ’)

gamma = 0.9
cum_reward = 0
n_rounds = 500
env . reset ()
for t_rounds in range ( n_rounds ) :
# init env and value function
observation = env . reset ()
v = np . zeros ( env . nS )

# solve MDP
for _ in range (100) :
v_old = v . copy ()
v = i t e r a t e _ v a l u e _ f u n c t i o n (v , gamma , env )
if np . all ( v == v_old ) :
break
policy = b u i l d _ g r e e d y _ p o l i c y (v , gamma , env ) . astype ( np . int )

# apply policy
for t in range (1000) :
action = policy [ observation ]
observation , reward , done , info = env . step ( action )
cum_reward += reward
if done :
break
if t_rounds % 50 == 0 and t_rounds > 0:
print ( cum_reward * 1.0 / ( t_rounds + 1) )
env . close ()

Listing 2.3 Value Iteration for Gym Taxi

44 2 Tabular Value-Based Reinforcement Learning

Fig. 2.8 Taxi world [394]

3: Move west
4: Pick up passenger
5: Drop off passenger
There is a reward of −1 for each action and an additional reward of +20 for
delivering the passenger. There is a reward of −10 for executing actions pickup and
dropoff illegally.
The Taxi environment has a simple transition function, which is used by the
agent in the value iteration code.8 Listing 2.3 shows an implementation of value
iteration that uses the Taxi environment to find a solution. This code is written by
Mikhail Trofimov, and illustrates clearly how value iteration first creates the value
function for the states, and then that a policy is formed by finding the best action
in each state, in the build-greedy-policy function.9
Please use the value iteration code with the Gym Taxi environment, refer to
Listing 2.3. Run the code, and play around with some of the hyperparameters to
familiarize yourself a bit with Gym and with planning by value iteration. Try to
visualize for yourself what the algorithm is doing. This will prepare you for the
more complex algorithms that we will look into next.

2.2.4.2 Model-Free Learning

The value iteration algorithm can compute the policy function. It uses the transition
model in its computation. Frequently, we are in a situation when the exact transition
probabilities are not known to the agent, and we need other methods to compute the
8Note that the code uses the environment to compute the next state, so that we do not have to
implement a version of the transition function for the agent.
9 https://gist.github.com/geffy/b2d16d01cbca1ae9e13f11f678fa96fd#file-taxi-vi-
py
2.2 Tabular Value-Based Agents 45

Name Approach Ref

Value Iteration Model-based enumeration [72, 15]
SARSA On-policy temporal difference model-free [644]
Q-learning Off-policy temporal difference model-free [830]
Table 2.1 Tabular Value-Based Approaches

policy function. For this situation, model-free algorithms have been developed, for
the agent to compute the policy without knowing any of the transition probabilities
itself.
The development of these model-free methods is a major milestone of reinforce-
ment learning, and we will spend some time to understand how they work. We will
start with value-based model-free algorithms. We will see how, when the agent does
not know the transition function, an optimal policy can be learned by sampling
rewards from the environment. Table 2.1 lists value iteration in conjunction with
the model-free algorithms that we cover in this chapter.
First we will discuss how the principle of temporal difference uses sampling and
bootstrapping to construct the value function from irreversible actions. We will
see how the value function can be use to find the best actions, to form the policy.
Second, we will discuss which mechanisms for action selection exist, where we will
encounter the exploration/exploitation trade-off. Third, we will discuss how to learn
from the rewards of the selected actions. We will encounter on-policy learning and
off-policy learning. We wil discuss two simple algorithms: SARSA and Q-learning.
Let us now start by having a closer look at sampling actions with temporal difference
learning.

Temporal Difference Learning

In the previous section the value function was calculated recursively by using the
value function of successor states, following Bellman’s equation (Eq. 2.7).
Bootstrapping is a process of subsequent refinement by which old estimates of
a value are refined with new updates. It means literally: pull yourself up by your
boot straps. Bootstrapping solves the problem of computing a final value when we
only know how to compute step-by-step intermediate values. Bellman’s recursive
computation is a form of bootstrapping. In model-free learning, the role of the
transition function is replaced by an iterative sequence of environment samples.
A bootstrapping method that can be used to process the samples, and to refine
them to approximate the final value, is temporal difference learning. Temporal
difference learning, TD for short, was introduced by Sutton [739] in 1988. The
temporal difference in the name refers to the difference in values between two time
steps, which it uses to calculate the value at the new time step.
Temporal difference learning works by updating the current estimate of the state
value 𝑉 (𝑠) (the bootstrap-value) with an error value based on the estimate of the
46 2 Tabular Value-Based Reinforcement Learning

next state that it has gotten through sampling the environment:

𝑉 (𝑠) ← 𝑉 (𝑠) + 𝛼[𝑟 0 + 𝛾𝑉 (𝑠 0) − 𝑉 (𝑠)] (2.8)

Here 𝑠 is the current state, 𝑠 0 the new state, and 𝑟 0 the reward of the new state.
Note the introduction of 𝛼, the learning rate, which controls how fast the algorithm
learns (bootstraps). It is an important parameter; setting the value too high can
be deterimental since the last value then dominates the bootstrap process too
much. Finding the optimal value will require experimentation. The 𝛾 parameter is
the discount rate. The last term −𝑉 (𝑠) subtracts the value of the current state, to
compute the temporal difference. Another way to write the update rule is

𝑉 (𝑠) ← 𝛼[𝑟 0 + 𝛾𝑉 (𝑠 0)] + (1 − 𝛼)𝑉 (𝑠)

as the difference between the new temporal difference target and the old value.
Note the absence of transition model 𝑇 in the formula; temporal difference is a
model-free update formula.
The introduction of the temporal difference method has allowed model-free
methods to be used successfully in various reinforcement learning settings. Most
notably, it was the basis of the program TD-Gammon, that beat human world-
champions in the game of Backgammon in the early 1990s [762].
Now that we know how to calculate the value function (the up-motion in the
tree diagram), let us see how we can select the action in our model-free algorithm
(the down-motion in the tree diagram).

Find Policy by Value-based Learning

The goal of reinforcement learning is to construct the policy with the highest
cumulative reward. Thus, we must find the best action 𝑎 in each state 𝑠. In the
value-based approach we know the value functions 𝑉 (𝑠) or 𝑄(𝑠, 𝑎). How can that
help us to find action 𝑎? In a discrete action space, there is at least one discrete
action with the highest value. Thus, if we have the optimal state-value 𝑉 ★, then the
optimal policy can be found by finding the action with that value. This relationship
is given by
𝜋★ = max 𝑉 𝜋 (𝑠) = max 𝑄 𝜋 (𝑠, 𝑎)
𝜋 𝑎, 𝜋

and the arg max function finds the best action for us

𝑎★ = arg max 𝑄★ (𝑠, 𝑎).

𝑎∈𝐴

In this way the optimal policy sequence of best actions 𝜋★ (𝑠) can be recovered from
the value functions, hence the name value-based method [845].
2.2 Tabular Value-Based Agents 47

2.2.4.3 Exploration

Since there is no local transition function, model-free methods perform their state
changes directly in the environment. This may be an expensive operation, for
example, when a real-world robot arm has to perform a movement. The sampling
policy should choose promising actions to reduce the number of samples as much
as possible, and not waste any actions. What behavior policy should we use? It is
tempting to favor at each state the actions with the highest Q-value, since then we
would be following what is currently thought to be the best policy.
This approach is called the greedy approach. It appears attractive, but is short-
sighted and risks settling for local maxima. Following the trodden path based
on only a few early samples risks missing a potential better path. Indeed, the
greedy approach is high variance, using values based on few samples, resulting
in a high uncertainty. We run the risk of circular reinforcement, if we update
the same behavior policy that we use to choose our samples from. In addition to
exploiting known good actions, a certain amount of exploration of unknown actions
is necessary. Smart sampling strategies use a mix of the current behavior policy
(exploitation) and randomness (exploration) to select which action to perform in the
environment.

Bandit Theory

The exploration/exploitation trade-off, the question of how to get the most reliable
information at the least cost, has been studied extensively in the literature [345, 844].
The field has the colorful name of multi-armed bandit theory [30, 442, 278, 631]. A
bandit in this context refers to a casino slot machine, with not one arm, but many
arms, each with a different and unknown payout probability. Each trial costs a coin.
The multi-armed bandit problem is then to find a strategy that finds the arm with
the highest payout at the least cost.
A multi-armed bandit is a single-state reinforcement learning problem, a one-step
non-sequential decision making problem, with the arms representing the possible
actions. This simplified model of stochastic decision making allows the in-depth
study of exploration/exploitation strategies.
Single-step exploration/exploitation questions arise for example in clinical trials,
where new drugs are tested on test-subjects (real people). The bandit is the trial, and
the arms are the choice how many of the test subjects are given the real experimental
drug, and how many are given the placebo. This is a serious setting, since the cost
may be measured in the quality of human lives.
In a conventional fixed randomized controlled trial (supervised setup) the sizes
of the groups that get the experimental drugs and the control group would be fixed,
and the confidence interval and the duration of the test would also be fixed. In an
adaptive trial (bandit setup) the sizes would adapt during the trial depending on
the outcomes, with more people getting the drug if it appears to work, and fewer if
it does not.
48 2 Tabular Value-Based Reinforcement Learning

Fig. 2.9 Adaptive Trial [5]

Let us have a look at Fig. 2.9. Assume that the learning process is a clinical trial
in which three new compounds are tested for their medical effect on test subjects.
In the fixed trial (left panel) all test subjects receive the medicine of their group to
the end of the test period, after which the data set is complete and we can determine
which of the compounds has the best effect. At that point we know which group has
had the best medicine, and which two thirds of the subjects did not, with possibly
harmful effect. Clearly, this is not a satisfactory situation. It would be better if we
could gradually adjust the proportion of the subjects that receive the medicine
that currently looks best, as our confidence in our test results increases as the trial
progresses. Indeed, this is what reinforcement learning does (Fig. 2.9, right panel).
It uses a mix of exploration and exploitation, adapting the treatment, giving more
subjects the promising medicine, while achieving the same confidence as the static
trial at the end [442, 441].

𝝐-greedy Exploration

A popular pragmatic approach is to use a fixed ratio of exploration versus exploita-

tion. This approach is known as the 𝜖-greedy approach, which is to mostly try the
(greedy) action that currently has the highest policy value except to explore an 𝜖
fraction of times a randomly selected other action. If 𝜖 = 0.1 then 90% of the times
the currently-best action is taken, and 10% of the times a random other action.
The algorithmic choice between greedily exploiting known information and
exploring unknown actions to gain new information is called the exploration/ex-
ploitation trade-off. It is a central concept in reinforcement learning; it determines
how much confidence we have in our outcome, and how quickly the confidence can
be increased and the variance reduced. A second approach is to use an adapative
𝜖-ratio, that changes over time, or over other statistics of the learning process.
Other popular approaches to add exploration are to add Dirichlet-noise [424] or
to use Thompson sampling [769, 647].
2.2 Tabular Value-Based Agents 49

2.2.4.4 Off-Policy Learning

In addition to the exploration/exploitation selection question, another main theme

in the design of reinforcement learning algorithms is how the behavior policy
learns. Should it learn on-policy—only from its recent actions—or off-policy, from
all available information?
Reinforcement learning is concerned with learning a policy from actions and
rewards. The agent selects an action, and learns from the reward that it gets back
from the environment. We will have a closer look at this reward learning: should
downward selection and upward learning always be tightly coupled, or can they be
loosely coupled?
Normally, the learning takes place by consistently backing up the value of the
selected action back to the same behavior policy function that was used to select the
action: the learning is on-policy. There is, however, an alternative. If the learning
takes place by backing up values of another action, not the one selected by the
behavior policy, then this is known as off-policy learning. This makes sense when
the behavior policy explores by selecting a non-optimal action (it does not perform
the greedy exploitation action; this usually results in an inferior reward value). On-
policy learning would then have to backup the value of the non-optimal exploration
action (since otherwise it would not learn on-policy). Off-policy learning, however,
is free to backup the value of the best action instead, and not the inferior one
selected by the exploration policy, not polluting the policy with a known inferior
choice. Thus, in the case of exploration, off-policy learning can be more efficient, by
not stubbornly backing up the value of the action selected by the behavior policy,
but the value of an older, better, action instead. (In the case of an exploiting step by
the policy, on-policy learning and off-policy learning are the same.)
Learning methods face a trade-off: they try to learn the best target action from
current behavior that is known so far (learning-exploitation), or they choose new
(but sometimes non-optimal) behavior. Do we penalize a node when exploration
found a worse value (as expected) or do we keep the optimistic value? Thus, off-
policy learning uses two policies: first, the behavior policy that is used for actual
action selection behavior (sometimes exploring), and second, the target policy that
is updated by backing up values. The first policy performs downward selection to
expand states, the second policy performs upward value propagation to update the
target policy. In on-policy learning these policies are the same, in off-polic learning
the behavior policy and the target policy are separate.
A well-known tabular on-policy algorithm is SARSA.10 An even more well-
known tabular off-policy algorithm is Q-learning. Both SARSA and Q-learning
converge to the optimal action-value function, although convergence in off-policy
can be slower, since older, non-current, values are used.
10 The name of the SARSA algorithm is a play on the MDP symbols as they occur in the action

value update formula: 𝑠, 𝑎, 𝑟 , 𝑠, 𝑎.

50 2 Tabular Value-Based Reinforcement Learning

On-Policy SARSA

In on-policy learning a single policy function is used for (downward) action selection
and (upward) value backup towards the learning target. SARSA is an on-policy
algorithm [644]. On-policy learning updates values directly on the single policy.
The same policy function is used for exploration behavior and for the target policy.
The SARSA update formula is

𝑄(𝑠𝑡 , 𝑎 𝑡 ) ← 𝑄(𝑠𝑡 , 𝑎 𝑡 ) + 𝛼[𝑟 𝑡+1 + 𝛾𝑄(𝑠𝑡+1 , 𝑎 𝑡+1 ) − 𝑄(𝑠𝑡 , 𝑎 𝑡 )] (2.9)

Going back to temporal difference (Eq. 2.8), we see that the SARSA formula looks
very much like TD, although now we deal with state-action values, and temporal
difference dealt with state values.
On-policy learning selects an action, evaluates it in the environment, and moves
on to better actions, guided by the behavior policy (which is not specified in the
formula, but might be 𝜖-greedy). On-policy learning begins with a behavior policy,
samples the state space with this policy, and improves the policy by backing up
values of the selected actions. Note that the term 𝑄(𝑠𝑡+1 , 𝑎 𝑡+1 ) can also be written
as 𝑄(𝑠𝑡+1 , 𝜋(𝑠𝑡+1 )), highlighting the difference with off-policy learning. SARSA
updates its Q-values using the Q-value of the next state 𝑠 and the current policy’s
action.
The primary advantage of on-policy learning is that it directly optimizes the
target of interest, and converges quickly by learning with the direct behavior values.
The biggest drawback is sample inefficiency, since the target policy is updated with
sub-optimal explorative rewards.

Off-Policy Q-Learning

Off-policy learning is more complicated; it uses separate behavior and target policies:
one for exploratory downward selection behavior, and one to update as the current
target backup policy. Learning (backing up) is from data off the downward behavior
policy, and the whole method is therefore called off-policy learning.
The most well-known off-policy algorithm is Q-learning [830]. It gathers infor-
mation from explored moves, it evaluates states as if a greedy policy was used, even
when the actual behavior performed an exploration step.
The Q-learning update formula is

𝑄(𝑠𝑡 , 𝑎 𝑡 ) ← 𝑄(𝑠𝑡 , 𝑎 𝑡 ) + 𝛼[𝑟 𝑡+1 + 𝛾 max 𝑄(𝑠𝑡+1 , 𝑎) − 𝑄(𝑠𝑡 , 𝑎 𝑡 )] (2.10)

𝑎

The difference from on-policy learning is that the 𝛾𝑄(𝑠𝑡+1 , 𝑎 𝑡+1 ) term from Eq. 2.9
has been replaced by 𝛾 max𝑎 𝑄(𝑠𝑡+1 , 𝑎). The learning is from backup values of
the best action, not the one that was actually evaluated. Listing 2.4 shows the full
pseudocode for Q-learning.
2.2 Tabular Value-Based Agents 51

def qlearn ( environment , alpha =0.001 , gamma =0.9 , epsilon =0.05) :

Q [ TERMINAL , _ ] = 0 # policy
for episode in range ( max_episodes ) :
s = s0
while s not TERMINAL : # perform steps of one full episode
a = epsilongreedy ( Q [ s ] , epsilon )
(r , sp ) = environment (s , a )
Q [s , a ] = Q [s , a ] + alpha *( r + gamma * max ( Q [ sp ]) -Q [s , a ])
s = sp
return Q

Listing 2.4 Q-learning pseudocode [830, 742]

The reason that Q-learning is off-policy is that it updates its Q-values using the Q-
value of the next state 𝑠, and the greedy action (not necessarily the behavior policy’s
action—it is learning off the behavior policy). In this sense, off-policy learning
collects all available information and uses it simultaneously to construct the best
target policy.
On-policy targets follow the behavioral policy and convergence is typically more
stable (low variance). Off-policy targets can learn the optimal policy/value (low
bias), but they can be unstable due to the max operation, especially in combination
with function approximation, as we will see in the next chapter.

Sparse Rewards and Reward Shaping

Before we conclude this section, we should discuss sparsity. For some environments
a reward exists for each state. For the supermarket example a reward can be calcu-
lated for each state that the agent has walked to. The reward is actually the opposite
of the cost expended in walking. Environments in which a reward exists in each
state are said to have a dense reward structure.
For other environments rewards may exist for only some of the states. For
example, in chess, rewards only exist at terminal board positons where there is a
win or a draw. In all other states the return depends on the future states and must be
calculated by the agent by propagating reward values from future states up towards
the root state 𝑠0 . Such an environment is said to have a sparse reward structure.
Finding a good policy is more complicated when the reward structure is sparse. A
graph of the landscape of such a sparse reward function would show a flat landscape
with a few sharp mountain peaks. Many of the algorithms that we will see in future
chapters use the reward-gradient to find good returns. Finding the optimum in a
flat landscape where the gradient is zero, is hard. In some applications it is possible
to change the reward function to have a shape more amenable to gradient-based
optimization algorithms such as we use in deep learning. Reward shaping can make
all the difference when no solution can be found with a naive reward function. It
is a way of incorporating heuristic knowledge into the MDP. A large literature on
52 2 Tabular Value-Based Reinforcement Learning

reward shaping and heuristic information exists [559]. The use of heuristics on
board games such as chess and checkers can also be regarded as reward shaping.

2.2.4.5 Hands On: Q-learning on Taxi

To get a feeling for how these algorithms work in practice, let us see how Q-learning
solves the Taxi problem.
In Sect. 2.2.4.1 we saw how value iteration solved the Taxi problem, making use
of the transition model. We will now see how we solve this problem if we do not
have the transition model, but have to use a model-free sample method. Q-learning
samples actions, and records the reward values in a Q-table, converging to the
state-action value function. When in all states the best values of the best actions
are known, these can be used to sequence the optimal policy.
Let us see how a value-based model-free algorithm would solve a simple 5 × 5
Taxi problem. Refer to Fig. 2.8 on page 44 for an illustration of Taxi world.
Please recall that in Taxi world, the taxi can be in one of 25 locations and there
are 25 × (4 + 1) × 4 = 500 different states that the environment can be in.
We follow the reward model as it is used in the Gym Taxi environment. Recall
that our goal is to find a policy (actions in each state) that leads to the highest
cumulative reward. Q-learning learns the best policy through guided sampling.
It records the rewards it gets from actions it performs in the environment. The
Q-values are the expected rewards of the actions in the states. It uses the Q-values
to guide which actions it will sample. Q-values 𝑄(𝑠, 𝑎) are stored in an array that
is indexed by state and action. The Q-values guide the exploration, higher values
indicate better actions.
Listing 2.5 shows the full Q-learning algorithm, in Python, after [394]. It uses
an 𝜖-greedy behavior policy: mostly the best action is followed, but in a certain
fraction a random action is chosen, for exploration. Recall that the Q-values are
updated according to the Q-learning formula:

𝑄(𝑠𝑡 , 𝑎 𝑡 ) ← 𝑄(𝑠𝑡 , 𝑎 𝑡 ) + 𝛼[𝑟 𝑡+1 + 𝛾 max 𝑄(𝑠𝑡+1 , 𝑎) − 𝑄(𝑠𝑡 , 𝑎 𝑡 )]

𝑎

where 0 ≤ 𝛾 ≤ 1 is the discount factor and 0 < 𝛼 ≤ 1 the learning rate. Note that
Q-learning uses bootstrapping, and the initial Q-values are set to a random value
(their value will disappear slowly due to the learning rate).
Q-learning is learning the best action to take in the current state by looking at
the reward for the current state-action combination, plus the maximum rewards
for the next state. Eventually the best policy is found in this way, and the taxi will
consider the route consisting of a sequence of the best rewards.
To summarize informally:
1. Initialize the Q-table to random values
2. Select a state 𝑠
3. For all possible actions from 𝑠 select the one with the highest Q-value and travel
to this state, which becomes the new 𝑠, or, with 𝜖 greedy, explore
2.2 Tabular Value-Based Agents 53

# Q learning for OpenAI Gym Taxi environment

import gym
import numpy as np
import random
# Environment Setup
env = gym . make ( " Taxi - v2 " )
env . reset ()
env . render ()
# Q [ state , action ] table im plemen tation
Q = np . zeros ([ env . o b s e r v a t i o n _ s p a c e .n , env . action_space . n ])
gamma = 0.7 # discount factor
alpha = 0.2 # learning rate
epsilon = 0.1 # epsilon greedy
for episode in range (1000) :
done = False
total_reward = 0
state = env . reset ()
while not done :
if random . uniform (0 , 1) < epsilon :
action = env . action_space . sample () # Explore state
space
else :
action = np . argmax ( Q [ state ]) # Exploit learned values
next_state , reward , done , info = env . step ( action ) #
invoke Gym
next_max = np . max ( Q [ next_state ])
old_value = Q [ state , action ]

new_value = old_value + alpha * ( reward + gamma *

next_max - old_value )

Q [ state , action ] = new_value

total_reward += reward
state = next_state
if episode % 100 == 0:
print ( " Episode ␣ {} ␣ Total ␣ Reward : ␣ {} " . format ( episode ,
total_reward ) )

Listing 2.5 Q-learning Taxi example, after [394]

4. Update the values in the Q-array using the equation

5. Repeat until the goal is reached; when the goal state is reached, repeat the process
until the Q-values stop changing (much), then stop.
Listing 2.5 shows Q-learning code for finding the policy in Taxi world.
The optimal policy can be found by sequencing together the actions with the
highest Q-value in each state. Listing 2.6 shows the code for this. The number of
illegal pickups/drop-offs is shown as penalty.
This example shows how the optimal policy can be found by the introduction of
a Q-table that records the quality of irreversible actions in each state, and uses that
54 2 Tabular Value-Based Reinforcement Learning

total_epochs , t ot a l_ pe n al ti e s = 0 , 0
ep = 100
for _ in range ( ep ) :
state = env . reset ()
epochs , penalties , reward = 0 , 0 , 0
done = False
while not done :
action = np . argmax ( Q [ state ])
state , reward , done , info = env . step ( action )
if reward == -10:
penalties += 1
epochs += 1
t ot al _ pe n al ti e s += penalties
total_epochs += epochs
print ( f " Results ␣ after ␣ { ep } ␣ episodes : " )
print ( f " Average ␣ timesteps ␣ per ␣ episode : ␣ { total_epochs ␣ / ␣ ep } " )
print ( f " Average ␣ penalties ␣ per ␣ episode : ␣ { t ot al _ pe n al ti e s ␣ / ␣ ep } " )

Listing 2.6 Evaluate the optimal Taxi result, after [394]

table to converge the rewards to the value function. In this way the optimal policy
can be found model-free.

Tuning your Learning Rate

Go ahead, implement and run this code, and play around to become familiar with
the algorithm. Q-learning is an excellent algorithm to learn the essence of how
reinforcement learning works. Try out different values for hyperparameters, such
as the exploration parameter 𝜖, the discount factor 𝛾 and the learning rate 𝛼. To
be successful in this field, it helps to have a feeling for these hyperparameters. A
choice close to 1 for the discount parameter is usually a good start, and a choice
close to 0 for the learning rate is a good start. You may feel a tendency to do the
opposite, to choose the learning rate as high as possible (close to 1) to learn as
quickly as possible. Please go ahead and see which works best in Q-learning (you
can have a look at [230]). In many deep learning environments a high learning rate
is a recipe for disaster, your algorithm may not converge at all, and Q-values can
become unbounded. Play around with tabular Q-learning, and approach your deep
learning slowly, with gentle steps!
The Taxi example is small, and you will get results quickly. It is well suited to build
up useful intuition. In later chapters, we will do experiments with deep learning,
that take longer to converge, and acquiring intuition for tuning hyperparameter
values will be more expensive.
2.3 Classic Gym Environments 55

Conclusion

We have now seen how a value function can be learned by an agent without having
the transition function, by sampling the environment. Model-free methods use
actions that are irreversible for the agent. It samples states and rewards from the
environment, using a behavior policy with the current best action, and following
an exploration/exploitation trade-off. The backup rule for learning is based on
bootstrapping, and can follow the rewards of the actions on-policy, including the
value of the occasional explorative action, or off-policy, always using the value
of the best action. We have seen two model-free tabular algorithms, SARSA and
Q-learning, where the value function is assumed to be stored in an exact table data
structure.
In the next chapter we will move to network-based algorithms for high-
dimensional state spaces, based on function approximation with a deep neural
network.

2.3 Classic Gym Environments

Now that we have discussed at length the agent algorithms, it is time to have a look
at the environments, the other part of the reinforcement learning model. Without
them, progress cannot be measured, and results cannot be compared in a meaningful
way. In a real sense, environments define the kind of intelligence that our artificial
methods can perform.
In this chapter we will start with a few smaller environments, that are suited
for the tabular algorithms that we have discussed. Two environments that have
been around since the early days of reinforcement learning are Mountain car and
Cartpole (see Fig. 2.10).

2.3.1 Mountain Car and Cartpole

Mountain car is a physics puzzle in which a car on a one-dimensional road is in

a valley between two mountains. The goal for the car is to drive up the mountain
and reach the flag on the right. The car’s engine can go forward and backward. The
problem is that the car’s engine is not strong enough to climb the mountain by itself
in a single pass [531], but it can do so with the help of gravity: by repeatedly going
back and forth the car can build up momentum. The challenge for the reinforcement
learning agent is to apply alternating backward and forward forces at the right
moment.
Cartpole is a pole-balancing problem. A pole is attached by a joint to a movable
cart, which can be pushed forward or backward. The pendulum starts upright, and
must be kept upright by applying either a force of +1 or −1 to the cart. The puzzle
56 2 Tabular Value-Based Reinforcement Learning

Fig. 2.10 Cartpole and Mountain Car

ends when the pole falls over, or when the cart runs too far left or right [57]. Again
the challenge is to apply the right force at the right moment, solely by feedback of
the pole being upright or too far down.

2.3.2 Path Planning and Board Games

Navigation tasks and board games provide environments for reinforcement learning
that are simple to understand. They are well suited to reason about new agent
algorithms. Navigation problems, and the heuristic search trees built for board
games, can be of moderate size, and are then suited for determining the best action
by dynamic programming methods, such as tabular Q-learning, A*, branch and
bound, and alpha-beta [646]. These are straightforward search methods that do not
attempt to generalize to new, unseen, states. They find the best action in a (largish)
space of states, all of which are present at training time—the optimization methods
do not perform generalization from training to test time.

Path Planning

Path planning (Fig 2.1) is a classic problem that is related to robotics [455, 264].
Popular versions are mazes, as we have seen earlier (Fig. 2.2). The Taxi domain was
originally introduced (Fig. 2.8) in the context of hierarchical problem solving [196].
Box-pushing problems such as Sokoban are frequently used as well [385, 204, 542,
877], see Fig. 2.3. The action space of these puzzles and mazes is discrete. Basic path
and motion planning can enumerate possible solutions [169, 321].
Small versions of mazes can be solved exactly by enumeration, larger instances
are only suitable for approximation methods. Mazes can be used to test algorithms
for path finding problems and are frequently used to do so. Navigation tasks and
box-pushing games such as Sokoban can feature rooms or subgoals, that may then
2.3 Classic Gym Environments 57

be used to test algorithms for hierarchically structured problems [235, 297, 614, 238]
(Chap. 8).
The problems can be made more difficult by enlarging the grid and by insert-
ing more obstacles. Mazes and Sokoban grids are sometimes procedurally gen-
erated [691, 328, 780]. The goal for the algorithms is typically to find a solution
for a grid of a certain difficulty class, to find a shortest path solution, or, in trans-
fer learning, to learn to solve a class of grids by training on a different class of
grids [858].

Board Games

Board games are a classic group of benchmarks for planning and learning since
the earliest days of artificial intelligence. Two-person zero-sum perfect information
board games such as tic tac toe, chess, checkers, Go, and shogi have been used to
test algorithms since the 1950s. The action space of these games is discrete. Notable
achievements were in checkers, chess, and Go, where human world champions
were defeated in 1994, 1997, and 2016, respectively [662, 124, 702].
The board games are typically used “as is” and are not changed for different
experiments (in contrast to mazes, that are often adapted in size or complexity for
specific purposes of the experiment). Board games are often used for the difficulty of
the challenge. The ultimate goals is to beat human grandmasters or even the world
champion. Board games have been traditional mainstays of artificial intelligence,
mostly associated with the search-based symbolic reasoning approach to artificial
intelligence [646]. In contrast, the benchmarks in the next chapter are associated
with connectionist artificial intelligence.

Summary and Further Reading

This has been a long chapter. We will summarize the chapter, and provide references
for further reading.

Summary

Reinforcement learning can learn behavior that achieves high rewards, using feed-
back from the environment. Reinforcement learning has no supervisory labels,
it can learn beyond a teacher, as long as there is an environment that provides
feedback.
Reinforcement learning problems are modeled as a Markov decision problem,
consisting of a 5-tuple (𝑆, 𝐴, 𝑇𝑎 , 𝑅 𝑎 , 𝛾) for state, action, transition, reward, and
58 2 Tabular Value-Based Reinforcement Learning

discount factor. The agent performs an action, and the environment returns the
new state and the reward value to be associated with the new state.
Games and robotics are two important fields of application. Fields of application
can be episodic (they end—such as a game of chess) or continuous (they do not
end—a robot remains in the world). In continuous problems it often makes sense to
discount behavior that is far from the present, episodic problems typically do not
bother with a discount factor—a win is a win.
Environments can be deterministic (many board games are deterministic—boards
don’t move) or stochastic (many robotic worlds are stochastic—the world around a
robot moves). The action space can be discrete (many games have a discrete action
space—a piece either moves to a square or it does not) or continuous (typical robot
joints move continuously—a car can move any distance, an arm can rotate over a
continuous angle).
The goal in reinforcement learning is to find for all states the best actions (the
policy, 𝜋) that maximizes the cumulative future reward. The policy function is
used in two different ways. In a discrete environment 𝑎 = 𝜋(𝑠) the policy function
typically returns for each state the best action in that sate. Alternatively the policy
returns the value of each action in each state, out of which the argmax function
can find the action with the highest value.
The optimal policy can be found by finding the maximal value of a state. The
value function 𝑉 (𝑠) returns the expected reward for a state. When the transition
function 𝑇𝑎 (𝑠, 𝑠 0) is present, the agent can use Bellman’s equation, or a dynamic
programming method to recursively traverse the behavior space. Value iteration is
one such dynamic programming method, it traverses all actions of all states, backing
up reward values, until the value function stops changing. Planning methods follow
the principle of dynamic programming. The state-action value 𝑄(𝑠, 𝑎) determines
the value of an action of a state.
Bellman’s equation calculates the value of a state by calculating the value of
successor states. Accessing successor states (by following the action and transition)
is also called expanding a successor state. In a tree diagram successor states are
called child nodes, and expanding is a downard action. Backpropagating the reward
values to the parent node is a movement upward in the tree.
Methods where the agent makes use of the transition model are called model-
based methods. When the agent does not use the transition model, they are model-
free methods.
In many situations the learning agent does not have access to the transition model
of the environment, and planning methods cannot be used by the agent. Value-based
model-free methods can find an optimal policy by using only irreversible actions,
sampling the environment to find the value of the actions.
A major determinant in model-free reinforcement learning is the exploration/ex-
ploitation trade-off, or how much of the information that has been learned from the
environment is used in choosing actions to sample. We discussed the advantages
of exploiting the latest knowledge in settings where environment actions are very
costly, such as clinial trials.
2.3 Classic Gym Environments 59

A well-known exploration/exploitation method is 𝜖-greedy, where the greedy

(best) action is followed from the behavior policy, except in 𝜖 times, when random
exploration is performed. Always following the policy’s best action runs the risk
of getting stuck in a cycle. Exploring random nodes allows breaking free of such
cycles.
So far we have discussed the action selection operation. How should we process
the rewards that are found at nodes? Here we introduced another fundamental
element of reinforcement learning: bootstrapping, or finding a value by refining a
previous value. Temporal difference learning uses the principle of bootstrapping to
find the value of a state by adding appropriately discounted future reward values
to the state value function.
We have now discussed both up and down motions, and can construct full model-
free algorithms. The best-known algorithm may well be Q-learning, which learns
the action-value function of each action in each state through off-policy temporal
difference learning. Off-policy algorithms improve the best policy function always
with the value of the best action, even if the behavior action was less. Off-policy
learning thus has a behavior policy that is different form the policy that is optimized.
In the next chapters we will look at value-based and policy-based model-free
methods for large, complex problems, that make use of function approximation
(deep learning).

There is a rich literature on tabular reinforcement learning. A standard work for

tabular value-based reinforcement learning is Sutton and Barto [742]. Two con-
densed introductions to reinforcement learning are [27, 254]. Another major work
on reinforcement learning is Bertsekas and Tsitsiklis [86]. Kaelbling has written an
important survey article on the field [388]. The early works of Richard Bellman on
dynamic programming, and planning algorithms are [72, 73]. For a recent treatment
of games and reinforcement learning, with a focus on heuristic search methods and
the methods behind AlphaZero, see [599].
The methods of this chapter are based on bootstrapping [72] and temporal
difference learning [739]. The on-policy algorithm SARSA [644] and the off-policy
algorithm Q-Learning [830] are among the best known exact, tabular, value-based
model-free algorithms.
For general reference, one of the major textbooks on artificial intelligence is
written by Russell and Norvig [646]. A more specific textbook on machine learning
is by Bishop [93].
60 2 Tabular Value-Based Reinforcement Learning

Exercises

We will end with questions on key concepts, with programming exercises to build
up more experience,

Questions

The questions below are meant to refresh your memory, and should be answered
with yes, no, or short answers of one or two sentences.
1. In reinforcement learning the agent can determine which training examples are
generated next through its action. Why is this beneficial? What is a potential
problem?
2. What is Grid world?
3. Which five elements does an MDP have to model reinforcement learning prob-
lems?
4. In a tree diagram, is successor selection of behavior up or down?
5. In a tree diagram, is learning values through backpropagation up or down?
6. What is 𝜏?
7. What is 𝜋(𝑠)?
8. What is 𝑉 (𝑠)?
9. What is 𝑄(𝑠, 𝑎)?
10. What is dynamic programming?
11. What is recursion?
12. Do you know a dynamic programming method to determine the value of a state?
13. Is an action in an environment reversible for the agent?
14. Mentions two tyical application areas of reinforcement learning.
15. Is the action space of games typically discrete or continuous?
16. Is the action space of robots typically discrete or continuous?
17. Is the environment of games typically deterministic or stochastic?
18. Is the environment of robots typically deterministic or stochastic?
19. What is the goal of reinforcement learning?
20. Which of the five MDP elements is not used in episodic problems?
21. Which model or function is meant when we say “model-free” or “model-based”?
22. What type of action space and what type of environment are suited for value-
based methods?
23. Why are value-based methods used for games and not for robotics?
24. Name two basic Gym environments.

Exercises

There is an even better way to learn about deep reinforcement learning then reading
about it, and that is to perform experiments yourself, to see the learning processes
2.3 Classic Gym Environments 61

# SARSA for OpenAI Gym Taxi environment

import gym
import numpy as np
import random
# Environment Setup
env = gym . make ( " Taxi - v2 " )
env . reset ()
env . render ()
# Q [ state , action ] table im plemen tation
Q = np . zeros ([ env . o b s e r v a t i o n _ s p a c e .n , env . action_space . n ])
gamma = 0.7 # discount factor
alpha = 0.2 # learning rate
epsilon = 0.1 # epsilon greedy
for episode in range (1000) :
done = False
total_reward = 0
current_state = env . reset ()
if random . uniform (0 , 1) < epsilon :
curr ent_ac tion = env . action_space . sample () # Explore
state space
else :
curr ent_ac tion = np . argmax ( Q [ current_state ]) # Exploit
learned values
while not done :
next_state , reward , done , info = env . step ( curren t_acti on )
# invoke Gym
if random . uniform (0 , 1) < epsilon :
next_action = env . action_space . sample () # Explore
state space
else :
next_action = np . argmax ( Q [ next_state ]) # Exploit
learned values
sarsa_value = Q [ next_state , next_action ]
old_value = Q [ current_state , curren t_acti on ]

new_value = old_value + alpha * ( reward + gamma *

sarsa_value - old_value )

Q [ current_state , cur rent_a ction ] = new_value

total_reward += reward
current_state = next_state
curr ent_ac tion = next_action
if episode % 100 == 0:
print ( " Episode ␣ {} ␣ Total ␣ Reward : ␣ {} " . format ( episode ,
total_reward ) )

Listing 2.7 SARSA Taxi example, after [394]

62 2 Tabular Value-Based Reinforcement Learning

unfold before your own eyes. The following exercises are meant as starting points
for your own discoveries in the world of deep reinforcement learning.
Consider using Gym to implement these exercises. Section 2.2.4.1 explains how
to install Gym.
1. Implement Q-learning for Taxi, including the procedure to derive the best policy
for the Q-table. Go to Sect. 2.2.4.5 and implement it. Print the Q-table, to see
the values on the squares. You could print a live policy as the search progresses.
Try different values for 𝜖, the exploration rate. Does it learn faster? Does it keep
finding the optimal solution? Try different values for 𝛼, the learning rate. Is it
faster?
2. Implement SARSA, the code is in Listing 2.7. Compare your results to Q-learning,
can you see how SARSA chooses different paths? Try different 𝜖 and 𝛼.
3. How large can problems be before converging starts taking too long?
4. Run Cartpole with the greedy policy computed by value iteration. Can you make
it work? Is value iteration a suitable algorithm for Cartpole? If not, why do you
think it is not?
Chapter 3
Deep Value-Based Reinforcement Learning

The previous chapter introduced the field of classic reinforcement learning. We

learned about agents and environments, and about states, actions, values, and policy
functions. We also saw our first planning and learning algorithms: value iteration,
SARSA and Q-learning. The methods in the previous chapter were exact, tabular,
methods, that work for problems of moderate size that fit in memory. Convergence
was relatively straightforward with access to the entire state space.
In this chapter we move to high-dimensional problems with large state spaces. We
will go beyond tabular methods and use methods to approximate the value function
and to generalize beyond trained behavior. We will do so with deep learning.
The methods in this chapter are deep model-free, value-based methods, related
to Q-learning. We will start by having a closer look at the new, larger, environments
that our agents must now be able to solve. Next, we will look at deep reinforcement
learning algorithms. In reinforcement learning the (greedy) current behavior policy
determines which action is selected next, a process that can be self-reinforcing.
There is no ground truth, as in supervised learning. The targets of loss functions are
no longer static, or even stable. In deep reinforcement learning convergence is based
on a bootstrapping process, and the first challenge is to find training methods that
converge to a stable value function. Furthermore, since with neural networks the
states are no longer represented individually, but by approximation, convergence
proofs can no longer count on identifying all states individually. By combining a
number of approaches (such as the replay buffer and increased exploration) Deep
Q-Networks (DQN) was able to achieve stable learning in a high-dimensional
environment. The success of DQN spawned a large research effort to improve
training further. We will discuss some of these new methods.
The chapter is concluded with exercises, a summary, and pointers to further
reading.

63
64 3 Deep Value-Based Reinforcement Learning

Deep Learning: Deep reinforcement learning builds on deep supervised

learning, and this chapter and the rest of the book assume a basic under-
standing of deep learning. When your knowledge of parameterized neural
networks and function approximation is rusty, this is the time to go to
Appendix B and take an in-depth refresher. The appendix also reviews
essential concepts such as training, testing, accuracy, overfitting and the
bias-variance trade-off. When in doubt, try to answer the questions on
page 329.

Core Concepts

• Diversity and stability

• Replay buffer

Core Problem

• Achieve stable deep reinforcement learning in large problems

Core Algorithm

• Deep Q-network (Listing 3.6)

End-to-end Learning

Before the advent of deep learning, traditional reinforcement learning has been
used mostly on small problems and puzzles, such as the supermarket example. Their
state space fits in the small memories of yesterday’s computers. Reward shaping, in
the form of domain-specific heuristics, was used to shoehorn the problem into a
computer, for example, in chess and checkers [124, 353, 661]. Deep learning changed
this situation, and reinforcement learning is now used on large and high-dimensional
problems.
In the field of supervised learning, a yearly competition had created years of
steady progress in which the accuracy of image classification had steadily improved.
Progress was driven by the availability of ImageNet, a large database of labeled
images [236, 191], by increases in computation power through GPUs, and by steady
3 Deep Value-Based Reinforcement Learning 65

Fig. 3.1 Atari Experiments on the Cover of Nature

Fig. 3.2 Example Game from the Arcade Learning Environment [71]

improvement of machine learning algorithms, especially in deep convolutional

neural networks. In 2012, a paper by Krizhevsky, Sutskever and Hinton presented a
method that out-performed other algorithms by a large margin, and approached the
performance of human image recognition [430]. The paper introduced the AlexNet
architecture (after the first name of the first author) and 2012 is often regarded as
the year of the breakthrough of deep learning. (See Appendix B for details.) This
breakthrough in deep supervised learning raised the question whether something
similar in reinforcement learning could be achieved.
We did not have to wait long, only a year later, in 2013, at the deep learning
workshop of one of the main AI conferences, a paper was presented on a deep
66 3 Deep Value-Based Reinforcement Learning

reinforcement learning algorithm that could play 1980s Atari video games just
by training on the pixel input of the video screen (Fig. 3.2). The algorithm used a
combination of deep learning and Q-learning, and was named Deep Q-Network, or
DQN [521, 522]. An illuminating video of how it learned to play the game Breakout
is here.1 This was a breakthrough for reinforcement learning. Many researchers
at the workshop could relate to this achievement, perhaps because they had spent
hours playing Space Invaders, Pac-Man and Pong themselves when they were
younger. Two years after the presentation at the deep learning workshop a longer
article appeared in the journal Nature in which a refined and expanded version of
DQN was presented (see Fig. 3.1 for the journal cover).
Why was this such a momentous achievement? Besides the fact that the problem
that was solved was easily understood, true eye-hand coordination of this com-
plexity had not been achieved by a computer before; furthermore, the end-to-end
learning from pixel to joystick implied artificial behavior that was close to how
humans play games. DQN essentially launched the field of deep reinforcement learn-
ing. For the first time the power of deep learning had been successfully combined
with behavior learning, for an imaginative problem.
A major technical challenge that was overcome by DQN is the instability of
the deep reinforcement learning process. In fact, there were convincing theoretical
analyses at the time that this instability was fundamental, and it was generally
assumed that it would be next to impossible to overcome [48, 282, 786, 742], since
the target of the loss-function depended on the convergence of the reinforcement
learning process itself. By the end of this chapter we will have covered the problems
of convergence and stability in reinforcement learning. We will have seen how
DQN addresses these problems, and we will also have discussed some of the many
further solutions that were devised after DQN.
But let us first have a look at the kind of new, high-dimensional, environments
that were the cause of these developments.

3.1 Large, High-Dimensional, Problems

In the previous chapter, Grid worlds and mazes were introduced as basic sequential
decision making problems in which exact, tabular, reinforcement learning methods
work well. The complexity of a problem is related to the number of unique states
that a problem has, or how large the state space is. Tabular methods work for small
problems, where the entire state space fits in memory. This is for example the case
with linear regression, which has only one variable 𝑥 and two parameters 𝑎 and
𝑏, or the Taxi problem, which has a state space of size 500. In this chapter we will
be more ambitious and introduce various games, most notably Atari arcade games.
The state space of Atari video input is 210 × 160 pixels of 256 RGB color values
= 25633600 , see Sect. B.1.2, where we discuss the curse of dimensionality.
1 https://www.youtube.com/watch?v=TmPfTpjtdgg
3.1 Large, High-Dimensional, Problems 67

Fig. 3.3 Atari 2600 console

There is a qualitative difference between small (500) and large (25633600 ) prob-
lems. For small problems the policy can be learned by loading all states of a problem
in memory. States are identified individually, and each has its own best action, that
we can try to find. Large problems, in contrast, do not fit in memory, the policy
cannot be memorized, and states are grouped together based on their features (see
Sect. B.1.3). A parameterized network maps states to actions and values; states are
no longer individually identifiable in a lookup table.
When deep learning methods were introduced in reinforcement learning, larger
problems than before could be solved. Let us have a look at those problems.

3.1.1 Atari Arcade Games

Learning actions directly from high-dimensional sound and vision inputs is one of
the long-standing challenges of artificial intelligence. To stimulate this research, in
2012 a test-bed was created designed to provide challenging reinforcement learning
tasks. It was called the Arcade Learning Environment, or ALE [71], and it was based
on a simulator for 1980s Atari 2600 video games. Figure 3.3 shows a picture of a
distinctly retro Atari 2600 gaming console.
Among other things ALE contains an emulator of the Atari 2600 console. ALE
presents agents with a high-dimensional2 visual input (210 × 160 RGB video at
60 Hz, or 60 images per second) of tasks that were designed to be interesting and
challenging for human players (Fig. 3.2 showed an example of such a game and
Fig. 3.4 shows a few more). The game cartridge ROM holds 2-4 kB of game code,
while the console random-access memory is small, just 128 bytes (really, just 128
bytes, although the video memory is larger, of course). The actions can be selected
2 That is, high dimensional for machine learning. 210 × 160 pixels is not exactly high-definition

video quality.
68 3 Deep Value-Based Reinforcement Learning

Fig. 3.4 Screenshots of 4 Atari Games (Breakout, Pong, Montezuma’s Revenge, and Private Eye)

via a joystick (9 directions), which has a fire button (fire on/off), giving 18 actions
in total.
The Atari games provide challenging eye-hand coordination and reasoning tasks,
that are both familiar and challenging to humans, providing a good test-bed for
learning sequential decision making.
Atari games, with high-resolution video input at high frame rates, are an entirely
different kind of challenge than Grid worlds or board games. Atari is a step closer to
a human environment in which visual inputs should quickly be followed by correct
actions. Indeed, the Atari benchmark called for very different agent algorithms,
prompting the move from tabular algorithms to algorithms based on function
approximation and deep learning. The ALE has become a standard benchmark in
deep reinforcement learning research. ALE is included in the Gym environment.
ALE has fulfilled its goal of stimulating research into deep end-to-end reinforcement
learning well.

3.1.2 Real-Time Strategy and Video Games

Real-time strategy games provide an even greater challenge than simulated 1980s
Atari consoles. Games such as StarCraft (Fig. 1.6) [572], and Capture the Flag [372]
have very large state spaces. These are games with large maps, many players, many
pieces, and many types of actions. The state space of StarCraft is estimated at
101685 [572], more than 1500 orders of magnitude larger than Go (10170 ) [539, 785]
3.2 Deep Value-Based Agents 69

and more than 1635 orders of magnitude large than chess (1047 ) [354], which
is played on a relatively small 8 × 8 board. Most real time strategy games are
multi-player, non-zero-sum, imperfect information games that also feature high-
dimensional pixel input, reasoning, and team collaboration. The action space is
stochastic and is a mix of discrete and continuous actions.
Despite the challenging nature, impressive achievements have been reported
recently in three games where human performance was matched or even ex-
ceeded [812, 80, 372], see also Chap. 7.

3.2 Deep Value-Based Agents

We will now turn to agent algorithms for solving large sequential decision problems.
The main challenge of this section is to create an agent algorithm that can learn a
good policy by interacting with the world—with a large problem, not a toy problem.
From now on, our agents will be deep learning agents.
Let us look at deep reinforcement learning algorithms. How can we use deep
learning for high-dimensional and large sequential decision making environments?
How can tabular value and policy functions 𝑉, 𝑄, and 𝜋 be transformed into 𝜃
parameterized functions 𝑉 𝜃 , 𝑄 𝜃 , and 𝜋 𝜃 ?

3.2.1 Generalization of Large Problems with Deep Learning

Deep supervised learning uses a static dataset to approximate a function, and loss-
function targets (the labels) are therefore stable. However, convergence in deep
reinforcement learning is based on the Q-learning bootstrapping process, and we
lack a static dataset with ground truths; our data items are generated dynamically,
and our bootstrapped loss-function targets move. The movement is influenced by
the same policy function that the convergence process is trying to learn.
It has taken quite some effort to find deep learning algorithms that converge
on these moving targets. Let us try to understand in more detail how the super-
vised methods have to be adapted in order to work in reinforcement learning, by
comparing three algorithmic structures.

3.2.1.1 Minimizing Supervised Target Loss

Listing 3.1 shows pseudocode for a typical supervised deep learning training algo-
rithm, consisting of an input dataset, a forward pass that calculates the network
output, a loss computation, and a backward pass. See Appendix B or [279] for more
details. We see that the code consists of a double loop: the outer loop controls
the training epochs. Epochs consist of forward approximation of the target value
70 3 Deep Value-Based Reinforcement Learning

def train_sl ( data , net , alpha =0.001) : # train classifier

for epoch in range ( max_epochs ) : # an epoch is one pass
sum_sq = 0 # reset to zero for each pass
for ( image , label ) in data :
output = net . forward_pass ( image ) # predict
sum_sq += ( output - label ) **2 # compute error
grad = net . gradient ( sum_sq ) # derivative of error
net . backward_pass ( grad , alpha ) # adjust weights
return net

Listing 3.1 Network training pseudocode for supervised learning

def qlearn ( environment , alpha =0.001 , gamma =0.9 , epsilon =0.05) :

Listing 3.2 Q-learning pseudocode [830, 742]

using the parameters, computation of the gradient, and backward adjusting of the
parameters with the gradient. In each epoch the inner loop serves all examples
of the static dataset to the forward computation of the output value, the loss and
the gradient computation, so that the parameters can be adjusted in the backward
pass. The dataset is static, and all that the inner loop does is deliver the samples
to the backpropagation algorithm. Note that each sample is independent of the
other, samples are chosen with equal probability. After an image of a white horse is
sampled, the probability that the next image is of a black grouse or a blue moon is
equally (un)likely.

3.2.1.2 Bootstrapping Q-Values

Let us now look at Q-learning. Reinforcement learning chooses the training exam-
ples differently. For convergence of algorithms such as Q-learning, the selection rule
must guarantee that eventually all states will be sampled by the environment [830].
For large problems, this is not the case; this condition for convergence to the value
function does not hold.
Listing 3.2 shows the short version of the bootstrapping tabular Q-learning
pseudocode from the previous chapter. As in the previous deep learning algorithm,
the algorithm consists of a double loop. The outer loop controls the Q-value conver-
3.2 Deep Value-Based Agents 71

def train_qlearn ( environment , Qnet , alpha =0.001 , gamma =0.0 ,

epsilon =0.05
s = s0 # initialize start state
for epoch in range ( max_epochs ) : # an epoch is one pass
sum_sq = 0 # reset to zero for each pass
while s not TERMINAL : # perform steps of one full episode
a = epsilongreedy ( Qnet (s , a ) ) # net : Q [s , a ] - values
(r , sp ) = environment ( a )
output = Qnet . forward_pass (s , a )
target = r + gamma * max ( Qnet ( sp ) )
sum_sq += ( target - output ) **2
s = sp
grad = Qnet . gradient ( sum_sq )
Qnet . backward_pass ( grad , alpha )
return Qnet # Q - values

Listing 3.3 Network training pseudocode for reinforcement learning

gence episodes, and each episode consists of a single trace of (time) steps from the
start state to a terminal state. The Q-values are stored in a Python-array indexed
by 𝑠 and 𝑎, since Q is the state-action value. Convergence of the Q-values is as-
sumed to have occurred when enough episodes have been sampled. The Q-formula
shows how the Q-values are built up by bootstrapping on previous values, and how
Q-learning is learning off-policy, taking the max value of an action.
A difference with the supervised learning is that in Q-learning subsequent
samples are not independent. The next action is determined by the current policy,
and will most likely be the best action of the state (𝜖-greedy). Furthermore, the next
state will be correlated to the previous state in the trajectory. After a state of the
ball in the upper left corner of the field has been sampled, the next sample will
with very high probability also be of a state where the ball is close to the upper left
corner of the field. Training can be stuck in local minima.

3.2.1.3 Deep Reinforcement Learning Target-Error

The two algorithms—deep learning and Q-learning—look similar in structure. Both

consist of a double loop in which a target is optimized, and we can wonder if
bootstrapping can be combined with loss-function minimization. This is indeed
the case, as Mnih et al. [521] showed in 2013. Our third listing, Listing 3.3, shows
a naive deep learning version of Q-learning [521, 523], based on the double loop
that now bootstraps Q-values by minimizing a loss function through adjusting the
𝜃 parameters. Indeed, a Q-network can be trained with a gradient by minimizing a
sequence of loss functions that change at each iteration. The loss function for this
bootstrap process can be the squared difference between the Q-value 𝑄 𝜃𝑡 (𝑠, 𝑎) and
the old update target 𝑟 + 𝛾 max𝑎0 𝑄 𝜃𝑡−1 (𝑠 0, 𝑎 0).3
3 Deep Q-learning is a fixed-point iteration [506]. The gradient of this loss function is
72 3 Deep Value-Based Reinforcement Learning

An important observation is that the update targets depend on the previous

network weights 𝜃; this is in contrast with the targets used in a supervised learning
process, that are fixed before learning begins [521]. In other words, the loss function
of deep Q-learning minimizes a moving target, a target that depends on the network
being optimized.

3.2.2 Three Challenges

To summarize, there are three problems with our naive deep Q-learner. First, con-
vergence to the optimal Q-function depends on full coverage of the state space.
Second, there is a strong correlation between subsequent training samples. Third,
the loss function of gradient descent literally has a moving target. Let us have a
closer look at these three problems.

3.2.2.1 Coverage

Proofs that algorithms such as Q-learning converge to the optimal policy depend
on the assumption that all state-action pairs are sampled. Otherwise, the algorithms
will not converge to an optimal action value for each state. Clearly, in large state
spaces where not all states are sampled, this situation does not hold.

3.2.2.2 Correlation

The coverage of the training set is also problematic when states are not uniformly
distributed and training samples are distributed differently between training and
test time. This happens, for example, when a chess program has been trained on a
particular opening, and the opponent plays a different one. When test examples are
different from training examples, then generalization will be bad. This problem is
related to out-of-distribution training, see for example [484].
In supervised learning, data samples (states) are independent and are assumed
to be distributed evenly over the state space and over the training and test set.
In the static dataset of images, there is no relation between subsequent images,
and examples are independently sampled. Each image is separate and unrelated to
another image.
In reinforcement learning a sequence of states is generated in an agent/environ-
ment loop. The states differ only by a single action, one move or one stone, all other
features of the states remaining unchanged, and thus, the value of subsequent sam-
ples are correlated, which may result in a biased training. The training may cover
h i
∇ 𝜃𝑖 L𝑖 ( 𝜃𝑖 ) = E𝑠,𝑎∼𝜌(·) ;𝑠0 ∼E 𝑟 + 𝛾 max
0
𝑄 𝜃𝑖−1 (𝑠0 , 𝑎0 ) − 𝑄 𝜃𝑖 (𝑠, 𝑎) ∇ 𝜃𝑖 𝑄 𝜃𝑖 (𝑠, 𝑎)
𝑎

where 𝜌 is the behavior distribution and E the Atari emulator. Further details are in [521].
3.2 Deep Value-Based Agents 73

only a part of the state space, especially when greedy action selection increases
the tendency to select a small set of actions and states. The bias can result in the
so-called specialization trap (when there is too much exploitation, and too little
exploration). Correlation between subsequent states contributes to the low coverage
that we discussed before, reducing convergence towards the optimal Q-function,
increasing the probability of local optima and feedback loops.

3.2.2.3 Convergence

When we naively apply our deep supervised methods to reinforcement learning, we

encounter a problem. Deep supervised learning uses a static dataset to approximate
a function, and loss-function targets are therefore stable (see Appendix B). However,
we lack such a static dataset in reinforcement learning; our data items are generated
dynamically, and our loss-function targets (the rewards) move. The loss is the
squared difference between the Q-value 𝑄 𝜃𝑡 (𝑠, 𝑎) and the old update target 𝑟 +
𝛾 max𝑎0 𝑄 𝜃𝑡−1 (𝑠 0, 𝑎 0). Thus both depend on parameters 𝜃 that are optimized, and
such naive optimization process can easily become unstable. It has taken quite some
effort to find algorithms that can tolerate these moving targets.

Deadly Triad

Multiple works [48, 282, 786] showed that a combination of off-policy reinforcement
learning with nonlinear function approximation (such as deep neural networks)
could cause Q-values to diverge. Sutton and Barto [742] further analyze three
elements for divergent training: function approximation, bootstrapping, and off-
policy learning, focusing on the effect state identification. Together, they are called
deadly triad.
Function approximation may attribute values to states inaccurately. In contrast
to exact tabular methods, that are designed to identify individual states exactly,
neural networks are designed to individual features of states. These features can be
shared by different states, and values attributed to those features are shared also by
other states. Function approximation may thus cause misidentification of states,
and reward values and Q-values that are not assigned correctly. In a reinforcement
learning process where new states are generated on the fly, divergent Q-values
may cause loops or other forms of instability, as we just discussed. If the accuracy
of the approximation of the true function values is good enough, then states may
be identified well enough to reduce or prevent divergent training processes and
loops [522].
Bootstrapping of values builds up new values on the basis of older values. This
occurs in Q-learning and temporal-difference learning where the current value
depends on the previous value. Bootstrapping increases the efficiency of the training
because values do not have to be calculated from the start. However, errors or
biases in initial values may persist, and even spill over to other states as values are
74 3 Deep Value-Based Reinforcement Learning

propagated incorrectly due to function approximation, and error values become

correlated between states. Bootstrapping and function approximation can thus
increase divergence.
Off-policy learning uses a behavior policy that is different from the target policy
that we are optimizing for (Sect. 2.2.4.4). When the behavior policy is improved,
the off-policy values may not improve. Off-policy learning converges generally less
well than on-policy learning as it converges independently from the behavior policy.
With function approximation convergence may be even slower, due to values being
assigned to incorrect states.

3.2.3 Stable Deep Value-Based Learning

These considerations discouraged further research in deep reinforcement learn-

ing for many years. Instead, research focused on linear function approximators,
which have better convergence guarantees. Nevertheless, work on convergent deep
reinforcement learning continued [653, 323, 88, 494], and algorithms such as neu-
ral fitted Q-learning were developed [630, 452, 481]. After the further results of
DQN [521] showed convincingly that convergence and stable learning could be
achieved in a non-trivial problem, even more experimental studies were performed
to find out under which circumstances convergence can be achieved. Further con-
vergence and diversity-enhancing techniques were developed, some of which we
will cover in Sect. 3.2.4.
Although the theory provides reasons why function approximation may preclude
stable reinforcement learning, there were, in fact, indications that stable training
is possible. Starting at the end of the 1980s, Tesauro had written a program that
played very strong Backgammon based on a neural network. The program was
called Neurogammon, and used supervised learning from grand-master games [761].
In order to improve the strength of the program, he switched to temporal difference
reinforcement learning from self-play games [763]. TD-Gammon [762] learned by
playing against itself, achieving stable learning in a shallow network. TD-Gammon’s
training used a temporal difference algorithm similar to Q-learning, approximating
the value function with a network with one hidden layer, using raw board input
enhanced with hand-crafted heuristic features [762]. Perhaps some form of stable
reinforcement learning was possible, at least in a shallow network?
TD-Gammon’s success prompted attempts with TD learning in checkers [139]
and Go [737, 154]. Unfortunately the success could not be replicated in these games,
and it was believed for some time that Backgammon was a special case, well suited
for reinforcement learning and self-play [603, 677].
However, as there came further reports of successful applications of deep neu-
ral networks in a reinforcement learning setting [323, 653], more work followed.
The results in Atari [522] and later in Go [705] as well as further work [798] have
now provided clear evidence that both stable training and generalizing reinforce-
3.2 Deep Value-Based Agents 75

ment learning are indeed possible, and have improved our understanding of the
circumstances that influence stability and convergence.
Let us have a closer look at the methods that are used to achieve stable deep
reinforcement learning.

3.2.3.1 Decorrelating States

As mentioned in the introduction of this chapter, in 2013 Mnih et al. [521, 522]
published their work on end-to-end reinforcement learning in Atari games.
Let us look in more detail at how stable learning was achieved in their Deep
Q-Network algorithm (DQN). The original focus of DQN is on breaking correlations
between subsequent states, and also on slowing down changes to parameters in
the training process to improve stability. The DQN algorithm has two methods to
achieve this: (1) experience replay and (2) infrequent weight updates. We will first
look at experience replay.

Experience Replay

In reinforcement learning training samples are created in a sequence of interactions

with the environment, and subsequent training states are strongly correlated to
preceding states. There is a tendency to train the network on too many samples
of a certain kind or in a certain area, and other parts of the state space remain
under-explored. Furthermore, through function approximation and bootstrapping,
some behavior may be forgotten. When an agent reaches a new level in a game that
is different from previous levels, the agent may forget how to play the other level.
We can reduce correlation—and the local minima they cause—by adding a small
amount of supervised learning. By using a buffer, a dynamic dataset from which
recent training examples are sampled, we train states from a more diverse set,
instead of only from the most recent one. To break correlations and to create a
more diverse set of training examples, DQN uses experience replay. Experience
replay introduces a replay buffer [480], a cache of previously explored states, from
which it samples training states at random.4 Experience replay stores the last 𝑁
examples in the replay memory, and samples uniformly when performing updates.
A typical number for 𝑁 is 106 [874]. The goal of experience replay is to increase
the independence of subsequent training examples. The next state to be trained on
is no longer a direct successor of the current state, but one somewhere in a long
history of previous states. In this way the replay buffer spreads out the learning
over all seen states, breaking temporal correlations between samples. DQN’s replay
buffer (1) improves coverage, and (2) reduces correlation.
4Originally experience replay is, as so much in artificial intelligence, a biologically inspired
mechanism [505, 571, 481].
76 3 Deep Value-Based Reinforcement Learning

DQN treats all examples equal, old and recent alike. A form of importance
sampling might differentiate between important transitions, as we will see in the
next section.
Note that, curiously, training by experience replay is a form of off-policy learning,
since the target parameters are different from those used to generate the sample.
Off-policy learning is one of the three elements of the deadly triad, and we find that
stable learning can actually be improved by a special form of one of its causes.
Experience replay works well in Atari [522]. However, further analysis of replay
buffers has pointed to possible problems. Zhang et al. [874] study the deadly triad
with experience replay, and find that larger networks resulted in more instabilities,
but also that longer multi-step returns yielded fewer unrealistically high reward
values. In Sect. 3.2.4 we will see many further enhancements to DQN-like algorithms.

3.2.3.2 Infrequent Updates of Target Weights

The second improvement in DQN is infrequent weight updates, introduced in the

2015 paper on DQN [522]. The aim of this improvement is to reduce divergence
that is caused by frequent updates of weights of the target 𝑄-value. Again, the aim
is to improve the stability of the network optimization by improving the stability
of the 𝑄-target in the loss function.
Every 𝑛 updates, the network 𝑄 is cloned to obtain target network 𝑄, ˆ which is
used for generating the targets for the following 𝑛 updates to 𝑄. In the original
DQN implementation a single set of network weights 𝜃 are used, and the network
is trained on a moving loss-target. Now, with infrequent updates the weights of the
target network change much slower than those of the behavior policy, improving
the stability of the Q-target.
The second network improves the stability of Q-learning, where normally an
update to 𝑄 𝜃 (𝑠𝑡 , 𝑎 𝑡 ) also increases the target at each time step, quite possibly
leading to oscillations and divergence of the policy. Generating the targets using an
older set of parameters adds a delay between the time an update to 𝑄 𝜃 is made and
the time the update changes the targets, making oscillations less likely.

3.2.3.3 Hands On: DQN and Breakout Gym Example

To get some hands on experience with DQN, we will now have a look at how DQN
can be used to play the Atari game Breakout.
The field of deep reinforcement learning is an open field where most codes of
algorithms are freely shared on GitHub and where test environments are available.
The most widely used environment is Gym, in which benchmarks such as ALE
and MuJoCo can be found, see also Appendix C. The open availability of the
software allows for easy replication, and, importantly, for further improvement of
the methods. Let us have a closer look at the code of DQN, to experience how it
works.
3.2 Deep Value-Based Agents 77

import gym

from s t a b l e _ b a s e l i n e s . common . policies import MlpPolicy

from s t a b l e _ b a s e l i n e s . common . vec_env import DummyVecEnv
from s t a b l e _ b a s e l i n e s import PPO2

env = gym . make ( ’ CartPole - v1 ’)

model = PPO2 ( MlpPolicy , env , verbose =1)

model . learn ( t ot a l_ ti m es te p s =10000)

obs = env . reset ()

for i in range (1000) :
action , _states = model . predict ( obs )
obs , rewards , dones , info = env . step ( action )
env . render ()

Listing 3.4 Running Stable Baseline PPO on the Gym Cartpole Environment

ALE and Gym are designed to get you started quickly with Atari problems. The
DQN papers come with source code. The original DQN code from [522] is available
at Atari DQN.5 This code is the original code, in the programming language Lua,
which may be interesting to study, if you are familiar with this language. A modern
reference implementation of DQN, with further improvements, is in the (stable)
baselines.6 The RL Baselines Zoo even provides a collection of pretrained agents, at
Zoo [602, 269].7 The Network Zoo is especially useful if your desired application
happens to be in the Zoo, since training often takes a long time.

Install Stable Baselines

An environment is only half of a reinforcement learning experiment, we also need

an agent algorithm that learns the policy. OpenAI also provides implementatinos of
agent algorithms, called the Baselines, at the Gym GitHub repository OpenAI Base-
lines.8 Most algorithms that are covered in this book are present. You can download
them, study the code, and experiment to gain an insight into their behavior.
In addition to OpenAi’s Baselines, there is Stable Baselines, a fork of the OpenAI
algorithms; it has more documentation and other features. It can be found at Stable
Baselines,9 and the documentation is at Docs.10
The stable release from the Stable Baselines is installed by typing
5 https://github.com/kuz/DeepMind-Atari-Deep-Q-Learner
6 https://stable-baselines.readthedocs.io/en/master/index.html
7 https://github.com/araffin/rl-baselines-zoo
8 https://github.com/openai/baselines
9 https://github.com/hill-a/stable-baselines
10 https://stable-baselines.readthedocs.io/en/master/
78 3 Deep Value-Based Reinforcement Learning

from s t a b l e _ b a s e l i n e s . common . a tari_w rapper s import make_atari

from s t a b l e _ b a s e l i n e s . deepq . policies import MlpPolicy , CnnPolicy
from s t a b l e _ b a s e l i n e s import DQN

env = make_atari ( ’ BreakoutNoFrameskip - v4 ’)

model = DQN ( CnnPolicy , env , verbose =1)

model . learn ( t ot a l_ ti m es te p s =25000)

obs = env . reset ()

while True :
action , _states = model . predict ( obs )
obs , rewards , dones , info = env . step ( action )
env . render ()

Listing 3.5 Deep Q-Network Atari Breakout example with Stable Baselines

pip install stable-baselines

pip install stable-baselines[mpi]

if support for OpenMPI is desired (a parallel message passing implementation for

cluster computers). A very quick check to see if everything works is to run the PPO
trainer from Listing 3.4. PPO is a policy-based algorithm that will be discussed in
Sect. 4.2.5. The Cartpole should appear again, but should now learn to stabilize for
a brief moment.

The DQN Code

After having studied tabular Q-learning on Taxi in Sect. 2.2.4.5, let us now see
how the network-based DQN works in practice. Listing 3.5 illustrates how easy
it is to use the Stable Baselines implementation of DQN on the Atari Breakout
environment. (See Sect. 2.2.4.1 for installation instructions of Gym.)
After you have run the DQN code and seen that it works, it is worthwhile to study
how the code is implemented. Before you dive into the Python implementation of
Stable Baselines, let’s look at the pseudocode to refresh how the elements of DQN
work together. See Listing 3.6. In this pseudocode we follow the 2015 version of
DQN [522]. (The 2013 version of DQN did not use the target network [521].)
DQN is based on Q-learning, with as extra a replay buffer and a target network
to improve stability and convergence. The core of Q-learning is the Q-table, the
core of DQN is the Q-network, both are implementations of the Q-function, which
holds the current action-values. First, at the start of the code, the replay buffer is
3.2 Deep Value-Based Agents 79

def dqn :
initialize replay_buffer empty
initialize Q network with random weights
initialize Qt target network with random weights
set s = s0
while not convergence :
# DQN in Atari uses preprocessing ; not shown
epsilon - greedy select action a in argmax ( Q (s , a ) ) # action
selection depends on Q ( moving target )
sx , reward = execute action in environment
append (s ,a ,r , sx ) to buffer
sample minibatch from buffer # break temporal correlation
take target batch R ( when terminal ) or Qt
do gradient descent step on Q # loss function uses target
Qt network

Listing 3.6 Pseudocode for DQN, after [522]

initialized to empty, and the weights of the Q network and the separate Q target
network are initialized. The state 𝑠 is set to the start state.
Next is the optimization loop, that runs until convergence. At the start of each
iteration an action is selected at the state 𝑠, following an 𝜖-greedy approach. The
action is executed in the environment, and the new state and the reward are stored
in a tuple in the replay buffer. Then, we train the Q-network. A minibatch is sampled
randomly from the replay buffer, and one gradient descent step is performed. For
this step the loss function is calculated with the separate Q-target network 𝑄ˆ 𝜃 , that
is updated less frequently than the primary Q-network 𝑄 𝜃 . In this way the loss
function
h i
L𝑡 (𝜃 𝑡 ) = E𝑠,𝑎∼𝜌( ·) E𝑠0 ∼E (𝑟 + 𝛾 max ˆ 𝜃 (𝑠 0, 𝑎 0)|𝑠, 𝑎) − 𝑄 𝜃 (𝑠, 𝑎) 2
𝑄
0 𝑡−1 𝑡
𝑎

is more stable, causing better convergence; 𝜌(𝑠, 𝑎) is behavior distribution over

𝑠 and 𝑎, and E is the Atari emulator [521]. Sampling the minibatch reduces the
correlation that is inherent in reinforcement learning between subsequent states.

Conclusion

In summary, DQN was able to successfully learn end-to-end behavior policies for
many different games (although similar and from the same benchmark set). Minimal
prior knowledge was used to guide the system, and the agent only got to see the
pixels and the game score. The same network architecture and procedure was used
on each game; however, a network trained for one game could not be used to play
another game.
80 3 Deep Value-Based Reinforcement Learning

Name Principle Applicability Effectiveness

DQN [521] replay buffer Atari stable Q learning
Double DQN [799] de-overestimate values DQN convergence
Prioritized experience [665] decorrelation replay buffer convergence
Distributional [70] probability distr stable gradients generalization
Random noise [253] parametric noise stable gradients more exploration
Table 3.1 Deep Value-Based Approaches

The DQN achievement was an important milestone in the history of deep rein-
forcement learning. The main problems that were overcome by Mnih et al. [521]
were training divergence and learning instability.
The nature of most Atari 2600 games is that they require eye-hand reflexes. The
games have few strategic elements, credit assignment is mostly over a short term,
and can be learned with a surprisingly simple neural network. Most Atari games
are more about immediate reflexes than about longer term reasoning. In this sense,
the problem of playing Atari well is not unlike an image categorization problem:
both problems are to find the right response that matches an input consisting of a
set of pixels. Mapping pixels to categories is not that different from mapping pixels
to joystick actions (see also the observations in [399]).
The Atari results have stimulated much subsequent research. Many blogs have
been written on reproducing the result, which is not a straightforward task, requir-
ing the fine-tuning of many hyperparameters [58].

3.2.4 Improving Exploration

The DQN results have spawned much activity among reinforcement learning re-
searchers to improve training stability and convergence further, and many refine-
ments have been devised, some of which we will review in this section.
Many of the topics that are covered by the enhancements are older ideas that
work well in deep reinforcement learning. DQN applies random sampling of its
replay buffer, and one of the first enhancements was prioritized sampling [665].
It was found that DQN, being an off-policy algorithm, typically overestimates
action values (due to the max operation, Sect. 2.2.4.4). Double DQN addresses
overestimation [799], and dueling DDQN introduces the advantage function to
standardize action values [829]. Other approaches look at variance in addition to
expected value, the effect of random noise on exploration was tested [253], and
distributional DQN showed that networks that use probability distributions work
better than networks that only use single point expected values [70].
In 2017 Hessel et al. [334] performed a large experiment that combined seven
important enhancements. They found that the enhancement worked well together.
The paper has become known as the Rainbow paper, since the major graph showing
the cumulative performance over 57 Atari games of the seven enhancements is
3.2 Deep Value-Based Agents 81

Fig. 3.5 Rainbow graph: performance over 57 Atari games [334]

multi-colored (Fig. 3.5). Table 3.1 summarizes the enhancements, and this section
provides an overview of the main ideas. The enhancements were tested on the same
benchmarks (ALE, Gym), and most algorithm implementations can be found on the
OpenAI Gym GitHub site in the baselines.11

3.2.4.1 Overestimation

Van Hasselt et al. introduce double deep Q learning (DDQN) [799]. DDQN is based
on the observation that Q-learning may overestimate action values. On the Atari
2600 games DQN suffers from substantial over-estimations. Remember that DQN
uses Q-learning. Because of the max operation in Q-learning this results in an
overestimation of the Q-value. To resolve this issue, DDQN uses the Q-Network to
choose the action but uses the separate target Q-Network to evaluate the action.
Let us compare the training target for DQN

𝑦 = 𝑟 𝑡+1 + 𝛾𝑄 𝜃𝑡 (𝑠𝑡+1 , arg max 𝑄 𝜃𝑡 (𝑠𝑡+1 , 𝑎)

𝑎

with the training target for DDQN (the difference is a single 𝜙)

𝑦 = 𝑟 𝑡+1 + 𝛾𝑄 𝜙𝑡 (𝑠𝑡+1 , arg max 𝑄 𝜃𝑡 (𝑠𝑡+1 , 𝑎).

𝑎

11 https://github.com/openai/baselines
82 3 Deep Value-Based Reinforcement Learning

The DQN target uses the same set of weights 𝜃 𝑡 twice, for selection and evaluation;
the DDQN target use a separate set of weights 𝜙𝑡 for evaluation, preventing overes-
timation due to the max operator. Updates are assigned randomly to either set of
weights.
Earlier Van Hasselt et al. [313] introduced the double Q learning algorithm in
a tabular setting. The later paper shows that this idea also works with a large
deep network. They report that the DDQN algorithm not only reduces the over-
estimations but also leads to better performance on several games. DDQN was
tested on 49 Atari games and achieved about twice the average score of DQN with
the same hyperparameters, and four times the average DQN score with tuned
hyperparameters [799].

Prioritized Experience Replay

DQN samples uniformly over the entire history in the replay buffer, where Q-
learning uses only the most recent (and important) state. It stands to reason to see
if a solution in between these two extremes performs well.
Prioritized experience replay, or PEX, is such an attempt. It was introduced by
Schaul et al. [665]. In the Rainbow paper PEX is combined with DDQN, and, as we
can see, the blue line (with PEX) indeed outperforms the purple line.
In DQN experience replay lets agents reuse examples from the past, although
experience transitions are uniformly sampled, and actions are simply replayed
at the same frequency that they were originally experienced, regardless of their
significance. The PEX approach provides a framework for prioritizing experience.
Important actions are replayed more frequently, and therefore learning efficiency
is improved. As measure for importance, standard proportional prioritized replay
is used, with the absolute TD error to prioritize actions. Prioritized replay is used
widely in value-based deep reinforcement learning. The measure can be computed
in the distributional setting using the mean action values. In the Rainbow paper all
distributional variants prioritize actions by the Kullback-Leibler loss [334].

Advantage Function

The original DQN uses a single neural network as function approximator; DDQN
(double deep Q-network) uses a separate target Q-Network to evaluate an action.
Dueling DDQN [829], also known as DDDQN, improves on this architecture by
using two separate estimators: a value function and an advantage function

𝐴(𝑠, 𝑎) = 𝑄(𝑠, 𝑎) − 𝑉 (𝑠).

Advantage functions are related to the actor-critic approach (see Chap. 4). An
advantage function computes the difference between the value of an action and the
value of the state. The function standardizes values on a baseline for the actions
3.3 Atari 2600 Environments 83

of a state [292]. Advantage functions provide better policy evaluation when many
actions have similar values.

3.2.4.2 Distributional Methods

The original DQN learns a single value, which is the estimated mean of the state
value. This approach does not take uncertainty into account. To remedy this, distri-
butional Q-learning [70] learns a categorical probability distribution of discounted
returns instead, increasing exploration. Bellemare et al. design a new distributional
algorithm which applies Bellman’s equation to the learning of distributions, a
method called distributional DQN. Also Moerland et al. [525, 526] look into the
distributional perspective.
Interestingly, a link between the distributional approach and biology has been
reported. Dabney et al. [174] showed correspondence between distributional rein-
forcement learning algorithms and the dopamine levels in mice, suggesting that
the brain represents possible future rewards as a probability distribution.

Noisy DQN

Another distributional method is noisy DQN [253]. Noisy DQN uses stochastic net-
work layers that add parametric noise to the weights. The noise induces randomness
in the agent’s policy, which increases exploration. The parameters that govern the
noise are learned by gradient descent together with the remaining network weights.
In their experiments the standard exploration heuristics for A3C (Sect. 4.2.4), DQN,
and dueling agents (entropy reward and 𝜖-greedy) were replaced with NoisyNet.
The increased exploration yields substantially higher scores for Atari (dark red
line).

3.3 Atari 2600 Environments

In their original 2013 workshop paper Mnih et al. [521] achieved human-level play
for some of the games. Training was performed on 50 million frames in total on
seven Atari games. In their 2013 work [521] the neural network performed better
than an expert human player on Breakout, Enduro, and Pong. On Seaqest, Q*Bert,
and Space Invaders performance was far below that of a human. In these games a
strategy must be found that extends over longer time periods. In their follow-up
journal article two years later were able to achieve human level for 49 of the 57
games that were in ALE [522], and performed better than human-level play in 29
of the 49 games.
Some of the games proved difficult, notably games that required longer-range
planning, where long stretches of the game do not give rewards, such as in Mon-
84 3 Deep Value-Based Reinforcement Learning

Fig. 3.6 DQN architecture [360]

tezuma’s Revenge, where the agent has to walk long distances, and pick up a key
to reach new rooms or new levels. In reinforcement learning terms, delayed credit
assignment over long periods is hard.

3.3.1 Network Architecture

End-to-end learning of challenging problems is computationally intensive. In ad-

dition to the two algorithmic innovations, the success of DQN is also due to the
creation of a specialized efficient training architecture [522].
Playing the Atari games is a computationally intensive task for a deep neural
network: the network trains a behavior policy directly from pixel frame input.
Therefore, the training architecture contains reduction steps. To start with, the net-
work consists of only three hidden layers (one fully connected, two convolutional),
which is simpler than what is used in most supervised learning tasks.
The pixel-images are high-resolution data. Since working with the full resolution
of 210 × 160 pixels of 128 color-values at 60 frames per second would be computa-
tionally too intensive, the images are reduced in resolution. The 210 × 160 with
a 128 color palette is reduced to gray scale and 110 × 84 pixels, which is further
cropped to 84 × 84. The first hidden layer convolves 16 8 × 8 filters with stride 4 and
ReLU neurons. The second hidden layer convolves 32 4 × 4 filters with stride 2 and
ReLU neurons. The third hidden layer is fully connected and consists of 256 ReLU
neurons. The output layer is also fully connected with one output per action (18
joystick actions). The outputs correspond to the Q-values of the individual action.
Figure 3.6 shows the architecture of DQN. The network receives the change in
game score as a number from the emulator, and derivative updates are mapped to
{−1, 0, +1} to indicate decrease, no change, or improvement of the score (the Huber
loss [58]).
To reduce computational demands further, frame skipping is employed. Only
one in every 3–4 frames was used, depending on the game. To take game history
into account, the net takes as input the last four resulting frames. This allows
3.3 Atari 2600 Environments 85

movement to be seen by the net. As optimizer RMSprop is used [641]. A variant

of 𝜖-greedy is used, that starts with an 𝜖 of 1.0 (fully exploring) going down to 0.1
(90% exploiting).

3.3.2 Benchmarking Atari

To close the Atari story, we discuss to final algorithms. Of the many value-based
model-free deep reinforcement learning algorithms that have been developed, the
final algorithm that we discuss is R2D2 [396], because of its performance. R2D2
is not part of the Rainbow experiments, but is a significant further improvement
of the algorithms. R2D2 stands for Recurrent Replay Distributed DQN. It is built
upon prioritized distributed replay and 5-step double Q-learning. Furthermore, it
uses a dueling network architecture and an LSTM layer after the convolutional
stack. Details about the architecture can be found in [829, 294]. The LSTM uses
the recurrent state to exploit long-term temporal dependencies, which improve
performance. The authors also report that the LSTM allows for better representation
learning. R2D2 achieved good results on all 57 Atari games [396].
A more recent benchmark achievement has been published as Agent57. Agent57
is the first program that achieves a score higher than the human baseline on all 57
Atari 2600 games from ALE. It uses a controller that adapts the long and short-term
behavior of the agent, training for a range of policies, from very exploitative to very
explorative, depending on the game [46].

Conclusion

Progress has come a long way since the replay buffer of DQN. Performance has
been improved greatly in value-based model-free deep reinforcement learning and
now super-human performance in all 57 Atari games of ALE has been achieved.
Many enhancements that improve coverage, correlation, and convergence have
been developed. The presence of a clear benchmark was instrumental for progress
so that researchers could clearly see which ideas worked and why. The earlier
mazes and navigation games, OpenAI’s Gym [108], and especially the ALE [71],
have enabled this progress.
In the next chapter we will look at the other main branch of model-free reinforce-
ment learning: policy-based algorithms. We will see how they work, and that they
are well suited for a different kind of application, with continuous action spaces.
86 3 Deep Value-Based Reinforcement Learning

Summary and Further Reading

This has been the first chapter in which we have seen deep reinforcement learning
algorithms learn complex, high-dimensional, tasks. We end with a summary and
pointers to the literature.

Summary

The methods that have been discussed in the previous chapter were exact, tabular
methods. Most interesting problems have large state spaces that do not fit into
memory. Feature learning identifies states by their common features. Function
values are not calculated exactly, but are approximated, with deep learning.
Much of the recent success of reinforcement learning is due to deep learning
methods. For reinforcement learning a problem arises when states are approximated.
Since in reinforcement learning the next state is determined by the previous state,
algorithms may get stuck in local minima or run in circles when values are shared
with different states.
Another problem is training convergence. Supervised learning has a static dataset
and training targets are also static. In reinforcement learning the loss function
targets depend on the parameters that are being optimized. This causes further
instability. DQN caused a breakthrough by showing that with a replay buffer and a
separate, more stable, target network, enough stability could be found for DQN to
learn how to play Atari arcade games just from looking at the video screen.
Many further improvements to increase stability through diversity have been
found. The Rainbow paper implements some of these improvements, and finds that
they are complementary, and together achieve very strong play.
The availability of compute power (GPU) software suites (TensorFlow/Keras,
PyTorch) has been a major force in deep reinforcement learning. Also the availability
of labeled data sets (MNIST, ImageNet) and environments (Gym) played a crucial
role in the progress that deep reinforcement learning has made in a relatively short
amount of time.

Deep learning revolutionized reinforcement learning. A comprehensive overview of

the field is provided by Dong et al. [201]. For more on deep learning, see Goodfellow
et al. [279], a book with much detail on deep learning; a major journal article is [458].
A brief survey is [27]. Also see Appendix B.
In 2013 the Arcade Learning Environment was presented [71, 493]. Experiment-
ing with reinforcement learning was made even more accessible with OpenAI’s
Gym [108], with clear and easy to use Python bindings.
3.3 Atari 2600 Environments 87

Deep learning versions of value-based tabular algorithms suffer from conver-

gence and stability problems [786], yet the idea that stable deep reinforcement
learning might be practical took hold with [323, 653]. Zhang et al. [874] study
the deadly triad with experience replay. Deep gradient TD methods were proven
to converge for evaluating a fixed policy [88]. Riedmiller et al. relaxed the fixed
control policy in neural fitted Q learning algorithm (NFQ) [630]. NFQ builds on
work on stable function approximation [281, 228] and experience replay [480], and
more recently on least-squares policy iteration [440]. In 2013 the first DQN paper
appeared, showing results on a small number of Atari games [521] with the replay
buffer to reduce temporal correlations. In 2015 the followup Nature paper reported
results in more games [522], with a separate target network to improve training
convergence. A well-known overview paper is the Rainbow paper [334, 386].
The use of benchmarks is of great importance for reproducible reinforcement
learning experiments [327, 370, 409, 364]. For TensorFlow and Keras, see [146, 269].

Exercises

We will end this chapter with some questions to review the concepts that we have
covered. Next are programming exercises to get some more exposure on how to
use the deep reinforcement learning algorithms in practice.

Questions

Below are some questions to check your understanding of this chapter. Each question
is a closed question where a simple, single sentence answer is expected.
1. What is Gym?
2. What are the Stable Baselines?
3. The loss function of DQN uses the Q-function as target. What is a consequence?
4. Why is the exploration/exploitation trade-off central in reinforcement learning?
5. Name one simple exploration/exploitation method.
6. What is bootstrapping?
7. Describe the architecture of the neural network in DQN.
8. Why is deep reinforcement learning more susceptible to unstable learning than
deep supervised learning?
9. What is the deadly triad?
10. How does function approximation reduce stability of Q-learning?
11. What is the role of the replay buffer?
12. How can correlation between states lead to local minima?
13. Why should the coverage of the state space be sufficient?
14. What happens when deep reinforcement learning algorithms do not converge?
15. How large is the state space of chess estimated to be? 1047 , 10170 or 101685 ?
88 3 Deep Value-Based Reinforcement Learning

16. How large is the state space of Go estimated to be? 1047 , 10170 or 101685 ?
17. How large is the state space of StarCraft estimated to be? 1047 , 10170 or 101685 ?
18. Why is the Rainbow paper so named, and what is the main message?
19. Mention three Rainbow improvements that are added to DQN.

Exercises

Let us now start with some exercises. If you have not done so already, install
Gym, PyTorch12 or TensorFlow and Keras (see Sect. 2.2.4.1 or go to the TensorFlow
page).13 Be sure to check the right versions of Python, Gym, TensorFlow, and the
Stable Baselines to make sure that they work well together. The exercises below
are designed to be done with Keras.
1. DQN Implement DQN from the Stable Baselines on Breakout from Gym. Turn
off Dueling and Priorities. Find out what the values are for 𝛼, the training rate,
for 𝜖, the exploration rate, what kind of neural network architecture is used,
what the replay buffer size is, and how frequently the target network is updated.
2. Hyperparameters Change all those hyperparameters, up, and down, and note the
effect on training speed, and the training outcome: how good is the result? How
sensitive is performance to hyperparameter optimization?
3. Cloud Use different computers, experiment with GPU versions to speed up
training, consider Colab, AWS, or another cloud provider with fast GPU (or TPU)
machines.
4. Gym Go to Gym and try different problems. For what kind of problems does
DQN work, what are characteristics of problems for which it works less good?
5. Stable Baselines Go to the Stable baselines and implement different agent algo-
rithms. Try Dueling algorithms, Prioritized experience replay, but also other
algorithm, such as Actor critic or policy-based. (These algorithms will be ex-
plained in the next chapter.) Note their performance.
6. Tensorboard With Tensorboard you can follow the training process as it pro-
gresses. Tensorboard works on log files. Try TensorBoard on a Keras exercise and
follow different training indicators. Also try TensorBoard on the Stable Baselines
and see which indicators you can follow.
7. Checkpointing Long training runs in Keras need checkpointing, to save valuable
computations in case of a hardware or software failure. Create a large training
job, and setup checkpointing. Test everything by interrupting the training, and
try to re-load the pre-trained checkpoint to restart the training where it left off.

12 https://pytorch.org
13 https://www.tensorflow.org
Chapter 4
Policy-Based Reinforcement Learning

Some of the most successful applications of deep reinforcement learning have a

continuous action space, such as applications in robotics, self-driving cars, and
real-time strategy games.
The previous chapters introduced value-based deep reinforcement learning.
Value-based methods find the policy in a two-step process. First they find the best
action-value of a state, for which then the accompanying actions are found (by
means of arg max). This works in environments with discrete actions, where the
highest value is clearly separate from the next-best value. Examples of continuous
action spaces are robot arms that can move over arbitrary angles, or poker bets
that can be any monetary value. In these action spaces value-based methods be-
come unstable and arg max is not appropriate. Another approach works better:
policy-based methods. Policy-based methods do not use a separate value function,
but find the policy directly. They start with a policy function, which they then
improve, episode by episode, with policy gradient methods. Policy-based methods
are applicable to more domains than value-based methods. They work well with
deep neural networks and gradient learning; they are some of the most popular
methods of deep reinforcement learning, and this chapter introduces you to them.
We start by looking at applications with continuous action spaces. Next, we look
at policy-based agent algorithms. We will introduce basic policy search algorithms,
and the policy gradient theorem. We will also discuss algorithms that combine
value-based and policy-based approaches: the so-called Actor critic algorithms. At
the end of the chapter we discuss larger environments for policy-based methods in
more depth, where we will discuss progress in visuo-motor robotics and locomotion
environments.
The chapter concludes with exercises, a summary, and pointers to further reading.

Core Concepts

• Policy gradient

89
90 4 Policy-Based Reinforcement Learning

• Actor critic

Core Problem

• Find a continuous action policy directly

Core Algorithms

• REINFORCE (Alg. 4.2)

• Asynchronous advantage actor critic (Alg. 4.4)
• Proximal policy optimization (Sect. 4.2.5)

Jumping Robots

One of the most intricate problems in robotics is learning to walk, or more generally,
how to perform locomotion. Much work has been put into making robots walk, run
and jump. A video of a simulated robot that taught itself to jump over an obstacle
course can be found at YouTube1 [324].
Learning to walk is a challenge that takes human infants months to master. (Cats
and dogs are quicker.) Teaching robots to walk is a challenging problem that is
studied extensively in artificial intelligence and engineering. Movies abound on the
internet of robots that try to open doors, and fall over, or just try to stand upright,
and still fall over.2
Locomotion of legged robots is a difficult sequential decision problem. For each
leg, many different joints are involved. They must be actuated in the right order,
turned with the right force, over the right duration, to the right angle. Most of
these angles, forces, and durations are continuous. The algorithm has to decide
how many degrees, Newtons, and seconds, constitute the optimal policy. All these
actions are continuous quantities. Robot locomotion is a difficult problem, that is
used frequently to study policy-based deep reinforcement learning.
1 https://www.youtube.com/watch?v=hx_bgoTF7bs
2 See, for example, https://www.youtube.com/watch?v=g0TaYhjpOfo.
4.1 Continuous Problems 91

4.1 Continuous Problems

In this chapter, our action are continuous and stochastic. We will discuss both
aspects, some of the challenges they pose. We will start with continuous action
policies.

4.1.1 Continuous Policies

The problems that we discussed in the previous chapters were discrete Grid worlds,
mazes, and high-dimensional Atari games, whose action spaces were small and
discrete—we could walk north, east, west, south, or we could choose from 9 joystick
movements. In board games such as chess the action space is larger, but still discrete.
When you move your pawn to e4, you do not move it to e4½.
In this chapter the problems are different. Steering a self driving car requires
turning the steering wheel a certain angle, duration, and angular velocity, to prevent
jerky movements. Throttle movements should also be smooth and continuous.
Actuation of robot joints is continuous, as we mentioned in the introduction of
this chapter. An arm joint can move 1 degree, 2 degrees, or 90 or 180 degrees or
anything in between.
An action in a continuous space is not one of a set of discrete choices, such as
{𝑁, 𝐸, 𝑊, 𝑆}, but rather a value over a continuous range, such as [0, 2𝜋] or R+ ; the
number of possible values is infinite. How can we find the optimum value in an
infinite space in a finite amount of time? Trying out all possible combinations of
setting joint 1 to 𝑥 degrees and applying force 𝑦 in motor 2 will take infinitely long.
A solution could be to discretize the actions, although that introduces potential
quantization errors.
When actions are not discrete, the arg max operation can not be used to identify
“the” best action. Policy-based methods find suitable continuous or stochastic policies
directly, without the intermediate step of a value function and the need for the
arg max operation.

4.1.2 Stochastic Policies

When a robot moves its hand to open a door, it must judge the distance correctly. A
small error, and it may fail (as many movie clips show).3 Stochastic environments
cause stability problems for value-based methods [479]. Small perturbations in
Q-values may lead to large changes in the policy of value-based methods. Con-
vergence can typically only be achieved at slow learning rates, to smooth out the
3 Even worse, when a robot thinks it stands still, it may actually be in the process of falling over
(and, of course, robots can not think, they only wished they could).
92 4 Policy-Based Reinforcement Learning

randomness. A stochastic policy (a target dsitribution) does not suffer from this
problem. Stochastic policies have another advantage. By their nature they perform
exploration, without the need to separately code 𝜖-greediness or other exploration
methods, since a stochastic policy returns a distribution over actions.
Policy-based methods find suitable stochastic policies directly. A potential dis-
advantage of purely episodic policy-based methods is that they are high-variance;
they may find local optima instead of global optima, and converge slower than
value-based methods. Newer (actor critic) methods, such as A3C, TRPO, and PPO,
were designed to overcome these problems. We will discuss these algorithms later
in this chapter.
Before we will explain policy-based methods, we will have a closer look at some
of the applications for which they are needed.

4.1.3 Environments: Gym and MuJoCo

Robotic experiments play an important role in policy-based methods. However, be-

cause of the cost associated with real-world robotics experiments, in reinforcement
learning often simulated robotics systems are used. This is especially important in
model-free methods, that tend to have a high sample complexity (real robots wear
down when trials run in the millions). These software simulators model behavior
of the robot and the effects on the environment, using physics models. This pre-
vents the expense of real experiments with real robots, although some precision
is lost to modeling error. Two well-known physics models are MuJoCo [779] and
PyBullet [167]. They can be used easily via the Gym environment.

4.1.3.1 Robotics

Most robotic applications are more complicated than the classics such as mazes,
Mountain car and Cart pole. Robotic control decisions involve more joints, directions
of travel, and degrees of freedom, than a single cart that moves in one dimension.
Typical problems involve learning of visuo-motor skills (eye-hand coordination,
grasping), or learning of different locomotion gaits of multi-legged “animals.” Some
examples of grasping and walking are illustrated in Fig. 4.1.
The environments for these actions are unpredictable to a certain degree: they
require reactions to disturbances such as bumps in the road, or the moving of objects
in a scene.

4.1.3.2 Physics Models

Simulating robot motion involves modeling forces, acceleration, velocity, and move-
ment. It also includes modeling mass and elasticity for bouncing balls, tactile/grasp-
4.1 Continuous Problems 93

Fig. 4.1 Robot Grasping and Gait [500]

Fig. 4.2 Gym MuJoCo Ant and Half-Cheetah [108]

ing mechanics, and the effect of different materials. A physics mechanics model
needs to simulate the result of actions in the real world. Among the goals of such a
simulation is to model grasping, locomotion, gaits, and walking and running (see
also Sect. 4.3.1).
The simulations should be accurate. Furthermore, since model-free learning
algorithms often involve millions of actions, it is important that the physics sim-
ulations are fast. Many different physics environments for model-based robotics
have been created, among them Bullet, Havok, ODE and PhysX, see [227] for a
comparison. Of the models, MuJoCo [779], and PyBullet [167] are the most popular
in reinforcement learning, especially MuJoCo is used in many experiments.
Although MuJoCo calculations are deterministic, the initial state of environments
is typically randomized, resulting in an overall non-deterministic environment.
Despite many code optimizations in MuJoCo, simulating physics is still an expensive
proposition. Most MuJoCo experiments in the literature therefore are based on
stick-like entities, that simulate limited motions, in order to limit the computational
demands.
Figures 4.2 and 4.3 illustrate a few examples of some of the common Gym/MuJoCo
problems that are often used in reinforcement learning: Ant, Half-cheetah, and
Humanoid.
94 4 Policy-Based Reinforcement Learning

Fig. 4.3 Gym MuJoCo Humanoid

Name Approach Ref

REINFORCE Policy-gradient optimization [843]
A3C Distributed Actor Critic [520]
DDPG Derivative of continuous action function [479]
TRPO Dynamically sized step size [680]
PPO Improved TRPO, first order [682]
SAC Variance-based Actor Critic for robustness [305]
Table 4.1 Policy-Based Algorithms: REINFORCE, Asynchronous Advantage Actor Critic, Deep
Deterministic Policy Gradient, Trust Region Policy Optimization, Proximal Policy Optimization,
Soft Actor Critic

4.1.3.3 Games

In real time video games and certain card games the decisions are also continuous.
For example, in some variants of poker, the size of monetary bets can be any amount,
which makes the action space quite large (although strictly speaking still discrete).
In games such as StarCraft and Capture the Flag, aspects of the physical world are
modeled, and movement of agents can vary in duration and speed. The environment
for these games is also stochastic: some information is hidden for the agent. This
increases the size of the state space greatly. We will discuss these games in Chap. 7
on multi-agent methods.

4.2 Policy-Based Agents

Now that we have discussed the problems and environments that are used with
policy-based methods, it is time to see how policy-based algorithms work. Policy-
based methods are a popular approach in model-free deep reinforcement learning.
Many algorithms have been developed that perform well. Table 4.1 lists some of
the better known algorithms that will be covered in this chapter.
4.2 Policy-Based Agents 95

We will first provide an intuitive explanation of the idea behind the policy-based
approach. Then we will provide references to the theory behind it, and discuss
advantages and disadvantages. Most of these disadvantages are alleviated by the
actor critic method, that is discussed next.
Let us start with the basic idea behind policy-based methods.

4.2.1 Policy-Based Algorithm: REINFORCE

Policy-based approaches learn a parameterized policy, that selects actions without

consulting a value function.4 In policy-based methods the policy function is repre-
sented directly, allowing policies to select a continuous action, something that is
difficult to do in value-based methods.

The Supermarket: To build some intuition on the nature of policy-based

methods, let us think back again at the supermarket navigation task, that
we used to illustrate the Q-function in Chap. 2. In this navigation problem
we can try to assess our current distance to the supermarket with the Q-
value-function, as we have done before. The Q-value assesses the distance
of each direction to take; it tells us how far each action is from the goal. We
can then used this distance function to choose the right direction.
In contrast, the policy-based alternative would be to ask a local the way,
who tells us, for example, to go straight and then left and then right at the
Opera House and straight until we reach the supermarket on our left. The
local just gave us a full path to follow, without having to infer which action
was the closest and then use that information to determine the way to go.
We can subsequently try to improve this full trajectory.

Let us see how we can optimize such a direct policy directly, without the intermedi-
ate step of the Q-function. We will develop a first, generic, policy-based algorithm
to see how the pieces fit together. We provide an intuitive explanation, based on
three papers [220, 192, 254].
The basic framework for policy-based algorithms is straightforward: (1) initialize
the parameters of a policy, (2) sample a trajectory, (3) if it is a good trajectory,
increase the parameters, otherwise decrease them, and (4) keep going until conver-
gence. Algorithm 4.1 provides a framework in pseudocode. Please note the similarity
with the codes in the previous chapter (Listing 3.1–3.3), and especially the deep
learning algorithms, where we also optimized function parameters in a loop.
The policy is represented by a set of parameters 𝜃. Together, the parameters 𝜃
map the states 𝑠 to action probability 𝑎. When we are given a set of parameters,
how should we adjust them to improve the policy? The basic idea is to randomly
sample a new policy, and if it is better, adjust the parameters a bit in the direction
4 Policy-based methods may use a value function to learn the policy parameters 𝜃, but do not use
it for action selection.
96 4 Policy-Based Reinforcement Learning

Algorithm 4.1 Gradient ascent optimization

Input: a differentiable objective 𝐽 ( 𝜃), learning rate 𝛼 ∈ R+ , threshold 𝜖 ∈ R+
Initialization: randomly initialize 𝜃 in R𝑑
repeat
Sample trajectory and compute gradient ∇ 𝜃
𝜃 ← 𝜃 + 𝛼 · ∇ 𝜃 𝐽 ( 𝜃)
until ∇ 𝜃 𝐽 ( 𝜃) converges below 𝜖
return parameters 𝜃

of this new policy (and away if it is worse). Let us see in more detail how this idea
works.
To know which policy is best, we need some kind of measure of its quality. We
denote the quality of the policy that is defined by the parameters as 𝐽 (𝜃). It is
natural to use the value function of the start state as our measure of quality

𝐽 (𝜃) = 𝑉 𝜋 (𝑠0 ).

When the parameters are differentiable, then all we need to do is to find a way to
improve the gradient
∇ 𝜃 𝐽 (𝜃) = ∇ 𝜃 𝑉 𝜋 (𝑠0 )
of this expression to maximize our objective function 𝐽 (·).
Policy-based methods apply gradient-based optimization, using the derivative
of the objective to find the optimum. Since we are maximizing, we apply gradient
ascent. In each time step 𝑡 of the algorithm we perform the following update:

𝜃 𝑡+1 = 𝜃 𝑡 + 𝛼 · ∇ 𝜃 𝐽 (𝜃)

for learning rate 𝛼 ∈ R+ and performance objective 𝐽, see the gradient ascent
algorithm in Alg. 4.1.
Remember that 𝜋 𝜃 (𝑎|𝑠) is the probability of taking action 𝑎 in state 𝑠. This
function 𝜋 is represented by a neural network, mapping states 𝑠 at the input side
of the network to action probabilities on the output side of the network. The
parameters 𝜃 determine the value of our function 𝜋. Our goal is to update the
parameters so that 𝜋 𝜃 becomes the optimal policy. The better t