100% found this document useful (1 vote)

1K views410 pages

Deep Reinforcement Learning

This document provides an overview of deep reinforcement learning. It introduces deep reinforcement learning and describes its applications in diverse fields such as autonomous driving, game playing, robotics, and more. The book aims to provide a comprehensive overview of deep reinforcement learning, covering foundations, algorithms, and applications. It assumes an undergraduate-level understanding of computer science and artificial intelligence.

Uploaded by

aaaaaaaxc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

1K views410 pages

Deep Reinforcement Learning

Uploaded by

aaaaaaaxc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 410

Aske Plaat

arXiv:2201.02135v5 [cs.AI] 23 Apr 2023

Deep Reinforcement Learning

April 25, 2023

Springer Nature
v

This is a preprint of the following work:

Aske Plaat,
Deep Reinforcement Learning,
2022,
Springer Nature,
reproduced with permission of Springer Nature Singapore Pte Ltd.
The final authenticated version is available online at: https://doi.org/10.1007/
978-981-19-0638-1
Preface

Deep reinforcement learning has gathered much attention recently. Impressive

results were achieved in activities as diverse as autonomous driving, game playing,
molecular recombination, and robotics. In all these fields, computer programs have
learned to solve difficult problems. They have learned to fly model helicopters and
perform aerobatic manoeuvers such as loops and rolls. In some applications they
have even become better than the best humans, such as in Atari, Go, poker and
StarCraft.
The way in which deep reinforcement learning explores complex environments
reminds us how children learn, by playfully trying out things, getting feedback,
and trying again. The computer seems to truly possess aspects of human learning;
deep reinforcement learning touches the dream of artificial intelligence.
The successes in research have not gone unnoticed by educators, and universities
have started to offer courses on the subject. The aim of this book is to provide a
comprehensive overview of the field of deep reinforcement learning. The book is
written for graduate students of artificial intelligence, and for researchers and prac-
titioners who wish to better understand deep reinforcement learning methods and
their challenges. We assume an undergraduate-level of understanding of computer
science and artificial intelligence; the programming language of this book is Python.
We describe the foundations, the algorithms and the applications of deep rein-
forcement learning. We cover the established model-free and model-based methods
that form the basis of the field. Developments go quickly, and we also cover more
advanced topics: deep multi-agent reinforcement learning, deep hierarchical rein-
forcement learning, and deep meta learning.
We hope that learning about deep reinforcement learning will give you as much
joy as the many researchers experienced when they developed their algorithms,
finally got them to work, and saw them learn!

vii
viii

Acknowledgments

This book benefited from the help of many friends. First of all, I thank everyone at
the Leiden Institute of Advanced Computer Science, for creating such a fun and
vibrant environment to work in.
Many people contributed to this book. Some material is based on the book
that we used in our previous reinforcement learning course and on lecture notes
on policy-based methods written by Thomas Moerland. Thomas also provided
invaluable critique on an earlier draft of the book. Furthermore, as this book was
being prepared, we worked on survey articles on deep model-based reinforcement
learning, deep meta-learning, and deep multi-agent reinforcement learning. I thank
Mike Preuss, Walter Kosters, Mike Huisman, Jan van Rijn, Annie Wong, Anna
Kononova, and Thomas Bäck, the co-authors on these articles.
Thanks to reader feedback the 2023 version of this book has been updated to
include the Monte Carlo sampling and the n-step methods, and to provide a better
explanation of on-policy and off-policy learning.
I thank all members of the Leiden reinforcement learning community for their
input and enthusiasm. I thank especially Thomas Moerland, Mike Preuss, Matthias
Müller-Brockhausen, Mike Huisman, Hui Wang, and Zhao Yang, for their help
with the course for which this book is written. I thank Wojtek Kowalczyk for
insightful discussions on deep supervised learning, and Walter Kosters for his views
on combinatorial search, as well as for his neverending sense of humor.
A very special thank you goes to Thomas Bäck, for our many discussions on
science, the universe, and everything (including, especially, evolution). Without
you, this effort would not have been possible.
This book is a result of the graduate course on reinforcement learning that we
teach in Leiden. I thank all students of this course, past, present, and future, for
their wonderful enthusiasm, sharp questions, and many suggestions. This book was
written for you and by you!
Finally, I thank Saskia, Isabel, Rosalin, Lily, and Dahlia, for being who they are,
for giving feedback and letting me learn, and for their boundless love.

Leiden,
December 2021 Aske Plaat
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What is Deep Reinforcement Learning? . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Three Machine Learning Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Overview of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Tabular Value-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . 23

2.1 Sequential Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Tabular Value-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Classic Gym Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3 Deep Value-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 67

3.1 Large, High-Dimensional, Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2 Deep Value-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3 Atari 2600 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4 Policy-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.1 Continuous Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.2 Policy-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3 Locomotion and Visuo-Motor Environments . . . . . . . . . . . . . . . . . . . . 115
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5 Model-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.1 Dynamics Models of High-Dimensional Problems . . . . . . . . . . . . . . . 126
5.2 Learning and Planning Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3 High-Dimensional Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

ix
x CONTENTS

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6 Two-Agent Self-Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.1 Two-Agent Zero-Sum Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.2 Tabula Rasa Self-Play Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.3 Self-Play Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

7 Multi-Agent Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

7.1 Multi-Agent Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.2 Multi-Agent Reinforcement Learning Agents . . . . . . . . . . . . . . . . . . . . 206
7.3 Multi-Agent Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

8 Hierarchical Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

8.1 Granularity of the Structure of Problems . . . . . . . . . . . . . . . . . . . . . . . 231
8.2 Divide and Conquer for Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
8.3 Hierarchical Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

10 Further Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

10.1 Development of Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . 275
10.2 Main Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
10.3 The Future of Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

A Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

A.1 Sets and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
A.2 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
A.3 Derivative of an Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
A.4 Bellman Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299

B Deep Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

B.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
B.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
B.3 Datasets and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
CONTENTS xi

C Deep Reinforcement Learning Suites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

C.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
C.2 Agent Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
C.3 Deep Learning Suites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
xii CONTENTS
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What is Deep Reinforcement Learning? . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.5 Four Related Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.5.1 Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.5.2 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.5.3 Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.5.4 Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Three Machine Learning Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Overview of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 Prerequisite Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.2 Structure of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Tabular Value-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . 23

2.1 Sequential Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Tabular Value-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.1 Agent and Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.2 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2.1 State 𝑆 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.2.2 Action 𝐴 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2.3 Transition 𝑇𝑎 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.2.4 Reward 𝑅 𝑎 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.2.5 Discount Factor 𝛾 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.2.6 Policy 𝜋 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.3 MDP Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

xiii
xiv Contents

2.2.3.1 Trace 𝜏 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.3.2 State Value 𝑉 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.3.3 State-Action Value 𝑄 . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.3.4 Reinforcement Learning Objective . . . . . . . . . . . . . . 38
2.2.3.5 Bellman Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.4 MDP Solution Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2.4.1 Hands On: Value Iteration in Gym . . . . . . . . . . . . . . . 41
2.2.4.2 Model-Free Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2.4.3 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.4.4 Off-Policy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.2.4.5 Hands On: Q-learning on Taxi . . . . . . . . . . . . . . . . . . 55
2.3 Classic Gym Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.1 Mountain Car and Cartpole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.2 Path Planning and Board Games . . . . . . . . . . . . . . . . . . . . . . . . 60
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3 Deep Value-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 67

3.1 Large, High-Dimensional, Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.1.1 Atari Arcade Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.1.2 Real-Time Strategy and Video Games . . . . . . . . . . . . . . . . . . . . 72
3.2 Deep Value-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2.1 Generalization of Large Problems with Deep Learning . . . . 73
3.2.1.1 Minimizing Supervised Target Loss . . . . . . . . . . . . . 74
3.2.1.2 Bootstrapping Q-Values . . . . . . . . . . . . . . . . . . . . . . . 75
3.2.1.3 Deep Reinforcement Learning Target-Error . . . . . 76
3.2.2 Three Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.2.2.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.2.2.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.2.2.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.2.3 Stable Deep Value-Based Learning . . . . . . . . . . . . . . . . . . . . . . 78
3.2.3.1 Decorrelating States . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.2.3.2 Infrequent Updates of Target Weights . . . . . . . . . . . 80
3.2.3.3 Hands On: DQN and Breakout Gym Example . . . . . 80
3.2.4 Improving Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.2.4.1 Overestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.2.4.2 Distributional Methods . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3 Atari 2600 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.3.2 Benchmarking Atari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Contents xv

4 Policy-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.1 Continuous Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.1.1 Continuous Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.1.2 Stochastic Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.1.3 Environments: Gym and MuJoCo . . . . . . . . . . . . . . . . . . . . . . . 96
4.1.3.1 Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.1.3.2 Physics Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.1.3.3 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.2 Policy-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.2.1 Policy-Based Algorithm: REINFORCE . . . . . . . . . . . . . . . . . . . 99
4.2.2 Bias-Variance Trade-Off in Policy-Based Methods . . . . . . . . 102
4.2.3 Actor Critic Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.2.4 Baseline Subtraction with Advantage Function . . . . . . . . . . . 105
4.2.5 Trust Region Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2.6 Entropy and Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.2.7 Deterministic Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.2.8 Hands On: PPO and DDPG MuJoCo Examples . . . . . . . . . . . . . 114
4.3 Locomotion and Visuo-Motor Environments . . . . . . . . . . . . . . . . . . . . 115
4.3.1 Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.3.2 Visuo-Motor Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.3.3 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5 Model-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.1 Dynamics Models of High-Dimensional Problems . . . . . . . . . . . . . . . 126
5.2 Learning and Planning Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.2.1 Learning the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.2.1.1 Modeling Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.2.1.2 Latent Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.2.2 Planning with the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.2.2.1 Trajectory Rollouts and Model-Predictive Control 136
5.2.2.2 End-to-end Learning and Planning-by-Network . 138
5.3 High-Dimensional Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.3.1 Overview of Model-Based Experiments . . . . . . . . . . . . . . . . . . 141
5.3.2 Small Navigation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.3.3 Robotic Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.3.4 Atari Games Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.3.5 Hands On: PlaNet Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
xvi Contents

6 Two-Agent Self-Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.1 Two-Agent Zero-Sum Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.1.1 The Difficulty of Playing Go . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.1.2 AlphaGo Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.2 Tabula Rasa Self-Play Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.2.1 Move-Level Self-Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.2.1.1 Minimax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.2.1.2 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . 168
6.2.2 Example-Level Self-Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.2.2.1 Policy and Value Network . . . . . . . . . . . . . . . . . . . . . 176
6.2.2.2 Stability and Exploration . . . . . . . . . . . . . . . . . . . . . . 176
6.2.3 Tournament-Level Self-Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.2.3.1 Self-Play Curriculum Learning . . . . . . . . . . . . . . . . . 179
6.2.3.2 Supervised and Reinforcement Curriculum
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.3 Self-Play Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.3.1 How to Design a World Class Go Program? . . . . . . . . . . . . . . 182
6.3.2 AlphaGo Zero Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.3.3 AlphaZero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.3.4 Open Self-Play Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.3.5 Hands On: Hex in Polygames Example . . . . . . . . . . . . . . . . . . . . 188
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

7 Multi-Agent Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

7.1 Multi-Agent Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.1.1 Competitive Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.1.2 Cooperative Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.1.3 Mixed Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.1.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
7.1.4.1 Partial Observability . . . . . . . . . . . . . . . . . . . . . . . . . . 205
7.1.4.2 Nonstationary Environments . . . . . . . . . . . . . . . . . . 205
7.1.4.3 Large State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
7.2 Multi-Agent Reinforcement Learning Agents . . . . . . . . . . . . . . . . . . . . 206
7.2.1 Competitive Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.2.1.1 Counterfactual Regret Minimization . . . . . . . . . . . . 207
7.2.1.2 Deep Counterfactual Regret Minimization . . . . . . . 208
7.2.2 Cooperative Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
7.2.2.1 Centralized Training/Decentralized Execution . . . 210
7.2.2.2 Opponent Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.2.2.3 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.2.2.4 Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
7.2.3 Mixed Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
7.2.3.1 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . 213
7.2.3.2 Swarm Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Contents xvii

7.2.3.3 Population-Based Training . . . . . . . . . . . . . . . . . . . . . 216

7.2.3.4 Self-Play Leagues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
7.3 Multi-Agent Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.3.1 Competitive Behavior: Poker . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.3.2 Cooperative Behavior: Hide and Seek . . . . . . . . . . . . . . . . . . . . 220
7.3.3 Mixed Behavior: Capture the Flag and StarCraft . . . . . . . . . . 222
7.3.4 Hands On: Hide and Seek in the Gym Example . . . . . . . . . . . . 224
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

8 Hierarchical Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

8.1 Granularity of the Structure of Problems . . . . . . . . . . . . . . . . . . . . . . . 231
8.1.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
8.1.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
8.2 Divide and Conquer for Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
8.2.1 The Options Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
8.2.2 Finding Subgoals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
8.2.3 Overview of Hierarchical Algorithms . . . . . . . . . . . . . . . . . . . . 235
8.2.3.1 Tabular Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
8.2.3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
8.3 Hierarchical Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
8.3.1 Four Rooms and Robot Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
8.3.2 Montezuma’s Revenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
8.3.3 Multi-Agent Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
8.3.4 Hands On: Hierarchical Actor Citic Example . . . . . . . . . . . . . . 242
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

9 Meta-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
9.1 Learning to Learn Related Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
9.2 Transfer Learning and Meta-Learning Agents . . . . . . . . . . . . . . . . . . . 251
9.2.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
9.2.1.1 Task Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
9.2.1.2 Pretraining and Finetuning . . . . . . . . . . . . . . . . . . . . 253
9.2.1.3 Hands-on: Pretraining Example . . . . . . . . . . . . . . . . . 253
9.2.1.4 Multi-task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
9.2.1.5 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
9.2.2 Meta-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
9.2.2.1 Evaluating Few-Shot Learning Problems . . . . . . . . 257
9.2.2.2 Deep Meta-Learning Algorithms . . . . . . . . . . . . . . . 258
9.2.2.3 Recurrent Meta-Learning . . . . . . . . . . . . . . . . . . . . . . 260
9.2.2.4 Model-Agnostic Meta-Learning . . . . . . . . . . . . . . . . 261
9.2.2.5 Hyperparameter Optimization . . . . . . . . . . . . . . . . . 263
9.2.2.6 Meta-Learning and Curriculum Learning . . . . . . . . 264
9.2.2.7 From Few-Shot to Zero-Shot Learning . . . . . . . . . . 264
xviii Contents

9.3 Meta-Learning Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

9.3.1 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
9.3.2 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
9.3.3 Meta-Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
9.3.4 Meta-World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
9.3.5 Alchemy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
9.3.6 Hands-on: Meta-World Example . . . . . . . . . . . . . . . . . . . . . . . . . 270
Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

10 Further Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

10.1 Development of Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . 275
10.1.1 Tabular Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
10.1.2 Model-free Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
10.1.3 Multi-Agent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
10.1.4 Evolution of Reinforcement Learning . . . . . . . . . . . . . . . . . . . . 277
10.2 Main Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
10.2.1 Latent Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
10.2.2 Self-Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
10.2.3 Hierarchical Reinforcement Learning . . . . . . . . . . . . . . . . . . . . 279
10.2.4 Transfer Learning and Meta-Learning . . . . . . . . . . . . . . . . . . . 280
10.2.5 Population-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
10.2.6 Exploration and Intrinsic Motivation . . . . . . . . . . . . . . . . . . . . 281
10.2.7 Explainable AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
10.2.8 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
10.3 The Future of Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

A Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

A.1 Sets and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
A.1.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
A.1.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
A.2 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
A.2.1 Discrete Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 290
A.2.2 Continuous Probability Distributions . . . . . . . . . . . . . . . . . . . . 291
A.2.3 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
A.2.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
A.2.4.1 Expectation of a Random Variable . . . . . . . . . . . . . . 294
A.2.4.2 Expectation of a Function of a Random Variable . 295
A.2.5 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
A.2.5.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
A.2.5.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
A.2.5.3 Cross-entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
A.2.5.4 Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . 297
A.3 Derivative of an Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
A.4 Bellman Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Contents xix

B Deep Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

B.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
B.1.1 Training Set and Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
B.1.2 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
B.1.3 Overfitting and the Bias-Variance Trade-Off . . . . . . . . . . . . . . 304
B.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
B.2.1 Weights, Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
B.2.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
B.2.3 End-to-end Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
B.2.4 Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
B.2.5 Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
B.2.6 More Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
B.2.7 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
B.3 Datasets and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
B.3.1 MNIST and ImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
B.3.2 GPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
B.3.3 Hands On: Classification Example . . . . . . . . . . . . . . . . . . . . . . . . 328
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

C Deep Reinforcement Learning Suites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
Chapter 1
Introduction

Deep reinforcement learning studies how we learn to solve complex problems,

problems that require us to find a solution to a sequence of decisions in high
dimensional states. To make bread, we must use the right flour, add some salt, yeast
and sugar, prepare the dough (not too dry and not too wet), pre-heat the oven to
the right temperature, and bake the bread (but not too long); to win a ballroom
dancing contest we must find the right partner, learn to dance, practice, and beat
the competition; to win in chess we must study, practice, and make all the right
moves.

1.1 What is Deep Reinforcement Learning?

Deep reinforcement learning is the combination of deep learning and reinforcement

learning.
The goal of deep reinforcement learning is to learn optimal actions that maximize
our reward for all states that our environment can be in (the bakery, the dance
hall, the chess board). We do this by interacting with complex, high-dimensional
environments, trying out actions, and learning from the feedback.
The field of deep learning is about approximating functions in high-dimensional
problems; problems that are so complex that tabular methods cannot find exact
solutions anymore. Deep learning uses deep neural networks to find approximations
for large, complex, high-dimensional environments, such as in image and speech
recognition. The field has made impressive progress; computers can now recog-
nize pedestrians in a sequence of images (to avoid running over them), and can
understand sentences such as: “What is the weather going to be like tomorrow?”
The field of reinforcement learning is about learning from feedback; it learns by
trial and error. Reinforcement learning does not need a pre-existing dataset to train
on; it chooses its own actions, and learns from the feedback that an environment
provides. It stands to reason that in this process of trial and error, our agent will
make mistakes (the fire extinguisher is essential to survive the process of learning

1
2 1 Introduction

Low-Dimensional States High-Dimensional States

Static Dataset classic supervised learning deep supervised learning
Agent/Environment Interaction tabular reinforcement learning deep reinforcement learning
Table 1.1 The Constituents of Deep Reinforcement Learning

to bake bread). The field of reinforcement learning is all about learning from success
as well as from mistakes.
In recent years the two fields of deep and reinforcement learning have come
together, and have yielded new algorithms, that are able to approximate high-
dimensional problems by feedback on their actions. Deep learning has brought new
methods and new successes, with advances in policy-based methods, in model-
based approaches, in transfer learning, in hierarchical reinforcement learning, and
in multi-agent learning.
The fields also exist separately, as deep supervised learning and as tabular re-
inforcement learning (see Table 1.1). The aim of deep supervised learning is to
generalize and approximate complex, high-dimensional, functions from pre-existing
datasets, without interaction; Appendix B discusses deep supervised learning. The
aim of tabular reinforcement learning is to learn by interaction in simpler, low-
dimensional, environments such as Grid worlds; Chap. 2 discusses tabular reinforce-
ment learning.
Let us have a closer look at the two fields.

1.1.1 Deep Learning

Classic machine learning algorithms learn a predictive model on data, using methods
such as linear regression, decision trees, random forests, support vector machines,
and artificial neural networks. The models aim to generalize, to make predictions.
Mathematically speaking, machine learning aims to approximate a function from
data.
In the past, when computers were slow, the neural networks that were used
consisted of a few layers of fully connected neurons, and did not perform exception-
ally well on difficult problems. This changed with the advent of deep learning and
faster computers. Deep neural networks now consist of many layers of neurons and
use different types of connections.1 Deep networks and deep learning have taken
the accuracy of certain important machine learning tasks to a new level, and have
allowed machine learning to be applied to complex, high-dimensional, problems,
such as recognizing cats and dogs in high-resolution (mega-pixel) images.
Deep learning allows high-dimensional problems to be solved in real-time; it
has allowed machine learning to be applied to day-to-day tasks such as the face-
recognition and speech-recognition that we use in our smartphones.
1 Where many means more than one hidden layer in between the input and output layer.
1.1 What is Deep Reinforcement Learning? 3

1.1.2 Reinforcement Learning

Let us look more deeply at reinforcement learning, to see what it means to learn
from our own actions.
Reinforcement learning is a field in which an agent learns by interacting with
an environment. In supervised learning we need pre-existing datasets of labeled
examples to approximate a function; reinforcement learning only needs an environ-
ment that provides feedback signals for actions that the agent is trying out. This
requirement is easier to fulfill, allowing reinforcement learning to be applicable to
more situations than supervised learning.
Reinforcement learning agents generate, by their actions, their own on-the-fly
data, through the environment’s rewards. Agents can choose which actions to
learn from; reinforcement learning is a form of active learning. In this sense, our
agents are like children, that, through playing and exploring, teach themselves a
certain task. This level of autonomy is one of the aspects that attracts researchers
to the field. The reinforcement learning agent chooses which action to perform—
which hypothesis to test—and adjusts its knowledge of what works, building up a
policy of actions that are to be performed in the different states of the world that it
has encountered. (This freedom is also what makes reinforcement learning hard,
because when you are allowed to choose your own examples, it is all too easy to
stay in your comfort zone, stuck in a positive reinforcement bubble, believing you
are doing great, but learning very little of the world around you.)

1.1.3 Deep Reinforcement Learning

Deep reinforcement learning combines methods for learning high-dimensional prob-

lems with reinforcement learning, allowing high-dimensional, interactive learning.
A major reason for the interest in deep reinforcement learning is that it works well
on current computers, and does so in seemingly different applications. For exam-
ple, in Chap. 3 we will see how deep reinforcement learning can learn eye-hand
coordination tasks to play 1980s video games, in Chap. 4 we see how a simulated
robot cheetah learns to jump, and in Chap. 6 we see how it can teach itself to play
complex games of strategy to the extent that world champions are beaten.
Let us have a closer look at the kinds of applications on which deep reinforcement
learning does so well.

1.1.4 Applications

In its most basic form, reinforcement learning is a way to teach an agent to operate in
the world. As a child learns to walk from actions and feedback, so do reinforcement
learning agents learn from actions and feedback. Deep reinforcement learning can
4 1 Introduction

learn to solve large and complex decision problems—problems whose solution is

not yet known, but for which an approximating trial-and-error mechanism exists
that can learn a solution out of repeated interactions with the problem. This may
sound a bit cryptical and convoluted, but approximation and trial and error are
something that we do in real life all the time. Generalization and approximation
allow us to infer patterns or rules from examples. Trial and error is a method by
which humans learn how to deal with things that are unfamiliar to them (“What
happens if I press this button? Oh. Oops.” Or: “What happens if I do not put my leg
before my other leg while moving forward? Oh. Ouch.”).

Sequential Decision Problems

Learning to operate in the world is a high level goal; we can be more specific.
Reinforcement learning is about the agent’s behavior. Reinforcement learning can
find solutions for sequential decision problems, or optimal control problems, as they
are known in engineering. There are many situations in the real world where, in
order to reach a goal, a sequence of decisions must be made. Whether it is baking
a cake, building a house, or playing a card game; a sequence of decisions has
to be made. Reinforcement learning provides efficient ways to learn solutions to
sequential decision problems.
Many real world problems can be modeled as a sequence of decisions [544]. For
example, in autonomous driving, an agent is faced with questions of speed control,
finding drivable areas, and, most importantly, avoiding collisions. In healthcare,
treatment plans contain many sequential decisions, and factoring the effects of
delayed treatment can be studied. In customer centers, natural language process-
ing can help improve chatbot dialogue, question answering, and even machine
translation. In marketing and communication, recommender systems recommend
news, personalize suggestions, deliver notifications to user, or otherwise optimize
the product experience. In trading and finance, systems decide to hold, buy or
sell financial titles, in order to optimize future reward. In politics and governance,
the effects of policies can be simulated as a sequence of decisions before they are
implemented. In mathematics and entertainment, playing board games, card games,
and strategy games consists of a sequence of decisions. In computational creativity,
making a painting requires a sequence of esthetic decisions. In industrial robotics
and engineering, the grasping of items and the manipulation of materials consists of
a sequence of decisions. In chemical manufacturing, the optimization of production
processes consists of many decision steps, that influence the yield and quality of
the product. Finally, in energy grids, the efficient and safe distribution of energy
can be modeled as a sequential decision problem.
In all these situations, we must make a sequence of decisions. In all these situa-
tions, taking the wrong decision can be very costly.
The algorithmic research on sequential decision making has focused on two
types of applications: (1) robotic problems and (2) games. Let us have a closer look
at these two domains, starting with robotics.
1.1 What is Deep Reinforcement Learning? 5

Fig. 1.1 Robot Flipping Pancakes [423]

Fig. 1.2 Aerobatic Model Helicopter [3]

Robotics

In principle, all actions that a robot should take can be pre-programmed step-by-step
by a programmer in meticulous detail. In highly controlled environments, such as
a welding robot in a car factory, this can conceivably work, although any small
change or any new task requires reprogramming the robot.
It is surprisingly hard to manually program a robot to perform a complex task.
Humans are not aware of their own operational knowledge, such as what “voltages”
we put on which muscles when we pick up a cup. It is much easier to define a desired
goal state, and let the system find the complicated solution by itself. Furthermore,
in environments that are only slightly challenging, when the robot must be able to
respond more flexibly to different conditions, an adaptive program is needed.
It will be no surprise that the application area of robotics is an important driver
for machine learning research, and robotics researchers turned early on to finding
methods by which the robots could teach themselves certain behavior.
6 1 Introduction

Fig. 1.3 Chess

Fig. 1.4 Go

The literature on robotics experiments is varied and rich. A robot can teach itself
how to navigate a maze, how to perform manipulation tasks, and how to learn
locomotion tasks.
Research into adaptive robotics has made quite some progress. For example,
one of the recent achievements involves flipping pancakes [423] and flying an
aerobatic model helicopter [2, 3]; see Figs. 1.1 and 1.2. Frequently, learning tasks are
combined with computer vision, where a robot has to learn by visually interpreting
the consequences of its own actions.

Games

Let us now turn to games. Puzzles and games have been used from the earliest days
to study aspects of intelligent behavior. Indeed, before computers were powerful
enough to execute chess programs, in the days of Shannon and Turing, paper
1.1 What is Deep Reinforcement Learning? 7

Fig. 1.5 Pac-Man [71]

Fig. 1.6 StarCraft [813]

designs were made, in the hope that understanding chess would teach us something
about the nature of intelligence [694, 788].
Games allow researchers to limit the scope of their studies, to focus on intelli-
gent decision making in a limited environment, without having to master the full
complexity of the real world. In addition to board games such as chess and Go,
video games are being used extensively to test intelligent methods in computers.
Examples are Arcade-style games such as Pac-Man [523] and multi-player strategy
games such as StarCraft [813]. See Figs. 1.3–1.6.

1.1.5 Four Related Fields

Reinforcement learning is a rich field, that has existed in some form long before the
artificial intelligence endeavour had started, as a part of biology, psychology, and
education [86, 389, 743]. In artificial intelligence it has become one of the three main
categories of machine learning, the other two being supervised and unsupervised
learning [93]. This book is a book of algorithms that are inspired by topics from
the natural and social sciences. Although the rest of the book will be about these
8 1 Introduction

Fig. 1.7 Classical Conditioning: (1) a dog salivates when seeing food, (2) but initially not when
hearing a bell, (3) when the sound rings often enough together when food is served, the dog starts
to associate the bell with food, and (4) also salivates when only the bell rings

algorithms, it is interesting to briefly discuss the links of deep reinforcement learning

to human and animal learning. We will introduce the four scientific disciplines that
have a profound influence on deep reinforcement learning.

1.1.5.1 Psychology

In psychology, reinforcement learning is also known as learning by conditioning or

as operant conditioning. Figure 1.7 illustrates the folk psychological idea of how a
dog can be conditioned. A natural reaction to food is that a dog salivates. By ringing
a bell whenever the dog is given food, the dog learns to associate the sound with
food, and after enough trials, the dog starts salivating as soon as it hears the bell,
presumably in anticipation of the food, whether it is there or not.
The behavioral scientists Pavlov (1849–1936) and Skinner (1904–1990) are well-
known for their work on conditioning. Phrases such as Pavlov-reaction have entered
our everyday language, and various jokes about conditioning exist (see, for example,
Fig. 1.8). Psychological research into learning is one of the main influences on
reinforcement learning as we know it in artificial intelligence.
1.1 What is Deep Reinforcement Learning? 9

Fig. 1.8 Who is Conditioning Whom?

1.1.5.2 Mathematics

Mathematical logic is another foundation of deep reinforcement learning. Discrete

optimization and graph theory are of great importance for the formalization of
reinforcement learning, as we will see in Sect. 2.2.2 on Markov decision processes.
Mathematical formalizations have enabled the development of efficient planning
and optimization algorithms, that are at the core of current progress.
Planning and optimization are an important part of deep reinforcement learning.
They are also related to the field of operations research, although there the emphasis
is on (non-sequential) combinatorial optimization problems. In AI, planning and
optimization are used as building blocks for creating learning systems for sequential,
high-dimensional, problems that can include visual, textual or auditory input.
The field of symbolic reasoning is based on logic, it is one of the earliest success
stories in artificial intelligence. Out of work in symbolic reasoning came heuristic
search [593], expert systems, and theorem proving systems. Well-known systems
are the STRIPS planner [242], the Mathematica computer algebra system [119], the
logic programming language PROLOG [157], and also systems such as SPARQL for
semantic (web) reasoning [24, 81].
Symbolic AI focuses on reasoning in discrete domains, such as decision trees,
planning, and games of strategy, such as chess and checkers. Symbolic AI has driven
success in methods to search the web, to power online social networks, and to power
online commerce. These highly successful technologies are the basis of much of
our modern society and economy. In 2011 the highest recognition in computer
science, the Turing award, was awarded to Judea Pearl for work in causal reasoning
(Fig. 1.9).2 Pearl later published an influential book to popularize the field [594].
Another area of mathematics that has played a large role in deep reinforcement
learning is the field of continuous (numerical) optimization. Continuous methods are
2Joining a long list of AI researchers that have been honored earlier with a Turing award: Minsky,
McCarthy, Newell, Simon, Feigenbaum and Reddy.
10 1 Introduction

Fig. 1.9 Turing-award winner Judea Pearl

Fig. 1.10 Optimal Control of Dynamical Systems at Work

important, for example, in efficient gradient descent and backpropagation methods

that are at the heart of current deep learning algorithms.

1.1.5.3 Engineering

In engineering, the field of reinforcement learning is better known as optimal control.

The theory of optimal control of dynamical systems was developed by Richard
Bellman and Lev Pontryagin [85]. Optimal control theory originally focused on
dynamical systems, and the technology and methods relate to continuous optimiza-
tion methods such as used in robotics (see Fig. 1.10 for an illustration of optimal
control at work in docking two space vehicles). Optimal control theory is of central
importance to many problems in engineering.
To this day reinforcement learning and optimal control use a different termi-
nology and notation. States and actions are denoted as 𝑠 and 𝑎 in state-oriented
1.1 What is Deep Reinforcement Learning? 11

Fig. 1.11 Turing-award winners Geoffrey Hinton, Yann LeCun, and Yoshua Bengio

reinforcement learning, where the engineering world of optimal control uses 𝑥 and
𝑢. In this book the former notation is used.

1.1.5.4 Biology

Biology has a profound influence on computer science. Many nature-inspired opti-

mization algorithms have been developed in artificial intelligence. An important
nature-inspired school of thought is connectionist AI.
Mathematical logic and engineering approach intelligence as a top-down de-
ductive process; observable effects in the real world follow from the application of
theories and the laws of nature, and intelligence follows deductively from theory.
In contrast, connectionism approaches intelligence in a bottom-up fashion. Connec-
tionist intelligence emerges out of many low level interactions. Intelligence follows
inductively from practice. Intelligence is embodied: the bees in bee hives, the ants
in ant colonies, and the neurons in the brain all interact, and out of the connections
and interactions arises behavior that we recognize as intelligent [97].
Examples of the connectionist approach to intelligence are nature-inspired al-
gorithms such as Ant colony optimization [209], swarm intelligence [406, 97],
evolutionary algorithms [43, 252, 347], robotic intelligence [109], and, last but not
least, artificial neural networks and deep learning [318, 459, 280].
It should be noted that both the symbolic and the connectionist school of AI
have been very successful. After the enormous economic impact of search and
symbolic AI (Google, Facebook, Amazon, Netflix), much of the interest in AI in the
last two decades has been inspired by the success of the connectionist approach
in computer language and vision. In 2018 the Turing award was awarded to three
12 1 Introduction

key researchers in deep learning: Bengio, Hinton, and LeCun (Fig. 1.11). Their most
famous paper on deep learning may well be [459].

1.2 Three Machine Learning Paradigms

Now that we have introduced the general context and origins of deep reinforcement
learning, let us switch gears, and talk about machine learning. Let us see how deep
reinforcement learning fits in the general picture of the field. At the same time, we
will take the opportunity to introduce some notation and basic concepts.
In the next section we will then provide an outline of the book. But first it is
time for machine learning. We start at the beginning, with function approximation.

Representing a Function

Functions are a central part in artificial intelligence. A function 𝑓 transforms input 𝑥

to output 𝑦 according to some method, and we write 𝑓 (𝑥) → 𝑦. In order to perform
calculations with function 𝑓 , the function must be represented as a computer
program in some form in memory. We also write function

𝑓 : 𝑋 → 𝑌,

where the domain 𝑋 and range 𝑌 can be discrete or continuous; the dimensionality
(number of attributes in 𝑋) can be arbitrary.
Often, in the real world, the same input may yield a range of different outputs,
and we would like our function to provide a conditional probability distribution, a
function that maps
𝑓 : 𝑋 → 𝑝(𝑌 ).
Here the function maps the domain to a probability distribution 𝑝 over the range.
Representing a conditional probability allows us to model functions for which the in-
put does not always give the same output. (Appendix A provides more mathematical
background.)

Given versus Learned Function

Sometimes the function that we are interested in is given, and we can represent
the function by a specific algorithm that computes an analytical expression that is
known exactly. This is, for example, the case for laws of physics, or when we make
explicit assumptions for a particular system.
1.2 Three Machine Learning Paradigms 13

Fig. 1.12 Example of learning a function; data points are in blue, a possible learned linear function
is the red line, which allows us to make predictions 𝑦ˆ for any new input 𝑥

Example: Newton’s second Law of Motion states that for objects with
constant mass
𝐹 = 𝑚 · 𝑎,
where 𝐹 denotes the net force on the object, 𝑚 denotes its mass, and 𝑎
denotes its acceleration. In this case, the analytical expression defines the
entire function, for every possible combination of the inputs.

However, for many functions in the real world, we do not have an analytical
expression. Here, we enter the realm of machine learning, in particular of supervised
learning. When we do not know an analytical expression for a function, our best
approach is to collect data—examples of (𝑥, 𝑦) pairs—and reverse engineer or learn
the function from this data. See Fig. 1.12.

Example: A company wants to predict the chance that you buy a shampoo
to color your hair, based on your age. They collect many data points of
𝑥 ∈ N, your age (a natural number), that map to 𝑦 ∈ {0, 1}, a binary indicator
whether you bought their shampoo. They then want to learn the mapping

𝑦ˆ = 𝑓 (𝑥)

where 𝑓 is the desired function that tells the company who will buy the
product and 𝑦ˆ is the predicted 𝑦 (admittedly overly simplistic in this exam-
ple).

Let us see which methods exist in machine learning to find function approxima-
tions.
14 1 Introduction

Three Paradigms

There are three main paradigms for how the observations can be provided in
machine learning: (1) supervised learning, (2) reinforcement learning, and (3) unsu-
pervised learning.

1.2.1 Supervised Learning

The first and most basic paradigm for machine learning is supervised learning. In
supervised learning, the data to learn the function 𝑓 (𝑥) is provided to the learning
algorithm in (𝑥, 𝑦) example-pairs. Here 𝑥 is the input, and 𝑦 the observed output
value to be learned for that particular input value 𝑥. The 𝑦 values can be thought
of as supervising the learning process, they teach the learning process the right
answers for each input value 𝑥, hence the name supervised learning.
The data pairs to be learned from are organized in a dataset, which must be
present in its entirety before the algorithm can start. During the learning process,
an estimate of the real function that generated the data is created, 𝑓ˆ. The 𝑥 values
of the pair are also called the input, and the 𝑦 values are the label to be learned.
Two well-known problems in supervised learning are regression and classifi-
cation. Regression predicts a continuous number, classification a dicrete category.
The best known regression relation is the linear relation: the familiar straight line
through a cloud of observation points that we all know from our introductory
statistics course. Figure 1.12 shows such a linear relationship 𝑦ˆ = 𝑎 · 𝑥 + 𝑏. The
linear function can be characterized with two parameters 𝑎 and 𝑏. Of course, more
complex functions are possible, such as quadratic regression, non-linear regression,
or regression with higher-order polynomials [210].
The supervisory signal is computed for each data item 𝑖 as the difference between
the current estimate and the given label, for example by ( 𝑓ˆ(𝑥𝑖 ) − 𝑦 𝑖 ) 2 . Such an error
function ( 𝑓ˆ(𝑥) − 𝑦) 2 is also known as a loss function; it measures the quality of our
prediction. The closer our prediction is to the true label, the lower the loss. There
are many Í ways to compute this closeness, such as the mean squared error loss
L = 𝑁1 1𝑁 ( 𝑓ˆ(𝑥𝑖 ) − 𝑦 𝑖 ) 2 , which is used often for regression over 𝑁 observations.
This loss function can be used by a supervised learning algorithm to adjust model
parameters 𝑎 and 𝑏 to fit the function 𝑓ˆ to the data. Some of the many possible
learning algorithms are linear regression and support vector machines [93, 647].
In classification, a relation between an input value and a class label is learned. A
well-studied classification problem is image recognition, where two-dimensional
images are to be categorized. Table 1.2 shows a tiny dataset of labeled images
of the proverbial cats and dogs. A popular loss function for classification is the
cross-entropy loss L = − 1𝑁 𝑦 𝑖 log( 𝑓ˆ(𝑥𝑖 )), see also Sect. A.2.5.3. Again, such a
Í
loss function can be used to adjust the model parameters to fit the function to the
data. The model can be small and linear, with few parameters, or it can be large,
1.2 Three Machine Learning Paradigms 15

“Cat” “Cat” “Dog” “Cat” “Dog” “Dog”

Table 1.2 (Input/output)-Pairs for a Supervised Classification Problem

with many parameters, such as a neural network, which is often used for image
classification.
In supervised learning a large dataset exists where all input items have an
associated training label. Reinforcement learning is different, it does not assume
the pre-existence of a large labeled training set. Unsupervised learning does require
a large dataset, but no user-supplied output labels; all it needs are the inputs.
Deep learning function approximation was first developed in a supervised set-
ting. Although this book is about deep reinforcement learning, we will encounter
supervised learning concepts frequently, whenever we discuss the deep learning
aspect of deep reinforcement learning.

1.2.2 Unsupervised Learning

When there are no labels in the dataset, different learning algorithms must be
used. Learning without labels is called unsupervised learning. In unsupervised
learning an inherent metric of the data items is used, such as distance. A typical
problem in unsupervised learning is to find patterns in the data, such as clusters or
subgroups [820, 801].
Popular unsupervised learning algorithms are 𝑘-means algorithms, and prin-
cipal component analysis [677, 380]. Other popular unsupervised methods are
dimensionality reduction techniques from visualization, such as t-SNE [493], mini-
mum description length [294] and data compression [55]. A popular application of
unsupervised learning are autoencoders, see Sect. B.2.6 [411, 412].
The relation between supervised and unsupervised learning is sometimes char-
acterized as follows: supervised learning aims to learn the conditional probability
distribution 𝑝(𝑥|𝑦) of input data conditioned on a label 𝑦, whereas unsupervised
learning aims to learn the a priori probability distribution 𝑝(𝑥) [343].
We will encounter unsupervised methods in this book in a few places, specif-
ically, when autoencoders and dimension reduction are discussed, for example,
in Chap. 5. At the end of this book explainable artificial intelligence is discussed,
where interpretable models play an important role, in Chap. 10.
16 1 Introduction

Fig. 1.13 Agent and Environment

1.2.3 Reinforcement Learning

The last machine learning paradigm is, indeed, reinforcement learning. There are
three differences between reinforcement learning and the previous paradigms.
First, reinforcement learning learns by interaction; in contrast to supervised and
unsupervised learning, in reinforcement learning data items come one by one. The
dataset is produced dynamically, as it were. The objective in reinforcement learning
is to find the policy: a function that gives us the best action in each state that the
world can be in.
The approach of reinforcement learning is to learn the policy for the world by
interacting with it. In reinforcement learning we recognize an agent, that does the
learning of the policy, and an environment, that provides feedback to the agent’s
actions (and that performs state changes, see Fig. 1.13). In reinforcement learning,
the agent stands for the human, and the environment for the world. The goal of
reinforcement learning is to find the actions for each state that maximize the long
term accumulated expected reward. This optimal function of states to actions is
called the optimal policy.
In reinforcement learning there is no teacher or supervisor, and there is no static
dataset. There is, however, the environment, that will tell us how good the state
is in which we find ourselves. This brings us to the second difference: the reward
value. Reinforcement learning gives us partial information, a number indicating the
quality of the action that brought us to our state, where supervised learning gives
full information: a label that provides the correct answer in that state (Table 1.3). In
this sense, reinforcement learning is in between supervised learning, in which all
data items have a label, and unsupervised learning, where no data has a label.
The third difference is that reinforcement learning is used to solve sequential
decision problems. Supervised and unsupervised learning learn single-step relations
between items; reinforcement learning learns a policy, which is the answer to
a multi-step problem. Supervised learning can classify a set of images for you;
unsupervised learning can tell you which items belong together; reinforcement
1.2 Three Machine Learning Paradigms 17

Concept Supervised Learning Reinforcement Learning

Inputs 𝑥 Full dataset of states Partial (One state at a time)
Labels 𝑦 Full (correct action) Partial (Numeric action reward)
Table 1.3 Supervised vs. Reinforcement Learning

learning can tell you the winning sequence of moves in a game of chess, or the
action-sequence that robot-legs need to take in order to walk.
These three differences have consequences. Reinforcement learning provides the
data to the learning algorithm step by step, action by action; whereas in supervised
learning the data is provided all at once in one large dataset. The step-by-step
approach is well suited to sequential decision problems. On the other hand, many
deep learning methods were developed for supervised learning and may work
differently when data items are generated one-by-one. Furthermore, since actions
are selected using the policy function, and action rewards are used to update this
same policy function, there is a possibility of circular feedback and local minima.
Care must be taken to ensure convergence to global optima in our methods. Human
learning also suffers from this problem, when a stubborn child refuses to explore
outside of its comfort zone. This topic is is discussed in Sect. 2.2.4.3.
Another difference is that in supervised learning the pupil learns from a finite-
sized teacher (the dataset), and at some point may have learned all there is to learn.
The reinforcement learning paradigm allows a learning setup where the agent
can continue to sample the environment indefinitely, and will continue to become
smarter as long as the environment remains challenging (which can be a long time,
for example in games such as chess and Go).3
For these reasons there is great interest in reinforcement learning, although
getting the methods to work is often harder than for supervised learning.
Most classical reinforcement learning use tabular methods that work for low-
dimensional problems with small state spaces. Many real world problems are com-
plex and high-dimensional, with large state spaces. Due to steady improvements
in learning algorithms, datasets, and compute power, deep learning methods have
become quite powerful. Deep reinforcement learning methods have emerged that
successfully combine step-by-step sampling in high-dimensional problems with
large state spaces. We will discuss these methods in the subsequent chapters of this
book.
3In fact, some argue that reward is enough for artificial general intelligence, see Silver, Singh,
Precup, and Sutton [707].
18 1 Introduction

10. Further Developments

9. Meta-Learning

8. Hierarchical Reinforcement Learning

7. Multi-Agent Reinforcement Learning

6. Two-Agent Self-Play

5. Model-Based Reinforcement Learning

4. Policy-Based Reinforcement Learning

3. Deep Value-Based Reinforcement Learning

2. Tabular Value-Based Reinforcement Learning B. Deep Supervised Learning

Fig. 1.14 Deep Reinforcement Learning is built on Deep Supervised Learning and Tabular Rein-
forcement Learning

1.3 Overview of the Book

The aim of this book is to present the latest insights in deep reinforcement learning in
a single comprehensive volume, suitable for teaching a graduate level one-semester
course.
In addition to covering state of the art algorithms, we cover necessary background
in classic reinforcement learning and in deep learning. We also cover advanced,
forward looking developments in self-play, and in multi-agent, hierarchical, and
meta-learning.

1.3.1 Prerequisite Knowledge

In an effort to be comprehensive, we make modest assumptions about previous

knowledge. We assume a bachelor level of computer science or artificial intelligence,
and an interest in artificial intelligence and machine learning. A good introductory
textbook on artificial intelligence is Russell and Norvig: Artificial Intelligence, A
Modern Approach [647].
1.3 Overview of the Book 19

Figure 1.14 shows an overview of the structure of the book. Deep reinforcement
learning combines deep supervised learning and classical (tabular) reinforcement
learning. The figure shows how the chapters are built on this dual foundation.
For deep reinforcement learning, the field of deep supervised learning is of great
importance. It is a large field; deep, and rich. Many students may have followed
a course on deep learning, if not, Appendix B provides you with the necessary
background (dashed). Tabular reinforcement learning, on the other hand, may be
new to you, and we start our story with this topic in Chap. 2.
We also assume undergraduate level familiarity with the Python programming
language. Python has become the programming language of choice for machine
learning research, and the host-language of most machine learning packages. All
example code in this book is in Python, and major machine learning environments
such as scikit-learn, TensorFlow, Keras and PyTorch work best from Python. See
https://www.python.org for pointers on how to get started in Python. Use the
latest stable version, unless the text mentions otherwise.
We assume an undergraduate level of familiarity with mathematics—a basic
understanding of set theory, graph theory, probability theory and information
theory is necessary, although this is not a book of mathematics. Appendix A contains
a summary to refresh your mathematical knowledge, and to provide an introduction
to the notation that is used in the book.

Course

There is a lot of material in the chapters, both basic and advanced, with many
pointers to the literature. One option is to teach a single course about all topics in
the book. Another option is to go slower and deeper, to spend sufficient time on the
basics, and create a course about Chaps. 2–5 to cover the basic topics (value-based,
policy-based, and model-based learning), and to create a separate course about
Chaps. 6–9 to cover the more advanced topics of multi-agent, hierarchical, and
meta-learning.

Blogs and GitHub

The field of deep reinforcement learning is a highly active field, in which theory and
practice go hand in hand. The culture of the field is open, and you will easily find
many blog posts about interesting topics, some quite good. Theory drives experi-
mentation, and experimental results drive theoretical insights. Many researchers
publish their papers on arXiv and their algorithms, hyperparameter settings and
environments on GitHub.
In this book we aim for the same atmosphere. Throughout the text we provide
links to code, and we challenge you with hands-on sections to get your hands dirty
to perform your own experiments. All links to web pages that we use have been
stable for some time.
20 1 Introduction

Website: https://deep-reinforcement-learning.net is the com-

panion website for this book. It contains updates, slides, and other course
material that you are welcome to explore and use.

1.3.2 Structure of the Book

The field of deep reinforcement learning consists of two main areas: model-free
reinforcement learning and model-based reinforcement learning. Both areas have
two subareas. The chapters of this book are organized according to this structure.
• Model-free methods
– Value-based methods: Chap. 2 (tabular) and 3 (deep)
– Policy-based methods: Chap. 4
• Model-based methods
– Learned model: Chap. 5
– Given model: Chap. 6
Then, we have three chapters on more specialized topics.
• Multi-agent reinforcement learning: Chap. 7
• Hierarchical reinforcement learning: Chap. 8
• Transfer and Meta-learning: Chap. 9
Appendix B provides a necessary review of deep supervised learning.
The style of each chapter is to first provide the main idea of the chapter in
an intuitive example, to then explain the kind of problem to be solved, and then
to discuss algorithmic concepts that agents use, and the environments that have
been solved in practice with these algorithms. The sections of the chapters are
named accordingly: their names end in problem-agent-environment. At the end
of each chapter we provide questions for quizzes to check your understanding of
the concepts, and we provide exercises for larger programming assignments (some
doable, some quite challenging). We also end each chapter with a summary and
references to further reading.
Let us now look in more detail at what topics the chapters cover.

Chapters

After this introductory chapter, we continue with Chap. 2, in which we discuss in

detail the basic concepts of tabular (non-deep) reinforcement learning. We start with
Markov decision processes and discuss them at length. We will introduce tabular
planning and learning, and important concepts such as state, action, reward, value,
1.3 Overview of the Book 21

and policy. We will encounter the first, tabular, value-based model-free learning
algorithms (for an overview, see Table 2.1). Chapter 2 is the only non-deep chapter
of the book. All other chapters cover deep methods.
Chapter 3 explains deep value-based reinforcement learning. The chapter covers
the first deep algorithms that have been devised to find the optimal policy. We will
still be working in the value-based, model-free, paradigm. At the end of the chapter
we will analyze a player that teaches itself how to play 1980s Atari video games.
Table 3.1 lists some of the many stable deep value-based model-free algorithms.
Value-based reinforcement learning works well with applications such as games,
with discrete action spaces. The next chapter, Chap. 4, discusses a different approach:
deep policy-based reinforcement learning (Table 4.1). In addition to discrete spaces,
this approach is also suited for continuous actions spaces, such as robot arm move-
ment, and simulated articulated locomotion. We see how a simulated half-cheetah
teaches itself to run.
The next chapter, Chap. 5, introduces deep model-based reinforcement learning
with a learned model, a method that first builds up a transition model of the envi-
ronment before it builds the policy. Model-based reinforcement learning holds the
promise of higher sample efficiency, and thus faster learning. New developments,
such as latent models, are discussed. Applications are both in robotics and in games
(Table 5.2).
The next chapter, Chap. 6, studies how a self-play system can be created for
applications where the transition model is given by the problem description. This is
the case in two-agent games, where the rules for moving in the game determine the
transition function. We study how TD-Gammon and AlphaZero achieve tabula rasa
learning: teaching themselves from zero knowledge to world champion level play
through playing against a copy of itself (Table 6.2). In this chapter deep residual
networks and Monte Carlo Tree Search result in curriculum learning.
Chapter 7 introduces recent developments in deep multi-agent and team learning.
The chapter covers competition and collaboration, population-based methods, and
playing in teams. Applications of these methods are found in games such as poker
and StarCraft (Table 7.2).
Chapter 8 covers deep hierarchical reinforcement learning. Many tasks exhibit
an inherent hierarchical structure, in which clear subgoals can be identified. The
options framework is discussed, and methods that can identify subgoals, subpolicies,
and meta policies. Different approaches for tabular and deep hierarchical methods
are discussed (Table 8.1).
The final technical chapter, Chap. 9, covers deep meta-learning, or learning to
learn. One of the major hurdles in machine learning is the long time it takes to learn
to solve a new task. Meta-learning and transfer learning aim to speed up learning of
new tasks by using information that has been learned previously for related tasks;
algorithms are listed in Table 9.2. At the end of the chapter we will experiment with
few-shot learning, where a task has to be learned without having seen more than a
few training examples.
Chapter 10 concludes the book by reviewing what we have learned, and by
looking ahead into what the future may bring.
22 1 Introduction

Appendix A provides mathematical background information and notation. Ap-

pendix B provides a chapter-length overview of machine learning and deep super-
vised learning. If you wish to refresh your knowledge of deep learning, please go to
this appendix before you read Chap. 3. Appendix C provides lists of useful software
environments and software packages for deep reinforcement learning.
Chapter 2
Tabular Value-Based Reinforcement Learning

This chapter will introduce the classic, tabular, field of reinforcement learning, to
build a foundation for the next chapters. First, we will introduce the concepts of
agent and environment. Next come Markov decision processes, the formalism that
is used to reason mathematically about reinforcement learning. We discuss at some
length the elements of reinforcement learning: states, actions, values, policies.
We learn about transition functions, and solution methods that are based on
dynamic programming using the transition model. There are many situations where
agents do not have access to the transition model, and state and reward information
must be acquired from the environment. Fortunately, methods exist to find the
optimal policy without a model, by querying the environment. These methods,
appropriately named model-free methods, will be introduced in this chapter. Value-
based model-free methods are the most basic learning approach of reinforcement
learning. They work well in problems with deterministic environments and discrete
action spaces, such as mazes and games. Model-free learning makes few demands
on the environment, building up the policy function 𝜋(𝑠) → 𝑎 by sampling the
environment.
After we have discussed these concepts, it is time to apply them, and to under-
stand the kinds of sequential decision problems that we can solve. We will look
at Gym, a collection of reinforcement learning environments. We will also look at
simple Grid world puzzles, and see how to navigate those.
This is a non-deep chapter: in this chapter functions are exact, states are stored
in tables, an approach that works as long as problems are small enough to fit in
memory. The next chapter shows how function approximation with neural networks
works when there are more states than fit in memory.
The chapter is concluded with exercises, a summary, and pointers to further
reading.

23
24 2 Tabular Value-Based Reinforcement Learning

Core Concepts

• Agent, environment
• MDP: state, action, reward, value, policy
• Planning and learning
• Exploration and exploitation
• Gym, baselines

Core Problem

• Learn a policy from interaction with the environment

Core Algorithms

• Value iteration (Listing 2.1)

• Temporal difference learning (Sect. 2.2.4.2)
• Q-learning (Listing 2.6)

Finding a Supermarket

Imagine that you have just moved to a new city, you are hungry, and you want to
buy some groceries. There is a somewhat unrealistic catch: you do not have a map
of the city and you forgot to charge your smartphone. It is a sunny day, you put
on your hiking shoes, and after some random exploration you have found a way
to a supermarket and have bought your groceries. You have carefully noted your
route in a notebook, and you retrace your steps, finding your way back to your new
home.
What will you do the next time that you need groceries? One option is to
follow exactly the same route, exploiting your current knowledge. This option is
guaranteed to bring you to the store, at no additional cost for exploring possible
alternative routes. Or you could be adventurous, and explore, trying to find a new
route that may actually be quicker than the old route. Clearly, there is a trade-off:
you should not spend so much time exploring that you can not recoup the gains of
a potential shorter route before you move elsewhere.
Reinforcement learning is a natural way of learning the optimal route as we go,
by trial and error, from the effects of the actions that we take in our environment.
This little story contained many of the elements of a reinforcement learning
problem, and how to solve it. There is an agent (you), an environment (the city), there
2.1 Sequential Decision Problems 25

Fig. 2.1 Grid World with a goal, an “un-goal,” and a wall

are states (your location at different points in time), actions (assuming a Manhattan-
style grid, moving a block left, right, forward, or back), there are trajectories (the
routes to the supermarket that you tried), there is a policy (that tells which action
you will take at a particular location), there is a concept of cost/reward (the length
of your current path), we see exploration of new routes, exploitation of old routes, a
trade-off between them, and your notebook in which you have been sketching a
map of the city (your local transition model).
By the end of this chapter you will have learned which role all these topics play
in reinforcement learning.

2.1 Sequential Decision Problems

Reinforcement learning is used to solve sequential decision problems [27, 255].

Before we dive into the algorithms, let us have a closer look at these problems, to
better understand the challenges that the agents must solve.
In a sequential decision problem the agent has to make a sequence of decisions
in order to solve a problem. Solving implies to find the sequence with the highest
(expected cumulative future) reward. The solver is called the agent, and the problem
is called environment (or sometimes the world).
We will now discuss basic examples of sequential decision problems.

Grid Worlds

Some of the first environments that we encounter in reinforcement learning are

Grid worlds (Fig. 2.1). These environments consist of a rectangular grid of squares,
with a start square, and a goal square. The aim is for the agent to find the sequence
of actions that it must take (up, down, left, right) to arrive at the goal square. In
fancy versions a “loss” square is added, that scores minus points, or a “wall” square,
that is impenetrable for the agent. By exploring the grid, taking different actions,
and recording the reward (whether it reached the goal square), the agent can find a
route—and when it has a route, it can try to improve that route, to find a shorter
route to the goal.
26 2 Tabular Value-Based Reinforcement Learning

Fig. 2.2 The Longleat Hedge Maze in Wiltshire, England

Fig. 2.3 Sokoban Puzzle [136]

Grid world is a simple environment that is well-suited for manually playing

around with reinforcement learning algorithms, to build up intuition of what the
algorithms do. In this chapter we will model reinforcement learning problems
formally, and encounter algorithms that find optimal routes in Grid world.

Mazes and Box Puzzles

After Grid world problems, there are more complicated problems, with extensive
wall structures to make navigation more difficult (see Fig. 2.2). Trajectory planning
algorithms play a central role in robotics [456, 265]; there is a long tradition of
using 2D and 3D mazes for path-finding problems in reinforcement learning. The
2.2 Tabular Value-Based Agents 27

Agent

State 𝑠𝑡+1 Reward 𝑟𝑡+1 Action 𝑎𝑡

Environment

Fig. 2.4 Agent and environment [743]

Taxi domain was introduced by Dietterich [196], and box-pushing problems such
as Sokoban have also been used frequently [386, 204, 543, 878], see Fig. 2.3. The
challenge in Sokoban is that boxes can only be pushed, not pulled. Actions can have
the effect of creating an inadvertent dead-end for into the future, making Sokoban
a difficult puzzle to play. The action space of these puzzles and mazes is discrete.
Small versions of the mazes can be solved exactly by planning, larger instances are
only suitable for approximate planning or learning methods. Solving these planning
problems exactly is NP-hard or PSPACE-hard [169, 322], as a consequence the
computational time required to solve problem instances exactly grows exponentially
with the problem size, and becomes quickly infeasible for all but the smallest
problems.
Let us see how we can model agents to act in these types of environments.

2.2 Tabular Value-Based Agents

Reinforcement learning finds the best policy to operate in the environment by

interacting with it. The reinforcement learning paradigm consists of an agent (you,
the learner) and an environment (the world, which is in a certain state, and gives
you feedback on your actions).

2.2.1 Agent and Environment

In Fig. 2.4 the agent and environment are shown, together with action 𝑎 𝑡 , next state
𝑠𝑡+1 , and its reward 𝑟 𝑡+1 . Let us have a closer look at the figure.
The environment is in a certain state 𝑠𝑡 at time 𝑡. Then, the agent performs action
𝑎 𝑡 , resulting in a transition in the environment from state 𝑠𝑡 to 𝑠𝑡+1 at the next time
step, also denoted as 𝑠 → 𝑠 0. Along with this new state comes a reward value 𝑟 𝑡+1
(which may be a positive or a negative value). The goal of reinforcement learning
is to find the sequence of actions that gives the best reward. More formally, the
goal is to find the optimal policy function 𝜋★ that gives in each state the best action
to take in that state. By trying different actions, and accumulating the rewards,
28 2 Tabular Value-Based Reinforcement Learning

the agent can find the best action for each state. In this way, with the reinforcing
reward values, the optimal policy is learned from repeated interaction with the
environment, and the problem is “solved.”
In reinforcement learning the environment gives us only a number as an in-
dication of the quality of an action that we performed, and we are left to derive
the correct action policy from that, as we can see in Fig. 2.4. On the other hand,
reinforcement learning allows us to generate as many action-reward pairs as we
need, without a large hand-labeled dataset, and we can choose ourselves which
actions to try.

2.2.2 Markov Decision Process

Sequential decision problems can be modelled as Markov decision processes

(MDPs) [483]. Markov decision problems have the Markov property: the next state
depends only on the current state and the actions available in it (no historical mem-
ory of previous states or information from elsewhere influences the next state) [352].
The no-memory property is important because it makes reasoning about future
states possible using only the information present in the current state. If previous
histories would influence the current state, and these would all have to be taken
into account, then reasoning about the current state would be much harder or even
infeasible.
Markov processes are named after Russian mathematician Andrey Markov (1856–
1922) who is best known for his work on these stochastic processes. See [27, 255]
for an introduction into MDPs. The MDP formalism is the mathematical basis under
reinforcement learning, and we will introduce the relevant elements in this chapter.
We follow Moerland [524] and François-Lavet et al. [255] for some of the notation
and examples in this section.

Formalism

We define a Markov decision process for reinforcement learning as a 5-tuple

(𝑆, 𝐴, 𝑇𝑎 , 𝑅 𝑎 , 𝛾):
• 𝑆 is a finite set of legal states of the environment; the initial state is denoted as 𝑠0
• 𝐴 is a finite set of actions (if the set of actions differs per state, then 𝐴𝑠 is the
finite set of actions in state 𝑠)
• 𝑇𝑎 (𝑠, 𝑠 0) = Pr(𝑠𝑡+1 = 𝑠 0 |𝑠𝑡 = 𝑠, 𝑎 𝑡 = 𝑎) is the probability that action 𝑎 in state 𝑠
at time 𝑡 will transition to state 𝑠 0 at time 𝑡 + 1 in the environment
• 𝑅 𝑎 (𝑠, 𝑠 0) is the reward received after action 𝑎 transitions state 𝑠 to state 𝑠 0
• 𝛾 ∈ [0, 1] is the discount factor representing the difference between future and
present rewards.
2.2 Tabular Value-Based Agents 29

2.2.2.1 State 𝑺

Let us have a deeper look at the Markov-tuple 𝑆, 𝐴, 𝑇𝑎 , 𝑅 𝑎 , 𝛾, to see their role in the
reinforcement learning paradigm, and how, together, they can model and describe
reward-based learning processes.
At the basis of every Markov decision process is a description of the state 𝑠𝑡 of
the system at a certain time 𝑡.

State Representation

The state 𝑠 contains the information to uniquely represent the configuration of the
environment.
Often there is a straightforward way to uniquely represent the state in a computer
memory. For the supermarket example, each identifying location is a state (such
as: I am at the corner of 8th Av and 27nd St). For chess, this can be the location of
all pieces on the board (plus information for the 50 move repetition rule, castling
rights, and en-passant state). For robotics this can be the orientation of all joints of
the robot, and the location of the limbs of the robot. For Atari, the state comprises
the values of all screen pixels.
Using its current behavior policy, the agent chooses an action 𝑎, which is per-
formed in the environment. How the environment reacts to the action is defined by
the transition model 𝑇𝑎 (𝑠, 𝑠 0) that is internal to the environment, which the agent
does not know. The environment returns the new state 𝑠 0, as well as a reward value
𝑟 0 for the new state.

Deterministic and Stochastic Environment

In discrete deterministic environments the transition function defines a one-step

transition, as each action (from a certain old state) deterministically leads to a single
new state. This is the case in Grid worlds, Sokoban, and in games such as chess and
checkers, where a move action deterministically leads to one new board position.
An example of a non-deterministic situation is a robot movement in an envi-
ronment. In a certain state, a robot arm is holding a bottle. An agent-action can be
turning the bottle in a certain orientation (presumably to pour a drink in a cup). The
next state may be a full cup, or it may be a mess, if the bottle was not poured in the
correct orientation, or location, or if something happened in the environment such
as someone bumping the table. The outcome of the action is unknown beforehand
by the agent, and depends on elements in the environment, that are not known to
the agent.
30 2 Tabular Value-Based Reinforcement Learning

2.2.2.2 Action 𝑨

Now that we have looked at the state, it is time to look at the second item that
defines an MDP, the action.

Irreversible Environment Action

When the agent is in state 𝑠, it chooses an action 𝐴 to perform, based on its current
behavior policy 𝜋(𝑎|𝑠) (policies are explained soon). The agent communicates the
selected action 𝑎 to the environment (Fig. 2.4). For the supermarket example, an
example of an action could be walking along a block in a certain direction (such
as: East). For Sokoban, an action can be pushing a box to a new location in the
warehouse. Note that in different states the possible actions may differ. For the
supermarket example, walking East may not be possible at each street corner, and
in Sokoban pushing a box in a certain direction will only be possible in states where
this direction is not blocked by a wall.
An action changes the state of the environment irreversibly. In the reinforcement
learning paradigm, there is no undo operator for the environment (nor is there in
the real world). When the environment has performed a state transition, it is final.
The new state is communicated to the agent, together with a reward value. The
actions that the agent performs in the environment are also known as its behavior,
just as the actions of a human in the world constitute the human’s behavior.

Discrete or Continuous Action Space

The actions are discrete in some applications, continuous in others. For example,
the actions in board games, and choosing a direction in a navigation task in a grid,
are discrete.
In contrast, arm and joint movements of robots, and bet sizes in certain games, are
continuous (or span a very large range of values). Applying algorithms to continuous
or very large action spaces either requires discretization of the continuous space
(into buckets) or the development of a different kind of algorithm. As we will see
in Chaps. 3 and 4, value-based methods work well for discrete action spaces, and
policy-based methods work well for both action spaces.
For the supermarket example we can actually choose between modeling our
actions discrete or continuous. From every state, we can move any number of steps,
small or large, integer or fractional, in any direction. We can even walk a curvy
path. So, strictly speaking, the action space is continuous. However, if, as in some
cities, the streets are organized in a rectangular Manhattan-pattern, then it makes
sense to discretize the continuous space, and to only consider discrete actions that
2.2 Tabular Value-Based Agents 31

𝑠 𝑠

𝜋 𝜋 𝑠
𝑎 𝑎

𝑡𝑎 , 𝑟𝑎 𝑡𝑎 , 𝑟𝑎 𝜋 𝑡𝑎 , 𝑟𝑎
𝑠0 𝑠0 𝑎, 𝑠0

Fig. 2.5 Backup Diagrams for MDP Transitions: Stochastic (left) and Deterministic (middle and
right) [743]

take us to the next street corner. Then, our action space has become discrete, by
using extra knowledge of the problem structure.1

2.2.2.3 Transition 𝑻𝒂

After having discussed state and action, it is time to look at the transition func-
tion 𝑇𝑎 (𝑠, 𝑠 0). The transition function 𝑇 𝑎 determines how the state changes after
an action has been selected. In model-free reinforcement learning the transition
function is implicit to the solution algorithm: the environment has access to the
transition function, and uses it to compute the next state 𝑠 0, but the agent has not.
(In Chap. 5 we will discuss model-based reinforcement learning. There the agent
has its own transition function, an approximation of the environment’s transition
function, which is learned from the environment feedback.)

Graph View of the State Space

We have discussed states, actions and transitions. The dynamics of the MDP are
modelled by transition function 𝑇𝑎 (·) and reward function 𝑅 𝑎 (·). The imaginary
space of all possible states is called the state space. The state space is typically
large. The two functions define a two-step transition from state 𝑠 to 𝑠 0, via action 𝑎:
𝑠 → 𝑎 → 𝑠 0.
To help our understanding of the transitions between states we can use a graph-
ical depiction, as in Fig. 2.5.
In the figure, states and actions are depicted as nodes (vertices), and transitions
are links (edges) between the nodes. States are drawn as open circles, and actions
as smaller black circles. In a certain state 𝑠, the agent can choose which action 𝑎 to
1 If we assume that supermarkets are large, block-sized, items that typically can be found on street
corners, then we can discretize the action space. Note that we may miss small sub-block-sized
supermarkets, because of this simplification. Another, better, simplification, would be to discretize
the action space into walking distances of the size of the smallest supermarket that we expect to
ever encounter.
32 2 Tabular Value-Based Reinforcement Learning

perform, that is then acted out in the environment. The environment returns the
new state 𝑠 0 and the reward 𝑟 0.
Figure 2.5 shows a transition graph of the elements of the MDP tuple 𝑠, 𝑎, 𝑡 𝑎 , 𝑟 𝑎
as well as 𝑠 0, and policy 𝜋, and how the value can be calculated. The root node
at the top is state 𝑠, where policy 𝜋 allows the agent to choose between three
actions 𝑎, that, following distribution Pr, each can transition to two possible states
𝑠 0, with their reward 𝑟 0. In the figure, a single transition is shown. Please use your
imagination to picture the other transitions as the graph extends down.
In the left panel of the figure the environment can choose which new state it
returns in response to the action (stochastic environment), in the middle panel there
is only one state for each action (deterministic environment); the tree can then be
simplified, showing only the states, as in the right panel.
To calculate the value of the root of the tree a backup procedure can be followed.
Such a procedure calculates the value of a parent from the values of the children,
recursively, in a bottom-up fashion, summing or maxing their values from the
leaves to the root of the tree. This calculation uses discrete time steps, indicated
by subscripts to the state and action, as in 𝑠𝑡 , 𝑠𝑡+1 , 𝑠𝑡+2 , . . .. For brevity, 𝑠𝑡+1 is
sometimes written as 𝑠 0. The figure shows a single transition step; an episode in
reinforcement learning typically consists of a sequence of many time steps.

Trial and Error, Down and Up

A graph such as the one in the center and right panel of Fig. 2.5, where child nodes
have only one parent node and without cycles, is known as a tree. In computer
science the root of a tree is at the top, and branches grow downward to the leaves.
As actions are performed and states and rewards are returned backup the tree, a
learning process is taking place in the agent. We can use Fig. 2.5 to better understand
the learning process that is unfolding.
The rewards of actions are learned by the agent by interacting with the envi-
ronment, performing the actions. In the tree of Fig. 2.5 an action selection moves
downward, towards the leaves. At the deeper states, we find the rewards, which we
propagate to the parent states upwards. Reward learning is learning by backpropa-
gation: in Fig. 2.5 the reward information flows upward in the diagram from the
leaves to the root. Action selection moves down, reward learning flows up.
Reinforcement learning is learning by trial and error. Trial is selecting an action
down (using the behavior policy) to perform in the environment. Error is moving
up the tree, receiving a feedback reward from the environment, and reporting that
back up the tree to the state to update the current behavior policy. The downward
selection policy chooses which actions to explore, and the upward propagation of
the error signal performs the learning of the policy.
Figures such as the one in Fig. 2.5 are useful for seeing how values are calculated.
The basic notions are trial, and error, or down, and up.
2.2 Tabular Value-Based Agents 33

2.2.2.4 Reward 𝑹𝒂

The reward function 𝑅 𝑎 is of central importance in reinforcement learning. It

indicates the measure of quality of that state, such solved, or distance. Rewards are
associated with single states, indicating their quality. However, we are most often
interested in the quality of a full decision making sequence from root to leaves
(this sequence of decisions would be one possible answer to our sequential decision
problem).
The reward of such a full sequence is called the return, sometimes denoted
confusingly as 𝑅, just as the reward. The expected cumulative discounted future
reward of a state is called the value function 𝑉 𝜋 (𝑠). The value function 𝑉 𝜋 (𝑠) is
the expected cumulative reward of 𝑠 where actions are chosen according to policy
𝜋. The value function plays a central role in reinforcement learning algorithms; in
a few moments we will look deeper into return and value.

2.2.2.5 Discount Factor 𝜸

We distinguish between two types of tasks: (1) continuous time, long running, tasks,
and (2) episodic tasks—tasks that end. In continuous and long running tasks it
makes sense to discount rewards from far in the future in order to more strongly
value current information at the present time. To achieve this a discount factor 𝛾 is
used in our MDP that reduces the impact of far away rewards. Many continuous
tasks use discounting, 𝛾 ≠ 1.
However, in this book we will often discuss episodic problems, where 𝛾 is
irrelevant. Both the supermarket example and the game of chess are episodic, and
discounting does not make sense in these problems, 𝛾 = 1.

2.2.2.6 Policy 𝝅

Of central importance in reinforcement learning is the policy function 𝜋. The policy

function 𝜋 answers the question how the different actions 𝑎 at state 𝑠 should be
chosen. Actions are anchored in states. The central question of MDP optimization
is how to choose our actions. The policy 𝜋 is a conditional probability distribution
that for each possible state specifies the probability of each possible action. The
function 𝜋 is a mapping from the state space to a probability distribution over the
action space:
𝜋 : 𝑆 → 𝑝( 𝐴)
where 𝑝( 𝐴) can be a discrete or continuous probability distribution. For a particular
probability (density) from this distribution we write

𝜋(𝑎|𝑠)
34 2 Tabular Value-Based Reinforcement Learning

Example: For a discrete state space and discrete action space, we may store
an explicit policy as a table, e.g.:
𝑠 𝜋(𝑎=up|𝑠) 𝜋(𝑎=down|𝑠) 𝜋(𝑎=left|𝑠) 𝜋(𝑎=right|𝑠)
1 0.2 0.8 0.0 0.0
2 0.0 0.0 0.0 1.0
3 0.7 0.0 0.3 0.0
etc. . . . .

A special case of a policy is a deterministic policy, denoted by

𝜋(𝑠)

where

𝜋:𝑆→𝐴
A deterministic policy selects a single action in every state. Of course the deter-
ministic action may differ between states, as in the example below:

Example: An example of a deterministic discrete policy is

𝑠 𝜋(𝑎=up|𝑠) 𝜋(𝑎=down|𝑠) 𝜋(𝑎=left|𝑠) 𝜋(𝑎=right|𝑠)
1 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 1.0
3 1.0 0.0 0.0 0.0
etc. . . . .
We would write 𝜋(𝑠 = 1) = down, 𝜋(𝑠 = 2) = right, etc.

2.2.3 MDP Objective

Finding the optimal policy function is the goal of the reinforcement learning prob-
lem, and the remainder of this book will discuss many different algorithms to
achieve this goal under different circumstances. Let us have a closer look at the
objective of reinforcement learning. Before we can do so, we will look at traces,
their return, and value functions.

2.2.3.1 Trace 𝝉

As we start interacting with the MDP, at each timestep 𝑡, we observe 𝑠𝑡 , take an

action 𝑎 𝑡 and then observe the next state 𝑠𝑡+1 ∼ 𝑇𝑎𝑡 (𝑠) and reward 𝑟 𝑡 = 𝑅 𝑎𝑡 (𝑠𝑡 , 𝑠𝑡+1 ).
2.2 Tabular Value-Based Agents 35

𝑎
𝑠

𝑇 𝑇

Fig. 2.6 Single Transition Step versus Full 3-Step Trace/Episode/Trajectory

Repeating this process leads to a sequence or trace in the environment, which we

denote by 𝜏𝑡𝑛 :
𝜏𝑡𝑛 = {𝑠𝑡 , 𝑎 𝑡 , 𝑟 𝑡 , 𝑠𝑡+1 , .., 𝑎 𝑡+𝑛 , 𝑟 𝑡+𝑛 , 𝑠𝑡+𝑛+1 }
Here, 𝑛 denotes the length of the 𝜏. In practice, we often assume 𝑛 = ∞, which means
that we run the trace until the domain terminates. In those cases, we will simply
write 𝜏𝑡 = 𝜏𝑡∞ . Traces are one of the basic building blocks of reinforcement learning
algorithms. They are a single full rollout of a sequence from the sequential decision
problem. They are also called trajectory, episode, or simply sequence (Fig. 2.6 shows
a single transition step, and an example of a three-step trace).

Example: A short trace with three actions could look like:

𝜏02 = {𝑠0 =1, 𝑎 0 =up, 𝑟 0 =−1, 𝑠1 =2, 𝑎 1 =up, 𝑟 1 =−1, 𝑠2 =3, 𝑎 2 =left, 𝑟 2 =20, 𝑠3 =5}

Since both the policy and the transition dynamics can be stochastic, we will not
always get the same trace from the start state. Instead, we will get a distribution
over traces. The distribution of traces from the start state (distribution) is denoted
by 𝑝(𝜏0 ). The probability of each possible trace from the start is actually given by
the product of the probability of each specific transition in the trace:

𝑝(𝜏0 ) = 𝑝 0 (𝑠0 ) · 𝜋(𝑎 0 |𝑠0 ) · 𝑇𝑎0 (𝑠0 , 𝑠1 ) · 𝜋(𝑎 1 |𝑠1 )...

Ö∞
= 𝑝 0 (𝑠0 ) · 𝜋(𝑎 𝑡 |𝑠𝑡 ) · 𝑇𝑎𝑡 (𝑠𝑡 , 𝑠𝑡+1 ) (2.1)
𝑡=0

Policy-based reinforcement learning depends heavily on traces, and we will

discuss traces more deeply in Chap. 4. Value-based reinforcement learning (this
chapter) uses single transition steps.
36 2 Tabular Value-Based Reinforcement Learning

Return 𝑹

We have not yet formally defined what we actually want to achieve in the sequential
decision-making task—which is, informally, the best policy. The sum of the future
reward of a trace is known as the return. The return of trace 𝜏𝑡 is:

𝑅(𝜏𝑡 ) = 𝑟 𝑡 + 𝛾 · 𝑟 𝑡+1 + 𝛾 2 · 𝑟 𝑡+2 + ...

∞
∑︁
= 𝑟𝑡 + 𝛾 𝑖 𝑟 𝑡+𝑖 (2.2)
𝑖=1

where 𝛾 ∈ [0, 1] is the discount factor. Two extreme cases are:

• 𝛾 = 0: A myopic agent, which only considers the immediate reward, 𝑅(𝜏𝑡 ) = 𝑟 𝑡
• 𝛾 = 1: A far-sighted agent, which treats all future rewards as equal, 𝑅(𝜏𝑡 ) =
𝑟 𝑡 + 𝑟 𝑡+1 + 𝑟 𝑡+2 + . . .
Note that if we would use an infinite-horizon return (Eq. 2.2) and 𝛾 = 1.0, then the
cumulative reward may become unbounded. Therefore, in continuous problems,
we use a discount factor close to 1.0, such as 𝛾 = 0.99.

Example: For the previous trace example we assume 𝛾 = 0.9. The return
(cumulative reward) is equal to:

𝑅(𝜏02 ) = −1 + 0.9 · −1 + 0.92 · 20 = 16.2 − 1.9 = 14.3

2.2.3.2 State Value 𝑽

The real measure of optimality that we are interested in is not the return of just one
trace. The environment can be stochastic, and so can our policy, and for a given
policy we do not always get the same trace. Therefore, we are actually interested
in the expected cumulative reward that a certain policy achieves. The expected
cumulative discounted future reward of a state is better known as the value of that
state.
We define the state value 𝑉 𝜋 (𝑠) as the return we expect to achieve when an
agent starts in state 𝑠 and then follows policy 𝜋, as:
∞
∑︁
𝑉 𝜋 (𝑠) = E 𝜏𝑡 ∼ 𝑝 ( 𝜏𝑡 ) 𝛾 𝑖 · 𝑟 𝑡+𝑖 |𝑠𝑡 = 𝑠 (2.3)
𝑖=0
2.2 Tabular Value-Based Agents 37

Example: Imagine that we have a policy 𝜋, which from state 𝑠 can result
in two traces. The first trace has a cumulative reward of 20, and occurs in
60% of the times. The other trace has a cumulative reward of 10, and occurs
40% of the times. What is the value of state 𝑠?

𝑉 𝜋 (𝑠) = 0.6 · 20 + 0.4 · 10 = 16.

The average return (cumulative reward) that we expect to get from state 𝑠
under this policy is 16.

Every policy 𝜋 has one unique associated value function 𝑉 𝜋 (𝑠). We often omit
𝜋 to simplify notation, simply writing 𝑉 (𝑠), knowing a state value is always condi-
tioned on a certain policy.
The state value is defined for every possible state 𝑠 ∈ 𝑆. 𝑉 (𝑠) maps every state
to a real number (the expected return):

𝑉 :𝑆→R

Example: In a discrete state space, the value function can be represented

as a table of size |𝑆|.
𝑠 𝑉 𝜋 (𝑠)
1 2.0
2 4.0
3 1.0
etc. .

Finally, the state value of a terminal state is by definition zero:

𝑠 = terminal ⇒ 𝑉 (𝑠) := 0.

2.2.3.3 State-Action Value 𝑸

In addition to state values 𝑉 𝜋 (𝑠), we also define state-action value 𝑄 𝜋 (𝑠, 𝑎).2 The
only difference is that we now condition on a state and action. We estimate the
average return we expect to achieve when taking action 𝑎 in state 𝑠, and follow
policy 𝜋 afterwards:
∞
𝜋
∑︁
𝑄 (𝑠, 𝑎) = E 𝜏𝑡 ∼ 𝑝 ( 𝜏𝑡 ) 𝛾 𝑖 · 𝑟 𝑡+𝑖 |𝑠𝑡 = 𝑠, 𝑎 𝑡 = 𝑎 (2.4)
𝑖=0

2The reason for the choice for letter Q is lost in the mists of time. Perhaps it is meant to indicate
quality.
38 2 Tabular Value-Based Reinforcement Learning

Every policy 𝜋 has only one unique associated state-action value function 𝑄 𝜋 (𝑠, 𝑎).
We often omit 𝜋 to simplify notation. Again, the state-action value is a function

𝑄:𝑆×𝐴→R

which maps every state-action pair to a real number.

Example: For a discrete state and action space, 𝑄(𝑠, 𝑎) can be represented
as a table of size |𝑆| × | 𝐴|. Each table entry stores a 𝑄(𝑠, 𝑎) estimate for the
specific 𝑠, 𝑎 combination:
𝑎=up 𝑎=down 𝑎=left 𝑎=right
𝑠=1 4.0 3.0 7.0 1.0
𝑠=2 2.0 -4.0 0.3 1.0
𝑠=3 3.5 0.8 3.6 6.2
etc. . . . .

The state-action value of a terminal state is by definition zero:

𝑠 = terminal ⇒ 𝑄(𝑠, 𝑎) := 0, ∀𝑎

2.2.3.4 Reinforcement Learning Objective

We now have the ingredients to formally state the objective 𝐽 (·) of reinforcement
learning. The objective is to achieve the highest possible average return from the
start state:
h i
𝐽 (𝜋) = 𝑉 𝜋 (𝑠0 ) = E 𝜏0 ∼ 𝑝 ( 𝜏0 | 𝜋) 𝑅(𝜏0 ) . (2.5)

for 𝑝(𝜏0 ) given in Eq. 2.1. There is one optimal value function, which achieves
higher or equal value than all other value functions. We search for a policy that
achieves this optimal value function, which we call the optimal policy 𝜋★:

𝜋★ (𝑎|𝑠) = arg max 𝑉 𝜋 (𝑠0 ) (2.6)

𝜋

This function 𝜋★ is the optimal policy, it uses the arg max function to select the
policy with the optimal value. The goal in reinforcement learning is to find this
optimal policy for start state 𝑠0 .
A potential benefit of state-action values 𝑄 over state values 𝑉 is that state-
action values directly tell what every action is worth. This may be useful for action
selection, since, for discrete action spaces,

𝑎★ = arg max 𝑄★ (𝑠, 𝑎)

𝑎∈𝐴
2.2 Tabular Value-Based Agents 39

the Q function directly identifies the best action. Equivalently, the optimal policy
can be obtained directly from the optimal Q function:

𝜋★ (𝑠) = arg max 𝑄★ (𝑠, 𝑎).

𝑎∈𝐴

We will now turn to construct algorithms to compute the value function and the
policy function.

2.2.3.5 Bellman Equation

To calculate the value function, let us look again at the tree in Fig. 2.5 on page 31,
and imagine that it is many times larger, with subtrees that extend to fully cover
the state space. Our task is to compute the value of the root, based on the reward
values at the real leaves, using the transition function 𝑇𝑎 . One way to calculate the
value 𝑉 (𝑠) is to traverse this full state space tree, computing the value of a parent
node by taking the reward value and the sum of the children, discounting this value
by 𝛾.
This intuitive approach was first formalized by Richard Bellman in 1957. Bell-
man showed that discrete optimization problems can be described as a recursive
backward induction problem [72]. He introduced the term dynamic programming
to recursively traverse the states and actions. The so-called Bellman equation shows
the relationship between the value function in state 𝑠 and the future child state 𝑠 0,
when we follow the transition function.
The discrete Bellman equation of the value of state 𝑠 after following policy 𝜋 is:3
∑︁ h ∑︁ i
𝑇𝑎 (𝑠, 𝑠 0) 𝑅 𝑎 (𝑠, 𝑠 0) + 𝛾 · 𝑉 𝜋 (𝑠 0)

𝑉 𝜋 (𝑠) = 𝜋(𝑎|𝑠) (2.7)
𝑎∈𝐴 𝑠0 ∈𝑆

where 𝜋 is the probability of action 𝑎 in state 𝑠, 𝑇 is the stochastic transition function,

𝑅 is the reward function and 𝛾 is the discount rate. Note the recursion on the value
function, and that for the Bellman equation the transition and reward functions
must be known for all states by the agent.
Together, the transition and reward model are referred to as the dynamics model
of the environment. The dynamics model is often not known by the agent, and
model-free methods have been developed to compute the value function and policy
function without them.
The recursive Bellman equation is the basis of algorithms to compute the value
function, and other relevant functions to solve reinforcement learning problems. In
the next section we will study these solution methods.
3 State-action value and continuous Bellman equations can be found in Appendix A.4.
40 2 Tabular Value-Based Reinforcement Learning

Fig. 2.7 Recursion: Droste effect

2.2.4 MDP Solution Methods

The Bellman equation is a recursive equation: it shows how to calculate the value
of a state, out of the values of applying the function specification again on the
successor states. Figure 2.7 shows a recursive picture, of a picture in a picture, in
a picture, etc. In algorithmic form, dynamic programming calls its own code on
states that are closer and closer to the leaves, until the leaves are reached, and the
recursion can not go further.
Dynamic programming uses the principle of divide and conquer: it begins with
a start state whose value is to be determined by searching a large subtree, which
it does by going down into the recursion, finding the value of sub-states that are
closer to terminals. At terminals the reward values are known, and these are then
used in the construction of the parent values, as it goes up, back out of the recursion,
and ultimately arrives at the root value itself.
A simple dynamic programming method to iteratively traverse the state space to
calculate Bellman’s equation is value iteration (VI). Pseudocode for a basic version
of VI is shown in Listing 2.1, based on [15]. Value iteration converges to the optimal
value function by iteratively improving the estimate of 𝑉 (𝑠). The value function
𝑉 (𝑠) is first initialized to random values. Value iteration repeatedly updates 𝑄(𝑠, 𝑎)
and 𝑉 (𝑠) values, looping over the states and their actions, until convergence occurs
(when the values of 𝑉 (𝑠) stop changing much).
Value iteration works with a finite set of actions. It has been proven to converge
to the optimal values, but, as we can see in the pseudocode in Listing 2.1, it does so
quite inefficiently by essentially repeatedly enumerating the entire state space in a
triply nested loop, traversing the state space many times. Soon we will see more
efficient methods.
2.2 Tabular Value-Based Agents 41

1 def v al ue _ it er a ti o n () :
2 initialize ( V )
3 while not convergence ( V ) :
4 for s in range ( S ) :
5 for a in range ( A ) :
Í 0 0 0
6 Q [s , a ] = 𝑠0 ∈𝑆 𝑇𝑎 (𝑠, 𝑠 ) (𝑅 𝑎 (𝑠, 𝑠 ) + 𝛾𝑉 [𝑠 ])
7 V [ s ] = max_a ( Q [s , a ])
8 return V

Listing 2.1 Value Iteration pseudocode

2.2.4.1 Hands On: Value Iteration in Gym

We have discussed in detail how to model a reinforcement learning problem with

an MDP. We have talked in depth and at length about states, actions, and policies. It
is now time for some hands-on work, to experiment with the theoretical concepts.
We will start with the environment.

OpenAI Gym

OpenAI has created the Gym suite of environments for Python, which has become
the de facto standard in the field [108]. The Gym suite can be found at OpenAI4
and on GitHub.5 Gym works on Linux, macOS and Windows. An active community
exists and new environments are created continuously and uploaded to the Gym
website. Many interesting environments are available for experimentation, to create
your own agent algorithm for, and test it.
If you browse Gym on GitHub, you will see different sets of environments,
from easy to advanced. There are the classics, such as Cartpole and Mountain car.
There are also small text environments. Taxi is there, and the Arcade Learning
Environment [71], which was used in the paper that introduced DQN [522], as we
will discuss at length in the next chapter. MuJoCo6 is also available, an environment
for experimentation with simulated robotics [780], or you can use pybullet.7
You should now install Gym. Go to the Gym page on https://gym.openai.com
and read the documentation. Make sure Python is installed on your system (does
typing python at the command prompt work?), and that your Python version is up
to date (version 3.10 at the time of this writing). Then type

pip install gym

4 https://gym.openai.com
5 https://github.com/openai/gym
6 http://www.mujoco.org
7 https://pybullet.org/wordpress/
42 2 Tabular Value-Based Reinforcement Learning

1 import gym
2
3 env = gym . make ( ’ CartPole - v0 ’)
4 env . reset ()
5 for _ in range (1000) :
6 env . render ()
7 env . step ( env . action_space . sample () ) # take a random action
8 env . close ()

Listing 2.2 Running the Gym CartPole Environment from Gym

Fig. 2.8 Taxi world [395]

to install Gym with the Python package manager. Soon, you will also be needing
deep learning suites, such as TensorFlow or PyTorch. It is recommended to install
Gym in the same virtual environment as your upcoming PyTorch and TensorFlow
installation, so that you can use both at the same time (see Sect. B.3.3). You may
have to install or update other packages, such as numpy, scipy and pyglet, to get
Gym to work, depending on your system installation.
You can check if the installation works by trying if the CartPole environment
works, see Listing 2.2. A window should appear on your screen in which a Cartpole
is making random movements (your window system should support OpenGL, and
you may need a version of pyglet newer than version 1.5.11 on some operating
systems).

Taxi Example with Value Iteration

The Taxi example (Fig. 2.8) is an environment where taxis move up, down, left,
and right, and pickup and drop off passengers. Let us see how we can use value
iteration to solve the Taxi problem. The Gym documentation describes the Taxi
world as follows. There are four designated locations in the Grid world indicated by
2.2 Tabular Value-Based Agents 43

1 import gym
2 import numpy as np
3
4 def i t e r a t e _ v a l u e _ f u n c t i o n ( v_inp , gamma , env ) :
5 ret = np . zeros ( env . nS )
6 for sid in range ( env . nS ) :
7 temp_v = np . zeros ( env . nA )
8 for action in range ( env . nA ) :
9 for ( prob , dst_state , reward , is_final ) in env . P [ sid
][ action ]:
10 temp_v [ action ] += prob *( reward + gamma * v_inp [
dst_state ]*( not is_final ) )
11 ret [ sid ] = max ( temp_v )
12 return ret
13
14 def b u i l d _ g r e e d y _ p o l i c y ( v_inp , gamma , env ) :
15 new_policy = np . zeros ( env . nS )
16 for state_id in range ( env . nS ) :
17 profits = np . zeros ( env . nA )
18 for action in range ( env . nA ) :
19 for ( prob , dst_state , reward , is_final ) in env . P [
state_id ][ action ]:
20 profits [ action ] += prob *( reward + gamma * v [
dst_state ])
21 new_policy [ state_id ] = np . argmax ( profits )
22 return new_policy
23
24
25 env = gym . make ( ’ Taxi - v3 ’)
26 gamma = 0.9
27 cum_reward = 0
28 n_rounds = 500
29 env . reset ()
30 for t_rounds in range ( n_rounds ) :
31 # init env and value function
32 observation = env . reset ()
33 v = np . zeros ( env . nS )
34
35 # solve MDP
36 for _ in range (100) :
37 v_old = v . copy ()
38 v = i t e r a t e _ v a l u e _ f u n c t i o n (v , gamma , env )
39 if np . all ( v == v_old ) :
40 break
41 policy = b u i l d _ g r e e d y _ p o l i c y (v , gamma , env ) . astype ( np . int )
42
43 # apply policy
44 for t in range (1000) :
45 action = policy [ observation ]
46 observation , reward , done , info = env . step ( action )
47 cum_reward += reward
48 if done :
49 break
50 if t_rounds % 50 == 0 and t_rounds > 0:
51 print ( cum_reward * 1.0 / ( t_rounds + 1) )
52 env . close ()

Listing 2.3 Value Iteration for Gym Taxi

44 2 Tabular Value-Based Reinforcement Learning

R(ed), B(lue), G(reen), and Y(ellow). When the episode starts, the taxi starts off at a
random square and the passenger is at a random location. The taxi drives to the
passenger’s location, picks up the passenger, drives to the passenger’s destination
(another one of the four specified locations), and then drops off the passenger. Once
the passenger is dropped off, the episode ends.
The Taxi problem has 500 discrete states: there are 25 taxi positions, five possible
locations of the passenger (including the case when the passenger is in the taxi),
and 4 destination locations (25 × 5 × 4).
The environment returns a new result tuple at each step. There are six discrete
deterministic actions for the Taxi driver:
0: Move south
1: Move north
2: Move east
3: Move west
4: Pick up passenger
5: Drop off passenger
There is a reward of −1 for each action and an additional reward of +20 for
delivering the passenger, and a reward of −10 for executing actions pickup and
dropoff illegally.
The Taxi environment has a simple transition function, which is used by the
agent in the value iteration code.8 Listing 2.3 shows an implementation of value
iteration that uses the Taxi environment to find a solution. This code is written by
Mikhail Trofimov, and illustrates clearly how value iteration first creates the value
function for the states, and then that a policy is formed by finding the best action
in each state, in the build-greedy-policy function.9
To get a feeling for how the algorithms work, please use the value iteration code
with the Gym Taxi environment, see to Listing 2.3. Run the code, and play around
with some of the hyperparameters to familiarize yourself a bit with Gym and with
planning by value iteration. Try to visualize for yourself what the algorithm is
doing. This will prepare you for the more complex algorithms that we will look
into next.

2.2.4.2 Model-Free Learning

The value iteration algorithm can compute the policy function. It uses the transition
model in its computation. Frequently, we are in a situation when the transition
probabilities are not known to the agent, and we need other methods to compute
the policy function. For this situation, model-free algorithms have been developed.
8Note that the code uses the environment to compute the next state, so that we do not have to
implement a version of the transition function for the agent.
9 https://gist.github.com/geffy/b2d16d01cbca1ae9e13f11f678fa96fd#file-taxi-vi-
py
2.2 Tabular Value-Based Agents 45

Name Approach Ref

Value Iteration Model-based enumeration [72, 15]
SARSA On-policy temporal difference model-free [645]
Q-learning Off-policy temporal difference model-free [831]
Table 2.1 Tabular Value-Based Approaches

The development of these model-free methods is a major milestone of reinforce-

ment learning, and we will spend some time to understand how they work. We will
start with value-based model-free algorithms. We will see how, when the agent does
not know the transition function, an optimal policy can be learned by sampling
rewards from the environment. Table 2.1 lists value iteration in conjunction with
the value-based model-free algorithms that we cover in this chapter. (Policy-based
model-free algorithms will be covered in Chap. 4.)
These algorithms are based on a few principles. First we will discuss how the
principle of sampling can be used to construct a value function. We discuss both
full-episode Monte Carlo sampling and single-step temporal difference learning;
we encounter the principle of bootstrapping and the bias-variance trade-off; and
we will see how the value function can be use to find the best actions, to form the
policy.
Second, we will discuss which mechanisms for action selection exist, where we
will encounter the exploration/exploitation trade-off. Third, we will discuss how to
learn from the rewards of the selected actions. We will encounter on-policy learning
and off-policy learning. Finally, we wil discuss two full algorithms in which all
these concepts come together: SARSA and Q-learning. Let us now start by having a
closer look at sampling actions with Monte Carlo sampling and temporal difference
learning.

Monte Carlo Sampling

A straightforward way to sample rewards is to generate a random episode, and use

its return to update the value function at the visited states. This approach consists
of two loops: a simple loop over the time steps of the episode, embedded in a loop
to sample long enough for the value function to convergence. This approach, of
randomly sampling full episodes, has become known as the Monte Carlo approach
(after the famous casino, because of the random action selection).
Listing 2.4 shows code for the Monte Carlo approach. We see three elements in
the code. First, the main variables are initialized. Then a loop for the desired number
of total samples performs the unrolling of the episodes. For each episode the state,
action and reward lists are initialized, and then filled with the samples from the
environment until we hit the terminal state of the episode.10 Then, at the end of
the episode, the return is calcuated in variable 𝑔 (the return of a state is the sum of
10 With epsilon-greedy action selection, see next subsection.
46 2 Tabular Value-Based Reinforcement Learning

1 def monte_carlo ( n_samples , ep_length , alpha , gamma ) :

2 # 0: initialize
3 t = 0; total_t = 0
4 Qsa = []
5
6 # sample n_times
7 while total_t < n_samples :
8
9 # 1: generate a full episode
10 s = env . reset ()
11 s_ep = []
12 a_ep = []
13 r_ep = []
14 for t in range ( ep_length ) :
15 a = select_action (s , Qsa )
16 s_next , r , done = env . step ( a )
17 s_ep . append ( s )
18 a_ep . append ( a )
19 r_ep . append ( r )
20
21 total_t += 1
22 if done or total_t >= n_times :
23 break ;
24 s = s_next
25
26 # 2: update Q function with a full episode ( incremental
27 # implem entati on )
28 g = 0.0
29 for t in reversed ( range ( len ( a_ep ) ) ) :
30 s = s_ep [ t ]; a = a_ep [ t ]
31 g = r_ep [ t ] + gamma * g
32 Qsa [s , a ] = Qsa [s , a ] + alpha * ( g - Qsa [s , a ])
33
34 return Qsa
35
36 def select_action (s , Qsa ) :
37
38 # policy is egreedy
39 epsilon = 0.1
40 if np . random . rand () < epsilon :
41 a = np . random . randint ( low =0 , high = env . n_actions )
42 else :
43 a = argmax ( Qsa [ s ])
44 return a
45
46 env = gym . make ( ’ Taxi - v3 ’)
47 monte_carlo ( n_samples =10000 , ep_length =100 , alpha =0.1 , gamma
=0.99)

Listing 2.4 Monte Carlo Sampling code

2.2 Tabular Value-Based Agents 47

its discounted future rewards). The learning rate 𝛼 is then used to update the 𝑄
function in an incremental implementation.11 The main purpose of the code is to
illustate full episode learning. Since it is a complete working algorithms, the code
also uses on-policy learning with 𝜖-greedy selection, topics that we will discuss in
the next subsection.
The Monte Carlo approach is a basic building block of value based reinforcement
learning. An advantage of the approach is its simplicity. A disadvantage is that a full
episode has to be sampled before the reward values are used, and sample efficiency
may be low. For this reason (and others, as we will soon see) another approach was
developed, inspired by the way the Bellman equation bootstraps on intermediate
values.

Temporal Difference Learning

Recall that in value iteration the value function was calculated recursively using
the values of successor states, following Bellman’s equation (Eq. 2.7).
Bootstrapping is the process of subsequent refinement by which old estimates
of a value are refined with new updates. It means literally: pull yourself up (out of
the swamp) by your boot straps. Bootstrapping solves the problem of computing a
final value when we only know how to compute step-by-step intermediate values.
Bellman’s recursive computation is a form of bootstrapping. In model-free learning,
we can use a similar approach, when the role of the transition function is replaced
by a sequence of environment samples.
A bootstrapping method that can be used to process the samples, and to refine
them to approximate the final state values, is temporal difference learning. Temporal
difference learning, TD for short, was introduced by Sutton [740] in 1988. The
temporal difference in the name refers to the difference in values between two time
steps, which are used to calculate the value at the new time step.
Temporal difference learning works by updating the current estimate of the state
value 𝑉 (𝑠) (the bootstrap-value) with an error value (new minus current) based on
the estimate of the next state that it has gotten through sampling the environment:

𝑉 (𝑠) ← 𝑉 (𝑠) + 𝛼[𝑟 0 + 𝛾𝑉 (𝑠 0) − 𝑉 (𝑠)] (2.8)

Here 𝑠 is the current state, 𝑠 0 the new state, and 𝑟 0 the reward of the new state.
Note the introduction of 𝛼, the learning rate, which controls how fast the algorithm
learns (bootstraps). It is an important parameter; setting the value too high can be
detrimental since the last value then dominates the bootstrap process too much.
Finding the optimal value will require experimentation. The 𝛾 parameter is the
discount rate. The last term −𝑉 (𝑠) subtracts the value of the current state, to
compute the temporal difference. Another way to write this update rule is
11The incremental implementation works for nonstationary situations, where the transition
probabilities may change, hence the previous 𝑄 values are subtracted.
48 2 Tabular Value-Based Reinforcement Learning

1 def t e m p o r a l _ d i f f e r e n c e ( n_samples , alpha , gamma ) :

2 # 0: initialize
3 Qsa = []
4 s = env . reset ()
5
6 for t in range ( n_samples ) :
7 a = select_action (s , Qsa )
8 s_next , r , done = env . step ( a )
9
10 # update Q function each time step with max of action
values
11 Qsa [s , a ] = Qsa [s , a ] + alpha * ( r + gamma * np . max ( Qsa [
s_next ]) - Qsa [s , a ])
12
13 if done :
14 s = env . reset ()
15 else :
16 s = s_next
17
18 return Qsa
19
20 def select_action (s , Qsa ) :
21
22 # policy is egreedy
23 epsilon = 0.1
24 if np . random . rand () < epsilon :
25 a = np . random . randint ( low =0 , high = env . n_actions )
26 else :
27 a = argmax ( Qsa [ s ])
28 return a
29
30 env = gym . make ( ’ Taxi - v3 ’)
31 t e m p o r a l _ d i f f e r e n c e ( n_samples =10000 , alpha =0.1 , gamma =0.99)

Listing 2.5 Temporal Difference Q-learning code

𝑉 (𝑠) ← 𝛼[𝑟 0 + 𝛾𝑉 (𝑠 0)] + (1 − 𝛼)𝑉 (𝑠)

as the difference between the new temporal difference target and the old value.
Note the absence of transition model 𝑇 in the formula; temporal difference is a
model-free update formula. Listing 2.5 shows code for the TD approach, for the
state-action value function. (This code is off-policy, and uses the same 𝜖-greedy
selection function as Monte Carlo sampling.)
The introduction of the temporal difference method has allowed model-free
methods to be used successfully in various reinforcement learning settings. Most
notably, it was the basis of the program TD-Gammon, that beat human world-
champions in the game of Backgammon in the early 1990s [763].
2.2 Tabular Value-Based Agents 49

Fig. 2.9 High and Low Bias, and High and Low Variance

Bias-Variance Trade-off

A crucial difference between the Monte Carlo method and the temporal difference
method is the use of bootstrapping to calculate the value function. The use of
bootstrapping has an important consequence: it trades off bias and variance (see
Fig. 2.9). Monte Carlo does not use bootstrapping. It performs a full episode with
many random action choices before it uses the reward. As such, its action choices
are unbiased (they are fully random), they are not influenced by previous reward
values. However, the fully random choices also cause a high variance of returns
between episodes. We say that Monte Carlo is a low-bias/high-variance algorithm.
In contrast, temporal difference bootstraps the 𝑄-function with the values of
the previous steps, refining the function values with the rewards after each single
step. It learns more quickly, at each step, but once a step has been taken, old reward
values linger around in the bootstrapped function value for a long time, biasing the
function value. On the other hand, because these old values are part of the new
bootstrapped value, the variance is lower. Thus, because of bootstrapping, TD is
a high-bias/low variance method. Figure 2.9 illustrates the concepts of bias and
variance with pictures of dart boards.
Both approaches have their uses in different circumstances. In fact, we can think
of situations where a middle ground (of medium bias/medium variance) might be
useful. This is the idea behind the so-called n-step approach: do not sample a full
episode, and also not a single step, but sample a few steps at a time before using
the reward values. The n-step algorithm has medium bias and medium variance.
50 2 Tabular Value-Based Reinforcement Learning

Fig. 2.10 Single Step Temporal Difference Learning, N-Step, and Monte Carlo Sampling [743]

Figure 2.10 from [743] illustrates the relation between Monte Carlo sampling, n-step,
and temporal difference learning.

Find Policy by Value-based Learning

The goal of reinforcement learning is to construct the policy with the highest
cumulative reward. Thus, we must find the best action 𝑎 in each state 𝑠. In the
value-based approach we know the value functions 𝑉 (𝑠) or 𝑄(𝑠, 𝑎). How can that
help us to find action 𝑎? In a discrete action space, there is at least one discrete
action with the highest value. Thus, if we have the optimal state-value 𝑉 ★, then the
optimal policy can be found by finding the action with that value. This relationship
is given by
𝜋★ = max 𝑉 𝜋 (𝑠) = max 𝑄 𝜋 (𝑠, 𝑎)
𝜋 𝑎, 𝜋

and the arg max function finds the best action for us

𝑎★ = arg max 𝑄★ (𝑠, 𝑎).

𝑎∈𝐴

In this way the optimal policy sequence of best actions 𝜋★ (𝑠) can be recovered from
the values, hence the name value-based method [846].
A full reinforcement learning algorithm consists of a rule for the selection part
(downward) and a rule for the learning part (upward). Now that we know how to
2.2 Tabular Value-Based Agents 51

calculate the value function (the up-motion in the tree diagram), let us see how
we can select the action in our model-free algorithm (the down-motion in the tree
diagram).

2.2.4.3 Exploration

Since there is no local transition function, model-free methods perform their state
changes directly in the environment. This may be an expensive operation, for
example, when a real-world robot arm has to perform a movement. The sampling
policy should choose promising actions to reduce the number of samples as much
as possible, and not waste any actions. What behavior policy should we use? It is
tempting to favor at each state the actions with the highest Q-value, since then we
would be following what is currently thought to be the best policy.
This approach is called the greedy approach. It appears attractive, but is short-
sighted and risks settling for local maxima. Following the trodden path based on
only a few early samples risks missing a potential better path. Indeed, the greedy
approach is high bias, using values based on few samples. We run the risk of circular
reinforcement, if we update the same behavior policy that we use to choose our
samples from. In addition to exploiting known good actions, a certain amount of
exploration of unknown actions is necessary. Smart sampling strategies use a mix
of the current behavior policy (exploitation) and randomness (exploration) to select
which action to perform in the environment.

Bandit Theory

The exploration/exploitation trade-off, the question of how to get the most reliable
information at the least cost, has been studied extensively in the literature for single
step decision problems [346, 845]. The field has the colorful name of multi-armed
bandit theory [30, 443, 279, 632]. A bandit in this context refers to a casino slot
machine, with not one arm, but many arms, each with a different and unknown
payout probability. Each trial costs a coin. The multi-armed bandit problem is then
to find a strategy that finds the arm with the highest payout at the least cost.
A multi-armed bandit is a single-state single-decision reinforcement learning
problem, a one-step non-sequential decision making problem, with the arms repre-
senting the possible actions. This simplified model of stochastic decision making
allows the in-depth study of exploration/exploitation strategies.
Single-step exploration/exploitation questions arise for example in clinical trials,
where new drugs are tested on test-subjects (real people). The bandit is the trial, and
the arms are the choice how many of the test subjects are given the real experimental
drug, and how many are given the placebo. This is a serious setting, since the cost
may be measured in the quality of human lives.
In a conventional fixed randomized controlled trial (supervised setup) the sizes
of the groups that get the experimental drugs and the control group would be fixed,
52 2 Tabular Value-Based Reinforcement Learning

Fig. 2.11 Adaptive Trial [5]

and the confidence interval and the duration of the test would also be fixed. In an
adaptive trial (bandit setup) the sizes would adapt during the trial depending on
the outcomes, with more people getting the drug if it appears to work, and fewer if
it does not.
Let us have a look at Fig. 2.11. Assume that the learning process is a clinical trial
in which three new compounds are tested for their medical effect on test subjects.
In the fixed trial (left panel) all test subjects receive the medicine of their group to
the end of the test period, after which the data set is complete and we can determine
which of the compounds has the best effect. At that point we know which group has
had the best medicine, and which two thirds of the subjects did not, with possibly
harmful effect. Clearly, this is not a satisfactory situation. It would be better if we
could gradually adjust the proportion of the subjects that receive the medicine
that currently looks best, as our confidence in our test results increases as the trial
progresses. Indeed, this is what reinforcement learning does (Fig. 2.11, right panel).
It uses a mix of exploration and exploitation, adapting the treatment, giving more
subjects the promising medicine, while achieving the same confidence as the static
trial at the end [443, 442].

𝝐-greedy Exploration

A popular pragmatic exploration/exploitation approach is to use a fixed ratio of

exploration versus exploitation. This approach is known as the 𝜖-greedy approach,
which is to mostly try the (greedy) action that currently has the highest policy value
except to explore an 𝜖 fraction of times a randomly selected other action. If 𝜖 = 0.1
then 90% of the times the currently-best action is taken, and 10% of the times a
random other action.
The algorithmic choice between greedily exploiting known information and
exploring unknown actions to gain new information is called the exploration/ex-
ploitation trade-off. It is a central concept in reinforcement learning; it determines
how much confidence we have in our outcome, and how quickly the confidence can
2.2 Tabular Value-Based Agents 53

be increased and the variance reduced. A second approach is to use an adapative

𝜖-ratio, that changes over time, or over other statistics of the learning process.
Other popular approaches to add exploration are to add Dirichlet-noise [425] or
to use Thompson sampling [770, 648].

2.2.4.4 Off-Policy Learning

In addition to the selection question, another main theme in the design of full
reinforcement learning algorithms is which learning method to use. Reinforcement
learning is concerned with learning an action-policy from the rewards. The agent
selects an action to perform, and learns from the reward that it gets back from the
environment. The question is whether the agent should perform updates strictly
on-policy—only learning from its most recent action—or allow off-policy updates,
learning from all available information.
In on-policy learning, the learning takes place by using the value of the action
that was selected by the policy. The policy determines the action to take, and the
value of that action is used to update the value of the policy function: the learning
is on-policy.
There is, however, an alternative to this straightforward method. In off-policy
methods, the learning takes place by backing up values of another action, not
necessarily the one selected by the behavior policy. This method makes sense when
the agent explores. When the behavior policy explores, it selects a non-optimal
action. The policy does not perform the greedy exploitation action; of course,
this usually results in an inferior reward value. On-policy learning would then
blindly backup the value of the non-optimal exploration action. Off-policy learning,
however, is free to backup another value instead. It makes sense to choose the value
of the best action, and not the inferior one selected by the exploration policy. The
advantage of this off-policy approach is that it does not pollute the behavior policy
with a value that is most likely inferior.
The difference between on-policy and off-policy is only in how they act when
exploring the non-greedy action. In the case of exploration, off-policy learning can
be more efficient, by not stubbornly backing up the value of the action selected by
the behavior policy, but the value of an older, better, action.
An important point is that the convergence behavior of on-policy and off-policy
learning is different. In general, tabular reinforcement learning have been proven to
converge when the policy is greedy in the limit with infinite exploration (GLIE) [743].
This means that (1) if a state is visited infinitely often, that each action is also chosen
infinitely often, and that (2) in the limit the policy is greedy with respect to the
learned 𝑄 function. Off-policy methods learn from the greedy rewards and thus
converge to the optimal policy, after having sampled enough states. However, on-
policy methods with a fixed 𝜖 do not converge to the optimal policy, since they keep
selecting explorative actions. When we use a variable-𝜖-policy in which the value
54 2 Tabular Value-Based Reinforcement Learning

of 𝜖 goes to zero, then on-policy methods do converge, since then they choose, in
the limit, the greedy action.12
A well-known tabular on-policy algorithm is SARSA.13 An even more well-
known off-policy algorithm is Q-learning.

On-Policy SARSA

SARSA is an on-policy algorithm [645]. On-policy learning updates the policy with
the action values of the policy. The SARSA update formula is

𝑄(𝑠𝑡 , 𝑎 𝑡 ) ← 𝑄(𝑠𝑡 , 𝑎 𝑡 ) + 𝛼[𝑟 𝑡+1 + 𝛾𝑄(𝑠𝑡+1 , 𝑎 𝑡+1 ) − 𝑄(𝑠𝑡 , 𝑎 𝑡 )]. (2.9)

Going back to temporal difference (Eq. 2.8), we see that the SARSA formula looks
very much like TD, although now we deal with state-action values.
On-policy learning selects an action, evaluates it in the environment, and follows
the actions, guided by the behavior policy. The behavior policy is not specified by
the formula, but might be 𝜖-greedy, or an other policy that trades off exploration
and exploitation. On-policy learning samples the state space following the behavior
policy, and improves the policy by backing up values of the selected actions. Note
that the term 𝑄(𝑠𝑡+1 , 𝑎 𝑡+1 ) can also be written as 𝑄(𝑠𝑡+1 , 𝜋(𝑠𝑡+1 )), to highlight the
difference with off-policy learning. SARSA updates its Q-values using the Q-value of
the next state 𝑠 and the current policy’s action. The primary advantage of on-policy
learning is its predictive behavior.

Off-Policy Q-Learning

Off-policy learning is more complicated; it may learn its policy from actions that
are different from the one just taken.
The best-known off-policy algorithm is Q-learning [831]. It performs exploiting
and exploring selection actions as before, but it evaluates states as if a greedy policy
is used always, even when the actual behavior performed an exploration step.
The Q-learning update formula is

𝑄(𝑠𝑡 , 𝑎 𝑡 ) ← 𝑄(𝑠𝑡 , 𝑎 𝑡 ) + 𝛼[𝑟 𝑡+1 + 𝛾 max 𝑄(𝑠𝑡+1 , 𝑎) − 𝑄(𝑠𝑡 , 𝑎 𝑡 )]. (2.10)

𝑎

12 However, in the next chapter, deep learning is introduced, and a complication arises. In deep
learning states and values are no longer exact but are approximated. Now, off-policy methods
become less stable than on-policy methods. In a neural network, states are “connected” via joint
features. The max-operator in off-policy methods pushes up training targets of these connected
states. As a consequence deep off-policy methods may not converge. A so-called deadly triad
of function approximation, bootstrapping and off-policy learning occurs that causes unstable
convergence. Because of this, with function approximation, on-policy methods are sometimes
favored.
13 The name of the SARSA algorithm is a play on the MDP symbols as they occur in the action

value update formula: 𝑠, 𝑎, 𝑟 , 𝑠, 𝑎.

2.2 Tabular Value-Based Agents 55

The only difference from on-policy learning is that the 𝛾𝑄(𝑠𝑡+1 , 𝑎 𝑡+1 ) term from
Eq. 2.9 has been replaced by 𝛾 max𝑎 𝑄(𝑠𝑡+1 , 𝑎). We now learn from backup values
of the best action, not the one that was actually evaluated. Listing 2.5 showed
the pseudocode for Q-learning. Indeed, the term temporal difference learning is
sometimes used for the Q-learning algorithm.
The reason that Q-learning is called off-policy is that it updates its Q-values
using the Q-value of the next state 𝑠𝑡+1 , and the greedy action (not necessarily the
behavior policy’s action—it is learning off the behavior policy). Off-policy learning
collects all available information and uses it to construct the best target policy.

Sparse Rewards and Reward Shaping

Before we conclude this section, we should discuss sparsity. For some environments
a reward exists for each state. For the supermarket example a reward can be calcu-
lated for each state that the agent has walked to. (The reward is the opposite of the
cost expended in walking.) Environments in which a reward exists in each state are
said to have a dense reward structure.
For other environments rewards may exist for only some of the states. For
example, in chess, rewards only exist at terminal board positions where there is a
win or a draw. In all other states the return depends on the future states and must be
calculated by the agent by propagating reward values from future states up towards
the root state 𝑠0 . Such an environment is said to have a sparse reward structure.
Finding a good policy is more complicated when the reward structure is sparse.
A graph of the landscape of such a sparse reward function would show a flat
landscape with a few sharp mountain peaks. Reinforcement learning algorithms use
the reward-gradient to find good returns. Finding the optimum in a flat landscape
where the gradient is zero, is hard. In some applications it is possible to change the
reward function to have a shape more amenable to gradient-based optimization
algorithms such as we use in deep learning. Reward shaping can make all the
difference when no solution can be found with a naive reward function. It is a way
of incorporating heuristic knowledge into the MDP. A large literature on reward
shaping and heuristic information exists [560]. The use of heuristics on board games
such as chess and checkers can also be regarded as reward shaping.

2.2.4.5 Hands On: Q-learning on Taxi

To get a feeling for how these algorithms work in practice, let us see how Q-learning
solves the Taxi problem.
In Sect. 2.2.4.1 we discussed how value iteration can be used for the Taxi problem,
provided that the agent has access to the transition model. We will now see how
we solve this problem if we do not have the transition model. Q-learning samples
actions, and records the reward values in a Q-table, converging to the state-action
56 2 Tabular Value-Based Reinforcement Learning

value function. When in all states the best values of the best actions are known,
then these can be used to sequence the optimal policy.
Let us see how a value-based model-free algorithm solves a simple 5 × 5 Taxi
problem. Refer to Fig. 2.8 on page 42 for an illustration of Taxi world.
Please recall that in Taxi world, the taxi can be in one of 25 locations and there
are 25 × (4 + 1) × 4 = 500 different states that the environment can be in.
We follow the reward model as it is used in the Gym Taxi environment. Recall
that our goal is to find a policy (actions in each state) that leads to the highest
cumulative reward. Q-learning learns the best policy through guided sampling.
The agent records the rewards that it gets from actions that it performs in the
environment. The Q-values are the expected rewards of the actions in the states.
The agent uses the Q-values to guide which actions it will sample. Q-values 𝑄(𝑠, 𝑎)
are stored in an array that is indexed by state and action. The Q-values guide the
exploration, higher values indicate better actions.
Listing 2.6 shows the full Q-learning algorithm, in Python, after [395]. It uses
an 𝜖-greedy behavior policy: mostly the best action is followed, but in a certain
fraction a random action is chosen, for exploration. Recall that the Q-values are
updated according to the Q-learning formula:

𝑄(𝑠𝑡 , 𝑎 𝑡 ) ← 𝑄(𝑠𝑡 , 𝑎 𝑡 ) + 𝛼[𝑟 𝑡+1 + 𝛾 max 𝑄(𝑠𝑡+1 , 𝑎) − 𝑄(𝑠𝑡 , 𝑎 𝑡 )]

𝑎

where 0 ≤ 𝛾 ≤ 1 is the discount factor and 0 < 𝛼 ≤ 1 the learning rate. Note that
Q-learning uses bootstrapping, and the initial Q-values are set to a random value
(their value will disappear slowly due to the learning rate).
Q-learning learns the best action in the current state by looking at the reward
for the current state-action combination, plus the maximum rewards for the next
state. Eventually the best policy is found in this way, and the taxi will consider the
route consisting of a sequence of the best rewards.
To summarize informally:
1. Initialize the Q-table to random values
2. Select a state 𝑠
3. For all possible actions from 𝑠 select the one with the highest Q-value and travel
to this state, which becomes the new 𝑠, or, with 𝜖 greedy, explore
4. Update the values in the Q-array using the equation
5. Repeat until the goal is reached; when the goal state is reached, repeat the process
until the Q-values stop changing (much), then stop.
Listing 2.6 shows Q-learning code for finding the policy in Taxi world.
The optimal policy can be found by sequencing together the actions with the
highest Q-value in each state. Listing 2.7 shows the code for this. The number of
illegal pickups/drop-offs is shown as penalty.
This example shows how the optimal policy can be found by the introduction of
a Q-table that records the quality of irreversible actions in each state, and uses that
table to converge the rewards to the value function. In this way the optimal policy
can be found model-free.
2.2 Tabular Value-Based Agents 57

1 # Q learning for OpenAI Gym Taxi environment

2 import gym
3 import numpy as np
4 import random
5 # Environment Setup
6 env = gym . make ( " Taxi - v2 " )
7 env . reset ()
8 env . render ()
9 # Q [ state , action ] table im plemen tation
10 Q = np . zeros ([ env . o b s e r v a t i o n _ s p a c e .n , env . action_space . n ])
11 gamma = 0.7 # discount factor
12 alpha = 0.2 # learning rate
13 epsilon = 0.1 # epsilon greedy
14 for episode in range (1000) :
15 done = False
16 total_reward = 0
17 state = env . reset ()
18 while not done :
19 if random . uniform (0 , 1) < epsilon :
20 action = env . action_space . sample () # Explore state
space
21 else :
22 action = np . argmax ( Q [ state ]) # Exploit learned values
23 next_state , reward , done , info = env . step ( action ) #
invoke Gym
24 next_max = np . max ( Q [ next_state ])
25 old_value = Q [ state , action ]
26
27 new_value = old_value + alpha * ( reward + gamma *
next_max - old_value )
28
29 Q [ state , action ] = new_value
30 total_reward += reward
31 state = next_state
32 if episode % 100 == 0:
33 print ( " Episode ␣ {} ␣ Total ␣ Reward : ␣ {} " . format ( episode ,
total_reward ) )

Listing 2.6 Q-learning Taxi example, after [395]

Tuning your Learning Rate

Go ahead, implement and run this code, and play around to become familiar with
the algorithm. Q-learning is an excellent algorithm to learn the essence of how
reinforcement learning works. Try out different values for hyperparameters, such
as the exploration parameter 𝜖, the discount factor 𝛾 and the learning rate 𝛼. To
be successful in this field, it helps to have a feeling for these hyperparameters. A
choice close to 1 for the discount parameter is usually a good start, and a choice
close to 0 for the learning rate is a good start. You may feel a tendency to do the
opposite, to choose the learning rate as high as possible (close to 1) to learn as
58 2 Tabular Value-Based Reinforcement Learning

1 total_epochs , t ot a l_ pe n al ti e s = 0 , 0
2 ep = 100
3 for _ in range ( ep ) :
4 state = env . reset ()
5 epochs , penalties , reward = 0 , 0 , 0
6 done = False
7 while not done :
8 action = np . argmax ( Q [ state ])
9 state , reward , done , info = env . step ( action )
10 if reward == -10:
11 penalties += 1
12 epochs += 1
13 t ot al _ pe n al ti e s += penalties
14 total_epochs += epochs
15 print ( f " Results ␣ after ␣ { ep } ␣ episodes : " )
16 print ( f " Average ␣ timesteps ␣ per ␣ episode : ␣ { total_epochs ␣ / ␣ ep } " )
17 print ( f " Average ␣ penalties ␣ per ␣ episode : ␣ { t ot al _ pe n al ti e s ␣ / ␣ ep } " )

Listing 2.7 Evaluate the optimal Taxi result, after [395]

quickly as possible. Please go ahead and see which works best in Q-learning (you
can have a look at [231]). In many deep learning environments a high learning rate
is a recipe for disaster, your algorithm may not converge at all, and Q-values can
become unbounded. Play around with tabular Q-learning, and approach your deep
learning slowly, with gentle steps!
The Taxi example is small, and you will get results quickly. It is well suited to build
up useful intuition. In later chapters, we will do experiments with deep learning
that take longer to converge, and acquiring intuition for tuning hyperparameter
values will be more expensive.

Conclusion

We have now seen how a value function can be learned by an agent without having
the transition function, by sampling the environment. Model-free methods use
actions that are irreversible for the agent. The agent samples states and rewards
from the environment, using a behavior policy with the current best action, and
following an exploration/exploitation trade-off. The backup rule for learning is based
on bootstrapping, and can follow the rewards of the actions on-policy, including
the value of the occasional explorative action, or off-policy, always using the value
of the best action. We have seen two model-free tabular algorithms, SARSA and
Q-learning, where the value function is assumed to be stored in an exact table
structure.
In the next chapter we will move to network-based algorithms for high-
dimensional state spaces, based on function approximation with a deep neural
network.
2.3 Classic Gym Environments 59

Fig. 2.12 Cartpole and Mountain Car

2.3 Classic Gym Environments

Now that we have discussed at length the tabular agent algorithms, it is time to
have a look at the environments, the other part of the reinforcement learning model.
Without them, progress cannot be measured, and results cannot be compared in a
meaningful way. In a real sense, environments define the kind of intelligence that
our artificial methods can be trained to perform.
In this chapter we will start with a few smaller environments, that are suited
for the tabular algorithms that we have discussed. Two environments that have
been around since the early days of reinforcement learning are Mountain car and
Cartpole (see Fig. 2.12).

2.3.1 Mountain Car and Cartpole

Mountain car is a physics puzzle in which a car on a one-dimensional road is in

a valley between two mountains. The goal for the car is to drive up the mountain
and reach the flag on the right. The car’s engine can go forward and backward. The
problem is that the car’s engine is not strong enough to climb the mountain by itself
in a single pass [532], but it can do so with the help of gravity: by repeatedly going
back and forth the car can build up momentum. The challenge for the reinforcement
learning agent is to apply alternating backward and forward forces at the right
moment.
Cartpole is a pole-balancing problem. A pole is attached by a joint to a movable
cart, which can be pushed forward or backward. The pendulum starts upright, and
must be kept upright by applying either a force of +1 or −1 to the cart. The puzzle
ends when the pole falls over, or when the cart runs too far left or right [57]. Again
the challenge is to apply the right force at the right moment, solely by feedback of
the pole being upright or too far down.
60 2 Tabular Value-Based Reinforcement Learning

2.3.2 Path Planning and Board Games

Navigation tasks and board games provide environments for reinforcement learning
that are simple to understand. They are well suited to reason about new agent
algorithms. Navigation problems, and the heuristic search trees built for board
games, can be of moderate size, and are then suited for determining the best action
by dynamic programming methods, such as tabular Q-learning, A*, branch and
bound, and alpha-beta [647]. These are straightforward search methods that do not
attempt to generalize to new, unseen, states. They find the best action in a space of
states, all of which are present at training time—the optimization methods do not
perform generalization from training to test time.

Path Planning

Path planning (Fig 2.1) is a classic problem that is related to robotics [456, 265]. Pop-
ular versions are mazes, as we have seen earlier (Fig. 2.2). The Taxi domain (Fig. 2.8)
was originally introduced in the context of hierarchical problem solving [196]. Box-
pushing problems such as Sokoban are frequently used as well [386, 204, 543, 878],
see Fig. 2.3. The action space of these puzzles and mazes is discrete. Basic path and
motion planning can enumerate possible solutions [169, 322].
Small versions of mazes can be solved exactly by enumeration, larger instances
are only suitable for approximation methods. Mazes can be used to test algorithms
for path finding problems and are frequently used to do so. Navigation tasks and
box-pushing games such as Sokoban can feature rooms or subgoals, that may then
be used to test algorithms for hierarchically structured problems [236, 298, 615, 239]
(Chap. 8). The problems can be made more difficult by enlarging the grid and by
inserting more obstacles.

Board Games

Board games are a classic group of benchmarks for planning and learning since
the earliest days of artificial intelligence. Two-person zero-sum perfect information
board games such as tic tac toe, chess, checkers, Go, and shogi have been used to
test algorithms since the 1950s. The action space of these games is discrete. Notable
achievements were in checkers, chess, and Go, where human world champions
were defeated in 1994, 1997, and 2016, respectively [663, 124, 703].
The board games are typically used “as is” and are not changed for different
experiments (in contrast to mazes, that are often adapted in size or complexity for
specific purposes of the experiment). Board games are used for the difficulty of the
challenge. The ultimate goals is to beat human grandmasters or even the world
champion. Board games have been traditional mainstays of artificial intelligence,
mostly associated with the search-based symbolic reasoning approach to artificial
2.3 Classic Gym Environments 61

intelligence [647]. In contrast, the benchmarks in the next chapter are associated
with connectionist artificial intelligence.

Summary and Further Reading

This has been a long chapter, to provide a solid basis for the rest of the book. We
will summarize the chapter, and provide references for further reading.

Summary

Reinforcement learning can learn behavior that achieves high rewards, using feed-
back from the environment. Reinforcement learning has no supervisory labels,
it can learn beyond a teacher, as long as there is an environment that provides
feedback.
Reinforcement learning problems are modeled as a Markov decision problem,
consisting of a 5-tuple (𝑆, 𝐴, 𝑇𝑎 , 𝑅 𝑎 , 𝛾) for states, actions, transition, reward, and
discount factor. The agent performs an action, and the environment returns the
new state and the reward value to be associated with the new state.
Games and robotics are two important fields of application. Fields of application
can be episodic (they end—such as a game of chess) or continuous (they do not
end—a robot remains in the world). In continuous problems it often makes sense to
discount behavior that is far from the present, episodic problems typically do not
bother with a discount factor—a win is a win.
Environments can be deterministic (many board games are deterministic—boards
don’t move) or stochastic (many robotic worlds are stochastic—the world around a
robot moves). The action space can be discrete (a piece either moves to a square or
it does not) or continuous (typical robot joints move continuously over an angle).
The goal in reinforcement learning is to find the optimal policy that gives for
all states the best actions, maximizing the cumulative future reward. The policy
function is used in two different ways. In a discrete environment the policy function
𝑎 = 𝜋(𝑠) returns for each state the best action in that sate. (Alternatively the value
function returns the value of each action in each state, out of which the argmax
function can be used to find the actions with the highest value.)
The optimal policy can be found by finding the maximal value of a state. The
value function 𝑉 (𝑠) returns the expected reward for a state. When the transition
function 𝑇𝑎 (𝑠, 𝑠 0) is present, the agent can use Bellman’s equation, or a dynamic
programming method to recursively traverse the behavior space. Value iteration is
one such dynamic programming method. Value iteration traverses all actions of
all states, backing up reward values, until the value function stops changing. The
state-action value 𝑄(𝑠, 𝑎) determines the value of an action of a state.
62 2 Tabular Value-Based Reinforcement Learning

Bellman’s equation calculates the value of a state by calculating the value of

successor states. Accessing successor states (by following the action and transition)
is also called expanding a successor state. In a tree diagram successor states are
called child nodes, and expanding is a downward action. Backpropagating the
reward values to the parent node is a movement upward in the tree.
Methods where the agent makes use of the transition model are called model-
based methods. When the agent does not use the transition model, they are model-
free methods. In many situations the learning agent does not have access to the
transition model of the environment, and planning methods cannot be used by the
agent. Value-based model-free methods can find an optimal policy by using only
irreversible actions, sampling the environment to find the value of the actions.
A major determinant in model-free reinforcement learning is the exploration/ex-
ploitation trade-off, or how much of the information that has been learned from the
environment is used in choosing actions to sample. We discussed the advantages
of exploiting the latest knowledge in settings where environment actions are very
costly, such as clinial trials. A well-known exploration/exploitation method is 𝜖-
greedy, where the greedy (best) action is followed from the behavior policy, except
in 𝜖 times, when random exploration is performed. Always following the policy’s
best action runs the risk of getting stuck in a cycle. Exploring random nodes allows
breaking free of such cycles.
So far we have discussed the action selection operation. How should we process
the rewards that are found at nodes? Here we introduced another fundamental
element of reinforcement learning: bootstrapping, or finding a value by refining a
previous value. Temporal difference learning uses the principle of bootstrapping to
find the value of a state by adding appropriately discounted future reward values
to the state value function.
We have now discussed both up and down motions, and can construct full model-
free algorithms. The best-known algorithm may well be Q-learning, which learns
the action-value function of each action in each state through off-policy temporal
difference learning. Off-policy algorithms improve the policy function with the
value of the best action, even if the (exploring) behavior action was different.
In the next chapters we will look at value-based and policy-based model-free
methods for large, complex problems, that make use of function approximation
(deep learning).

Further Reading

There is a rich literature on tabular reinforcement learning. A standard work for

tabular value-based reinforcement learning is Sutton and Barto [743]. Two con-
densed introductions to reinforcement learning are [27, 255]. Another major work
on reinforcement learning is Bertsekas and Tsitsiklis [86]. Kaelbling has written an
important survey article on the field [389]. The early works of Richard Bellman on
dynamic programming, and planning algorithms are [72, 73]. For a recent treatment
2.3 Classic Gym Environments 63

of games and reinforcement learning, with a focus on heuristic search methods and
the methods behind AlphaZero, see [600].
The methods of this chapter are based on bootstrapping [72] and temporal
difference learning [740]. The on-policy algorithm SARSA [645] and the off-policy
algorithm Q-Learning [831] are among the best known exact, tabular, value-based
model-free algorithms.
Mazes and Sokoban grids are sometimes procedurally generated [692, 329, 781].
The goal for the algorithms is typically to find a solution for a grid of a certain
difficulty class, to find a shortest path solution, or, in transfer learning, to learn to
solve a class of grids by training on a different class of grids [859].
For general reference, one of the major textbooks on artificial intelligence is
written by Russell and Norvig [647]. A more specific textbook on machine learning
is by Bishop [93].

Exercises

We will end with questions on key concepts, with programming exercises to build
up more experience,

Questions

The questions below are meant to refresh your memory, and should be answered
with yes, no, or short answers of one or two sentences.
1. In reinforcement learning the agent can choose which training examples are
generated. Why is this beneficial? What is a potential problem?
2. What is Grid world?
3. Which five elements does an MDP have to model reinforcement learning prob-
lems?
4. In a tree diagram, is successor selection of behavior up or down?
5. In a tree diagram, is learning values through backpropagation up or down?
6. What is 𝜏?
7. What is 𝜋(𝑠)?
8. What is 𝑉 (𝑠)?
9. What is 𝑄(𝑠, 𝑎)?
10. What is dynamic programming?
11. What is recursion?
12. Do you know a dynamic programming method to determine the value of a state?
13. Is an action in an environment reversible for the agent?
14. Mention two typical application areas of reinforcement learning.
15. Is the action space of games typically discrete or continuous?
16. Is the action space of robots typically discrete or continuous?
17. Is the environment of games typically deterministic or stochastic?
64 2 Tabular Value-Based Reinforcement Learning

18. Is the environment of robots typically deterministic or stochastic?

19. What is the goal of reinforcement learning?
20. Which of the five MDP elements is not used in episodic problems?
21. Which model or function is meant when we say “model-free” or “model-based”?
22. What type of action space and what type of environment are suited for value-
based methods?
23. Why are value-based methods used for games and not for robotics?
24. Name two basic Gym environments.

Exercises

There is an even better way to learn about deep reinforcement learning then reading
about it, and that is to perform experiments yourself, to see the learning processes
unfold before your own eyes. The following exercises are meant as starting points
for your own discoveries in the world of deep reinforcement learning.
Consider using Gym to implement these exercises. Section 2.2.4.1 explains how
to install Gym.
1. Q-learning Implement Q-learning for Taxi, including the procedure to derive
the best policy for the Q-table. Go to Sect. 2.2.4.5 and implement it. Print the
Q-table, to see the values on the squares. You could print a live policy as the
search progresses. Try different values for 𝜖, the exploration rate. Does it learn
faster? Does it keep finding the optimal solution? Try different values for 𝛼, the
learning rate. Is it faster?
2. SARSA Implement SARSA, the code is in Listing 2.8. Compare your results to
Q-learning, can you see how SARSA chooses different paths? Try different 𝜖 and
𝛼.
3. Problem size How large can problems be before converging starts taking too
long?
4. Cartpole Run Cartpole with the greedy policy computed by value iteration. Can
you make it work? Is value iteration a suitable algorithm for Cartpole? If not,
why do you think it is not?
2.3 Classic Gym Environments 65

1 # SARSA for OpenAI Gym Taxi environment

2 import gym
3 import numpy as np
4 import random
5 # Environment Setup
6 env = gym . make ( " Taxi - v2 " )
7 env . reset ()
8 env . render ()
9 # Q [ state , action ] table im plemen tation
10 Q = np . zeros ([ env . o b s e r v a t i o n _ s p a c e .n , env . action_space . n ])
11 gamma = 0.7 # discount factor
12 alpha = 0.2 # learning rate
13 epsilon = 0.1 # epsilon greedy
14 for episode in range (1000) :
15 done = False
16 total_reward = 0
17 current_state = env . reset ()
18 if random . uniform (0 , 1) < epsilon :
19 curr ent_ac tion = env . action_space . sample () # Explore
state space
20 else :
21 curr ent_ac tion = np . argmax ( Q [ current_state ]) # Exploit
learned values
22 while not done :
23 next_state , reward , done , info = env . step ( curren t_acti on )
# invoke Gym
24 if random . uniform (0 , 1) < epsilon :
25 next_action = env . action_space . sample () # Explore
state space
26 else :
27 next_action = np . argmax ( Q [ next_state ]) # Exploit
learned values
28 sarsa_value = Q [ next_state , next_action ]
29 old_value = Q [ current_state , curren t_acti on ]
30
31 new_value = old_value + alpha * ( reward + gamma *
sarsa_value - old_value )
32
33 Q [ current_state , curr ent_ac tion ] = new_value
34 total_reward += reward
35 current_state = next_state
36 curr ent_ac tion = next_action
37 if episode % 100 == 0:
38 print ( " Episode ␣ {} ␣ Total ␣ Reward : ␣ {} " . format ( episode ,
total_reward ) )

Listing 2.8 SARSA Taxi example, after [395]

Chapter 3
Deep Value-Based Reinforcement Learning

The previous chapter introduced the field of classic reinforcement learning. We

learned about agents and environments, and about states, actions, values, and policy
functions. We also saw our first planning and learning algorithms: value iteration,
SARSA and Q-learning. The methods in the previous chapter were exact, tabular,
methods, that work for problems of moderate size that fit in memory.
In this chapter we move to high-dimensional problems with large state spaces
that no longer fit in memory. We will go beyond tabular methods and use methods
to approximate the value function and to generalize beyond trained behavior. We
will do so with deep learning.
The methods in this chapter are deep, model-free, value-based methods, related
to Q-learning. We will start by having a closer look at the new, larger, environments
that our agents must now be able to solve (or rather, approximate). Next, we will
look at deep reinforcement learning algorithms. In reinforcement learning the
current behavior policy determines which action is selected next, a process that
can be self-reinforcing. There is no ground truth, as in supervised learning. The
targets of loss functions are no longer static, or even stable. In deep reinforcement
learning convergence to 𝑉 and 𝑄 values is based on a bootstrapping process, and
the first challenge is to find training methods that converge to stable function values.
Furthermore, since with neural networks the states are approximated based on their
features, convergence proofs can no longer count on identifying states individually.
For many years it was assumed that deep reinforcement learning is inherently
unstable due to a so-called deadly triad of bootstrapping, function approximation
and off-policy learning.
However, surprisngly, solutions have been found for many of the challenges.
By combining a number of approaches (such as the replay buffer and increased
exploration) the Deep Q-Networks algorithm (DQN) was able to achieve stable
learning in a high-dimensional environment. The success of DQN spawned a large
research effort to improve training further. We will discuss some of these new
methods.
The chapter is concluded with exercises, a summary, and pointers to further
reading.

67
68 3 Deep Value-Based Reinforcement Learning

Deep Learning: Deep reinforcement learning builds on deep supervised

learning, and this chapter and the rest of the book assume a basic under-
standing of deep learning. When your knowledge of parameterized neural
networks and function approximation is rusty, this is the time to go to
Appendix B and take an in-depth refresher. The appendix also reviews
essential concepts such as training, testing, accuracy, overfitting and the
bias-variance trade-off. When in doubt, try to answer the questions on
page 332.

Core Concepts

• Stable convergence
• Replay buffer

Core Problem

• Achieve stable deep reinforcement learning in large problems

Core Algorithm

• Deep Q-network (Listing 3.6)

End-to-end Learning

Before the advent of deep learning, traditional reinforcement learning had been
used mostly on smaller problems such as puzzles, or the supermarket example.
Their state space fits in the memories of our computers. Reward shaping, in the
form of domain-specific heuristics, can be used to shoehorn the problem into a
computer, for example, in chess and checkers [124, 354, 662]. Impressive results are
achieved, but at the cost of extensive problem-specific reward shaping and heuristic
engineering [600]. Deep learning changed this situation, and reinforcement learning
is now used on high-dimensional problems that are too large to fit into memory.
In the field of supervised learning, a yearly competition had created years of
steady progress in which the accuracy of image classification had steadily improved.
Progress was driven by the availability of ImageNet, a large database of labeled
3 Deep Value-Based Reinforcement Learning 69

Fig. 3.1 Example Game from the Arcade Learning Environment [71]

Fig. 3.2 Atari Experiments on the Cover of Nature

images [237, 191], by increases in computation power through GPUs, and by steady
improvement of machine learning algorithms, especially in deep convolutional
neural networks. In 2012, a paper by Krizhevsky, Sutskever and Hinton presented a
method that out-performed other algorithms by a large margin, and approached the
performance of human image recognition [431]. The paper introduced the AlexNet
architecture (after the first name of the first author) and 2012 is often regarded as
the year of the breakthrough of deep learning. (See Appendix B.3.1 for details.) This
breakthrough raised the question whether something similar in deep reinforcement
learning could be achieved.
70 3 Deep Value-Based Reinforcement Learning

We did not have to wait long, only a year later, in 2013, at the deep learning
workshop of one of the main AI conferences, a paper was presented on an algorithm
that could play 1980s Atari video games just by training on the pixel input of the
video screen (Fig. 3.1). The algorithm used a combination of deep learning and
Q-learning, and was named Deep Q-Network, or DQN [522, 523]. An illuminating
video of how it learned to play the game Breakout is here.1 This was a breakthrough
for reinforcement learning. Many researchers at the workshop could relate to this
achievement, perhaps because they had spent hours playing Space Invaders, Pac-
Man and Pong themselves when they were younger. Two years after the presentation
at the deep learning workshop a longer article appeared in the journal Nature in
which a refined and expanded version of DQN was presented (see Fig. 3.2 for the
journal cover).
Why was this such a momentous achievement? Besides the fact that the problem
that was solved was easily understood, true eye-hand coordination of this com-
plexity had not been achieved by a computer before; furthermore, the end-to-end
learning from pixel to joystick implied artificial behavior that was close to how
humans play games. DQN essentially launched the field of deep reinforcement learn-
ing. For the first time the power of deep learning had been successfully combined
with behavior learning, for an imaginative problem.
A major technical challenge that was overcome by DQN was the instability of
the deep reinforcement learning process. In fact, there were convincing theoretical
analyses at the time that this instability was fundamental, and it was generally
assumed that it would be next to impossible to overcome [48, 283, 787, 743], since
the target of the loss-function depended on the convergence of the reinforcement
learning process itself. By the end of this chapter we will have covered the problems
of convergence and stability in reinforcement learning. We will have seen how
DQN addresses these problems, and we will also have discussed some of the many
further solutions that were devised after DQN.
But let us first have a look at the kind of new, high-dimensional, environments
that were the cause of these developments.

3.1 Large, High-Dimensional, Problems

In the previous chapter, Grid worlds and mazes were introduced as basic sequential
decision making problems in which exact, tabular, reinforcement learning methods
work well. These are problems of moderate complexity. The complexity of a problem
is related to the number of unique states that a problem has, or how large the state
space is. Tabular methods work for small problems, where the entire state space
fits in memory. This is for example the case with linear regression, which has only
one variable 𝑥 and two parameters 𝑎 and 𝑏, or the Taxi problem, which has a state
space of size 500. In this chapter we will be more ambitious and introduce various
1 https://www.youtube.com/watch?v=TmPfTpjtdgg
3.1 Large, High-Dimensional, Problems 71

Fig. 3.3 Atari 2600 console

games, most notably Atari arcade games. The state space of a single frame of Atari
video input is 210 × 160 pixels of 256 RGB color values = 25633600 .
There is a qualitative difference between small (500) and large (25633600 ) prob-
lems.2 For small problems the policy can be learned by loading all states of a problem
in memory. States are identified individually, and each has its own best action, that
we can try to find. Large problems, in contrast, do not fit in memory, the policy
cannot be memorized, and states are grouped together based on their features (see
Sect. B.1.3, where we discuss feature learning). A parameterized network maps
states to actions and values; states are no longer individually identifiable in a lookup
table.
When deep learning methods were introduced in reinforcement learning, larger
problems than before could be solved. Let us have a look at those problems.

3.1.1 Atari Arcade Games

Learning actions directly from high-dimensional sound and vision inputs is one of
the long-standing challenges of artificial intelligence. To stimulate this research, in
2012 a test-bed was created designed to provide challenging reinforcement learning
tasks. It was called the Arcade learning environment, or ALE [71], and it was based
on a simulator for 1980s Atari 2600 video games. Figure 3.3 shows a picture of a
distinctly retro Atari 2600 gaming console.
Among other things ALE contains an emulator of the Atari 2600 console. ALE
presents agents with a high-dimensional3 visual input (210 × 160 RGB video at
60 Hz, or 60 images per second) of tasks that were designed to be interesting and
challenging for human players (Fig. 3.1 showed an example of such a game and
2 See Sect. B.1.2, where we discuss the curse of dimensionality.
3 That is, high dimensional for machine learning. 210 × 160 pixels is not exactly high-definition
video quality.
72 3 Deep Value-Based Reinforcement Learning

Fig. 3.4 Screenshots of 4 Atari Games (Breakout, Pong, Montezuma’s Revenge, and Private Eye)

Fig. 3.4 shows a few more). The game cartridge ROM holds 2-4 kB of game code,
while the console random-access memory is small, just 128 bytes (really, just 128
bytes, although the video memory is larger, of course). The actions can be selected
via a joystick (9 directions), which has a fire button (fire on/off), giving 18 actions
in total.
The Atari games provide challenging eye-hand coordination and reasoning tasks,
that are both familiar and challenging to humans, providing a good test-bed for
learning sequential decision making.
Atari games, with high-resolution video input at high frame rates, are an entirely
different kind of challenge than Grid worlds or board games. Atari is a step closer to
a human environment in which visual inputs should quickly be followed by correct
actions. Indeed, the Atari benchmark called for very different agent algorithms,
prompting the move from tabular algorithms to algorithms based on function
approximation and deep learning. ALE has become a standard benchmark in deep
reinforcement learning research.

3.1.2 Real-Time Strategy and Video Games

Real-time strategy games provide an even greater challenge than simulated 1980s
Atari consoles. Games such as StarCraft (Fig. 1.6) [573], and Capture the Flag [373]
have very large state spaces. These are games with large maps, many players, many
pieces, and many types of actions. The state space of StarCraft is estimated at
3.2 Deep Value-Based Agents 73

101685 [573], more than 1500 orders of magnitude larger than Go (10170 ) [540, 786]
and more than 1635 orders of magnitude large than chess (1047 ) [355]. Most real
time strategy games are multi-player, non-zero-sum, imperfect information games
that also feature high-dimensional pixel input, reasoning, and team collaboration.
The action space is stochastic and is a mix of discrete and continuous actions.
Despite the challenging nature, impressive achievements have been reported
recently in three games where human performance was matched or even ex-
ceeded [813, 80, 373], see also Chap. 7.
Let us have a look at the methods that can solve these very different types of
problems.

3.2 Deep Value-Based Agents

We will now turn to agent algorithms for solving large sequential decision problems.
The main challenge of this section is to create an agent algorithm that can learn a
good policy by interacting with the world—with a large problem, not a toy problem.
From now on, our agents will be deep learning agents.
The questions that we are faced with, are the following. How can we use deep
learning for high-dimensional and large sequential decision making environments?
How can tabular value and policy functions 𝑉, 𝑄, and 𝜋 be transformed into 𝜃
parameterized functions 𝑉 𝜃 , 𝑄 𝜃 , and 𝜋 𝜃 ?

3.2.1 Generalization of Large Problems with Deep Learning

Recall from Appendix B that deep supervised learning uses a static dataset to
approximate a function, and that the labels are static targets in an optimization
process where the loss-function is minimized.
Deep reinforcement learning is based on the observation that bootstrapping is
also a kind of minimization process in which an error (or difference) is minimized. In
reinforcement learning this bootstrapping process converges on the true state value
and state-action value functions. However, the Q-learning bootstrapping process
lacks static ground truths; our data items are generated dynamically, and our loss-
function targets move. The movement of the loss-function targets is influenced by
the same policy function that the convergence process is trying to learn.
It has taken quite some effort to find deep learning algorithms that converge to
stable functions on these moving targets. Let us try to understand in more detail
how the supervised methods have to be adapted in order to work in reinforce-
ment learning. We do this by comparing three algorithmic structures: supervised
minimization, tabular Q-learning, and deep Q-learning.
74 3 Deep Value-Based Reinforcement Learning

1 def train_sl ( data , net , alpha =0.001) : # train classifier

2 for epoch in range ( max_epochs ) : # an epoch is one pass
3 sum_sq = 0 # reset to zero for each pass
4 for ( image , label ) in data :
5 output = net . forward_pass ( image ) # predict
6 sum_sq += ( output - label ) **2 # compute error
7 grad = net . gradient ( sum_sq ) # derivative of error
8 net . backward_pass ( grad , alpha ) # adjust weights
9 return net

Listing 3.1 Network training pseudocode for supervised learning

1 def qlearn ( environment , alpha =0.001 , gamma =0.9 , epsilon =0.05) :

2 Q [ TERMINAL , _ ] = 0 # policy
3 for episode in range ( max_episodes ) :
4 s = s0
5 while s not TERMINAL : # perform steps of one full episode
6 a = epsilongreedy ( Q [ s ] , epsilon )
7 (r , sp ) = environment (s , a )
8 Q [s , a ] = Q [s , a ] + alpha *( r + gamma * max ( Q [ sp ]) -Q [s , a ])
9 s = sp
10 return Q

Listing 3.2 Q-learning pseudocode [831, 743]

3.2.1.1 Minimizing Supervised Target Loss

Listing 3.1 shows pseudocode for a typical supervised deep learning training algo-
rithm, consisting of an input dataset, a forward pass that calculates the network
output, a loss computation, and a backward pass. See Appendix B or [280] for more
details.
We see that the code consists of a double loop: the outer loop controls the training
epochs. Epochs consist of forward approximation of the target value using the
parameters, computation of the gradient, and backward adjusting of the parameters
with the gradient. In each epoch the inner loop serves all examples of the static
dataset to the forward computation of the output value, the loss and the gradient
computation, so that the parameters can be adjusted in the backward pass.
The dataset is static, and all that the inner loop does is deliver the samples
to the backpropagation algorithm. Note that each sample is independent of the
other, samples are chosen with equal probability. After an image of a white horse is
sampled, the probability that the next image is of a black grouse or a blue moon is
equally (un)likely.
3.2 Deep Value-Based Agents 75

1 def train_qlearn ( environment , Qnet , alpha =0.001 , gamma =0.0 ,

epsilon =0.05
2 s = s0 # initialize start state
3 for epoch in range ( max_epochs ) : # an epoch is one pass
4 sum_sq = 0 # reset to zero for each pass
5 while s not TERMINAL : # perform steps of one full episode
6 a = epsilongreedy ( Qnet (s , a ) ) # net : Q [s , a ] - values
7 (r , sp ) = environment ( a )
8 output = Qnet . forward_pass (s , a )
9 target = r + gamma * max ( Qnet ( sp ) )
10 sum_sq += ( target - output ) **2
11 s = sp
12 grad = Qnet . gradient ( sum_sq )
13 Qnet . backward_pass ( grad , alpha )
14 return Qnet # Q - values

Listing 3.3 Network training pseudocode for reinforcement learning

3.2.1.2 Bootstrapping Q-Values

Let us now look at Q-learning. Reinforcement learning chooses the training exam-
ples differently. For convergence of algorithms such as Q-learning, the selection rule
must guarantee that eventually all states will be sampled by the environment [831].
For large problems, this is not the case; this condition for convergence to the value
function does not hold.
Listing 3.2 shows the short version of the bootstrapping tabular Q-learning
pseudocode from the previous chapter. As in the previous deep learning algorithm,
the algorithm consists of a double loop. The outer loop controls the Q-value conver-
gence episodes, and each episode consists of a single trace of (time) steps from the
start state to a terminal state. The Q-values are stored in a Python-array indexed
by 𝑠 and 𝑎, since Q is the state-action value. Convergence of the Q-values is as-
sumed to have occurred when enough episodes have been sampled. The Q-formula
shows how the Q-values are built up by bootstrapping on previous values, and how
Q-learning is learning off-policy, taking the max value of an action.
A difference with the supervised learning is that in Q-learning subsequent
samples are not independent. The next action is determined by the current policy,
and will most likely be the best action of the state (𝜖-greedy). Furthermore, the next
state will be correlated to the previous state in the trajectory. After a state of the
ball in the upper left corner of the field has been sampled, the next sample will
with very high probability also be of a state where the ball is close to the upper
left corner of the field. Training can be stuck in local minima, hence the need for
exploration.
76 3 Deep Value-Based Reinforcement Learning

3.2.1.3 Deep Reinforcement Learning Target-Error

The two algorithms—deep learning and Q-learning—look similar in structure. Both

consist of a double loop in which a target is optimized, and we can wonder if
bootstrapping can be combined with loss-function minimization. This is indeed
the case, as Mnih et al. [522] showed in 2013. Our third listing, Listing 3.3, shows
a naive deep learning version of Q-learning [522, 524], based on the double loop
that now bootstraps Q-values by minimizing a loss function through adjusting the
𝜃 parameters.
Indeed, a Q-network can be trained with a gradient by minimizing a sequence
of loss functions. The loss function for this bootstrap process is quite literally
based on the Q-learning update formula. The loss function is the squared difference
between the new Q-value 𝑄 𝜃𝑡 (𝑠, 𝑎) from the forward pass and the old update target
𝑟 + 𝛾 max𝑎0 𝑄 𝜃𝑡−1 (𝑠 0, 𝑎 0).4
An important observation is that the update targets depend on the previous
network weights 𝜃 𝑡−1 (the optimization targets move during optimization); this is
in contrast with the targets used in a supervised learning process, that are fixed
before learning begins [522]. In other words, the loss function of deep Q-learning
minimizes a moving target, a target that depends on the network being optimized.

3.2.2 Three Challenges

Let us have a closer look at the challenges that deep reinforcement learning faces.
There are three problems with our naive deep Q-learner. First, convergence to the
optimal Q-function depends on full coverage of the state space, yet the state space is
too large to sample fully. Second, there is a strong correlation between subsequent
training samples, with a real risk of local optima. Third, the loss function of gradient
descent literally has a moving target, and bootstrapping may diverge. Let us have a
look at these three problems in more detail.

3.2.2.1 Coverage

Proofs that algorithms such as Q-learning converge to the optimal policy depend on
the assumption of full state space coverage; all state-action pairs must be sampled.
Otherwise, the algorithms will not converge to an optimal action value for each
state. Clearly, in large state spaces where not all states are sampled, this situation
does not hold, and there is no guarantee of convergence.
4Deep Q-learning is a fixed-point iteration [507]. The gradient of this loss function is ∇ 𝜃𝑖 L𝑖 ( 𝜃𝑖 ) =
E𝑠,𝑎∼𝜌(·) ;𝑠0 ∼E 𝑟 + 𝛾 max𝑎0 𝑄 𝜃𝑖−1 (𝑠0 , 𝑎0 ) − 𝑄 𝜃𝑖 (𝑠, 𝑎) ∇ 𝜃𝑖 𝑄 𝜃𝑖 (𝑠, 𝑎) where 𝜌 is the behavior

distribution and E the Atari emulator. Further details are in [522].
3.2 Deep Value-Based Agents 77

3.2.2.2 Correlation

In reinforcement learning a sequence of states is generated in an agent/environment

loop. The states differ only by a single action, one move or one stone, all other
features of the states remain unchanged, and thus, the values of subsequent samples
are correlated, which may result in a biased training. The training may cover only
a part of the state space, especially when greedy action selection increases the
tendency to select a small set of actions and states. The bias can result in the
so-called specialization trap (when there is too much exploitation, and too little
exploration).
Correlation between subsequent states contributes to the low coverage that we
discussed before, reducing convergence towards the optimal Q-function, increasing
the probability of local optima and feedback loops. This happens, for example, when
a chess program has been trained on a particular opening, and the opponent plays
a different one. When test examples are different from training examples, then
generalization will be bad. This problem is related to out-of-distribution training,
see for example [485].

3.2.2.3 Convergence

When we naively apply our deep supervised methods to reinforcement learning,

we encounter the problem that in a bootstrap process, the optimization target is
part of the bootstrap process itself. Deep supervised learning uses a static dataset
to approximate a function, and loss-function targets are therefore stable. However,
deep reinforcement learning uses as bootstrap target the Q-update from the previous
time step, which changes during the optimization.
The loss is the squared difference between the Q-value 𝑄 𝜃𝑡 (𝑠, 𝑎) and the old
update target 𝑟 + 𝛾 max𝑎0 𝑄 𝜃𝑡−1 (𝑠 0, 𝑎 0). Since both depend on parameters 𝜃 that are
optimized, the risk of overshooting the target is real, and the optimization process
can easily become unstable. It has taken quite some effort to find algorithms that
can tolerate these moving targets.

Deadly Triad

Multiple works [48, 283, 787] showed that a combination of off-policy reinforcement
learning with nonlinear function approximation (such as deep neural networks)
could cause Q-values to diverge. Sutton and Barto [743] further analyze three
elements for divergent training: function approximation, bootstrapping, and off-
policy learning. Together, they are called deadly triad.
Function approximation may attribute values to states inaccurately. In contrast
to exact tabular methods, that are designed to identify individual states exactly,
neural networks are designed to individual features of states. These features can be
shared by different states, and values attributed to those features are shared also by
78 3 Deep Value-Based Reinforcement Learning

other states. Function approximation may thus cause mis-identification of states,

and reward values and Q-values that are not assigned correctly. If the accuracy
of the approximation of the true function values is good enough, then states may
be identified well enough to reduce or prevent divergent training processes and
loops [523].
Bootstrapping of values builds up new values on the basis of older values. This
occurs in Q-learning and temporal-difference learning where the current value
depends on the previous value. Bootstrapping increases the efficiency of the training
because values do not have to be calculated from the start. However, errors or
biases in initial values may persist, and spill over to other states as values are
propagated incorrectly due to function approximation. Bootstrapping and function
approximation can thus increase divergence.
Off-policy learning uses a behavior policy that is different from the target policy
that we are optimizing for (Sect. 2.2.4.4). When the behavior policy is improved,
the off-policy values may not improve. Off-policy learning converges generally less
well than on-policy learning as it converges independently from the behavior policy.
With function approximation convergence may be even slower, due to values being
assigned to incorrect states.

3.2.3 Stable Deep Value-Based Learning

These considerations discouraged further research in deep reinforcement learn-

ing for many years. Instead, research focused for some time on linear function
approximators, which have better convergence guarantees. Nevertheless, work
on convergent deep reinforcement learning continued [654, 324, 88, 495], and al-
gorithms such as neural fitted Q-learning were developed, which showed some
promise [631, 453, 482]. After the further results of DQN [522] showed convincingly
that convergence and stable learning could be achieved in a non-trivial problem,
even more experimental studies were performed to find out under which circum-
stances convergence can be achieved and the deadly triad can be overcome. Further
convergence and diversity-enhancing techniques were developed, some of which
we will cover in Sect. 3.2.4.
Although the theory provides reasons why function approximation may preclude
stable reinforcement learning, there were, in fact, indications that stable training
is possible. Starting at the end of the 1980s, Tesauro had written a program that
played very strong Backgammon based on a neural network. The program was
called Neurogammon, and used supervised learning from grand-master games [762].
In order to improve the strength of the program, he switched to temporal difference
reinforcement learning from self-play games [764]. TD-Gammon [763] learned by
playing against itself, achieving stable learning in a shallow network. TD-Gammon’s
training used a temporal difference algorithm similar to Q-learning, approximating
the value function with a network with one hidden layer, using raw board input
3.2 Deep Value-Based Agents 79

enhanced with hand-crafted heuristic features [763]. Perhaps some form of stable
reinforcement learning was possible, at least in a shallow network?
TD-Gammon’s success prompted attempts with TD learning in checkers [139]
and Go [738, 154]. Unfortunately the success could not be replicated in these games,
and it was believed for some time that Backgammon was a special case, well suited
for reinforcement learning and self-play [604, 678].
However, as there came further reports of successful applications of deep neural
networks in a reinforcement learning setting [324, 654], more work followed. The
results in Atari [523] and later in Go [706] as well as further work [799] have now
provided clear evidence that both stable training and generalizing deep reinforce-
ment learning are indeed possible, and have improved our understanding of the
circumstances that influence stability and convergence.
Let us have a closer look at the methods that are used to achieve stable deep
reinforcement learning.

3.2.3.1 Decorrelating States

As mentioned in the introduction of this chapter, in 2013 Mnih et al. [522, 523]
published their work on end-to-end reinforcement learning in Atari games.
The original focus of DQN is on breaking correlations between subsequent states,
and also on slowing down changes to parameters in the training process to improve
stability. The DQN algorithm has two methods to achieve this: (1) experience replay
and (2) infrequent weight updates. We will first look at experience replay.

Experience Replay

In reinforcement learning training samples are created in a sequence of interactions

with the environment, and subsequent training states are strongly correlated to
preceding states. There is a tendency to train the network on too many samples
of a certain kind or in a certain area, and other parts of the state space remain
under-explored. Furthermore, through function approximation and bootstrapping,
some behavior may be forgotten. When an agent reaches a new level in a game that
is different from previous levels, the agent may forget how to play the other level.
We can reduce correlation—and the local minima they cause—by adding a small
amount of supervised learning. To break correlations and to create a more diverse
set of training examples, DQN uses experience replay. Experience replay introduces
a replay buffer [481], a cache of previously explored states, from which it samples
training states at random.5 Experience replay stores the last 𝑁 examples in the
replay memory, and samples uniformly when performing updates. A typical number
for 𝑁 is 106 [875]. By using a buffer, a dynamic dataset from which recent training
examples are sampled, we train states from a more diverse set, instead of only from
5Originally experience replay is, as so much in artificial intelligence, a biologically inspired
mechanism [506, 572, 482].
80 3 Deep Value-Based Reinforcement Learning

the most recent one. The goal of experience replay is to increase the independence
of subsequent training examples. The next state to be trained on is no longer a direct
successor of the current state, but one somewhere in a long history of previous
states. In this way the replay buffer spreads out the learning over more previously
seen states, breaking temporal correlations between samples. DQN’s replay buffer
(1) improves coverage, and (2) reduces correlation.
DQN treats all examples equal, old and recent alike. A form of importance
sampling might differentiate between important transitions, as we will see in the
next section.
Note that, curiously, training by experience replay is a form of off-policy learning,
since the target parameters are different from those used to generate the sample.
Off-policy learning is one of the three elements of the deadly triad, and we find that
stable learning can actually be improved by a special form of one of its problems.
Experience replay works well in Atari [523]. However, further analysis of replay
buffers has pointed to possible problems. Zhang et al. [875] study the deadly triad
with experience replay, and find that larger networks resulted in more instabilities,
but also that longer multi-step returns yielded fewer unrealistically high reward
values. In Sect. 3.2.4 we will see many further enhancements to DQN-like algorithms.

3.2.3.2 Infrequent Updates of Target Weights

The second improvement in DQN is infrequent weight updates, introduced in the

2015 paper on DQN [523]. The aim of this improvement is to reduce divergence
that is caused by frequent updates of weights of the target 𝑄-value. Again, the aim
is to improve the stability of the network optimization by improving the stability
of the 𝑄-target in the loss function.
Every 𝑛 updates, the network 𝑄 is cloned to obtain target network 𝑄, ˆ which is
used for generating the targets for the following 𝑛 updates to 𝑄. In the original
DQN implementation a single set of network weights 𝜃 are used, and the network
is trained on a moving loss-target. Now, with infrequent updates the weights of the
target network change much slower than those of the behavior policy, improving
the stability of the Q-targets.
The second network improves the stability of Q-learning, where normally an
update to 𝑄 𝜃 (𝑠𝑡 , 𝑎 𝑡 ) also changes the target at each time step, quite possibly leading
to oscillations and divergence of the policy. Generating the targets using an older
set of parameters adds a delay between the time an update to 𝑄 𝜃 is made and the
time the update changes the targets, making oscillations less likely.

3.2.3.3 Hands On: DQN and Breakout Gym Example

To get some hands-on experience with DQN, we will now have a look at how DQN
can be used to play the Atari game Breakout.
3.2 Deep Value-Based Agents 81

The field of deep reinforcement learning is an open field where most codes of
algorithms are freely shared on GitHub and where test environments are available.
The most widely used environment is Gym, in which benchmarks such as ALE
and MuJoCo can be found, see also Appendix C. The open availability of the
software allows for easy replication, and, importantly, for further improvement of
the methods. Let us have a closer look at the code of DQN, to experience how it
works.
The DQN papers come with source code. The original DQN code from [523]
is available at Atari DQN.6 This code is the original code, in the programming
language Lua, which may be interesting to study, if you are familiar with this
language. A modern reference implementation of DQN, with further improvements,
is in the (stable) baselines.7 The RL Baselines Zoo even provides a collection of
pretrained agents, at Zoo [603, 270].8 The Network Zoo is especially useful if your
desired application happens to be in the Zoo, to prevent long training times.

Install Stable Baselines

The environment is only half of the reinforcement learning experiment, we also

need an agent algorithm to learn the policy. OpenAI also provides implementations
of agent algorithms, called the Baselines, at the Gym GitHub repository Baselines.9
Most algorithms that are covered in this book are present. You can download them,
study the code, and experiment to gain an insight into their behavior.
In addition to OpenAI’s Baselines, there is Stable Baselines, a fork of the OpenAI
algorithms; it has more documentation and other features. It can be found at Stable
Baselines,10 and the documentation is at docs.11
The stable release from the Stable Baselines is installed by typing

pip install stable-baselines

pip install stable-baselines[mpi]

if support for OpenMPI is desired (a parallel message passing implementation for

cluster computers). A very quick check to see if everything works is to run the PPO
trainer from Listing 3.4. PPO is a policy-based algorithm that will be discussed in
the next chapter in Sect. 4.2.5. The Cartpole should appear again, but should now
learn to stabilize for a brief moment.
6 https://github.com/kuz/DeepMind-Atari-Deep-Q-Learner
7 https://stable-baselines.readthedocs.io/en/master/index.html
8 https://github.com/araffin/rl-baselines-zoo
9 https://github.com/openai/baselines
10 https://github.com/hill-a/stable-baselines
11 https://stable-baselines.readthedocs.io/en/master/
82 3 Deep Value-Based Reinforcement Learning

1 import gym
2
3 from s t a b l e _ b a s e l i n e s . common . policies import MlpPolicy
4 from s t a b l e _ b a s e l i n e s . common . vec_env import DummyVecEnv
5 from s t a b l e _ b a s e l i n e s import PPO2
6
7 env = gym . make ( ’ CartPole - v1 ’)
8
9 model = PPO2 ( MlpPolicy , env , verbose =1)
10 model . learn ( t ot a l_ ti m es te p s =10000)
11
12 obs = env . reset ()
13 for i in range (1000) :
14 action , _states = model . predict ( obs )
15 obs , rewards , dones , info = env . step ( action )
16 env . render ()

Listing 3.4 Running Stable Baseline PPO on the Gym Cartpole Environment

1 from s t a b l e _ b a s e l i n e s . common . a tari_w rapper s import make_atari

2 from s t a b l e _ b a s e l i n e s . deepq . policies import MlpPolicy , CnnPolicy
3 from s t a b l e _ b a s e l i n e s import DQN
4
5 env = make_atari ( ’ BreakoutNoFrameskip - v4 ’)
6
7 model = DQN ( CnnPolicy , env , verbose =1)
8 model . learn ( t ot a l_ ti m es te p s =25000)
9
10 obs = env . reset ()
11 while True :
12 action , _states = model . predict ( obs )
13 obs , rewards , dones , info = env . step ( action )
14 env . render ()

Listing 3.5 Deep Q-Network Atari Breakout example with Stable Baselines

The DQN Code

After having studied tabular Q-learning on Taxi in Sect. 2.2.4.5, let us now see
how the network-based DQN works in practice. Listing 3.5 illustrates how easy
it is to use the Stable Baselines implementation of DQN on the Atari Breakout
environment. (See Sect. 2.2.4.1 for installation instructions of Gym.)
After you have run the DQN code and seen that it works, it is worthwhile to study
how the code is implemented. Before you dive into the Python implementation of
Stable Baselines, let us look at the pseudocode to refresh how the elements of DQN
work together. See Listing 3.6. In this pseudocode we follow the 2015 version of
DQN [523]. (The 2013 version of DQN did not use the target network [522].)
3.2 Deep Value-Based Agents 83

1 def dqn :
2 initialize replay_buffer empty
3 initialize Q network with random weights
4 initialize Qt target network with random weights
5 set s = s0
6 while not convergence :
7 # DQN in Atari uses preprocessing ; not shown
8 epsilon - greedy select action a in argmax ( Q (s , a ) ) # action
selection depends on Q ( moving target )
9 sx , reward = execute action in environment
10 append (s ,a ,r , sx ) to buffer
11 sample minibatch from buffer # break temporal correlation
12 take target batch R ( when terminal ) or Qt
13 do gradient descent step on Q # loss function uses target
Qt network

Listing 3.6 Pseudocode for DQN, after [523]

DQN is based on Q-learning, with as extra a replay buffer and a target network
to improve stability and convergence. First, at the start of the code, the replay buffer
is initialized to empty, and the weights of the Q network and the separate Q target
network are initialized. The state 𝑠 is set to the start state.
Next is the optimization loop, that runs until convergence. At the start of each
iteration an action is selected at the state 𝑠, following an 𝜖-greedy approach. The
action is executed in the environment, and the new state and the reward are stored
in a tuple in the replay buffer. Then, we train the Q-network. A minibatch is sampled
randomly from the replay buffer, and one gradient descent step is performed. For
this step the loss function is calculated with the separate Q-target network 𝑄ˆ 𝜃 , that
is updated less frequently than the primary Q-network 𝑄 𝜃 . In this way the loss
function
h i
L𝑡 (𝜃 𝑡 ) = E𝑠,𝑎∼𝜌( ·) E𝑠0 ∼E (𝑟 + 𝛾 max ˆ 𝜃 (𝑠 0, 𝑎 0)|𝑠, 𝑎) − 𝑄 𝜃 (𝑠, 𝑎) 2
𝑄
0 𝑡−1 𝑡
𝑎

is more stable, causing better convergence; 𝜌(𝑠, 𝑎) is the behavior distribution over
𝑠 and 𝑎, and E is the Atari emulator [522]. Sampling the minibatch reduces the
correlation that is inherent in reinforcement learning between subsequent states.

Conclusion

In summary, DQN was able to successfully learn end-to-end behavior policies for
many different games (although similar and from the same benchmark set). Minimal
prior knowledge was used to guide the system, and the agent only got to see the
pixels and the game score. The same network architecture and procedure was used
on each game; however, a network trained for one game could not be used to play
another game.
84 3 Deep Value-Based Reinforcement Learning

Name Principle Applicability Effectiveness

DQN [522] replay buffer Atari stable Q learning
Double DQN [800] de-overestimate values DQN convergence
Prioritized experience [666] decorrelation replay buffer convergence
Distributional [70] probability distr stable gradients generalization
Random noise [254] parametric noise stable gradients more exploration
Table 3.1 Deep Value-Based Approaches

The DQN achievement was an important milestone in the history of deep rein-
forcement learning. The main problems that were overcome by Mnih et al. [522]
were training divergence and learning instability.
The nature of most Atari 2600 games is that they require eye-hand reflexes. The
games have some strategic elements, credit assignment is mostly over a short term,
and can be learned with a surprisingly simple neural network. Most Atari games
are more about immediate reflexes than about longer term reasoning. In this sense,
the problem of playing Atari well is not unlike an image categorization problem:
both problems are to find the right response that matches an input consisting of a
set of pixels. Mapping pixels to categories is not that different from mapping pixels
to joystick actions (see also the observations in [400]).
The Atari results have stimulated much subsequent research. Many blogs have
been written on reproducing the result, which is not a straightforward task, requir-
ing the fine-tuning of many hyperparameters [58].

3.2.4 Improving Exploration

The DQN results have spawned much activity among reinforcement learning re-
searchers to improve training stability and convergence further, and many refine-
ments have been devised, some of which we will review in this section.
Many of the topics that are covered by the enhancements are older ideas that
work well in deep reinforcement learning. DQN applies random sampling of its
replay buffer, and one of the first enhancements was prioritized sampling [666].
It was found that DQN, being an off-policy algorithm, typically overestimates
action values (due to the max operation, Sect. 2.2.4.4). Double DQN addresses
overestimation [800], and dueling DDQN introduces the advantage function to
standardize action values [830]. Other approaches look at variance in addition to
expected value, the effect of random noise on exploration was tested [254], and
distributional DQN showed that networks that use probability distributions work
better than networks that only use single point expected values [70].
In 2017 Hessel et al. [335] performed a large experiment that combined seven
important enhancements. They found that the enhancements worked well together.
The paper has become known as the Rainbow paper, since the major graph showing
the cumulative performance over 57 Atari games of the seven enhancements is
3.2 Deep Value-Based Agents 85

Fig. 3.5 Rainbow graph: performance over 57 Atari games [335]

multi-colored (Fig. 3.5). Table 3.1 summarizes the enhancements, and this section
provides an overview of the main ideas. The enhancements were tested on the same
benchmarks (ALE, Gym), and most algorithm implementations can be found on the
OpenAI Gym GitHub site in the baselines.12

3.2.4.1 Overestimation

Van Hasselt et al. introduce double deep Q learning (DDQN) [800]. DDQN is based
on the observation that Q-learning may overestimate action values. On the Atari
2600 games DQN suffers from substantial over-estimations. Remember that DQN
uses Q-learning. Because of the max operation in Q-learning this results in an
overestimation of the Q-value. To resolve this issue, DDQN uses the Q-Network to
choose the action but uses the separate target Q-Network to evaluate the action.
Let us compare the training target for DQN

𝑦 = 𝑟 𝑡+1 + 𝛾𝑄 𝜃𝑡 (𝑠𝑡+1 , arg max 𝑄 𝜃𝑡 (𝑠𝑡+1 , 𝑎)

𝑎

with the training target for DDQN (the difference is a single 𝜙)

𝑦 = 𝑟 𝑡+1 + 𝛾𝑄 𝜙𝑡 (𝑠𝑡+1 , arg max 𝑄 𝜃𝑡 (𝑠𝑡+1 , 𝑎).

𝑎

12 https://github.com/openai/baselines
86 3 Deep Value-Based Reinforcement Learning

The DQN target uses the same set of weights 𝜃 𝑡 twice, for selection and evaluation;
the DDQN target use a separate set of weights 𝜙𝑡 for evaluation, preventing overes-
timation due to the max operator. Updates are assigned randomly to either set of
weights.
Earlier Van Hasselt et al. [314] introduced the double Q learning algorithm in
a tabular setting. The later paper shows that this idea also works with a large
deep network. They report that the DDQN algorithm not only reduces the over-
estimations but also leads to better performance on several games. DDQN was
tested on 49 Atari games and achieved about twice the average score of DQN with
the same hyperparameters, and four times the average DQN score with tuned
hyperparameters [800].

Prioritized Experience Replay

DQN samples uniformly over the entire history in the replay buffer, where Q-
learning uses only the most recent (and important) state. It stands to reason to see
if a solution in between these two extremes performs well.
Prioritized experience replay, or PEX, is such an attempt. It was introduced by
Schaul et al. [666]. In the Rainbow paper PEX is combined with DDQN, and, as we
can see, the blue line (with PEX) indeed outperforms the purple line.
In DQN experience replay lets agents reuse examples from the past, although
experience transitions are uniformly sampled, and actions are simply replayed
at the same frequency that they were originally experienced, regardless of their
significance. The PEX approach provides a framework for prioritizing experience.
Important actions are replayed more frequently, and therefore learning efficiency
is improved. As measure of importance, standard proportional prioritized replay
is used, with the absolute TD error to prioritize actions. Prioritized replay is used
widely in value-based deep reinforcement learning. The measure can be computed
in the distributional setting using the mean action values. In the Rainbow paper all
distributional variants prioritize actions by the Kullback-Leibler loss [335].

Advantage Function

The original DQN uses a single neural network as function approximator; DDQN
(double deep Q-network) uses a separate target Q-Network to evaluate an action.
Dueling DDQN [830], also known as DDDQN, improves on this architecture by
using two separate estimators: a value function and an advantage function

𝐴(𝑠, 𝑎) = 𝑄(𝑠, 𝑎) − 𝑉 (𝑠).

Advantage functions are related to the actor-critic approach (see Chap. 4). An
advantage function computes the difference between the value of an action and the
value of the state. The function standardizes values on a baseline for the actions
3.3 Atari 2600 Environments 87

of a state [293]. Advantage functions provide better policy evaluation when many
actions have similar values.

3.2.4.2 Distributional Methods

The original DQN learns a single value, which is the estimated mean of the state
value. This approach does not take uncertainty into account. To remedy this, distri-
butional Q-learning [70] learns a categorical probability distribution of discounted
returns instead, increasing exploration. Bellemare et al. design a new distributional
algorithm which applies Bellman’s equation to the learning of distributions, a
method called distributional DQN. Moerland et al. [526, 527] propose uncertain
value networks. Interestingly, a link between the distributional approach and bi-
ology has been reported. Dabney et al. [174] showed correspondence between
distributional reinforcement learning algorithms and the dopamine levels in mice,
suggesting that the brain represents possible future rewards as a probability distri-
bution.

Noisy DQN

Another distributional method is noisy DQN [254]. Noisy DQN uses stochastic net-
work layers that add parametric noise to the weights. The noise induces randomness
in the agent’s policy, which increases exploration. The parameters that govern the
noise are learned by gradient descent together with the remaining network weights.
In their experiments the standard exploration heuristics for A3C (Sect. 4.2.4), DQN,
and dueling agents (entropy reward and 𝜖-greedy) were replaced with NoisyNet.
The increased exploration yields substantially higher scores for Atari (dark red
line).

3.3 Atari 2600 Environments

In their original 2013 workshop paper Mnih et al. [522] achieved human-level play
for some of the games. Training was performed on 50 million frames in total on
seven Atari games. The neural network performed better than an expert human
player on Breakout, Enduro, and Pong. On Seaqest, Q*Bert, and Space Invaders
performance was far below that of a human. In these games a strategy must be
found that extends over longer time periods. In their follow-up journal article two
years later they were able to achieve human level play for 49 of the 57 games that
are in ALE [523], and performed better than human-level play in 29 of the 49 games.
Some of the games still proved difficult, notably games that require longer-
range planning, where long stretches of the game do not give rewards, such as in
Montezuma’s Revenge, where the agent has to walk long distances, and pick up
88 3 Deep Value-Based Reinforcement Learning

Fig. 3.6 DQN architecture [361]

a key to reach new rooms to enter new levels. In reinforcement learning terms,
delayed credit assignment over long periods is hard. Towards the end of the book we
will see Montezuma’s Revenge again, when we discuss hierarchical reinforcement
learning methods, in Chap 8. These methods are specifically developed to take large
steps in the state space. The Go-Explore algorithm was able to solve Montezuma’s
Revenge [221, 222].

3.3.1 Network Architecture

End-to-end learning of challenging problems is computationally intensive. In ad-

dition to the two algorithmic innovations, the success of DQN is also due to the
creation of a specialized efficient training architecture [523].
Playing the Atari games is a computationally intensive task for a deep neural
network: the network trains a behavior policy directly from pixel frame input.
Therefore, the training architecture contains reduction steps. To start with, the net-
work consists of only three hidden layers (one fully connected, two convolutional),
which is simpler than what is used in most supervised learning tasks.
The pixel-images are high-resolution data. Since working with the full resolution
of 210 × 160 pixels of 128 color-values at 60 frames per second would be computa-
tionally too intensive, the images are reduced in resolution. The 210 × 160 with
a 128 color palette is reduced to gray scale and 110 × 84 pixels, which is further
cropped to 84 × 84. The first hidden layer convolves 16 8 × 8 filters with stride 4 and
ReLU neurons. The second hidden layer convolves 32 4 × 4 filters with stride 2 and
ReLU neurons. The third hidden layer is fully connected and consists of 256 ReLU
neurons. The output layer is also fully connected with one output per action (18
joystick actions). The outputs correspond to the Q-values of the individual action.
Figure 3.6 shows the architecture of DQN. The network receives the change in
game score as a number from the emulator, and derivative updates are mapped to
{−1, 0, +1} to indicate decrease, no change, or improvement of the score (the Huber
loss [58]).
3.3 Atari 2600 Environments 89

To reduce computational demands further, frame skipping is employed. Only

one in every 3–4 frames was used, depending on the game. To take game history
into account, the net takes as input the last four resulting frames. This allows
movement to be seen by the net. As optimizer RMSprop is used [642]. A variant
of 𝜖-greedy is used, that starts with an 𝜖 of 1.0 (fully exploring) going down to 0.1
(90% exploiting).

3.3.2 Benchmarking Atari

To end the Atari story, we discuss two final algorithms. Of the many value-based
model-free deep reinforcement learning algorithms that have been developed, one
more algorithm that we discuss is R2D2 [397], because of its performance. R2D2
is not part of the Rainbow experiments, but is a significant further improvement
of the algorithms. R2D2 stands for Recurrent Replay Distributed DQN. It is built
upon prioritized distributed replay and 5-step double Q-learning. Furthermore, it
uses a dueling network architecture and an LSTM layer after the convolutional
stack. Details about the architecture can be found in [830, 295]. The LSTM uses
the recurrent state to exploit long-term temporal dependencies, which improve
performance. The authors also report that the LSTM allows for better representation
learning. R2D2 achieved good results on all 57 Atari games [397].
A more recent benchmark achievement has been published as Agent57. Agent57
is the first program that achieves a score higher than the human baseline on all 57
Atari 2600 games from ALE. It uses a controller that adapts the long and short-term
behavior of the agent, training for a range of policies, from very exploitative to very
explorative, depending on the game [46].

Conclusion

Progress has come a long way since the replay buffer of DQN. Performance has
been improved greatly in value-based model-free deep reinforcement learning and
now super-human performance in all 57 Atari games of ALE has been achieved.
Many enhancements that improve coverage, correlation, and convergence have
been developed. The presence of a clear benchmark was instrumental for progress
so that researchers could clearly see which ideas worked and why. The earlier
mazes and navigation games, OpenAI’s Gym [108], and especially the ALE [71],
have enabled this progress.
In the next chapter we will look at the other main branch of model-free reinforce-
ment learning: policy-based algorithms. We will see how they work, and that they
are well suited for a different kind of application, with continuous action spaces.
90 3 Deep Value-Based Reinforcement Learning

Summary and Further Reading

This has been the first chapter in which we have seen deep reinforcement learning
algorithms learn complex, high-dimensional, tasks. We end with a summary and
pointers to the literature.

Summary

The methods that have been discussed in the previous chapter were exact, tabular
methods. Most interesting problems have large state spaces that do not fit into
memory. Feature learning identifies states by their common features. Function
values are not calculated exactly, but are approximated, with deep learning.
Much of the recent success of reinforcement learning is due to deep learning
methods. For reinforcement learning a problem arises when states are approximated.
Since in reinforcement learning the next state is determined by the previous state,
algorithms may get stuck in local minima or run in circles when values are shared
with different states.
Another problem is training convergence. Supervised learning has a static dataset
and training targets are also static. In reinforcement learning the loss function
targets depend on the parameters that are being optimized. This causes further
instability. DQN caused a breakthrough by showing that with a replay buffer and a
separate, more stable, target network, enough stability could be found for DQN to
converge and learn how to play Atari arcade games.
Many further improvements to increase stability have been found. The Rainbow
paper implements some of these improvements, and finds that they are complemen-
tary, and together achieve very strong play.

Further Reading

Deep learning revolutionized reinforcement learning. A comprehensive overview of

the field is provided by Dong et al. [201]. For more on deep learning, see Goodfellow
et al. [280], a book with much detail on deep learning; a major journal article is [459].
A brief survey is [27]. Also see Appendix B.
In 2013 the Arcade Learning Environment was presented [71, 494]. Experiment-
ing with reinforcement learning was made even more accessible with OpenAI’s
Gym [108], with clear and easy to use Python bindings.
Deep learning versions of value-based tabular algorithms suffer from conver-
gence and stability problems [787], yet the idea that stable deep reinforcement
learning might be practical took hold with [324, 654]. Zhang et al. [875] study
the deadly triad with experience replay. Deep gradient TD methods were proven
to converge for evaluating a fixed policy [88]. Riedmiller et al. relaxed the fixed
3.3 Atari 2600 Environments 91

control policy in neural fitted Q learning algorithm (NFQ) [631]. NFQ builds on
work on stable function approximation [282, 229] and experience replay [481], and
more recently on least-squares policy iteration [441]. In 2013 the first DQN paper
appeared, showing results on a small number of Atari games [522] with the replay
buffer to reduce temporal correlations. In 2015 the followup Nature paper reported
results in more games [523], with a separate target network to improve training
convergence. A well-known overview paper is the Rainbow paper [335, 387].
The use of benchmarks is of great importance for reproducible reinforcement
learning experiments [328, 371, 410, 365]. For TensorFlow and Keras, see [146, 270].

Exercises

We will end this chapter with some questions to review the concepts that we have
covered. Next are programming exercises to get some more exposure on how to
use the deep reinforcement learning algorithms in practice.

Questions

Below are some questions to check your understanding of this chapter. Each question
is a closed question where a simple, single sentence answer is expected.
1. What is Gym?
2. What are the Stable Baselines?
3. The loss function of DQN uses the Q-function as target. What is a consequence?
4. Why is the exploration/exploitation trade-off central in reinforcement learning?
5. Name one simple exploration/exploitation method.
6. What is bootstrapping?
7. Describe the architecture of the neural network in DQN.
8. Why is deep reinforcement learning more susceptible to unstable learning than
deep supervised learning?
9. What is the deadly triad?
10. How does function approximation reduce stability of Q-learning?
11. What is the role of the replay buffer?
12. How can correlation between states lead to local minima?
13. Why should the coverage of the state space be sufficient?
14. What happens when deep reinforcement learning algorithms do not converge?
15. How large is the state space of chess estimated to be? 1047 , 10170 or 101685 ?
16. How large is the state space of Go estimated to be? 1047 , 10170 or 101685 ?
17. How large is the state space of StarCraft estimated to be? 1047 , 10170 or 101685 ?
18. What does the rainbow in the Rainbow paper stand for, and what is the main
message?
19. Mention three Rainbow improvements that are added to DQN.
92 3 Deep Value-Based Reinforcement Learning

Exercises

Let us now start with some exercises. If you have not done so already, install
Gym, PyTorch13 or TensorFlow and Keras (see Sect. 2.2.4.1 and B.3.3 or go to the
TensorFlow page).14 Be sure to check the right versions of Python, Gym, TensorFlow,
and the Stable Baselines to make sure that they work well together. The exercises
below are designed to be done with Keras.
1. DQN Implement DQN from the Stable Baselines on Breakout from Gym. Turn
off Dueling and Priorities. Find out what the values are for 𝛼, the training rate,
for 𝜖, the exploration rate, what kind of neural network architecture is used,
what the replay buffer size is, and how frequently the target network is updated.
2. Hyperparameters Change all those hyperparameters, up, and down, and note the
effect on training speed, and the training outcome: how good is the result? How
sensitive is performance to hyperparameter optimization?
3. Cloud Use different computers, experiment with GPU versions to speed up
training, consider Colab, AWS, or another cloud provider with fast GPU (or TPU)
machines.
4. Gym Go to Gym and try different problems. For what kind of problems does
DQN work, what are characteristics of problems for which it works less well?
5. Stable Baselines Go to the Stable baselines and implement different agent algo-
rithms. Try Dueling algorithms, Prioritized experience replay, but also other
algorithm, such as Actor critic or policy-based. (These algorithms will be ex-
plained in the next chapter.) Note their performance.
6. Tensorboard With Tensorboard you can follow the training process as it pro-
gresses. Tensorboard works on log files. Try TensorBoard on a Keras exercise and
follow different training indicators. Also try TensorBoard on the Stable Baselines
and see which indicators you can follow.
7. Checkpointing Long training runs in Keras need checkpointing, to save valuable
computations in case of a hardware or software failure. Create a large training
job, and setup checkpointing. Test everything by interrupting the training, and
try to re-load the pre-trained checkpoint to restart the training where it left off.

13 https://pytorch.org
14 https://www.tensorflow.org
Chapter 4
Policy-Based Reinforcement Learning

Some of the most successful applications of deep reinforcement learning have a

continuous action space, such as applications in robotics, self-driving cars, and
real-time strategy games.
The previous chapters introduced value-based reinforcement learning. Value-
based methods find the policy in a two-step process. First they find the best action-
value of a state, for which then the accompanying actions are found (by means of
arg max). This works in environments with discrete actions, where the highest-
valued action is clearly separate from the next-best action. Examples of continuous
action spaces are robot arms that can move over arbitrary angles, or poker bets that
can be any monetary value. In these action spaces value-based methods become
unstable and arg max is not appropriate.
Another approach works better: policy-based methods. Policy-based methods
do not use a separate value function but find the policy directly. They start with a
policy function, which they then improve, episode by episode, with policy gradient
methods. Policy-based methods are applicable to more domains than value-based
methods. They work well with deep neural networks and gradient learning; they
are some of the most popular methods of deep reinforcement learning, and this
chapter introduces you to them.
We start by looking at applications with continuous action spaces. Next, we look
at policy-based agent algorithms. We will introduce basic policy search algorithms,
and the policy gradient theorem. We will also discuss algorithms that combine
value-based and policy-based approaches: the so-called Actor critic algorithms. At
the end of the chapter we discuss larger environments for policy-based methods in
more depth, where we will discuss progress in visuo-motor robotics and locomotion
environments.
The chapter concludes with exercises, a summary, and pointers to further reading.

93
94 4 Policy-Based Reinforcement Learning

Core Concepts

• Policy gradient
• Bias-variance trade-off; Actor critic

Core Problem

• Find a low variance continuous action policy directly

Core Algorithms

• REINFORCE (Alg. 4.2)

• Asynchronous advantage actor critic (Alg. 4.4)
• Proximal policy optimization (Sect. 4.2.5)

Jumping Robots

One of the most intricate problems in robotics is learning to walk, or more generally,
how to perform locomotion. Much work has been put into making robots walk, run
and jump. A video of a simulated robot that taught itself to jump over an obstacle
course can be found on YouTube1 [325].
Learning to walk is a challenge that takes human infants months to master. (Cats
and dogs are quicker.) Teaching robots to walk is a challenging problem that is
studied extensively in artificial intelligence and engineering. Movies abound on the
internet of robots that try to open doors, and fall over, or just try to stand upright,
and still fall over.2
Locomotion of legged robots is a difficult sequential decision problem. For each
leg, many different joints are involved. They must be actuated in the right order,
turned with the right force, over the right duration, to the right angle. Most of
these angles, forces, and durations are continuous. The algorithm has to decide
how many degrees, Newtons, and seconds, constitute the optimal policy. All these
actions are continuous quantities. Robot locomotion is a difficult problem, that is
studied frequently in policy-based deep reinforcement learning.
1 https://www.youtube.com/watch?v=hx_bgoTF7bs
2 See, for example, https://www.youtube.com/watch?v=g0TaYhjpOfo.
4.1 Continuous Problems 95

4.1 Continuous Problems

In this chapter, our actions are continuous, and stochastic. We will discuss both of
these aspects, and some of the challenges they pose. We will start with continuous
action policies.

4.1.1 Continuous Policies

In the previous chapter we discussed environments with large state spaces. We will
now move our attention to action spaces. The action spaces of the problems that
we have seen so far—Grid worlds, mazes, and high-dimensional Atari games—were
actually action spaces that were small and discrete—we could walk north, east, west,
south, or we could choose from 9 joystick movements. In board games such as chess
the action space is larger, but still discrete. When you move your pawn to e4, you
do not move it to e4½.
In this chapter the problems are different. Steering a self driving car requires
turning the steering wheel a certain angle, duration, and angular velocity, to prevent
jerky movements. Throttle movements should also be smooth and continuous.
Actuation of robot joints is continuous, as we mentioned in the introduction of
this chapter. An arm joint can move 1 degree, 2 degrees, or 90 or 180 degrees or
anything in between.
An action in a continuous space is not one of a set of discrete choices, such as
{𝑁, 𝐸, 𝑊, 𝑆}, but rather a value over a continuous range, such as [0, 2𝜋] or R+ ; the
number of possible values is infinite. How can we find the optimum value in an
infinite space in a finite amount of time? Trying out all possible combinations of
setting joint 1 to 𝑥 degrees and applying force 𝑦 in motor 2 will take infinitely long.
A solution could be to discretize the actions, although that introduces potential
quantization errors.
When actions are not discrete, the arg max operation can not be used to identify
“the” best action, and value-based methods are no longer sufficient. Policy-based
methods find suitable continuous or stochastic policies directly, without the inter-
mediate step of a value function and the need for the arg max operation to construct
the final policy.

4.1.2 Stochastic Policies

We will now turn to the modeling of stochastic policies.

96 4 Policy-Based Reinforcement Learning

When a robot moves its hand to open a door, it must judge the distance correctly.
A small error, and it may fail (as many movie clips show).3 Stochastic environments
cause stability problems for value-based methods [480]. Small perturbations in
Q-values may lead to large changes in the policy of value-based methods. Con-
vergence can typically only be achieved at slow learning rates, to smooth out the
randomness. A stochastic policy (a target distribution) does not suffer from this
problem. Stochastic policies have another advantage. By their nature they perform
exploration, without the need to separately code 𝜖-greediness or other exploration
methods, since a stochastic policy returns a distribution over actions.
Policy-based methods find suitable stochastic policies directly. A potential dis-
advantage of purely episodic policy-based methods is that they are high-variance;
they may find local optima instead of global optima, and converge slower than
value-based methods. Newer (actor critic) methods, such as A3C, TRPO, and PPO,
were designed to overcome these problems. We will discuss these algorithms later
in this chapter.
Before we will explain these policy-based agent algorithms, we will have a closer
look at some of the applications for which they are needed.

4.1.3 Environments: Gym and MuJoCo

Robotic experiments play an important role in reinforcement learning. However, be-

cause of the cost associated with real-world robotics experiments, in reinforcement
learning often simulated robotics systems are used. This is especially important in
model-free methods, that tend to have a high sample complexity (real robots wear
down when trials run in the millions). These software simulators model behavior
of the robot and the effects on the environment, using physics models. This pre-
vents the expense of real experiments with real robots, although some precision
is lost to modeling error. Two well-known physics models are MuJoCo [780] and
PyBullet [167]. They can be used easily via the Gym environment.

4.1.3.1 Robotics

Most robotic applications are more complicated than the classics such as mazes,
Mountain car and Cart pole. Robotic control decisions involve more joints, directions
of travel, and degrees of freedom, than a single cart that moves in one dimension.
Typical problems involve learning of visuo-motor skills (eye-hand coordination,
grasping), or learning of different locomotion gaits of multi-legged “animals.” Some
examples of grasping and walking are illustrated in Fig. 4.1.
3 Even worse, when a robot thinks it stands still, it may actually be in the process of falling over

(and, of course, robots can not think, they only wished they could).
4.1 Continuous Problems 97

Fig. 4.1 Robot Grasping and Gait [501]

The environments for these actions are unpredictable to a certain degree: they
require reactions to disturbances such as bumps in the road, or the moving of objects
in a scene.

4.1.3.2 Physics Models

Simulating robot motion involves modeling forces, acceleration, velocity, and move-
ment. It also includes modeling mass and elasticity for bouncing balls, tactile/grasp-
ing mechanics, and the effect of different materials. A physics mechanics model
needs to simulate the result of actions in the real world. Among the goals of such a
simulation is to model grasping, locomotion, gaits, and walking and running (see
also Sect. 4.3.1).
The simulations should be accurate. Furthermore, since model-free learning
algorithms often involve millions of actions, it is important that the physics sim-
ulations are fast. Many different physics environments for model-based robotics
have been created, among them Bullet, Havok, ODE and PhysX, see [228] for a
comparison. Of the models, MuJoCo [780], and PyBullet [167] are the most popular
in reinforcement learning, especially MuJoCo is used in many experiments.
Although MuJoCo calculations are deterministic, the initial state of environments
is typically randomized, resulting in an overall non-deterministic environment.
Despite many code optimizations in MuJoCo, simulating physics is still an expensive
proposition. Most MuJoCo experiments in the literature therefore are based on
stick-like entities, that simulate limited motions, in order to limit the computational
demands.
Figures 4.2 and 4.3 illustrate a few examples of some of the common Gym/MuJoCo
problems that are often used in reinforcement learning: Ant, Half-cheetah, and
Humanoid.
98 4 Policy-Based Reinforcement Learning

Fig. 4.2 Gym MuJoCo Ant and Half-Cheetah [108]

Fig. 4.3 Gym MuJoCo Humanoid

4.1.3.3 Games

In real time video games and certain card games the decisions are also continuous.
For example, in some variants of poker, the size of monetary bets can be any amount,
which makes the action space quite large (although strictly speaking still discrete).
In games such as StarCraft and Capture the Flag, aspects of the physical world are
modeled, and movement of agents can vary in duration and speed. The environment
for these games is also stochastic: some information is hidden for the agent. This
increases the size of the state space greatly. We will discuss these games in Chap. 7
when we discuss multi-agent methods.

4.2 Policy-Based Agents

Now that we have discussed the problems and environments that are used with
policy-based methods, it is time to see how policy-based algorithms work. Policy-
based methods are a popular approach in model-free deep reinforcement learning.
Many algorithms have been developed that perform well. Table 4.1 lists some of
the better known algorithms that will be covered in this chapter.
4.2 Policy-Based Agents 99

Name Approach Ref

REINFORCE Policy-gradient optimization [844]
A3C Distributed Actor Critic [521]
DDPG Derivative of continuous action function [480]
TRPO Dynamically sized step size [681]
PPO Improved TRPO, first order [683]
SAC Variance-based Actor Critic for robustness [306]
Table 4.1 Policy-Based Algorithms: REINFORCE, Asynchronous Advantage Actor Critic, Deep
Deterministic Policy Gradient, Trust Region Policy Optimization, Proximal Policy Optimization,
Soft Actor Critic

We will first provide an intuitive explanation of the idea behind the basic policy-
based approach. Then we will discuss some of the theory behind it, as well as
advantages and disadvantages of the basic policy-based approach. Most of these
disadvantages are alleviated by the actor critic method, that is discussed next.
Let us start with the basic idea behind policy-based methods.

4.2.1 Policy-Based Algorithm: REINFORCE

Policy-based approaches learn a parameterized policy, that selects actions without

consulting a value function.4 In policy-based methods the policy function is repre-
sented directly, allowing policies to select a continuous action, something that is
difficult to do in value-based methods.

The Supermarket: To build some intuition on the nature of policy-based

methods, let us think back again at the supermarket navigation task, that
we used in Chap. 2. In this navigation problem we can try to assess our
current distance to the supermarket with the Q-value-function, as we have
done before. The Q-value assesses the distance of each direction to take; it
tells us how far each action is from the goal. We can then use this distance
function to find our path.
In contrast, the policy-based alternative would be to ask a local the way,
who tells us, for example, to go straight and then left and then right at the
Opera House and straight until we reach the supermarket on our left. The
local just gave us a full path to follow, without having to infer which action
was the closest and then use that information to determine the way to go.
We can subsequently try to improve this full trajectory.

4 Policy-based methods may use a value function to learn the policy parameters 𝜃, but do not use
it for action selection.
100 4 Policy-Based Reinforcement Learning

Let us see how we can optimize such a direct policy directly, without the intermedi-
ate step of the Q-function. We will develop a first, generic, policy-based algorithm
to see how the pieces fit together. The explanation will be intuitive in nature.
The basic framework for policy-based algorithms is straightforward. We start
with a parameterized policy function 𝜋 𝜃 . We first (1) initialize the parameters 𝜃
of the policy function, (2) sample a new trajectory 𝜏, (3) if 𝜏 is a good trajectory,
increase the parameters 𝜃 towards 𝜏, otherwise decrease them, and (4) keep going
until convergence. Algorithm 4.1 provides a framework in pseudocode. Please note
the similarity with the codes in the previous chapter (Listing 3.1–3.3), and especially
the deep learning algorithms, where we also optimized function parameters in a
loop.
The policy is represented by a set of parameters 𝜃 (these can be the weights in a
neural network). Together, the parameters 𝜃 map the states 𝑆 to action probabilities
𝐴. When we are given a set of parameters, how should we adjust them to improve
the policy? The basic idea is to randomly sample a new policy, and if it is better,
adjust the parameters a bit in the direction of this new policy (and away if it is
worse). Let us see in more detail how this idea works.
To know which policy is best, we need some kind of measure of its quality. We
denote the quality of the policy that is defined by the parameters as 𝐽 (𝜃). It is
natural to use the value function of the start state as our measure of quality

𝐽 (𝜃) = 𝑉 𝜋 (𝑠0 ).

We wish to maximize 𝐽 (·). When the parameters are differentiable, then all we need
to do is to find a way to improve the gradient

∇ 𝜃 𝐽 (𝜃) = ∇ 𝜃 𝑉 𝜋 (𝑠0 )

of this expression to maximize our objective function 𝐽 (·).

Policy-based methods apply gradient-based optimization, using the derivative
of the objective to find the optimum. Since we are maximizing, we apply gradient
ascent. In each time step 𝑡 of the algorithm we perform the following update:

𝜃 𝑡+1 = 𝜃 𝑡 + 𝛼 · ∇ 𝜃 𝐽 (𝜃)

for learning rate 𝛼 ∈ R+ and performance objective 𝐽, see the gradient ascent
algorithm in Alg. 4.1.
Remember that 𝜋 𝜃 (𝑎|𝑠) is the probability of taking action 𝑎 in state 𝑠. This
function 𝜋 is represented by a neural network 𝜃, mapping states 𝑆 at the input
side of the network to action probabilities on the output side of the network. The
parameters 𝜃 determine the mapping of our function 𝜋. Our goal is to update the
parameters so that 𝜋 𝜃 becomes the optimal policy. The better the action 𝑎 is, the
more we want to increase the parameters 𝜃.
If we now would know, by some magical way, the optimal action 𝑎★, then we
could use the gradient to push each parameter 𝜃 𝑡 , 𝑡 ∈ trajectory, of the policy, in
the direction of the optimal action, as follows
4.2 Policy-Based Agents 101

Algorithm 4.1 Gradient ascent optimization

Input: a differentiable objective 𝐽 ( 𝜃), learning rate 𝛼 ∈ R+ , threshold 𝜖 ∈ R+
Initialization: randomly initialize 𝜃 in R𝑑
repeat
Sample trajectory 𝜏 and compute gradient ∇ 𝜃
𝜃 ← 𝜃 + 𝛼 · ∇ 𝜃 𝐽 ( 𝜃)
until ∇ 𝜃 𝐽 ( 𝜃) converges below 𝜖
return parameters 𝜃

𝜃 𝑡+1 = 𝜃 𝑡 + 𝛼∇𝜋 𝜃𝑡 (𝑎★ |𝑠).

Unfortunately, we do not know which action is best. We can, however, take a sample
trajectory and use estimates of the value of the actions of the sample. This estimate
can use the regular 𝑄ˆ function from the previous chapter, or the discounted return
function, or an advantage function (to be introduced shortly). Then, by multiplying
the push of the parameters (the probability) with our estimate, we get
ˆ 𝑎)∇𝜋 𝜃 (𝑎|𝑠).
𝜃 𝑡+1 = 𝜃 𝑡 + 𝛼𝑄(𝑠, 𝑡

A problem with this formula is that not only are we going to push harder on actions
with a high value, but also more often, because the policy 𝜋 𝜃𝑡 (𝑎|𝑠) is the probability
of action 𝑎 in state 𝑠. Good actions are thus doubly improved, which may cause
instability. We can correct by dividing by the general probability:

ˆ 𝑎) ∇𝜋 𝜃𝑡 (𝑎|𝑠)
𝜃 𝑡+1 = 𝜃 𝑡 + 𝛼𝑄(𝑠, .
𝜋 𝜃 (𝑎|𝑠)
In fact, we have now almost arrived at the classic policy-based algorithm, REIN-
FORCE, introduced by Williams in 1992 [844]. In this algorithm our formula is
expressed in a way that is reminiscent of a logarithmic cross-entropy loss function.
We can arrive at such a log-formulation by using the basic fact from calculus that

∇ 𝑓 (𝑥)
∇ log 𝑓 (𝑥) = .
𝑓 (𝑥)
Substituting this formula into our equation, we arrive at
ˆ 𝑎)∇ 𝜃 log 𝜋 𝜃 (𝑎|𝑠).
𝜃 𝑡+1 = 𝜃 𝑡 + 𝛼𝑄(𝑠,

This formula is indeed the core of REINFORCE, the prototypical policy-based

algorithm, which is shown in full in Alg. 4.2, with discounted cumulative reward.
To summarize, the REINFORCE formula pushes the parameters of the policy
in the direction of the better action (multiplied proportionally by the size of the
estimated action-value) to know which action is best.
We have arrived at a method to improve a policy that can be used directly to
indicate the action to take. The method whether the action is discrete, continuous, or
102 4 Policy-Based Reinforcement Learning

Algorithm 4.2 Monte Carlo policy gradient (REINFORCE) [844]

Input: A differentiable policy 𝜋 𝜃 (𝑎 |𝑠), learning rate 𝛼 ∈ R+ , threshold 𝜖 ∈ R+
Initialization: Initialize parameters 𝜃 in R𝑑
repeat
Generate full trace 𝜏 = {𝑠0 , 𝑎0 , 𝑟0 , 𝑠1 , .., 𝑠𝑇 } following 𝜋 𝜃 (𝑎 |𝑠)
for 𝑡 ∈ 0,Í. . . , 𝑇 − 1 do ⊲ Do for each step of the episode
−1 𝑘−𝑡
𝑅 ← 𝑇𝑘=𝑡 𝛾 · 𝑟𝑘 ⊲ Sum Return from trace
𝜃 ← 𝜃 + 𝛼𝛾 𝑡 𝑅 ∇ 𝜃 log 𝜋 𝜃 (𝑎𝑡 |𝑠𝑡 ) ⊲ Adjust parameters
end for
until ∇ 𝜃 𝐽 ( 𝜃) converges below 𝜖
return Parameters 𝜃

stochastic, without having to go through intermediate value or arg max functions

to find it. Algorithm 4.2 shows the full algorithm, which is called Monte Carlo policy
gradient. The algorithm is called Monte Carlo because it samples a trajectory.

Online and Batch

The versions of gradient ascent (Alg. 4.1) and REINFORCE (Alg. 4.2) that we show,
update the parameters inside the innermost loop. All updates are performed as the
time steps of the trajectory are traversed. This method is called the online approach.
When multiple processes work in parallel to update data, the online approach makes
sure that information is used as soon as it is known.
The policy gradient algorithm can also be formulated in batch-fashion: all gradi-
ents are summed over the states and actions, and the parameters are updated at the
end of the trajectory. Since parameter updates can be expensive, the batch approach
can be more efficient. An intermediate form that is frequently applied in practice
is to work with mini-batches, trading off computational efficiency for information
efficiency.
Let us now take a step back and look at the algorithm and assess how well it
works.

4.2.2 Bias-Variance Trade-Off in Policy-Based Methods

Now that we have seen the principles behind a policy-based algorithm, let us
see how policy-based algorithms work in practice, and compare advantages and
disadvantages of the policy-based approach.
Let us start with the advantages. First of all, parameterization is at the core of
policy-based methods, making them a good match for deep learning. For value-
based methods deep learning had to be retrofitted, giving rise to complications
as we saw in Sect. 3.2.3. Second, policy-based methods can easily find stochastic
policies; value-based methods find deterministic policies. Due to their stochastic
4.2 Policy-Based Agents 103

nature, policy-based methods naturally explore, without the need for methods such
as 𝜖-greedy, or more involved methods, that may require tuning to work well. Third,
policy-based methods are effective in large or continuous action spaces. Small
changes in 𝜃 lead to small changes in 𝜋, and to small changes in state distributions
(they are smooth). Policy-based algorithms do not suffer (as much) from convergence
and stability issues that are seen in arg max-based algorithms in large or continuous
action spaces.
On the other hand, there are disadvantages to the episodic Monte Carlo version
of the REINFORCE algorithm. Remember that REINFORCE generates a full random
episode in each iteration, before it assesses the quality. (Value-based methods use a
reward to select the next action in each time step of the episode.) Because of this,
policy-based is low bias, since full random trajectories are generated. However, they
are also high variance, since the full trajectory is generated randomly (whereas value-
based uses the value for guidance at each selection step). What are the consequences?
First, policy evaluation of full trajectories has low sample efficiency and high
variance. As a consequence, policy improvement happens infrequently, leading to
slow convergence compared to value-based methods. Second, this approach often
finds a local optimum, since convergence to the global optimum takes too long.
Much research has been performed to address the high variance of the episode-
based vanilla policy gradient [57, 421, 420, 293]. The enhancements that have
been found have greatly improved performance, so much so that policy-based
approaches—such as A3C, PPO, SAC, DDPG—have become favorite model-free
reinforcement learning algorithms for many applications. The enhancements to
reduce high variance that we discuss are:
• Actor critic introduces within-episode value-based critics based on temporal
difference value bootstrapping;
• Baseline subtraction introduces an advantage function to lower variance;
• Trust regions reduce large policy parameter changes;
• Exploration is crucial to get out of local minima and for more robust result; high
entropy action distributions are often used.
Let us have a look at these enhancements.

4.2.3 Actor Critic Bootstrapping

The actor critic approach combines value-based elements with the policy-based
method. The actor stands for the action, or policy-based, approach; the critic stands
for the value-based approach [743].
Action selection in episodic REINFORCE is random, and hence low bias. However,
variance is high, since the full episode is sampled (the size and direction of the
update can strongly vary between different samples). The actor critic approach is
designed to combine the advantage of the value-based approach (low variance) with
the advantage of the policy-based approach (low bias). Actor critic methods are
104 4 Policy-Based Reinforcement Learning

Algorithm 4.3 Actor critic with bootstrapping

Input: A policy 𝜋 𝜃 (𝑎 |𝑠), a value function 𝑉𝜙 (𝑠)
An estimation depth 𝑛, learning rate 𝛼, number of episodes 𝑀
Initialization: Randomly initialize 𝜃 and 𝜙
repeat
for 𝑖 ∈ 1, . . . , 𝑀 do
Sample trace 𝜏 = {𝑠0 , 𝑎0 , 𝑟0 , 𝑠1 , .., 𝑠𝑇 } following 𝜋 𝜃 (𝑎 |𝑠)
for 𝑡 ∈ 0, . . . , 𝑇 − 1 do
𝑄ˆ 𝑛 (𝑠𝑡 , 𝑎𝑡 ) = Í𝑛−1 𝛾 𝑘 · 𝑟𝑡+𝑘 + 𝛾 𝑛 · 𝑉𝜙 (𝑠𝑡+𝑛 ) ⊲ 𝑛-step target
𝑘=0
end for
end for Í ˆ 2
𝜙 ← 𝜙 − 𝛼 · ∇𝜙 𝑡 𝑄 𝑛 (𝑠𝑡 , 𝑎𝑡 ) − 𝑉𝜙 (𝑠𝑡 ) ⊲ Descent value loss
Í ˆ
𝜃 ← 𝜃 + 𝛼 · 𝑡 [ 𝑄𝑛 (𝑠𝑡 , 𝑎𝑡 ) · ∇ 𝜃 log 𝜋 𝜃 (𝑎𝑡 |𝑠𝑡 ) ] ⊲ Ascent policy gradient
until ∇ 𝜃 𝐽 ( 𝜃) converges below 𝜖
return Parameters 𝜃

popular because they work well. It is an active field where many different algorithms
have been developed.
The variance of policy methods can originate from two sources: (1) high variance
in the cumulative reward estimate, and (2) high variance in the gradient estimate.
For both problems a solution has been developed: bootstrapping for better reward
estimates, and baseline subtraction to lower the variance of gradient estimates. Both
of these methods use the learned value function, which we denote by 𝑉 𝜙 (𝑠). The
value function can use a separate neural network, with separate parameters 𝜙, or it
can use a value head on top of the actor parameters 𝜃. In this case the actor and
the critic share the lower layers of the network, and the network has two separate
top heads: a policy and a value head. We will use 𝜙 for the parameters of the value
function, to discriminate them from the policy parameters 𝜃.

Temporal Difference Bootstrapping

To reduce the variance of the policy gradient, we can increase the number of traces
𝑀 that we sample. However, the possible number of different traces is exponential
in the length of the trace for a given stochastic policy, and we cannot afford to
sample them all for one update. In practice the number of sampled traces 𝑀 is small,
sometimes even 𝑀 = 1, updating the policy parameters from a single trace. The
return of the trace depends on many random action choices; the update has high
variance. A solution is to use a principle that we known from temporal difference
learning, to bootstrap the value function step by step. Bootstrapping uses the value
function to compute intermediate 𝑛-step values per episode, trading-off variance
for bias. The 𝑛-step values are in-between full-episode Monte Carlo and single step
temporal difference targets.
We can use bootstrapping to compute an 𝑛-step target
4.2 Policy-Based Agents 105

𝑛−1
∑︁
𝑄ˆ n (𝑠𝑡 , 𝑎 𝑡 ) = 𝑟 𝑡+𝑘 + 𝑉 𝜙 (𝑠𝑡+𝑛 ),
𝑘=0

and we can then update the value function, for example on a squared loss
2
L (𝜙|𝑠𝑡 , 𝑎 𝑡 ) = 𝑄ˆ 𝑛 (𝑠𝑡 , 𝑎 𝑡 ) − 𝑉 𝜙 (𝑠𝑡 )

and update the policy with the standard policy gradient but with that (improved)
value 𝑄ˆ 𝑛
∇ 𝜃 L (𝜃|𝑠𝑡 , 𝑎 𝑡 ) = 𝑄ˆ 𝑛 (𝑠𝑡 , 𝑎 𝑡 ) · ∇ 𝜃 log 𝜋 𝜃 (𝑎 𝑡 |𝑠𝑡 ).
We are now using the value function prominently in the algorithm, which is param-
eterized by a separate set of parameters, denoted by 𝜙; the policy parameters are
still denoted by 𝜃. The use of both policy and value is what gives the actor critic
approach its name.
An example algorithm is shown in Alg. 4.3. When we compare this algorithm
with Alg. 4.2, we see how the policy gradient ascent update now uses the 𝑛-step 𝑄ˆ 𝑛
value estimate instead of the trace return 𝑅. We also see that this time the parameter
updates are in batch mode, with separate summations.

4.2.4 Baseline Subtraction with Advantage Function

Another method to reduce the variance of the policy gradient is by baseline subtrac-
tion. Subtracting a baseline from a set of numbers reduces the variance, but leaves
the expectation unaffected. Assume, in a given state with three available actions,
that we sample action returns of 65, 70, and 75, respectively. Policy gradient will
then try to push the probability of each action up, since the return for each action is
positive. The above method may lead to a problem, since we are pushing all actions
up (only somewhat harder on one of them). It might be better if we only push up
on actions that are higher than the average (action 75 is higher than the average of
70 in this example), and push down on actions that are below average (65 in this
example). We can do so through baseline subtraction.
The most common choice for the baseline is the value function. When we subtract
the value 𝑉 from a state-action value estimate 𝑄, the function is called the advantage
function:
𝐴(𝑠𝑡 , 𝑎 𝑡 ) = 𝑄(𝑠𝑡 , 𝑎 𝑡 ) − 𝑉 (𝑠𝑡 ).
The 𝐴 function subtracts the value of the state 𝑠 from the state-action value. It now
estimates how much better a particular action is compared to the expectation of a
particular state.
We can combine baseline subtraction with any bootstrapping method to estimate
ˆ 𝑡 , 𝑎 𝑡 ). We compute
the cumulative reward 𝑄(𝑠

𝐴ˆn (𝑠𝑡 , 𝑎 𝑡 ) = 𝑄ˆ 𝑛 (𝑠𝑡 , 𝑎 𝑡 ) − 𝑉 𝜙 (𝑠𝑡 )

106 4 Policy-Based Reinforcement Learning

Algorithm 4.4 Actor critic with bootstrapping and baseline subtraction

Input: A policy 𝜋 𝜃 (𝑎 |𝑠), a value function 𝑉𝜙 (𝑠)
An estimation depth 𝑛, learning rate 𝛼, number of episode 𝑀
Initialization: Randomly initialize 𝜃 and 𝜙
while not converged do
for 𝑖 = 1, . . . , 𝑀 do
Sample trace 𝜏 = {𝑠0 , 𝑎0 , 𝑟0 , 𝑠1 , .., 𝑠𝑇 } following 𝜋 𝜃 (𝑎 |𝑠)
for 𝑡 = 0, . . . , 𝑇 − 1 do
𝑄ˆ 𝑛 (𝑠𝑡 , 𝑎𝑡 ) = Í𝑛−1 𝛾 𝑘 · 𝑟𝑡+𝑘 + 𝛾 𝑛 · 𝑉𝜙 (𝑠𝑡+𝑛 ) ⊲ 𝑛-step target
𝑘=0
ˆn (𝑠𝑡 , 𝑎𝑡 ) = 𝑄
𝐴 ˆ 𝑛 (𝑠𝑡 , 𝑎𝑡 ) − 𝑉𝜙 (𝑠𝑡 ) ⊲ Advantage
end for
end for Í ˆ 2
𝜙 ← 𝜙 − 𝛼 · ∇𝜙 𝑡 𝐴 𝑛 (𝑠𝑡 , 𝑎𝑡 ) ⊲ Descent Advantage loss
Í ˆ
𝜃 ← 𝜃 + 𝛼 · 𝑡 [ 𝐴𝑛 (𝑠𝑡 , 𝑎𝑡 ) · ∇ 𝜃 log 𝜋 𝜃 (𝑎𝑡 |𝑠𝑡 ) ] ⊲ Ascent policy gradient
end while
return Parameters 𝜃

and update the policy with

∇ 𝜃 L (𝜃|𝑠𝑡 , 𝑎 𝑡 ) = 𝐴ˆ𝑛 (𝑠𝑡 , 𝑎 𝑡 ) · ∇ 𝜃 log 𝜋 𝜃 (𝑎 𝑡 |𝑠𝑡 ).

We have now seen the ingredients to construct a full actor critic algorithm. An
example algorithm is shown in Alg. 4.4.

Generic Policy Gradient Formulation

With these two ideas we can formulate an entire spectrum of policy gradient
methods, depending on the type of cumulative reward estimate that they use. In
general, the policy gradient estimator takes the following form, where we now
introduce a new target Ψ𝑡 that we sample from the trajectories 𝜏:
𝑛
h ∑︁ i
∇ 𝜃 𝐽 (𝜃) = E 𝜏0 ∼ 𝑝 𝜃 ( 𝜏0 ) Ψ𝑡 ∇ 𝜃 log 𝜋 𝜃 (𝑎 𝑡 |𝑠𝑡 )
𝑡=0

There is a variety of potential choices for Ψ𝑡 , based on the use of bootstrapping and
baseline substraction:
4.2 Policy-Based Agents 107
∞
∑︁
Ψ𝑡 = 𝑄ˆ 𝑀 𝐶 (𝑠𝑡 , 𝑎 𝑡 ) = 𝛾𝑖 · 𝑟𝑖 Monte Carlo target
𝑖=𝑡
∑︁1
𝑛−
Ψ𝑡 = 𝑄ˆ 𝑛 (𝑠𝑡 , 𝑎 𝑡 ) = 𝛾 𝑖 · 𝑟 𝑖 + 𝛾 𝑛𝑉 𝜃 (𝑠 𝑛 ) bootstrap (𝑛-step target)
𝑖=𝑡
∞
∑︁
Ψ𝑡 = 𝐴ˆ𝑀 𝐶 (𝑠𝑡 , 𝑎 𝑡 ) = 𝛾 𝑖 · 𝑟 𝑖 − 𝑉 𝜃 (𝑠𝑡 ) baseline subtraction
𝑖=𝑡
𝑛−1
∑︁
Ψ𝑡 = 𝐴ˆ𝑛 (𝑠𝑡 , 𝑎 𝑡 ) = 𝛾 𝑖 · 𝑟 𝑖 + 𝛾 𝑛𝑉 𝜃 (𝑠 𝑛 ) − 𝑉 𝜃 (𝑠𝑡 ) baseline + bootstrap
𝑖=𝑡
Ψ𝑡 = 𝑄 𝜙 (𝑠𝑡 , 𝑎 𝑡 ) Q-value approximation

Actor critic algorithms are among the most popular model-free reinforcement
learning algorithms in practice, due to their good performance. After having dis-
cussed relevant theoretical background, it is time to look at how actor critic can be
implemented in a practical, high performance, algorithm. We will start with A3C.

Asynchronous Advantage Actor Critic

Many high performance implementations are based on the actor critic approach. For
large problems the algorithm is typically parallelized and implemented on a large
cluster computer. A well-known parallel algorithm is Asynchronous advantage actor
critic (A3C). A3C is a framework that uses asynchronous (parallel and distributed)
gradient descent for optimization of deep neural network controllers [521].
There is also a non-parallel version of A3C, the synchronous variant A2C [849].
Together they popularized this approach to actor critic methods. Figure 4.4 shows
the distributed architecture of A3C [382]; Alg. 4.5 shows the pseudocode, from
Mnih et al. [521]. The A3C network will estimate both a value function 𝑉 𝜙 (𝑠)
and an advantage function 𝐴 𝜙 (𝑠, 𝑎), as well as a policy function 𝜋 𝜃 (𝑎|𝑠). In the
experiments on Atari [521], the neural networks were separate fully-connected
policy and value heads at the top (orange in Fig. 4.4), followed by joint convolutional
networks (blue). This network architecture is replicated over the distributed workers.
Each of these workers are run on a separate processor thread and are synced with
global parameters from time to time.
A3C improves on classic REINFORCE in the following ways: it uses an advantage
actor critic design, it uses deep learning, and it makes efficient use of parallelism
in the training stage. The gradient accumulation step at the end of the code can
be considered as a parallelized reformulation of minibatch-based stochastic gra-
dient update: the values of 𝜙 or 𝜃 are adjusted in the direction of each training
thread independently. A major contribution of A3C comes from its parallelized
and asynchronous architecture: multiple actor-learners are dispatched to separate
instantiations of the environment; they all interact with the environment and collect
108 4 Policy-Based Reinforcement Learning

Fig. 4.4 A3C network [382]

experience, and asynchronously push their gradient updates to a central target

network (just as DQN).
It was found that the parallel actor-learners have a stabilizing effect on training.
A3C surpassed the previous state-of-the-art on the Atari domain and succeeded on
a wide variety of continuous motor control problems as well as on a new task of
navigating random 3D mazes using high-resolution visual input [521].

4.2.5 Trust Region Optimization

Another important approach to further reduce the variance of policy methods is

the trust region approach. Trust region policy optimization (TRPO) aims to further
reduce the high variability in the policy parameters, by using a special loss function
with an additional constraint on the optimization problem [681].
A naive approach to speed up an algorithm is to try to increase the step size
of hyperparameters, such as the learning rate, and the policy parameters. This
approach will fail to uncover solutions that are hidden in finer grained trajectories,
and the optimization will converge to local optima. For this reason the step size
should not be too large. A less naive approach is to use an adaptive step size that
depends on the output of the optimization progress.
Trust regions are used in general optimization problems to constrain the update
size [734]. The algorithms work by computing the quality of the approximation; if
4.2 Policy-Based Agents 109

Algorithm 4.5 Asynchronous advantage actor-critic pseudocode for each actor-

learner thread [521]
Input: Assume global shared parameter vectors 𝜃 and 𝜙 and global shared counter 𝑇 = 0
Assume thread-specific parameter vectors 𝜃 0 and 𝜙0
Initialize thread step counter 𝑡 ← 1
repeat
Reset gradients: 𝑑 𝜃 ← 0 and 𝑑 𝜙 ← 0.
Synchronize thread-specific parameters 𝜃 0 = 𝜃 and 𝜙0 = 𝜙
𝑡𝑠𝑡 𝑎𝑟 𝑡 = 𝑡
Get state 𝑠𝑡
repeat
Perform 𝑎𝑡 according to policy 𝜋 (𝑎𝑡 |𝑠𝑡 ; 𝜃 0 )
Receive reward 𝑟𝑡 and new state 𝑠𝑡+1
𝑡 ←𝑡 +1
𝑇 ←𝑇 +1
until terminal 𝑠𝑡 or 𝑡 − 𝑡𝑠𝑡 𝑎𝑟 𝑡 == 𝑡𝑚𝑎𝑥
0 for terminal 𝑠𝑡
𝑅=
𝑉 (𝑠𝑡 , 𝜙0 ) for non-terminal 𝑠𝑡 // Bootstrap from last state
for 𝑖 ∈ {𝑡 − 1, . . . , 𝑡𝑠𝑡 𝑎𝑟𝑡 } do
𝑅 ← 𝑟𝑖 + 𝛾𝑅
Accumulate gradients wrt 𝜃 0 : 𝑑 𝜃 ← 𝑑 𝜃 + ∇ 𝜃 0 log 𝜋 (𝑎𝑖 |𝑠𝑖 ; 𝜃 0 ) (𝑅 − 𝑉 (𝑠𝑖 ; 𝜙0 ))
Accumulate gradients wrt 𝜙0 : 𝑑 𝜙 ← 𝑑 𝜙 + 𝜕 (𝑅 − 𝑉 (𝑠𝑖 ; 𝜙0 )) 2 /𝜕𝜙0
end for
Perform asynchronous update of 𝜃 using 𝑑 𝜃 and of 𝜙 using 𝑑 𝜙.
until 𝑇 > 𝑇𝑚𝑎𝑥

it is still good, then the trust region is expanded. Alternatively, the region can be
shrunk if the divergence of the new and current policy is getting large.
Schulman et al. [681] introduced trust region policy optimization (TRPO) based
on this ideas, trying to take the largest possible parameter improvement step on a
policy, without accidentally causing performance to collapse.
To this end, as it samples policies, TRPO compares the old and the new policy:
h 𝜋 (𝑎 |𝑠 ) i
𝜃 𝑡 𝑡
L (𝜃) = E𝑡 · 𝐴𝑡 .
𝜋 𝜃old (𝑎 𝑡 |𝑠𝑡 )

In order to increase the learning step size, TRPO tries to maximize this loss function
L, subject to the constraint that the old and the new policy are not too far away. In
TRPO the Kullback-Leibler divergence5 is used for this purpose:

E𝑡 [KL(𝜋 𝜃old (·|𝑠𝑡 ), 𝜋 𝜃 (·|𝑠𝑡 ))] ≤ 𝛿.

TRPO scales to complex high-dimensional problems. Original experiments

demonstrated its robust performance on simulated robotic Swimming, Hopping,
Walking gaits, and Atari games. TRPO is commonly used in experiments and as
a baseline for developing new algorithms. A disadvantage of TRPO is that it is
5The Kullback-Leibler divergence is a measure of distance between probability distributions [436,
93].
110 4 Policy-Based Reinforcement Learning

a complicated algorithm that uses second order derivatives; we will not cover
the pseudocode here. Implementations can be found at Spinning Up6 and Stable
Baselines.7
Proximal policy optimzation (PPO) [683] was developed as an improvement
of TRPO. PPO has some of the benefits of TRPO, but is simpler to implement,
is more general, has better empirical sample complexity and has better run time
complexity. It is motivated by the same question as TRPO, to take the largest possible
improvement step on a policy parameter without causing performance collapse.
There are two variants of PPO: PPO-Penalty and PPO-Clip. PPO-Penalty ap-
proximately solves a KL-constrained update (like TRPO), but merely penalizes the
KL-divergence in the objective function instead of making it a hard constraint.
PPO-Clip does not use a KL-divergence term in the objective and has no constraints
either. Instead it relies on clipping in the objective function to remove incentives
for the new policy to get far from the old policy; it clips the difference between the
old and the new policy within a fixed range [1 − 𝜖, 1 + 𝜖] · 𝐴𝑡 .
While simpler than TRPO, PPO is still a complicated algorithm to implement,
and we omit the code here. The authors of PPO provide an implementation as a
baseline.8 Both TRPO and PPO are on-policy algorithms. Hsu et al. [353] reflect on
design choices of PPO.

4.2.6 Entropy and Exploration

A problem in many deep reinforcement learning experiments where only a frac-

tion of the state space is sampled, is brittleness: the algorithms get stuck in local
optima, and different choices for hyperparameters can cause large differences in
performance. Even a different choice for seed for the random number generator
can cause large differences in performance for many algorithms.
For large problems, exploration is important, in value-based and policy-based
approaches alike. We must provide the incentive to sometimes try an action which
currently seems suboptimal [524]. Too little exploration results in brittle, local,
optima.
When we learn a deterministic policy 𝜋 𝜃 (𝑠) → 𝑎, we can manually add ex-
ploration noise to the behavior policy. In a continuous action space we can use
Gaussian noise, while in a discrete action space we can use Dirichlet noise [425].
For example, in a 1D continuous action space we could use:

𝜋 𝜃 ,behavior (𝑎|𝑠) = 𝜋 𝜃 (𝑠) + N (0, 𝜎),

where N (𝜇, 𝜎) is the Gaussian (normal) distribution with hyperparameters mean

𝜇 = 0 and standard deviation 𝜎; 𝜎 is our exploration hyperparameter.
6 https://spinningup.openai.com
7 https://stable-baselines.readthedocs.io
8 https://openai.com/blog/openai-baselines-ppo/#ppo
4.2 Policy-Based Agents 111

Soft Actor Critic

When we learn a stochastic policy 𝜋(𝑎|𝑠), then exploration is already partially

ensured due to the stochastic nature of our policy. For example, when we predict
a Gaussian distribution, then simply sampling from this distribution will already
induce variation in the chosen actions.

𝜋 𝜃 ,behavior (𝑎|𝑠) = 𝜋 𝜃 (𝑎|𝑠)

However, when there is not sufficient exploration, a potential problem is the collapse
of the policy distribution. The distribution then becomes too narrow, and we lose
the exploration pressure that is necessary for good performance.
Although we could simply add additional noise, another common approach
is to use entropy regularization (see Sect. A.2 for details). We then add an addi-
tional penalty to the loss function, that enforces the entropy 𝐻 of the distribution
to stay larger. Soft actor critic (SAC) is a well-known algorithm that focuses on
exploration [306, 307].9 SAC extends the policy gradient equation to

𝜃 𝑡+1 = 𝜃 𝑡 + 𝑅 · ∇ 𝜃 log 𝜋 𝜃 (𝑎 𝑡 |𝑠𝑡 ) + 𝜂∇ 𝜃 𝐻 [𝜋 𝜃 (·|𝑠𝑡 )]

where 𝜂 ∈ R+ is a constant that determines the amount of entropy regularization.

SAC ensures that we will move 𝜋 𝜃 (𝑎|𝑠) to the optimal policy, while also ensuring
that the policy stays as wide as Í possible (trading off the two against eachother).
Entropy is computed as 𝐻 = − 𝑖 𝑝 𝑖 log 𝑝 𝑖 where 𝑝 𝑖 is the probability of being
in state 𝑖; in SAC entropy is the negative log of the stochastic policy function
− log 𝜋 𝜃 (𝑎|𝑠).
High-entropy policies favor exploration. First, the policy is incentivized to ex-
plore more widely, while giving up on clearly unpromising avenues. Second, with
improved exploration comes improved learning speed.
Most policy-based algorithms (including A3C, TRPO, and PPO) only optimize
for expected value. By including entropy explicitly in the optimization goal, SAC
is able to increase the stability of outcome policies, achieving stable results for
different random seeds, and reducing the sensitivity to hyperparameter settings.
Including entropy into the optimization goal has been studied widely, see, for
example, [305, 396, 779, 880, 547].
A further element that SAC uses to improve stability and sample efficiency is a
replay buffer. Many policy-based algorithms are on-policy learners (including A3C,
TRPO, and PPO). In on-policy algorithms each policy improvement uses feedback
on actions according to the most recent version of the behavior policy. On-policy
methods converge well, but tend to require many samples to do so. In contrast, many
value-based algorithms are off-policy: each policy improvement can use feedback
collected at any earlier point during training, regardless of how the behavior policy
was acting to explore the environment at the time when the feedback was obtained.
The replay buffer is such a mechanism, breaking out of local maxima. Large replay
9 https://github.com/haarnoja/sac
112 4 Policy-Based Reinforcement Learning

buffers cause off-policy behavior, improving sample efficiency by learning from

behavior of the past, but also potentially causing convergence problems. Like DQN,
SAC has overcome these problems, and achieves stable off-policy performance.

4.2.7 Deterministic Policy Gradient

Actor critic approaches improve the policy-based approach with various value-based
ideas, and with good results. Another method to join policy and value approaches
is to use a learned value function as a differentiable target to optimize the policy
against—we let the policy follow the value function [524]. An example is the deter-
ministic policy gradient [705]. Imagine we collect data 𝐷 and train a value network
𝑄 𝜙 (𝑠, 𝑎). We can then attempt to optimize the parameters 𝜃 of a deterministic
policy by optimizing the prediction of the value network:
𝑛
h ∑︁ i
𝐽 (𝜃) = E𝑠∼𝐷 𝑄 𝜙 (𝑠, 𝜋 𝜃 (𝑠)) ,
𝑡=0

which by the chain-rule gives the following gradient expression

𝑛
∑︁
∇ 𝜃 𝐽 (𝜃) = ∇𝑎 𝑄 𝜙 (𝑠, 𝑎) · ∇ 𝜃 𝜋 𝜃 (𝑠).
𝑡=0

In essence, we first train a state-action value network based on sampled data,

and then let the policy follow the value network, by simply chaining the gradients.
Thereby, we push the policy network in the direction of those actions 𝑎 that increase
the value network prediction, towards actions that perform better.
Lillicrap et al. [480] present Deep deterministic policy gradient (DDPG). It is
based on DQN, with the purpose of applying it to continuous action functions.
In DQN, if the optimal action-value function 𝑄★ (𝑠, 𝑎) is known, then the optimal
action 𝑎★ (𝑠) can be found via 𝑎★ (𝑠) = arg max𝑎 𝑄★ (𝑠, 𝑎). DDPG uses the derivative
of a continuous function 𝑄(𝑠, 𝑎) with respect to the action argument to efficiently
approximate max𝑎 𝑄(𝑠, 𝑎). DDPG is also based on the algorithms Deterministic
policy gradients (DPG) [705] and Neurally fitted Q-learning with continuous actions
(NFQCA) [311], two actor critic algorithms.
The pseudocode of DDPG is shown in Alg. 4.6. DDPG has been shown to work
well on simulated physics tasks, including classic problems such as Cartpole, Gripper,
Walker, and Car driving, being able to learn policies directly from raw pixel inputs.
DDPG is off-policy and uses a replay buffer and a separate target network to achieve
stable deep reinforcement learning (just as DQN).
4.2 Policy-Based Agents 113

Algorithm 4.6 DDPG algorithm [480]

Randomly initialize critic network 𝑄 𝜙 (𝑠, 𝑎) and actor 𝜋 𝜃 (𝑠) with weights 𝜙 and 𝜃.
Initialize target network 𝑄0 and 𝜋 0 with weights 𝜙0 ← 𝜙, 𝜃 0 ← 𝜃
Initialize replay buffer 𝑅
for episode = 1, M do
Initialize a random process N for action exploration
Receive initial observation state 𝑠1
for t = 1, T do
Select action 𝑎𝑡 = 𝜋 𝜃 (𝑠𝑡 ) + N𝑡 according to the current policy and exploration noise
Execute action 𝑎𝑡 and observe reward 𝑟𝑡 and observe new state 𝑠𝑡+1
Store transition (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) in 𝑅
Sample a random minibatch of 𝑁 transitions (𝑠𝑖 , 𝑎𝑖 , 𝑟𝑖 , 𝑠𝑖+1 ) from 𝑅
Set 𝑦𝑖 = 𝑟𝑖 + 𝛾𝑄 𝜙0 (𝑠𝑖+1 , 𝜋 𝜃 0 (𝑠𝑖+1 ))
1 Í 2
Update critic by minimizing the loss: 𝐿 = 𝑁 𝑖 ( 𝑦𝑖 − 𝑄 𝜙 (𝑠𝑖 , 𝑎𝑖 ))
Update the actor policy using the sampled policy gradient:
1 ∑︁
∇𝜃 𝐽 ≈ ∇𝑎 𝑄 𝜙 (𝑠, 𝑎) | 𝑠=𝑠𝑖 ,𝑎=𝜇 (𝑠𝑖 ) ∇ 𝜃 𝜋 𝜃 (𝑠) | 𝑠𝑖
𝑁 𝑖

Update the target networks:

𝜙0 ← 𝜏 𝜙 + ( 1 − 𝜏) 𝜙0

𝜃 0 ← 𝜏 𝜃 + ( 1 − 𝜏) 𝜃 0
end for
end for

DDPG is a popular actor critic algorithm. Annotated pseudocode and efficient

implementations can be found at Spinning Up10 and Stable Baselines11 in addition
to the original paper [480].

Conclusion

We have seen quite some algorithms that combine the policy and value approach,
and we have discussed possible combinations of these building blocks to construct
working algorithms. Figure 4.5 provides a conceptual map of how the different
approaches are related, including two approaches that will be discussed in later
chapters (AlphaZero and Evolutionary approaches).
Researchers have constructed many algorithms and performed experiments to
see when they perform best. Quite a number of actor critic algorithms have been
developed. Working high-performance Python implementations can be found on
GitHub in the Stable Baselines.12
10 https://spinningup.openai.com
11 https://stable-baselines.readthedocs.io
12 https://stable-baselines.readthedocs.io/en/master/guide/quickstart.html
114 4 Policy-Based Reinforcement Learning

Fig. 4.5 Value-based, policy-based and Actor critic methods [524].

4.2.8 Hands On: PPO and DDPG MuJoCo Examples

OpenAI’s Spinning Up provides a tutorial on policy gradient algorithms, complete

with TensorFlow and PyTorch versions of REINFORCE to learn Gym’s Cartpole.13
with the TensorFlow code14 or PyTorch.15
Now that we have discussed these algorithms, let us see how they work in
practice, to get a feeling for the algorithms and their hyperparameters. MuJoCo is the
most frequently used physics simulator in policy-based learning experiments. Gym,
the (Stable) Baselines and Spinning up allow us to run any mix of learning algorithms
and experimental environments. You are encouraged to try these experiments
yourself.
Please be warned, however, that attempting to install all necessary pieces of
software may invite a minor version-hell. Different versions of your operating
system, of Python, of GCC, of Gym, of the Baselines, of TensorFlow or PyTorch, and
of MuJoCo all need to line up before you can see beautiful images of moving arms,
legs and jumping humanoids. Unfortunately not all of these versions are backwards-
compatible, specifically the switch from Python 2 to 3 and from TensorFlow 1 to 2
introduced incompatible language changes.
Getting everything to work may be an effort, and may require switching ma-
chines, operating systems and languages, but you should really try. This is the
13 Tutorial: https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#
deriving-the-simplest-policy-gradient
14 TensorFlow: https://github.com/openai/spinningup/blob/master/spinup/examples/
tf1/pg_math/1_simple_pg.py
15 PyTorch: https://github.com/openai/spinningup/blob/master/spinup/examples/
pytorch/pg_math/1_simple_pg.py
4.3 Locomotion and Visuo-Motor Environments 115

disadvantage of being part of one of the fastest moving fields in machine learning
research. If things do not work with your current operating system and Python
version, in general a combination of Linux Ubuntu (or macOS), Python 3.7, Ten-
sorFlow 1 or PyTorch, Gym, and the Baselines may be a good idea to start with.
Search on the GitHub repositories or Stackoverflow when you get error messages.
Sometimes downgrading to the one-but latest version will be necessary, or fiddling
with include or library paths.
If everything works, then both Spinning up and the Baselines provide convenient
scripts that facilitate mixing and matching algorithms and environments from the
command line.
For example, to run Spinup’s PPO on MuJoCo’s Walker environment, with a
32 × 32 hidden layer, the following command line does the job:

python -m spinup.run ppo

--hid "[32,32]" --env Walker2d-v2 --exp_name mujocotest

To train DDPG from the Baselines on the Half-cheetah, the command is:

python -m base-lines.run
--alg=ddpg --env=HalfCheetah-v2 --num_timesteps=1e6

All hyperparameters can be controlled via the command line, providing for a
flexible way to run experiments. A final example command line:

python scripts/all_plots.py
-a ddpg -e HalfCheetah Ant Hopper Walker2D -f logs/
-o logs/ddpg_results

The Stable Baselines site explains what this command line does.

4.3 Locomotion and Visuo-Motor Environments

We have seen many different policy-based reinforcement learning algorithms that

can be used in agents with continuous action spaces. Let us have a closer look at
the environments that they have been used in, and how well they perform.
Policy-based methods, and especially the actor critic policy/value hybrid, work
well for many problems, both with discrete and with continuous action spaces.
Policy-based methods are often tested on complex high-dimensional robotics appli-
cations [417]. Let us have a look at the kind of environments that have been used
to develop PPO, A3C, and the other algorithms.
Two application categories are robot locomotion, and visuo-motor interaction.
These two problems have drawn many researchers, and many new algorithms have
116 4 Policy-Based Reinforcement Learning

Fig. 4.6 Humanoid Standing Up [682]

been devised, some of which were able to learn impressives performance. For each
of the two problems, we will discuss a few results in more detail.

4.3.1 Locomotion

One of the problems of locomotion of legged entities is the problem of learning gaits.
Humans, with two legs, can walk, run, and jump, amongst others. Dogs and horses,
with four legs, have other gaits, where their legs may move in even more interesting
patterns, such as the trot, canter, pace and gallop. The challenges that we pose
robots are often easier. Typical reinforcement learning tasks are for a one-legged
robot to learn to jump, for biped robots to walk and jump, and for a quadruped to
get to learn to use its multitude of legs in any coordinated fashion that results in
forward moving. Learning such policies can be quite computationally expensive,
and a curious simulated virtual animal has emerged that is cheaper to simulate:
the two-legged half-cheetah, whose task it is to run forward. We have already seen
some of these robotic creatures in Figs. 4.2–4.3.
The first approach that we will discuss is by Schulman et al. [682]. They report
experiments where human-like bipeds and quadrupeds must learn to stand up and
learn running gaits. These are challenging 3D locomotion tasks that were formerly
attempted with hand-crafted policies. Figure 4.6 shows a sequence of states.
The challenge in these situations is actually somewhat spectacular: the agent is
only provided with a positive reward for moving forward; based on nothing more
it has to learn to control all its limbs by itself, through trial and error; no hint is
given on how to control a leg or what its purpose is. These results are best watched
in the movies that have been made16 about the learning process.
16 Such as the movie from the start of this chapter: https://www.youtube.com/watch?v=hx_
bgoTF7bs
4.3 Locomotion and Visuo-Motor Environments 117

Fig. 4.7 Walker Obstacle Course [325]

Fig. 4.8 Quadruped Obstacle Course [325]

The authors use an Advantage actor critic algorithm with trust regions. The
algorithm is fully model-free, and learning with simulated physics was reported
to take one to two weeks of real time. Learning to walk is quite a complicated
challenge, as the movies illustrate. They also show the robot learning to scale an
obstacle run all by itself.
In another study, Heess et al. [325] report on end-to-end learning of complex
robot locomotion from pixel input to (simulated) motor-actuation. Figure 4.7 shows
how a walker scales an obstacle course and Fig. 4.8 shows a time lapse of how a
quadruped traverses a course. Agents learned to run, jump, crouch and turn as
the environment required, without explicit reward shaping or other hand-crafted
features. For this experiment a distributed version of PPO was used. Interestingly,
the researchers stress that the use of a rich—varied, difficult—environment helps to
promote learning of complex behavior, that is also robust across a range of tasks.

4.3.2 Visuo-Motor Interaction

Most experiments in “end-to-end” learning of robotic locomotion are set up so that

the input is received directly from features that are derived from the states as calcu-
lated by the simulation software. A step further towards real-world interaction is to
learn directly from camera pixels. We then model eye-hand coordination in visuo-
118 4 Policy-Based Reinforcement Learning

Fig. 4.9 DeepMind Control Suite. Top: Acrobot, Ball-in-cup, Cart-pole, Cheetah, Finger, Fish,
Hopper. Bottom: Humanoid, Manipulator, Pendulum, Point-mass, Reacher, Swimmer (6 and 15
links), Walker [755]

motor interaction tasks, and the state of the environment has to be inferred from
camera or other visual means, and then be translated in joint (muscle) actuations.
Visuo-motor interaction is a difficult task, requiring many techniques to work
together. Different environments have been introduced to test algorithms. Tassa
et al. report on benchmarking efforts in robot locomotion with MuJoCo [755],
introducing the DeepMind control suite, a suite of environments consisting of
different MuJoCo control tasks (see Fig. 4.9). The authors also present baseline
implementations of learning agents that use A3C, DDPG and D4PG (distributional
distributed deep deterministic policy gradients—an algorithm that extends DDPG).
In addition to learning from state derived features, results are presented where
the agent learns from 84 × 84 pixel information, in a simulated form of visuo-motor
interaction. The DeepMind control suite is especially designed for further research
in the field [757, 511, 508, 510, 509]. Other environment suites are Meta-World [864],
Surreal [234], RLbench [375].
Visuo-motor interaction is a challenging problem that remains an active area of
research.

4.3.3 Benchmarking

Benchmarking efforts are of great importance in the field [212]. Henderson et

al. [328] published an influential study of the sensitivity of outcomes to different
hyperparameter settings, and the influence of non-determinism, by trying to repro-
duce many published works in the field. They find large variations in outcomes,
and in general that reproducibilty of results is problematic. They conclude that
without significance metrics and tighter standardization of experimental reporting,
it is difficult to determine whether improvements over the prior state-of-the-art are
4.3 Locomotion and Visuo-Motor Environments 119

meaningful [328]. Further studies confirmed these findings [7],17 and today more
works are being published with code, hyperparameters, and environments.
Taking inspiration from the success of the Arcade Learning Environment in
game playing, benchmark suites of continuous control tasks with high state and
action dimensionality have been introduced [212, 261].18 The tasks include 3D
humanoid locomotion, tasks with partial observations, and tasks with hierarchical
structure. The locomotion tasks are: Swimmer, Hopper, Walker, Half-cheetah, Ant,
simple Humanoid and full Humanoid, with the goal being to move forward as
fast as possible. These are difficult tasks because of the high degree of freedom of
movement. Partial observation is achieved by adding noise or leaving out certain
parts of the regular observations. The hierarchical tasks consist of low level tasks
such as learning to move, and a high level task such as finding the way out of a
maze.

Summary and Further Reading

This chapter is concerned with the second kind of model-free algorithms: policy-
based methods. We summarize what we have learned, and provide pointers to
further reading.

Summary

Policy-based model-free methods are some of the most popular methods of deep
reinforcement learning. For large, continuous action spaces, indirect value-based
methods are not well suited, because of the use of the arg max function to recover
the best action to go with the value. Where value-based methods work step-by-step,
vanilla policy-based methods roll out a full future trajectory or episode. Policy-based
methods work with a parameterized current policy, which is well suited for a neural
network as policy function approximator.
After the full trajectory has been rolled out, the reward and the value of the
trajectory is calculated and the policy parameters are updated, using gradient ascent.
Since the value is only known at the end of an episode, classic policy-based methods
have a higher variance than value based methods, and may converge to a local
optimum. The best known classic policy method is called REINFORCE.
Actor critic methods add a value network to the policy network, to achieve
the benefits of both approaches. To reduce variance, 𝑛-step temporal difference
bootstrapping can be added, and a baseline value can be subtracted, so that we get
the so-called advantage function (which subtracts the value of the parent state from
17 https://github.com/google-research/rliable
18The suite in the paper is called RLlab. A newer version of the suite is named Garage. See also
Appendix C.
120 4 Policy-Based Reinforcement Learning

the action values of the future states, bringing their expected value closer to zero).
Well known actor critic methods are A3C, DDPG, TRPO, PPO, and SAC.19 A3C
features an asynchronous (parallel, distributed) implementation, DDPG is an actor
critic version of DQN for continous action spaces, TRPO and PPO use trust regions
to achieve adaptive step sizes in non linear spaces, SAC optimizes for expected value
and entropy of the policy. Benchmark studies have shown that the performance of
these actor critic algorithm is as good or better than value-based methods [212, 328].
Robot learning is among the most popular applications for policy-based method.
Model-free methods have low sample efficiency, and to prevent the cost of wear after
millions of samples, most experiments use a physics simulation as environment,
such as MuJoCo. Two main application areas are locomotion (learning to walk,
learning to run) and visuo-motor interaction (learning directly from camera images
of one’s own actions).

Further Reading

Policy-based methods have been an active research area for some time. Their natural
suitability for deep function approximation for robotics applications and other ap-
plications with continuous action spaces has spurred a large interest in the research
community. The classic policy-based algorithm is Williams’ REINFORCE [844],
which is based on the policy gradient theorem, see [744]. Our explanation is based
on [220, 192, 255]. Joining policy and value-based methods as we do in actor critic is
discussed in Barto et al. [57]. Mnih et al. [521] introduce a modern efficient parallel
implementation named A3C. After the success of DQN a version for the continu-
ous action space of policy-based methods was introduced as DDPG by Lillicrap et
al. [480]. Schulman et al. have worked on trust regions, yielding efficient popular
algorithms TRPO [681] and PPO [683].
Important benchmark studies of policy-based methods are Duan et al. [212] and
Henderson et al. [328]. These papers have stimulated reproducibility in reinforce-
ment learning research.
Software environments that are used in testing policy-based methods are Mu-
JoCo [780] and PyBullet [167]. Gym [108] and the DeepMind control suite [755]
incorporate MuJoCo and provide an easy to use Python interface. An active research
community has emerged around the DeepMind control suite.

Exercises

We have come to the end of this chapter, and it is time to test our understanding
with questions, exercises, and a summary.
19Asynchronous advantage actor critic; Deep deterministic policy gradients; Trust region policy
optimization; Proximal policy optimization; Soft actor critic.
4.3 Locomotion and Visuo-Motor Environments 121

Questions

Below are some quick questions to check your understanding of this chapter. For
each question a simple, single sentence answer is sufficient.
1. Why are value-based methods difficult to use in continuous action spaces?
2. What is MuJoCo? Can you name a few example tasks?
3. What is an advantage of policy-based methods?
4. What is a disadvantage of full-trajectory policy-based methods?
5. What is the difference between actor critic and vanilla policy-based methods?
6. How many parameter sets are used by actor critic? How can they be represented
in a neural network?
7. Describe the relation between Monte Carlo REINFORCE, 𝑛-step methods, and
temporal difference bootstrapping.
8. What is the advantage function?
9. Describe a MuJoCo task that methods such as PPO can learn to perform well.
10. Give two actor critic approaches to further improve upon bootstrapping and
advantage functions, that are used in high-performing algorithms such as PPO
and SAC.
11. Why is learning robot actions from image input hard?

Exercises

Let us now look at programming exercises. If you have not already done so, install
MuJoCo or PyBullet, and install the DeepMind control suite.20 We will use agent
algorithms from the Stable baselines. Furthermore, browse the examples directory
of the DeepMind control suite on GitHub, and study the Colab notebook.
1. REINFORCE Go to the Medium blog21 and reimplement REINFORCE. You can
choose PyTorch, or TensorFlow/Keras, in which case you will have to improvise.
Run the algorithm on an environment with a discrete action space, and compare
with DQN. Which works better? Run in an environment with a continuous action
space. Note that Gym offers a discrete and a continuous version of Mountain
Car.
2. Algorithms Run REINFORCE on a Walker environment from the Baselines. Run
DDPG, A3C, and PPO. Run them for different time steps. Make plots. Com-
pare training speed, and outcome quality. Vary hyperparameters to develop an
intuition for their effect.
3. Suite Explore the DeepMind control suite. Look around and see what environ-
ments have been provided, and how you can use them. Consider extending an
environment. What learning challenges would you like to introduce? First do
20https://github.com/deepmind/dm_control
21 https://medium.com/@ts1829/policy-gradient-reinforcement-learning-in-
pytorch-df1383ea0baf
122 4 Policy-Based Reinforcement Learning

a survey of the literature that has been published about the DeepMind control
suite.
Chapter 5
Model-Based Reinforcement Learning

The previous chapters discussed model-free methods, and we saw their success
in video games and simulated robotics. In model-free methods the agent updates
a policy directly from the feedback that the environment provides on its actions.
The environment performs the state transitions and calculates the reward. A dis-
advantage of deep model-free methods is that they can be slow to train; for stable
convergence or low variance often millions of environment samples are needed
before the policy function converges to a high quality optimum.
In contrast, with model-based methods the agent first builds its own internal
transition model from the environment feedback. The agent can then use this local
transition model to find out about the effect of actions on states and rewards. The
agent can use a planning algorithm to play what-if games, and generate policy
updates, all without causing any state changes in the environment. This approach
promises higher quality at lower sample complexity. Generating policy updates
from the internal model is called planning or imagination.
Model-based methods update the policy indirectly: the agent first learns a local
transition model from the environment, which the agent then uses to update the
policy. Indirectly learning the policy function has two consequences. On the positive
side, as soon as the agent has its own model of the state transitions of the world, it
can learn the best policy for free, without further incurring the cost of acting in
the environment. Model-based methods thus may have a lower sample complexity.
The downside is that the learned transition model may be inaccurate, and the
resulting policy may be of low quality. No matter how many samples can be taken
for free from the model, if the agent’s local transition model does not reflect the
environment’s real transition model, then the locally learned policy function will
not work in the environment. Thus, dealing with uncertainty and model bias are
important elements in model-based reinforcement learning.
The idea to first learn an internal representation of the environment’s transition
function has been conceived many years ago, and transition models have been
implemented in many different ways. Models can be tabular, or they can be based
on various kinds of deep learning, as we will see.

123
124 5 Model-Based Reinforcement Learning

This chapter will start with an example showing how model-based methods
work. Next, we describe in more detail different kinds of model-based approaches;
approaches that focus on learning an accurate model, and approaches for planning
with an imperfect model. Finally, we describe application environments for which
model-based methods have been used in practice, to see how well the approaches
perform.
The chapter is concluded with exercises, a summary, and pointers to further
reading.

Core Concepts

• Imagination
• Uncertainty models
• World models, Latent models
• Model-predictive control
• Deep end-to-end planning and learning

Core Problem

• Learn and use accurate transition models for high-dimensional problems

Core Algorithms

• Dyna-Q (Alg. 5.3)

• Ensembles and model-predictive control (Alg. 5.4, 5.6)
• Value prediction networks (Alg. 5.5)
• Value iteration networks (Sect. 5.2.2.2)

Building a Navigation Map

To illustrate basic concepts of model-based reinforcement learning, we return to

the supermarket example.
Let us compare how model-free and model-based methods find their way to the
supermarket in a new city.1 In this example we will use value-based Q-learning;
1We use distance to the supermarket as negative reward, in order to formulate this as a distance
minimization problem, while still being able to reason in our familiar reward maximization setting.
5 Model-Based Reinforcement Learning 125

our policy 𝜋(𝑠, 𝑎) will be derived directly from the 𝑄(𝑠, 𝑎) values with arg max,
and writing 𝑄 is in this sense equivalent to writing 𝜋.
Model-free Q-learning: the agent picks the start state 𝑠0 , and uses (for example)
an 𝜖-greedy behavior policy on the action-value function 𝑄(𝑠, 𝑎) to select the next
action. The environment then executes the action, computes the next state 𝑠 0 and
reward 𝑟, and returns these to the agent. The agent updates its action-value function
𝑄(𝑠, 𝑎) with the familiar update rule

𝑄(𝑠, 𝑎) ← 𝑄(𝑠, 𝑎) + 𝛼[𝑟 + 𝛾 max 𝑄(𝑠 0, 𝑎) − 𝑄(𝑠, 𝑎)].

𝑎

The agent repeats this procedure until the values in the 𝑄-function no longer change
greatly.
Thus we pick our start location in the city, perform one walk along a block in an
𝜖-greedy direction, and record the reward and the new state at which we arrive.
We use the information to update the policy, and from our new location, we walk
again in an 𝜖-greedy direction using the policy. If we find the supermarket, we start
over again, trying to find a shorter path, until our policy values no longer change
(this may take many environment interactions). Then the best policy is the path
with the shortest distances.
Model-based planning and learning: the agent uses the 𝑄(𝑠, 𝑎) function as behav-
ior policy as before to sample the new state and reward from the environment, and
to update the policy (𝑄-function). In addition, however, the agent will record the
new state and reward in a local transition 𝑇𝑎 (𝑠, 𝑠 0) and reward function 𝑅 𝑎 (𝑠, 𝑠 0).
Because the agent now has these local entries we can also sample from our local
functions to update the policy. We can choose: sample from the (expensive) envi-
ronment transition function, or from the (cheap) local transition function. There
is a caveat with sampling locally, however. The local functions may contain fewer
entries—or only high variance entries—especially in the early stages, when few
environment samples have been performed. The usefulness of the local functions
increases as more environment samples are performed.
Thus, we now have a local map on which to record the new states and rewards.
We will use this map to peek, as often as we like and at no cost, at a location on
that map, to update the 𝑄-function. As more environment samples come in, the
map will have more and more locations for which a distance to the supermarket is
recorded. When glances at the map do not improve the policy anymore, we have to
walk in the environment again, and, as before, update the map and the policy.
In conclusion, model-free finds all policy updates outside the agent, from the
environment feedback; model-based also2 uses policy updates from within the agent,
using information from its local map (see Fig. 5.1). In both methods all updates
to the policy are ultimately derived from the environment feedback; model-based
offers a different way to use the information to update the policy, a way that may
2 One option is to only update the policy from the agent’s internal transition model, and not by the
environment samples anymore. However, another option is to keep using the environment samples
to also update the policy in the model-free way. Sutton’s Dyna [741] approach is a well-known
example of this last, hybrid, approach. Compare also Fig. 5.2 and Fig. 5.4.
126 5 Model-Based Reinforcement Learning

Value/Policy
planning
acting

direct RL

Model Environment

model learning

Fig. 5.1 Direct and Indirect Reinforcement Learning [743]

be more information-efficient, by keeping information from each sample within

the agent transition model and re-using that information.

5.1 Dynamics Models of High-Dimensional Problems

The application environments for model-based reinforcement learning are the same
as for model-free; our goal, however, is to solve larger and more complex problems
in the same amount of time, by virtue of the lower sample complexity and, as it
were, a deeper understanding of the environment.

Transition Model and Knowledge Transfer

The principle of model-based learning is as follows. Where model-free methods

sample the environment to learn the state-to-action policy function 𝜋(𝑠, 𝑎) based
on action rewards, model-based methods sample the environment to learn the state-
to-state transition function 𝑇𝑎 (𝑠, 𝑠 0) based on action rewards. Once the accuracy of
this local transition function is good enough, the agent can sample from this local
function to improve the policy 𝜋(𝑠, 𝑎) as often as it likes, without incurring the
cost of actual environment samples. In the model-based approach, the agent builds
its own local state-to-state transition (and reward) model of the environment, so
that, in theory at least, it does not need the environment anymore.
This brings us to another reason for the interest in model-based methods. For
sequential decision problems, knowing the transition function is a natural way of
capturing the essence of how the environment works—𝜋 gives the next action, 𝑇
gives the next state.
This is useful, for example, when we switch to a related environment. When
the transition function of the environment is known by the agent, then the agent
can be adapted quickly, without having to learn a whole new policy by sampling
the environment. When a good local transition function of the domain is known
by the agent, then new, but related, problems might be solved efficiently. Hence,
5.2 Learning and Planning Agents 127

model-based reinforcement learning may contribute to efficient transfer learning

(see Chap. 9).

Sample Efficiency

The sample efficiency of an agent algorithm tells us how many environment samples
it needs for the policy to reach a certain accuracy.
To achieve high sample efficiency, model-based methods learn a dynamics model.
Learning high-accuracy high-capacity models of high-dimensional problems re-
quires a high number of training examples, to prevent overfitting (see Sect. B.2.7).
Thus, reducing overfitting in learning the transition model would negate (some of)
the advantage of the low sample complexity that model-based learning of the policy
function achieves. Constructing accurate deep transition models can be difficult in
practice, and for many complex sequential decision problems the best results are
often achieved with model-free methods, although deep model-based methods are
becoming stronger (see, for example, Wang et al. [828]).

5.2 Learning and Planning Agents

The promise of model-based reinforcement learning is to find a high-accuracy

behavior policy at a low cost, by building a local model of the world. This will only
work if the learned transition model provides accurate predictions, and if the extra
cost of planning with the model is reasonable.
Let us see which solutions have been developed for deep model-based reinforce-
ment learning. In Sect. 5.3 we will have a closer look at the performance in different
environments. First, in this section, we will look at four different algorithmic ap-
proaches, and at a classic approach: Dyna’s tabular imagination.

Tabular Imagination

A classic approach is Dyna [741], which popularized the idea of model-based re-
inforcement learning. In Dyna, environment samples are used in a hybrid model-
free/model-based manner, to train the transition model, use planning to improve
the policy, while also training the policy function directly.
Why is Dyna a hybrid approach? Strict model-based methods update the policy
only by planning using the agent’s transition model, see Alg. 5.1. In Dyna, however,
environment samples are used to also update the policy directly (see Fig. 5.2 and
Alg. 5.2). Thus we get a hybrid approach combining model-based and model-free
learning. This hybrid model-based planning is called imagination because looking
ahead with the agent’s own dynamics model resembles imagining environment
128 5 Model-Based Reinforcement Learning

Algorithm 5.1 Strict Learned Dynamics Model

repeat
Sample environment 𝐸 to generate data 𝐷 = (𝑠, 𝑎, 𝑟 0 , 𝑠0 )
Use 𝐷 to learn 𝑀 = 𝑇𝑎 (𝑠, 𝑠0 ) , 𝑅𝑎 (𝑠, 𝑠0 ) ⊲ learning
for 𝑛 = 1, . . . , 𝑁 do
Use 𝑀 to update policy 𝜋 (𝑠, 𝑎) ⊲ planning
end for
until 𝜋 converges

Policy/Value
planning acting

learning
Dynamics Model Environment
learning

Fig. 5.2 Hybrid Model-Based Imagination

Algorithm 5.2 Hybrid Model-Based Imagination

repeat
Sample env 𝐸 to generate data 𝐷 = (𝑠, 𝑎, 𝑟 0 , 𝑠0 )
Use 𝐷 to update policy 𝜋 (𝑠, 𝑎) ⊲ learning
Use 𝐷 to learn 𝑀 = 𝑇𝑎 (𝑠, 𝑠0 ) , 𝑅𝑎 (𝑠, 𝑠0 ) ⊲ learning
for 𝑛 = 1, . . . , 𝑁 do
Use 𝑀 to update policy 𝜋 (𝑠, 𝑎) ⊲ planning
end for
until 𝜋 converges

samples outside the real environment inside the “mind” of the agent. In this approach
the imagined samples augment the real (environment) samples at no sample cost.3
Imagination is a mix of model-based and model-free reinforcement learning.
Imagination performs regular direct reinforcement learning, where the environment
is sampled with actions according to the behavior policy, and the feedback is used
to update the same behavior policy. Imagination also uses the environment sample
to update the dynamics model {𝑇𝑎 , 𝑅 𝑎 }. This extra model is also sampled, and
provides extra updates to the behavior policy, in between the model-free updates.
The diagram in Fig. 5.2 shows how sample feedback is used both for updating
the policy directly and for updating the model, which then updates the policy, by
planning “imagined” feedback. In Alg. 5.2 the general imagination approach is
shown as pseudocode.
Sutton’s Dyna-Q [741, 743], which is shown in more detail in Alg. 5.3, is a
concrete implementation of the imagination approach. Dyna-Q uses the Q-function
3 The term imagination is used somewhat loosely in the field. In a strict sense imagination refers
only to updating the policy from the internal model by planning. In a wider sense imagination
refers to hybrid schemes where the policy is updated from both the internal model and the
environment. Sometimes the term dreaming is used for agents imagining environments.
5.2 Learning and Planning Agents 129

Algorithm 5.3 Dyna-Q [741]

Initialize 𝑄 (𝑠, 𝑎) → R randomly
Initialize 𝑀 (𝑠, 𝑎) → R × 𝑆 randomly ⊲ Model
repeat
Select 𝑠 ∈ 𝑆 randomly
𝑎 ← 𝜋 (𝑠) ⊲ 𝜋 (𝑠) can be 𝜖 -greedy(𝑠) based on 𝑄
(𝑠0 , 𝑟 ) ← 𝐸 (𝑠, 𝑎) ⊲ Learn new state and reward from environment
𝑄 (𝑠, 𝑎) ← 𝑄 (𝑠, 𝑎) + 𝛼 · [𝑟 + 𝛾 · max𝑎0 𝑄 (𝑠0 , 𝑎0 ) − 𝑄 (𝑠, 𝑎) ]
𝑀 (𝑠, 𝑎) ← (𝑠0 , 𝑟 )
for 𝑛 = 1, . . . , 𝑁 do
Select 𝑠ˆ and 𝑎ˆ randomly
(𝑠0 , 𝑟 ) ← 𝑀 ( 𝑠ˆ, 𝑎)
ˆ ⊲ Plan imagined state and reward from model
𝑄 ( 𝑠ˆ, 𝑎) ˆ + 𝛼 · [𝑟 + 𝛾 · max𝑎0 𝑄 (𝑠0 , 𝑎0 ) − 𝑄 ( 𝑠ˆ, 𝑎)
ˆ ← 𝑄 ( 𝑠ˆ, 𝑎) ˆ ]
end for
until 𝑄 converges

as behavior policy 𝜋(𝑠) to perform 𝜖-greedy sampling of the environment. It then

updates this policy with the reward, and an explicit model 𝑀. When the model 𝑀
has been updated, it is used 𝑁 times by planning with random actions to update
the Q-function. The pseudocode shows the learning steps (from environment 𝐸)
and 𝑁 planning steps (from model 𝑀). In both cases the Q-function state-action
values are updated. The best action is then derived from the Q-values as usual.
Thus, we see that the number of updates to the policy can be increased without
more environment samples. By choosing the value for 𝑁, we can tune how many
of the policy updates will be environment samples, and how many will be model
samples. In the larger problems that we will see later in this chapter, the ratio
of environment-to-model samples is often set at, for example, 1 : 1000, greatly
reducing sample complexity. The questions then become, of course: how good is the
model, and: how far is the resulting policy from a model-free baseline?

Hands On: Imagining Taxi Example

It is time to illustrate how Dyna-Q works with an example. For that, we turn to one
of our favorites, the Taxi world.
Let us see what the effect of imagining with a model can be. Please refer to
Fig. 5.3. We use our simple maze example, the Taxi maze, with zero imagination
(𝑁 = 0), and with large imagination (𝑁 = 50). Let us assume that the reward at all
states returned by the environment is 0, except for the goal, where the reward is +1.
In states the usual actions are present (north, east, west, south), except at borders
or walls.
When 𝑁 = 0 Dyna-Q performs exactly Q-learning, randomly sampling action
rewards, building up the Q-function, and using the Q-values following the 𝜖-greedy
policy for action selection. The purpose of the Q-function is to act as a vessel
of information to find the goal. How does our vessel get filled with information?
Sampling starts off randomly, and the Q-values fill slowly, since the reward landscape
130 5 Model-Based Reinforcement Learning

Fig. 5.3 Taxi world [395]

is flat, or sparse: only the goal state returns +1, all other states return 0. In order
to fill the Q-values with actionable information on where to find the goal, first the
algorithm must be lucky enough to choose a state next to the goal, including the
appropriate action to reach the goal. Only then the first useful reward information
is found and the first non-zero step towards finding the goal can be entered into
the Q-function. We conclude that, with 𝑁 = 0, the Q-function is filled up slowly,
due to sparse rewards.
What happens when we turn on planning? When we set 𝑁 to a high value, such
as 50, we perform 50 planning steps for each learning step. As we can see in the
algorithm, the model is built alongside the 𝑄-function, from environment returns.
As long as the 𝑄-function is still fully zero, then planning with the model will also
be useless. But as soon as one goal entry is entered into 𝑄 and 𝑀, then planning
will start to shine: it will perform 50 planning samples on the M-model, probably
finding the goal information, and possibly building up an entire trajectory filling
states in the 𝑄-function with actions towards the goal.
In a way, the model-based planning amplifies any useful reward information
that the agent has learned from the environment, and plows it back quickly into
the policy function. The policy is learned much quicker, with fewer environment
samples.

Reversible Planning and Irreversible Learning

Model-free methods sample the environment and learn the policy function 𝜋(𝑠, 𝑎)
directly, in one step. Model-based methods sample the environment to learn the
policy indirectly, using a dynamics model {𝑇𝑎 , 𝑅 𝑎 } (as we see in Fig. 5.1 and 5.4,
and in Alg. 5.1).
5.2 Learning and Planning Agents 131

Policy/Value
planning acting Policy/Value

learning acting
Dynamics Model Environment
learning Environment

Fig. 5.4 Model-based (left) and Model-free (right). Learning changes the environment state
irreversibly (single arrow); planning changes the agent state reversibly (undo, double arrow)

Planning Learning
Transition model in: Agent Environment
Agent can Undo: Yes No
State is: Reversible by agent Irreversible by agent
Dynamics: Backtrack Forward only
Data structure: Tree Path
New state: In agent Sample from environment
Reward: By agent Sample from environment
Synonyms: Imagination, simulation Sampling, rollout
Table 5.1 Difference between Planning and Learning

It is useful to step back for a moment to consider the place of learning and
planning algorithms in the reinforcement learning paradigm. Please refer to Table 5.1
for a summary of differences between planning and learning.
Planning with an internal transition model is reversible. When the agent uses
its own transition model to perform local actions on a local state, then the actions
can be undone, since the agent applied them to a copy in its own memory [528].4
Because of this local state memory, the agent can return to the old state, reversing
the local state change caused by the local action that it has just performed. The
agent can then try an alternative action (which it can also reverse). The agent can
use tree-traversal methods to traverse the state space, backtracking to try other
states.
In contrast to planning, learning is done when the agent does not have access
to its own transition function 𝑇𝑎 (𝑠, 𝑠 0). The agent can get reward information by
sampling real actions in the environment. These actions are not played out inside
the agent but executed in the actual environment; they are irreversible and can not
be undone by the agent. Learning uses actions that irreversibly change the state of
the environment. Learning does not permit backtracking; learning algorithms learn
a policy by repeatedly sampling the environment.
Note the similarity between learning and planning: learning samples rewards
from the external environment, planning from the internal model; both use the
samples to update the policy function 𝜋(𝑠, 𝑎).
4 In our dreams we can undo our actions, play what-if, and imagine alternative realities.
132 5 Model-Based Reinforcement Learning

Four Types of Model-Based Methods

In model-based reinforcement learning the challenge is to learn deep, high-dimen-

sional transition models from limited data. Our methods should be able to account
for model uncertainty, and plan over these models to achieve policy and value
functions that perform as well or better than model-free methods. Let us look in
more detail at specific model-based reinforcement learning methods to see how this
can be achieved.
Over the years, many different approaches for high-accuracy high-dimensional
model-based reinforcement learning have been devised. Following [601], we group
the methods into four main approaches. We start with two approaches for learning
the model, and then two approaches for planning, using the model. For each we will
take a few representative papers from the literature that we describe in more depth.
After we have done so, we will look at their performance in different environments.
But let us start with the methods for learning a deep model first.

5.2.1 Learning the Model

In model-based approaches, the transition model is learned from sampling the

environment. If this model is not accurate, then planning will not improve the value
or policy function, and the method will perform worse than model-free methods.
When the learning/planning ratio is set to 1/1000, as it is in some experiments,
inaccuracy in the models will reveal itself quickly in a low accuracy policy function.
Much research has focused on achieving high accuracy dynamics models for
high-dimensional problems. Two methods to achieve better accuracy are uncertainty
modeling and latent models. We will start with uncertainty modeling.

5.2.1.1 Modeling Uncertainty

The variance of the transition model can be reduced by increasing the number
of environment samples, but there are also other approaches that we will discuss.
A popular approach for smaller problems is to use Gaussian processes, where
the dynamics model is learned by giving an estimate of the function and of the
uncertainty around the function with a covariance matrix on the entire dataset [93].
A Gaussian model can be learned from few data points, and the transition model can
be used to plan the policy function successfully. An example of this approach is the
PILCO system, which stands for Probabilistic Inference for Learning Control [188,
189]. This system was effective on Cartpole and Mountain car, but does not scale to
larger problems.
We can also sample from a trajectory distribution optimized for cost, and use
that to train the policy, with a policy-based method [469]. Then we can optimize
policies with the aid of locally-linear models and a stochastic trajectory optimizer.
5.2 Learning and Planning Agents 133

Algorithm 5.4 Planning with an Ensemble of Models [439]

Initialize policy 𝜋 𝜃 and the models 𝑚 ˆ 𝜙1 , 𝑚
ˆ 𝜙1 , . . . , 𝑚
ˆ 𝜙𝐾 ⊲ ensemble
Initialize an empty dataset 𝐷
repeat
𝐷 ← sample with 𝜋 𝜃 from environment 𝐸
Learn models 𝑚 ˆ 𝜙1 , 𝑚
ˆ 𝜙1 , . . . , 𝑚
ˆ 𝜙𝐾 using 𝐷 ⊲ ensemble
repeat
𝐷 0 ← sample with 𝜋 𝜃 from { 𝑚 ˆ 𝜙𝑖 }𝑖=
𝐾
1
Update 𝜋 𝜃 with TRPO using 𝐷 0 ⊲ planning
Estimate performance of trajectories 𝜂ˆ𝜏 ( 𝜃 , 𝜙𝑖 ) for 𝑖 = 1, . . . , 𝐾
until performance converges
until 𝜋 𝜃 performs well in environment 𝐸

This is the approach that is used in Guided policy search (GPS), which been shown
to train complex policies with thousands of parameters, learning tasks in MuJoCo
such as Swimming, Hopping and Walking.
Another popular method to reduce variance in machine learning is the ensemble
method. Ensemble methods combine multiple learning algorithms to achieve better
predictive performance; for example, a random forest of decision trees often has
better predictive performance than a single decision tree [93, 574]. In deep model-
based methods the ensemble methods are used to estimate the variance and account
for it during planning. A number of researchers have reported good results with
ensemble methods on larger problems [156, 376]. For example, Chua et al. use an
ensemble of probabilistic neural network models [149] in their approach named
Probabilistic ensembles with trajectory sampling (PETS). They report good results
on high-dimensional simulated robotic tasks (such as Half-cheetah and Reacher).
Kurutach et al. [439] combine an ensemble of models with TRPO, in ME-TRPO.5
In ME-TRPO an ensemble of deep neural networks is used to maintain model
uncertainty, while TRPO is used to control the model parameters. In the planner,
each imagined step is sampled from the ensemble predictions (see Alg. 5.4).
Uncertainty modeling tries to improve the accuracy of high-dimensional mod-
els by probabilistic methods. A different approach, specifically designed for high-
dimensional deep models, is the latent model approach, which we will discuss
next.

5.2.1.2 Latent Models

Latent models focus on dimensionality reduction of high-dimensional problems.

The idea behind latent models is that in most high-dimensional environments some
elements are less important, such as buildings in the background that never move
and that have no relation with the reward. We can abstract these unimportant
elements away from the model, reducing the effective dimensionality of the space.
5 A video is available at https://sites.google.com/view/me-trpo. The code is at https:

//github.com/thanard/me-trpo. A blog post is at [362].

134 5 Model-Based Reinforcement Learning

Latent models do so by learning to represent the elements of the input and the
reward. Since planning and learning are now possible in a lower-dimensional latent
space, the sampling complexity of learning from the latent models improves.
Even though latent model approaches are often complicated designs, many works
have been published that show good results [391, 309, 308, 688, 710, 310, 304]. Latent
models use multiple neural networks, as well as different learning and planning
algorithms.
To understand this approach, we will briefly discuss one such latent-model
approach: the Value prediction network (VPN) by Oh et al. [564].6 VPN uses four
differentiable functions, that are trained to predict the value [290], Fig. 5.5 shows
how the core functions. The core idea in VPN is not to learn directly in the actual
observation space, but first to transform the state respresentations to a smaller
latent representation model, also known as abstract model. The other functions,
such as value, reward, and next-state, then work on these smaller latent states,
instead of on the more complex high-dimensional states. In this way, planning and
learning occur in a space where states are encouraged only to contain the elements
that influence value changes. Latent space is lower-dimensional, and training and
planning become more efficient.
The four functions in VPN are: (1) an encoding function, (2) a reward function,
(3) a value function, and (4) a transition function. All functions are parameterized
with their own set of parameters. To distinghuish these latent-based functions
from the conventional observation-based functions 𝑅, 𝑉, 𝑇 they are denoted as
𝑓 𝜃𝑒𝑛𝑐
𝑒
, 𝑓 𝜃𝑟𝑟𝑒𝑤 𝑎𝑟 𝑑 , 𝑓 𝜃𝑣𝑣𝑎𝑙𝑢𝑒 , 𝑓 𝜃𝑡𝑟𝑡 𝑎𝑛𝑠 .
• The encoding function 𝑓 𝜃𝑒𝑛𝑐
𝑒
: 𝑠 𝑎𝑐𝑡𝑢𝑎𝑙 → 𝑠𝑙𝑎𝑡𝑒𝑛𝑡 maps the observation 𝑠 𝑎𝑐𝑡𝑢𝑎𝑙 to
the abstract state using neural network 𝜃 𝑒 , such as a CNN for visual observations.
This is the function that performs the dimensionality reduction.
• The latent-reward function 𝑓 𝜃𝑟𝑟𝑒𝑤 𝑎𝑟 𝑑 : (𝑠𝑙𝑎𝑡𝑒𝑛𝑡 , 𝑜) → 𝑟, 𝛾 maps the latent state 𝑠
and option 𝑜 (a kind of action) to the reward and discount factor. If the option
takes 𝑘 primitive actions, the network should predict the discounted sum of
the 𝑘 immediate rewards as a scalar. (The role of options is explained in the
paper [564].) The network also predicts option-discount factor 𝛾 for the number
of steps taken by the option.
• The latent-value function 𝑓 𝜃𝑣𝑣𝑎𝑙𝑢𝑒 : 𝑠𝑙𝑎𝑡𝑒𝑛𝑡 → 𝑉 𝜃𝑣 (𝑠𝑙𝑎𝑡𝑒𝑛𝑡 ) maps the abstract
state to its value using a separate neural network 𝜃 𝑣 . This value is the value of
the latent state, not of the actual observation state 𝑉 (𝑠 𝑎𝑐𝑡𝑢𝑎𝑙 ).
0
• The latent-transition function 𝑓 𝜃𝑡𝑟𝑡 𝑎𝑛𝑠 : (𝑠𝑙𝑎𝑡𝑒𝑛𝑡 , 𝑜) → 𝑠𝑙𝑎𝑡𝑒𝑛𝑡 maps the latent
state to the next latent state, depending also on the option.
Figure 5.5 shows how the core functions work together in the smaller, latent, space;
with 𝑥 the observed actual state, and 𝑠 the encoded latent state [564].
The figure shows a single rollout step, planning one step ahead. However, a
model also allows looking further into the future, by performing multi-step rollouts.
6 See https://github.com/junhyukoh/value-prediction-network for the code.
5.2 Learning and Planning Agents 135

Fig. 5.5 Architecture of latent model [564]

Algorithm 5.5 Multi-step planning [564]

function Q-Plan(s, o, 𝑑)
𝑟 , 𝛾, 𝑉 (𝑠0 ), 𝑠0 → 𝑓𝜃𝑐𝑜𝑟 𝑒 (𝑠, 𝑜) ⊲ Perform the four latent functions
if 𝑑 = 1 then
return 𝑟 + 𝛾𝑉 (𝑠0 )
end if
𝐴 ← 𝑏-best options based on 𝑟 + 𝛾𝑉𝜃 (𝑠0 ) ⊲ See paper for other expansion strategies
for 𝑜0 ∈ 𝐴 do
𝑞𝑜0 ← Q-Plan(𝑠0 , 𝑜0 , 𝑑 − 1)
end for
return 𝑟 + 𝛾 [ 𝑑1 𝑉 (𝑠0 ) + 𝑑− 1
𝑑 max𝑜 ∈ 𝐴 𝑞𝑜 ]
0 0

end function

Of course, this requires a highly accurate model, otherwise the accumulated inaccu-
racies diminish the accuracy of the far-into-the-future lookahead. Algorithm 5.5
shows the pseudocode for a 𝑑-step planner for the value prediction network.
The networks are trained with 𝑛-step Q-learning and TD search [709]. Trajec-
tories are generated with an 𝜖-greedy policy using the planning algorithm from
Alg. 5.5. VPN achieved good results on Atari games such as Pacman and Seaquest,
outperforming model-free DQN, and outperforming observation-based planning in
stochastic domains.
Another relevant approach is presented in a sequence of papers by Hafner et
al. [308, 309, 310]. Their PlaNet and Dreamer approaches use latent models based
on a Recurrent State Space Model (RSSM), that consists of a transition model, an ob-
servation model, a variational encoder and a reward model, to improve consistency
between one-step and multi-step predictions in latent space [398, 121, 199].
The latent-model approach reduces the dimensionality of the observation space.
Dimensionality reduction is related to unsupervised learning (Sect. 1.2.2), and
autoencoders (Sect. B.2.6). The latent-model approach is also related to world models,
a term used by Ha and Schmidhuber [303, 304]. World models are inspired by the
manner in which humans are thought to construct a mental model of the world in
which we live. Ha et al. implement world models using generative recurrent neural
networks that generate states for simulation using a variational autoencoder [411,
412] and a recurrent network. Their approach learns a compressed spatial and
temporal representation of the environment. By using features extracted from the
136 5 Model-Based Reinforcement Learning

world model as inputs to the agent, a compact and simple policy can be trained to
solve a task, and planning occurs in the compressed world. The term world model
goes back to 1990, see Schmidhuber [670].
Latent models and world models achieve promising results and are, despite their
complexity, an active area of research, see, for example [873]. In the next section
we will further discuss the performance of latent models, but we will first look at
two methods for planning with deep transition models.

5.2.2 Planning with the Model

We have discussed in some depth methods to improve the accuracy of models. We

will now switch from how to create deep models, to how to use them. We will
describe two planning approaches that are designed to be forgiving for models that
contain inaccuracies. The planners try to reduce the impact of the inaccuracy of the
model, for example, by planning ahead with a limited horizon, and by re-learning
and re-planning at each step of the trajectory. We will start with planning with a
limited horizon.

5.2.2.1 Trajectory Rollouts and Model-Predictive Control

At each planning step, the local transition model 𝑇𝑎 (𝑠) → 𝑠 0 computes the new state,
using the local reward to update the policy. Due to the inaccuracies of the internal
model, planning algorithms that perform many steps will quickly accumulate model
errors [296]. Full rollouts of long, inaccurate, trajectories are therefore problematic.
We can reduce the impact of accumulated model errors by not planning too far
ahead. For example, Gu et al. [296] perform experiments with locally linear models
that roll out planning trajectories of length 5 to 10. This reportedly works well for
MuJoCo tasks Gripper and Reacher.
In another experiment, Feinberg et al. [238] allow imagination to a fixed look-
ahead depth, after which value estimates are split into a near-future model-based
component and a distant future model-free component (Model-based value expan-
sion, MVE). They experiment with horizons of 1, 2, and 10, and find that 10 generally
performs best on typical MuJoCo tasks such as Swimmer, Walker, and Half-cheetah.
The sample complexity in their experiments is better than model-free methods
such as DDPG [705]. Similarly good results are reported by others [376, 393], with
a model horizon that is much shorter than the task horizon.

Model-Predictive Control

Taking the idea of shorter trajectories for planning than for learning further, we
arrive at decision-time planning [467], also known as Model-predictive control
5.2 Learning and Planning Agents 137

Algorithm 5.6 Neural Network Dynamics for Model-Based Deep Reinforcement

Learning (based on [549])
Initialize the model 𝑚 ˆ𝜙
Initialize an empty dataset 𝐷
for 𝑖 = 1, . . . , 𝐼 do
𝐷 ← 𝐸𝑎 ⊲ sample action from environment
Train 𝑚 ˆ 𝜙 (𝑠, 𝑎) on 𝐷 minimizing the error by gradient descent
for 𝑡 = 1, . . . , 𝑇 Horizon do ⊲ planning
𝐴𝑡 ← 𝑚 ˆ𝜙 ⊲ estimate optimal action sequence with finite MPC horizon
Execute first action 𝑎𝑡 from sequence 𝐴𝑡
𝐷 ← (𝑠𝑡 , 𝑎𝑡 )
end for
end for

(MPC) [440, 262]. Model-predictive control is a well-known approach in process

engineering, to control complex processes with frequent re-planning over a limited
time horizon. Model-predictive control uses the fact that many real-world processes
are approximately linear over a small operating range (even though they can be
highly non-linear over a longer range). In MPC the model is optimized for a limited
time into the future, and then it is re-learned after each environment step. In this
way small errors do not get a chance to accumulate and influence the outcome
greatly. Related to MPC are other local re-planning methods. All try to reduce the
impact of the use of an inaccurate model by not planning too far into the future
and by updating the model frequently. Applications are found in the automotive
industry and in aerospace, for example for terrain-following and obstacle-avoidance
algorithms [394].
MPC has been used in various deep model learning approaches. Both Finn et
al. and Ebert et al. [244, 218] use a form of MPC in the planning for their Visual
foresight robotic manipulation system. The MPC part uses a model that generates
the corresponding sequence of future frames based on an image to select the least-
cost sequence of actions. This approach is able to perform multi-object manipulation,
pushing, picking and placing, and cloth-folding tasks (which adds the difficulty of
material that changes shape as it is being manipulated).
Another approach is to use ensemble models for learning the transition model,
with MPC for planning. PETS [149] uses probabilistic ensembles [448] for learning,
based on cross-entropy-methods (CEM) [183, 99]. In MPC-fashion only the first
action from the CEM-optimized sequence is used, re-planning at every environment-
step. Many model-based approaches combine MPC and the ensemble method, as we
will also see in the overview in Table 5.2 at the end of the next section. Algorithm 5.6
shows in pseudocode an example of Model-predictive control (based on [549], only
the model-based part is shown).7
MPC is a simple and effective planning method that is well-suited for use with
inaccurate models, by restricting the planning horizon and by re-planning. It has
also been used with success in combination with latent models [309, 391].
7 The code is at https://github.com/anagabandi/nn_dynamics.
138 5 Model-Based Reinforcement Learning

It is now time to look at the final method, which is a very different approach to
planning.

5.2.2.2 End-to-end Learning and Planning-by-Network

Up until now, the learning of the dynamics model and its use are performed by
separate algorithms. In the previous subsection differentiable transition models were
learned through backpropagation and then the models were used by a conventional
hand-crafted procedural planning algorithm, such as depth-limited search, with
hand-coded selection and backup rules.
A trend in machine learning is to replace all hand-crafted algorithms by differ-
entiable approaches, that are trained by example, end-to-end. These differentiable
approaches often are more general and perform better than their hand-crafted
versions.8 We could ask the question if it would be possible to make the planning
phase differentiable as well? Or, to see if the planning rollouts can be implemented
in a single computational model, the neural network?
At first sight, it may seem strange to think of a neural network as something
that can perform planning and backtracking, since we often think of a neural
network as a state-less mathematical function. Neural networks normally perform
transformation and filter activities to achieve selection or classification. Planning
consists of action selection and state unrolling. Note, however, that recurrent neural
networks and LSTMs contain implicit state, making them a candidate to be used for
planning (see Sect. B.2.5). Perhaps it is not so strange to try to implement planning
in a neural network. Let us have a look at attempts to perform planning with a
neural network.
Tamar et al. [748] introduced Value Iteration Networks (VIN), convolutional
networks for planning in Grid worlds. A VIN is a differentiable multi-layer convolu-
tional network that can execute the steps of a simple planning algorithm [562]. The
core idea it that in a Grid world, value iteration can be implemented by a multi-layer
convolutional network: each layer does a step of lookahead (refer back to Listing 2.1
for value iteration). The value iterations are rolled-out in the network layers 𝑆 with
𝐴 channels, and the CNN architecture is shaped specifically for each problem task.
Through backpropagation the model learns the value iteration parameters including
the transition function. The aim is to learn a general model, that can navigate in
unseen environments.
Let us look in more detail at the value iteration algorithm. It is a simple algorithm
that consists of a doubly nested loop over states and actions, calculating the sum of
rewards 𝑠0 ∈𝑆 𝑇𝑎 (𝑠, 𝑠 0) (𝑅 𝑎 (𝑠, 𝑠 0) + 𝛾𝑉 [𝑠 0]) and a subsequent maximization opera-
Í

8 Note that here we use the term end-to-end to indicate the use of differentiable methods for
the learning and use of a deep dynamics model—to replace hand-crafted planning algorithms to
use the learned model. Elsewhere, in supervised learning, the term end-to-end is used differently,
to describe learning both features and their use from raw pixels for classification—to replace
hand-crafted feature recognizers to pre-process the raw pixels and use in a hand-crafted machine
learning algorithm.
5.2 Learning and Planning Agents 139

tion 𝑉 [𝑠] = max𝑎 (𝑄 [𝑠, 𝑎]). This double loop is iterated to convergence. The insight
is that each iteration can be implemented by passing the previous value function
𝑉𝑛 and reward function 𝑅 through a convolution layer and max-pooling layer. In
this way, each channel in the convolution layer corresponds to the Q-function for a
specific action—the innermost loop—and the convolution kernel weights correspond
to the transitions. Thus by recurrently applying a convolution layer 𝐾 times, 𝐾
iterations of value iteration are performed.
The value iteration module is simply a neural network that has the capability of
approximating a value iteration computation. Representing value iteration in this
form makes learning the MDP parameters and functions natural—by backpropa-
gating through the network, as in a standard CNN. In this way, the classic value
iteration algorithm can be approximated by a neural network.
Why would we want to have a fully differentiable algorithm that can only give
an approximation, if we have a perfectly good classic procedural implementation
that can calculate the value function 𝑉 exactly?
The reason is generalization. The exact algorithm only works for known tran-
sition probabilities. The neural network can learn 𝑇 (·) when it is not given, from
the environment, and it learns the reward and value functions at the same time. By
learning all functions all at once in an end-to-end fashion, the dynamics and value
functions might be better integrated than when a separately hand-crafted planning
algorithm uses the results of a learned dynamics model. Indeed, reported results do
indicate good generalization to unseen problem instances [748].
The idea of planning by gradient descent has existed for some time—actually, the
idea of learning all functions by example has existed for some time—several authors
explored learning approximations of dynamics in neural networks [403, 671, 369].
The VINs can be used for discrete and continuous path planning, and have been
tried in Grid world problems and natural language tasks.
Later work has extended the approach to other applications of more irregular
shape, by adding abstraction networks [668, 724, 710]. The addition of latent models
increases the power and versatility of end-to-end learning of planning and tran-
sitions even further. Let us look briefly in more detail at one such extension of
VIN, to illustrate how latent models and planning go together. TreeQN by Farquhar
et al. [236] is a fully differentiable model learner and planner, using observation
abstraction so that the approach works on applications that are less regular than
mazes.
TreeQN consists of five differentiable functions, four of which we have seen in
the previous section in Value Prediction Networks [564], Fig. 5.5 on page 135.
• The encoding function consists of a series of convolutional layers that embed
the actual state in a lower dimensional state 𝑠𝑙𝑎𝑡𝑒𝑛𝑡 ← 𝑓 𝜃𝑒𝑛𝑐 𝑒
(𝑠 𝑎𝑐𝑡𝑢𝑎𝑙 )
• The transition function uses a fully connected layer per action to calculate the
0
next-state representation 𝑠𝑙𝑎𝑡𝑒𝑛𝑡 ← 𝑓 𝜃𝑡𝑟𝑡 𝑎𝑛𝑠 (𝑠𝑙𝑎𝑡𝑒𝑛𝑡 , 𝑎 𝑖 )𝑖=
𝐼 .
0
• The reward function predicts the immediate reward for every action 𝑎 𝑖 ∈ 𝐴 in
0
state 𝑠𝑙𝑎𝑡𝑒𝑛𝑡 using a ReLU layer 𝑟 ← 𝑓 𝜃𝑟𝑟𝑒𝑤 𝑎𝑟 𝑑 (𝑠𝑙𝑎𝑡𝑒𝑛𝑡 ).
• The value function of a state is estimated with a vector of weights 𝑉 (𝑠𝑙𝑎𝑡𝑒𝑛𝑡 ) ←
𝑤 > 𝑠𝑙𝑎𝑡𝑒𝑛𝑡 + 𝑏.
140 5 Model-Based Reinforcement Learning

• The backup function applies 9

Í𝐼 a softmax function recursively to calculate the tree
backup value 𝑏(𝑥) ← 𝑖=0 𝑥𝑖 softmax(𝑥)𝑖 .
These functions together can learn a model, and can also execute 𝑛-step Q-learning,
to use the model to update a policy. Further details can be found in [236] and the
GitHub code.10 TreeQN has been applied on games such as box-pushing and some
Atari games, and outperformed model-free DQN.
A limitation of VIN is that the tight connection between problem domain, itera-
tion algorithm, and network architecture limited the applicability to other problems.
Another system that addresses this limitation os Predictron. Like TreeQN, the Pre-
dictron [710] introduces an abstract model to reduce this limitation. As in VPN, the
latent model consists of four differentiable components: a representation model, a
next-state model, a reward model, and a discount model. The goal of the abstract
model in Predictron is to facilitate value prediction (not state prediction) or pre-
diction of pseudo-reward functions that can encode special events, such as staying
alive or reaching the next room. The planning part rolls forward its internal model
𝑘 steps. Unlike VPN, Predictron uses joint parameters. The Predictron has been
applied to procedurally generated mazes and a simulated pool domain. In both cases
it out-performed model-free algorithms.
End-to-end model-based learning-and-planning is an active area of research.
Challenges include understanding the relation between planning and learning [18,
289], achieving performance that is competitive with classical planning algorithms
and with model-free methods, and generalizing the class of applications. In Sect. 5.3
more methods will be shown.

Conclusion

In the previous sections we have discussed two methods to reduce the inaccuracy
of the model, and two methods to reduce the impact of the use of an inaccurate
model. We have seen a range of different approaches to model-based algorithms.
Many of the algorithms were developed recently. Deep model-based reinforcement
learning is an active area of research.
Ensembles and MPC have improved the performance of model-based reinforce-
ment learning. The goal of latent or world models is to learn the essence of the do-
main, reducing the dimensionality, and for end-to-end, to also include the planning
part in the learning. Their goal is generalization in a fundamental sense. Model-free
learns a policy of which action to take in each state. Model-based methods learn
the transition model, from state (via action) to state. Model-free teaches you how to
best respond to actions in your world, model-based helps you to understand your
world. By learning the transition model (and possibly even how to best plan with
it) it is hoped that new generalization methods can be learned.
9 The softmax function normalizes an input vector of real numbers to a probability distribution
𝑓 ( 𝑥)
[ 0, 1 ]; 𝑝 𝜃 ( 𝑦 | 𝑥) = softmax( 𝑓𝜃 ( 𝑥)) = Í 𝑒 𝜃𝑓𝜃 ,𝑘 ( 𝑥)
𝑘 𝑒
10 See https://github.com/oxwhirl/treeqn for the code of TreeQN.
5.3 High-Dimensional Environments 141

The goal of model-based methods is to get to know the environment so intimitely

that the sample complexity can be reduced while staying close to the solution
quality of model-free methods. A second goal is that the generalization power of
the methods improves so much, that new classes of problems can be solved. The
literature is rich and contains many experiments of these approaches on different
environments. Let us now look at the environments to see if we have succeeded.

5.3 High-Dimensional Environments

We have now looked in some detail at approaches for deep model-based reinforce-
ment learning. Let us now change our perspective from the agent to the environment,
and look at the kinds of environments that can be solved with these approaches.

5.3.1 Overview of Model-Based Experiments

The main goal of model-based reinforcement learning is to learn the transition

model accurately—not just the policy function that finds the best action, but the
function that finds the next state. By learning the full essence of the environment a
substantial reduction of sample complexity can be achieved. Also, the hope is that
the model allows us to solve new classes of problems. In this section we will try to
answer the question if these approaches have succeeded.
The answer to this question can be measured in training time and in run-time
performance. For performance, most benchmark domains provide easily measurable
quantities, such as the score in an Atari game. For model-based approaches, the
scores achieved by state-of-the-art model-free algorithms such as DQN, DDPG,
PPO, SAC and A3C are a useful baseline. For training time, the reduction in sample
complexity is an obvious choice. However, many model-based approaches use a
fixed hyperparameter to determine the relation between external environment
samples and internal model samples (such as 1 : 1000). Then the number of time
steps needed for high performance to be reached becomes an important measure,
and this is indeed published by most authors. For model-free methods, we often see
time steps in the millions per training run, and sometimes even billions. With so
many time steps it becomes quite important how much processing each time step
takes. For model-based methods, individual time steps may take longer than for
model-free, since more processing for learning and planning has to be performed.
In the end, wall-clock time is important, and this is also often published.
There are two additional questions. First we are interested in knowing whether
a model-based approach allows new types of problems to be solved, that could
not be solved by model-free methods. Second is the question of brittleness. In
many experiments the numerical results are quite sensitive to different settings of
hyperparameters (including the random seeds). This is the case in many model-free
142 5 Model-Based Reinforcement Learning

Name Learning Planning Environment Ref

PILCO Uncertainty Trajectory Pendulum [188]
iLQG Uncertainty MPC Small [756]
GPS Uncertainty Trajectory Small [468]
SVG Uncertainty Trajectory Small [326]
VIN CNN e2e Mazes [748]
VProp CNN e2e Mazes [551]
Planning CNN/LSTM e2e Mazes [298]
TreeQN Latent e2e Mazes [236]
I2A Latent e2e Mazes [615]
Predictron Latent e2e Mazes [710]
World Model Latent e2e Car Racing [304]
Local Model Uncertainty Trajectory MuJoCo [296]
Visual Foresight Video Prediction MPC Manipulation [244]
PETS Ensemble MPC MuJoCo [149]
MVE Ensemble Trajectory MuJoCo [238]
Meta Policy Ensemble Trajectory MuJoCo [156]
Policy Optim Ensemble Trajectory MuJoCo [376]
PlaNet Latent MPC MuJoCo [309]
Dreamer Latent Trajectory MuJoCo [308]
Plan2Explore Latent Trajectory MuJoCo [688]
L3 P Latent Trajectory MuJoCo [873]
Video-prediction Latent Trajectory Atari [563]
VPN Latent Trajectory Atari [564]
SimPLe Latent Trajectory Atari [391]
Dreamer-v2 Latent Trajectory Atari [310]
MuZero Latent e2e/MCTS Atari/Go [679]

Table 5.2 Model-Based Reinforcement Learning Approaches [601]

and model-based results [328]. However, when the transition model is accurate, the
variance may diminish, and some model-based approaches might be more robust.
Table 5.2 lists 26 experiments with model-based methods [601]. In addition to
the name, the table provides an indication of the type of model learning that the
agent uses, of the type of planning, and of the application environment in which it
was used. The categories in the table are described in the previous section, where
e2e means end-to-end.
In the table, the approaches are grouped by environment. At the top are smaller
applications such as mazes and navigation tasks. In the middle are larger MuJoCo
tasks. At the bottom are high-dimensional Atari tasks. Let us look in more depth at
the three groups of environments: small navigation, robotics, and Atari games.
5.3 High-Dimensional Environments 143

5.3.2 Small Navigation Tasks

We see that a few approaches use smaller 2D Grid world navigation tasks such
as mazes, or block puzzles, such as Sokoban, and Pacman. Grid world tasks are
some of the oldest problems in reinforcement learning, and they are used frequently
to test out new ideas. Tabular imagination approaches such as Dyna, and some
latent model and end-to-end learning and planning, have been evaluated with
these environments. They typically achieve good results, since the problems are of
moderate complexity.
Grid world navigation problems are quintessential sequential decision problems.
Navigation problems are typically low-dimensional, and no visual recognition is
involved; transition functions are easy to learn.
Navigation tasks are also used for latent model and end-to-end learning. Three
latent model approaches in Table 5.2 use navigation problems. I2A deals with model
imperfections by introducing a latent model, based on Chiappa et al. and Buesing et
al. [144, 121]. I2A is applied to Sokoban and Mini-Pacman by [615, 121]. Performance
compares favorably to model-free learning and to planning algorithms (MCTS).
Value iteration networks introduced the concept of end-to-end differentiable
learning and planning [748, 562], after [403, 671, 369]. Through backpropagation
the model learns to perform value iteration. The aim to learn a general model that
can navigate in unseen environments was achieved, although different extensions
were needed for more complex environments.

5.3.3 Robotic Applications

Next, we look at papers that use MuJoCo to model continuous robotic problems.
Robotic problems are high-dimensional problems with continuous action spaces.
MuJoCo is used by most experiments in this category to simulate the physical
behavior of robotic movement and the environment.
Uncertainty modeling with ensembles and MPC re-planning try to reduce or
contain inaccuracies. The combination of ensemble methods with MPC is well suited
for robotic problems, as we have seen in individual approaches such as PILCO and
PETS.
Robotic applications are more complex than Grid worlds; model-free methods
can take many time steps to find good policies. It is important to know if model-
based methods succeed in reducing sample complexity in these problems. When we
have a closer look at how well uncertainty modeling and MPC succeed at achieving
our first goal, we find a mixed picture.
A benchmark study by Wang et al. [828] looked into the performance of ensemble
methods and Model-predictive control on MuJoCo tasks. It finds that these methods
mostly find good policies, and do so in significantly fewer time steps than model-
free methods, typcially in 200k time steps versus 1 million for model-free. So, it
would appear that the lower sample complexity is achieved. However, they also
144 5 Model-Based Reinforcement Learning

note that per time step, the more complex model-based methods perform more
processing than the simpler model-free methods. Although the sample complexity
may be lower, the wall-clock time is not, and model-free methods such as PPO
ans SAC are still much faster for many problems. Furthermore, the score that the
policy achieves varies greatly for different problems, and is sensitive to different
hyperparameter values.
Another finding is that in some experiments with a large number of time steps,
the performance of model-based methods plateaus well below model-free per-
formance, and the performance of the model-based methods themselves differs
substantially. There is a need for further research in deep model-based methods,
especially into robustness of results. More benchmarking studies are needed that
compare different methods.

5.3.4 Atari Games Applications

Some experiments in Table 5.2 use the Arcade learning environment (ALE). ALE
features high-dimensional inputs, and provides one of the most challenging environ-
ments of the table. Especially latent models choose Atari games to showcase their
performance, and some do indeed achieve impressive results, in that they are able
to solve new problems, such as playing all 57 Atari games well (Dreamer-v2) [310]
and learning the rules of Atari and chess (MuZero) [679].
Hafner et al. published the papers Dream to control: learning behaviors by latent
imagination, and Dreamer v2 [308, 310]. Their work extends the work on VPN and
PlaNet with more advanced latent models and reinforcement learning methods [564,
309]. Dreamer uses an actor-critic approach to learn behaviors that consider rewards
beyond the horizon. Values are backpropagated through the value model, similar to
DDPG [480] and Soft actor critic [306].
An important advantage of model-based reinforcement learning is that it can
generalize to unseen environments with similar dynamics [688]. The Dreamer
experiments showed that latent models are indeed more robust to unseen envi-
ronments than model-free methods. Dreamer is tested with applications from the
DeepMind control suite (Sect. 4.3.2).
Value prediction networks are another latent approach. They outperform model-
free DQN on mazes and Atari games such as Seaquest, QBert, Krull, and Crazy
Climber. Taking the development of end-to-end learner/planners such as VPN
and Predictron further is the work on MuZero [679, 290, 359]. In MuZero a new
architecture is used to learn the transition functions for a range of different games,
from Atari to the board games chess, shogi and Go. MuZero learns the transition
model for all games from interaction with the environment.11 The MuZero model
includes different modules: a representation, dynamics, and prediction function. Like
11This is a somewhat unusual usage of board games. Most researchers use board games because
the transition function is given (see next chapter). MuZero instead does not know the rules of
chess, but starts from scratch learning the rules from interaction with the environment.
5.3 High-Dimensional Environments 145

AlphaZero, MuZero uses a refined version of MCTS for planning (see Sect. 6.2.1.2 in
the next chapter). This MCTS planner is used in a self-play training loop for policy
improvement. MuZero’s achievements are impressive: it is able to learn the rules
of Atari games as well as board games, learning to play the games from scratch,
in conjunction with learning the rules of the games. The MuZero achievements
have created follow up work to provide more insight into the relationship between
actual and latent representations, and to reduce the computational demands [290,
186, 39, 334, 18, 680, 289, 860].
Latent models reduce observational dimensionality to a smaller model to perform
planning in latent space. End-to-end learning and planning is able to learn new
problems—the second of our two goals: it is able to learn to generalize navigation
tasks, and to learn the rules of chess and Atari. These are new problems, that are
out of reach for model-free methods (although the sample complexity of MuZero is
quite large).

Conclusion

In deep model-based reinforcement learning benchmarks drive progress. We have

seen good results with ensembles and local re-planning in continuous problems,
and with latent models in discrete problems. In some applications, both the first
goal, of better sample complexity, and the other goals, of learning new applications
and reducing brittleness, are achieved.
The experiments used many different environments within the ALE and MuJoCo
suites, from hard to harder. In the next two chapters we will study multi-agent
problems, where we encounter a new set of benchmarks, with a state space of
many combinations, including hidden information and simultaneous actions. These
provide even more complex challenges for deep reinforcement learning methods.

5.3.5 Hands On: PlaNet Example

Before we go to the next chaper, let us take a closer look at how one of these
methods achieves efficient learning of a complex high-dimensional task. We will
look at PlaNet, a well-documented project by Hafner et al. [309]. Code is available,12
scripts are available, videos are available, and a blog is available13 inviting us to
take the experiments further. The name of the work is Learning latent dynamics
from pixels, which describes what the algorithm does: use high dimensional visual
input, convert it to latent space, and plan in latent space to learn robot locomotion
dynamics.
12 https://github.com/google-research/planet
13 https://planetrl.github.io
146 5 Model-Based Reinforcement Learning

Fig. 5.6 Locomotion tasks of PlaNet [309]

PlaNet solves continuous control tasks that include contact dynamics, partial
observability, and sparse rewards. The applications used in the PlaNet experiments
are: (a) Cartpole (b) Reacher (c) Cheetah (d) Finger (e) Cup and (f) Walker (see
Fig. 5.6). The Cartpole task is a swing-up task, with a fixed viewpoint. The cart can
be out of sight, requiring the agent to remember information from previous frames.
The Finger spin task requires predicting the location of two separate objects and
their interactions. The Cheetah task involves learning to run. It includes contacts
of the feet with the ground that requires a model to predict multiple futures. The
Cup task must catch a ball in a cup. It provides a sparse reward signal once the
ball is caught, requiring accurate predictions far into the future. The Walker task
involves a simulated robot that begins lying on the ground, and must learn to stand
up and then walk. PlaNet performs well on these tasks. On DeepMind control tasks
it achieves higher accuracy than an A3C or an D4PG agent. It reportedly does so
using 5000% fewer interactions with the environment on average.
It is instructive to experiment with PlaNet. The code can be found on GitHub.14
Scripts are available to run the experiments with simple one line commands:

python3 -m planet.scripts.train --logdir /path/to/logdir

--params ’{tasks: [cheetah_run]}’

As usual, this does require having the right versions of the right libraries installed,
which may be a challenge and may require some creativity on your part. The
required versions are listed on the GitHub page. The blog also contains videos and
pictures of what to expect, including comparisons to model-free baselines from the
DeepMind Control Suite (A3C, D4PG).
The experiments show the viability of the idea to use rewards and values to
compress actual states into lower dimensional latent states, and then plan with
these latent states. Value-based compression reduces details in the high-dimensional
actual states as noise that is not relevant to improve the value function [290]. To
help understand how the actual state map to the latent states, see, for example [186,
472, 401].
14 https://github.com/google-research/planet
5.3 High-Dimensional Environments 147

Summary and Further Reading

This has been a diverse chapter. We will summarize the chapter, and provide refer-
ences for further reading.

Summary

Model-free methods sample the environment using the rewards to learn the policy
function, providing actions for all states for an environment. Model-based methods
use the rewards to learn the transition function, and then use planning methods
to sample the policy from this internal model. Metaphorically speaking: model-
free learns how to act in the environment, model-based learns how to be the
environment. The learned transition model acts as a multiplier on the amount of
information that is used from each environment sample. A consequence is that
model-based methods have a lower sample complexity, although, when the agent’s
transition model does not perfectly reflect the environment’s transition function,
the performance of the policy may be worse than a model-free policy (since that
always uses the environment to sample from).
Another, and perhaps more important aspect of the model-based approach, is
generalization. Model-based reinforcement learning builds a dynamics model of
the domain. This model can be used multiple times, for new problem instances,
but also for related problem classes. By learning the transition and reward model,
model-based reinforcement learning may be better at capturing the essence of a
domain than model-free methods, and thus be able to generalize to variations of
the problem.
Imagination showed how to learn a model and use it to fill in extra samples
based on the model (not the environment). For problems where tabular methods
work, imagination can be many times more efficient than model-free methods.
When the agent has access to the transition model, it can apply reversible
planning algorithms, in additon to one-way learning with samples. There is a large
literature on backtracking and tree-traversal algorithms. Using a look-ahead of more
than one step can increase the quality of the reward even more. When the problem
size increases, or when we perform deep multi-step look-ahead, the accuracy of the
model becomes critical. For high-dimensional problems high capacity networks are
used that require many samples to prevent overfitting. Thus a trade-off exists, to
keep sample complexity low.
Methods such as PETS aim to take the uncertainty of the model into account
in order to increase modeling accuracy. Model-predictive control methods re-plan
at each environment step to prevent over-reliance on the accuracy of the model.
Classical tabular approaches and Gaussian Process approaches have been quite
succesful in achieving low sample complexity for small problems [743, 190, 417].
Latent models observe that in many high-dimensional problems the factors that
influence changes in the value function are often lower-dimensional. For example,
148 5 Model-Based Reinforcement Learning

the background scenery in an image may be irrelevant for the quality of play in
a game, and has no effect on the value. Latent models use an encoder to translate
the high-dimensional actual state space into a lower-dimensional latent state space.
Subsequent planning and value functions work on the (much smaller) latent space.
Finally, we considered end-to-end model-based algorithms. These fully differen-
tiable algorithms not only learn the dynamics model, but also learn the planning
algorithm that uses the model. The work on Value iteration networks [748] inspired
recent work on end-to-end learning, where both the transition model and the plan-
ning algorithm are learned, end-to-end. Combined with latent models (or World
models [304]) impressive results were achieved [710], and the model and planning
accuracy was improved to the extent that tabula rasa self-learning of game-rules
was achieved, in Muzero [679] for both chess, shogi, Go, and Atari games.

Further Reading

Model-based reinforcement learning promises more sample efficient learning. The

field has a long history. For exact tabular methods Sutton’s Dyna-Q is a classical
approach that illustrates the basic concept of model-based learning [741, 742].
There is an extensive literature on the approaches that were discussed in this
chapter. For uncertainty modeling see [188, 189, 756], and for ensembles [574, 149,
156, 376, 439, 391, 469, 468, 326, 244]. For Model-predictive control see [549, 262,
505, 426, 296, 238, 38].
Latent models is an active field of research. Two of the earlier works are [563, 564],
although the ideas go back to World models [304, 403, 671, 369]. Later, a sequence
of PlaNet and Dreamer papers was influential [309, 308, 688, 310, 874].
The literature on end-to-end learning and planning is also extensive, starting
with VIN [748], see [22, 704, 706, 710, 551, 298, 236, 679, 239].
As applications became more challenging, notably in robotics, other methods
were developed, mostly based on uncertainty, see for surveys [190, 417]. Later, as
high-dimensional problems became prevalent, latent and end-to-end methods were
developed. The basis for the section on environments is an overview of recent model-
based approaches [601]. Another survey is [529], a comprehensive benchmark study
is [828].

Exercises

Let us go to the Exercises.

5.3 High-Dimensional Environments 149

Questions

Below are first some quick questions to check your understanding of this chapter.
For each question a simple, single sentence answer is sufficient.

1. What is the advantage of model-based over model-free methods?

2. Why may the sample complexity of model-based methods suffer in high-
dimensional problems?
3. Which functions are part of the dynamics model?
4. Mention four deep model-based approaches.
5. Do model-based methods achieve better sample complexity than model-free?
6. Do model-based methods achieve better performance than model-free?
7. In Dyna-Q the policy is updated by two mechanisms: learning by sampling the
environment and what other mechanism?
8. Why is the variance of ensemble methods lower than of the individual machine
learning approaches that are used in the ensemble?
9. What does model-predictive control do and why is this approach suited for
models with lower accuracy?
10. What is the advantage of planning with latent models over planning with actual
models?
11. How are latent models trained?
12. Mention four typical modules that constitute the latent model.
13. What is the advantage of end-to-end planning and learning?
14. Mention two end-to-end planning and learning methods.

Exercises

It is now time to introduce a few programming exercises. The main purpose of the
exercises is to become more familiar with the methods that we have covered in this
chapter. By playing around with the algorithms and trying out different hyperpa-
rameter settings you will develop some intuition for the effect on performance and
run time of the different methods.
The experiments may become computationally expensive. You may want to
consider running them in the cloud, with Google Colab, Amazon AWS, or Microsoft
Azure. They may have student discounts, and they will have the latest GPUs or
TPUs for use with TensorFlow or PyTorch.
1. Dyna Implement tabular Dyna-Q for the Gym Taxi environment. Vary the amount
of planning 𝑁 and see how performance is influenced.
2. Keras Make a function approximation version of Dyna-Q and Taxi, with Keras.
Vary the capacity of the network and the amount of planning. Compare against
a pure model-free version, and note the difference in performance for different
tasks and in computational demands.
150 5 Model-Based Reinforcement Learning

3. Planning In Dyna-Q, planning has so far been with single step model samples.
Implement a simple depth-limited multi-step look-ahead planner, and see how
performance is influenced for the different look-ahead depths.
4. MPC Read the paper by Nagabandi et al. [549] and download the code.15 Acquire
the right versions of the libraries, and run the code with the supplied scripts,
just for the MB (model-based) versions. Note that plotting is also supported by
the scripts. Run with different MPC horizons. Run with different ensemble sizes.
What are the effects on performance and run time for the different applications?
5. PlaNet Go to the PlaNet blog and read it (see previous section).16 Go to the PlaNet
GitHub site and download and install the code.17 Install the DeepMind control
suite,18 and all necessary versions of the support libraries.
Run Reacher and Walker in PlaNet, and compare against the model-free methods
D4PG and A3C. Vary the size of the encoding network and note the effect on
performance and run time. Now turn off the encoder, and run with planning on
actual states (you may have to change network sizes to achieve this). Vary the
capacity of the latent model, and of the value and reward functions. Also vary
the amount of planning, and note its effect.
6. End-to-end As you have seen, these experiments are computationally expen-
sive. We will now turn to end-to-end planning and learning (VIN and MuZero).
This exercise is also computationally expensive. Use small applications, such
as small mazes, and Cartpole. Find and download a MuZero implementation
from GitHub and explore using the experience that you have gained from the
previous exercises. Focus on gaining insight into the shape of the latent space.
Try MuZero-General [215],19 or a MuZero visualization [186] to get insight
into latent space.20 (This is a challenging exercise, suitable for a term project or
thesis.)

15 https://github.com/anagabandi/nn_dynamics
16 https://planetrl.github.io
17 https://github.com/google-research/planet
18 https://github.com/deepmind/dm_control
19 https://github.com/werner-duvaud/muzero-general
20 https://github.com/kaesve/muzero
Chapter 6
Two-Agent Self-Play

Previous chapters were concerned with how a single agent can learn optimal
behavior for its environment. This chapter is different. We turn to problems where
two agents operate whose behavior will both be modeled (and, in the next chapter,
more than two).
Two-agent problems are interesting for two reasons. First, the world around us is
full of active entities that interact, and modeling two agents and their interaction is
a step closer to understanding the real world than modeling a single agent. Second,
in two-agent problems exceptional results were achieved—reinforcement learning
agents teaching themselves to become stronger than human world champions—and
by studying these methods we may find a way to achieve similar results in other
problems.
The kind of interaction that we model in this chapter is zero-sum: my win is your
loss and vice versa. These two-agent zero-sum dynamics are fundamentally different
from single-agent dynamics. In single agent problems the environment lets you
probe it, lets you learn how it works, and lets you find good actions. Although the
environment may not be your friend, it is also not working against you. In two-agent
zero-sum problems the environment does try to win from you, it actively changes
its replies to minimize your reward, based on what it learns from your behavior.
When learning our optimal policy we should take all possible counter-actions into
account.
A popular way to do so is to implement the environment’s actions with self-play:
we replace the environment by a copy of ourselves. In this way we let ourselves
play against an opponent that has all the knowledge that we currently have, and
agents learn from eachother.
We start with a short review of two-agent problems, after which we dive into self-
learning. We look at the situation when both agents know the transition function
perfectly, so that model accuracy is no longer a problem. This is the case, for example,
in games such as chess and Go, where the rules of the game determine how we can
go from one state to another.
In self-learning the environment is used to generate training examples for the
agent to train a better policy, after which the better agent policy is used in this

151
152 6 Two-Agent Self-Play

environment to train the agent, and again, and again, creating a virtuous cycle of
self-learning and mutual improvement. It is possible for an agent to teach itself
to play a game without any prior knowledge at all, so-called tabula rasa learning,
learning from a blank slate.
The self-play systems that we describe in this chapter use model-based methods,
and combine planning and learning approaches. There is a planning algorithm that
we have mentioned a few times, but have not yet explained in detail. In this chapter
we will discuss Monte Carlo Tree Search, or MCTS, a highly popular planning
algorithm. MCTS can be used in single agent and in two-agent situations, and is
the core of many successful applications, including MuZero and the self-learning
AlphaZero series of programs. We will explain how self-learning and self-play work
in AlphaGo Zero, and why they work so well. We will then discuss the concept of
curriculum learning, which is behind the success of self-learning.
The chapter is concluded with exercises, a summary, and pointers to further
reading.

Core Concepts

• Self-play
• Curriculum learning

Core Problem

• Use a given transition model for self-play, in order to become stronger than the
current best players

Core Algorithms

• Minimax (Listing 6.2)

• Monte Carlo Tree Search (Listing 6.3)
• AlphaZero tabula rasa learning (Listing 6.1)

Self-Play in Games

We have seen in Chap. 5 that when the agent has a transition model of the envi-
ronment, it can achieve greater performance, especially when the model has high
accuracy. What if the accuracy of our model were perfect, if the agent’s transition
6 Two-Agent Self-Play 153

Fig. 6.1 Backgammon and Tesauro

function is the same as the environment’s, how far would that bring us? And what
if we could improve our environment as part of our learning process, can we then
transcend our teacher, can the sorcerer’s apprentice outsmart the wizard?
To set the scene for this chapter, let us describe the first game where this has
happened: backgammon.

Learning to Play Backgammon

In Sect. 3.2.3 we briefly discussed research into backgammon. Already in the early
1990s, the program TD-Gammon achieved stable reinforcement learning with a
shallow network. This work was started at the end of the 1980s by Gerald Tesauro,
a researcher at IBM laboratories. Tesauro was faced with the problem of getting a
program to learn beyond the capabilities of any existing entity. (In Fig. 6.1 we see
Tesauro in front of his program; image by IBM Watson Media.)
In the 1980s computing was different. Computers were slow, datasets were small,
and neural networks were shallow. Against this background, the success of Tesauro
is quite remarkable.
His programs were based on neural networks that learned good patterns of play.
His first program, Neurogammon, was trained using supervised learning, based on
games of human experts. In supervised learning the model cannot become stronger
than the human games it is trained on. Neurogammon achieved an intermediate
level of play [762]. His second program, TD-Gammon, was based on reinforcement
learning, using temporal difference learning and self-play. Combined with hand-
crafted heuristics and some planning, in 1992 it played at human championship
level, becoming the first computer program to do so in a game of skill [765].
TD-Gammon is named after temporal difference learning because it updates
its neural net after each move, reducing the difference between the evaluation of
previous and current positions. The neural network used a single hidden layer with
up to 80 units. TD-Gammon initially learned from a state of zero knowledge, tabula
rasa. Tesauro describes TD-Gammon’s self-play as follows: The move that is selected
is the move with maximum expected outcome for the side making the move. In other
154 6 Two-Agent Self-Play

words, the neural network is learning from the results of playing against itself. This
self-play training paradigm is used even at the start of learning, when the network’s
weights are random, and hence its initial strategy is a random strategy [764].
TD-Gammon performed tabula rasa learning, its neural network weights initili-
azed to small random numbers. It reached world-champion level purely by playing
against itself, learning the game as it played along.
Such autonomous self-learning is one of the main goals of artificial intelligence.
TD-Gammon’s success inspired many researchers to try neural networks and self-
play approaches, culminating eventually, many years later, in high-profile results
in Atari [522] and AlphaGo [703, 706], which we will describe in this chapter.1
In Sect. 6.1 two-agent zero-sum environments will be described. Next, in Sect. 6.2
the tabula rasa self-play method is described in detail. In Sect. 6.3 we focus on the
achievements of the self-play methods. Let us now start with two-agent zero-sum
problems.

6.1 Two-Agent Zero-Sum Problems

Before we look into self-play algorithms, let us look for a moment at the two-agent
games that have fascinated artificial intelligence researchers for such a long time.
Games come in many shapes and sizes. Some are easy, some are hard. The
characteristics of games are described in a fairly standard taxonomy. Important
characteristics of games are: the number of players, whether the game is zero-sum or
non-zero-sum, whether it is perfect or imperfect information, what the complexity
of taking decisions is, and what the state space complexity is. We will look at these
characteristics in more detail.

• Number of Players One of the most important elements of a game is the number
of players. One-player games are normally called puzzles, and are modeled as
a standard MDP. The goal of a puzzle is to find a solution. Two-player games
are “real” games. Quite a number of two-player games exist that provide a nice
balance between being too easy and being too hard for players (and for computer
programmers) [172]. Examples of two-player games that are popular in AI are
chess, checkers, Go, Othello, and shogi.
Multi-player games are played by three or more players. Well-known examples
of multiplayer games are the card games bridge and poker, and strategy games
such as Risk, Diplomacy, and StarCraft.
• Zero Sum versus Non Zero Sum An important aspect of a game is whether it is
competitive or cooperative. Most two-player games are competitive: the win (+1)
of player A is the loss (−1) of player B. These games are called zero sum because
the sum of the wins for the players remains a constant zero. Competition is an
1A modern reimplementation of TD-Gammon in TensorFlow is available on GitHub at TD-
Gammon https://github.com/fomorians/td-gammon
6.1 Two-Agent Zero-Sum Problems 155

important element in the real world, and these games provide a useful model for
the study of conflict and strategic behavior.
In contrast, in cooperative games the players win if they can find win/win
situations. Examples of cooperative games are Hanabi, bridge, Diplomacy [429,
185], poker and Risk. The next chapter will discuss multi-agent and cooperative
games.
• Perfect versus Imperfect Information In perfect information games all relevant
information is known to all players. This is the case in typical board games such
as chess and checkers. In imperfect information games some information may
be hidden from some players. This is the case in card games such as bridge and
poker, where not all cards are known to all players. Imperfect information games
can be modeled as partially observable Markov processes, POMDP [569, 693].
A special form of (im)perfect information games are games of chance, such as
backgammon and Monopoly, in which dice play an important role. There is no
hidden information in these games, and these games are sometimes considered
to be perfect information games, despite the uncertainty present at move time.
Stochasticity is not the same as imperfect information.
• Decision Complexity The difficulty of playing a game depends on the complexity
of the game. The decision complexity is the number of end positions that define
the value (win, draw, or loss) of the initial game position (also known as the
critical tree or proof tree [416]). The larger the number of actions in a position,
the larger the decision complexity. Games with small board sizes such as tic tac
toe (3 × 3) have a smaller complexity than games with larger boards, such as
gomoku (19 × 19). When the action space is very large, it can often be treated as
a continuous action space. In poker, for example, the monetary bets can be of
any size, defining an action size that is practically continuous.
• State Space Complexity The state space complexity of a game is the number of
legal positions reachable from the initial position of a game. State space and
decision complexity are normally positively correlated, since games with high
decision complexity typically have high state space complexity. Determining the
exact state space complexity of a game is a nontrivial task, since positions may
be illegal or unreachable.2 For many games approximations of the state space
have been calculated. In general, games with a larger state space complexity are
harder to play (“require more intelligence”) for humans and computers. Note
that the dimensionality of the states may not correlate with the size of the state
space, for example, the rules of some of the simpler Atari games limit the number
of reachable states, although the states themselves are high-dimensional (they
consist of many video pixels).
2For example, the maximal state space of tic tac toe is 39 = 19683 positions (9 squares of ’X’, ’O’,
or blank), where only 765 positions remain if we remove symmetrical and illegal positions [661].
156 6 Two-Agent Self-Play

Name board state space zero-sum information

Chess 8×8 1047 zero-sum perfect
Checkers 8×8 1018 zero-sum perfect
Othello 8×8 1028 zero-sum perfect
Backgammon 24 1020 zero-sum chance
Go 19 × 19 10170 zero-sum perfect
Shogi 9×9 1071 zero-sum perfect
Poker card 10161 non-zero imperfect
Table 6.1 Characteristics of games

Fig. 6.2 Deep Blue and Garry Kasparov in May 1997 in New York

Zero-Sum Perfect-Information Games

Two-person zero-sum games of perfect information, such as chess, checkers, and

Go, are among the oldest applications of artificial intelligence. Turing and Shannon
published the first ideas on how to write a program to play chess more than 70
years ago [788, 694]. To study strategic reasoning in artificial intelligence, these
games are frequently used. Strategies, or policies, determine the outcome. Table 6.1
summarizes some of the games that have played an important role in artificial
intelligence research.

6.1.1 The Difficulty of Playing Go

After the 1997 defeat of chess world champion Garry Kasparov by IBM’s Deep Blue
computer (Fig. 6.2; image by Chessbase), the game of Go (Fig. 1.4) became the next
6.1 Two-Agent Zero-Sum Problems 157

benchmark game, the Drosophila3 of AI, and research activity in Go intensified

significantly.
The game of Go is more difficult than chess. It is played on a larger board
(19 × 19 vs. 8 × 8), the action space is larger (around 250 moves available in a
position versus some 25 in chess), the game takes longer (typically 300 moves
versus 70) and the state space complexity is much larger: 10170 for Go, versus 1047
for chess. Furthermore, rewards in Go are sparse. Only at the end of a long game,
after many moves have been played, is the outcome (win/loss) known. Captures are
not so frequent in Go, and no good efficiently computable heuristic has been found.
In chess, in contrast, the material balance in chess can be calculated efficiently,
and gives a good indication of how far ahead we are. For the computer, much of
the playing in Go happens in the dark. In contrast, for humans, it can be argued
that the visual patterns of Go may be somewhat easier to interpret than the deep
combinatorial lines of chess.
For reinforcement learning, credit assignment in Go is challenging. Rewards only
occur after a long sequence of moves, and it is unclear which moves contributed
the most to such an outcome, or whether all moves contributed equally. Many
games will have to be played to acquire enough outcomes. In conclusion, Go is
more difficult to master with a computer than chess.
Traditionally, computer Go programs followed the conventional chess design
of a minimax search with a heuristic evaluation function, that, in the case of Go,
was based on the influence of stones (see Sect. 6.2.1 and Fig. 6.4) [515]. This chess
approach, however, did not work for Go, or at least not well enough. The level of
play was stuck at mid-amateur level for many years.
The main problems were the large branching factor, and the absence of an
efficient and good evaluation function.
Subsequently, Monte Carlo Tree Search was developed, in 2006. MCTS is a
variable depth adaptive search algorithm, that did not need a heuristic function, but
instead used random playouts to estimate board strength. MCTS programs caused
the level of play to improve from 10 kyu to 2-3 dan, and even stronger on the small
9 × 9 board.4 However, again, at that point, performance stagnated, and researchers
expected that world champion level play was still many years into the future. Neural
networks had been tried, but were slow, and did not improve performance much.

Playing Strength in Go

Let us compare the three programming paradigms of a few different Go programs

that have been written over the years (Fig. 6.3). The programs fall into three cat-
egories. First are the programs that use heuristic planning, the minimax-style
3 Drosophila Melanogaster is also known as the fruitfly, a favorite species of genetics researchers
to test their theories, because experiments produce quick and clear answers.
4 Absolute beginners in Go start at 30 kyu, progressing to 10 kyu, and advancing to 1 kyu (30k–1k).

Stronger amateur players then achieve 1 dan, progressing to 7 dan, the highest amateur rating for
Go (1d–7d). Professional Go players have a rating from 1 dan to 9 dan, written as 1p–9p.
158 6 Two-Agent Self-Play

Fig. 6.3 Go Playing Strength of Top Programs over the Years [20]

programs. GNU Go is a well-known example of this group of programs. The heuris-

tics in these programs are hand-coded. The level of play of these programs was at
medium amateur level. Next come the MCTS-based programs. They reached strong
amateur level. Finally come the AlphaGo programs, in which MCTS is combined
with deep self-play. These reached super-human performance. The figure also shows
other programs that follow a related approach.
Thus, Go provided a large and sparse state space, providing a highly challenging
test, to see how far self-play with a perfect transition function can come. Let us
have a closer look at the achievements of AlphaGo.

6.1.2 AlphaGo Achievements

In 2016, after decades of research, the effort in Go paid off. In the years 2015–2017
the DeepMind AlphaGo team played three matches in which it beat all human
champions that it played, Fan Hui, Lee Sedol, and Ke Jie. The breakthrough perfor-
mance of AlphaGo came as a surprise. Experts in computer games had expected
grandmaster level play to be at least ten years away.
The techniques used in AlphaGo are the result of many years of research, and
cover a wide range of topics. The game of Go worked very well as Drosophila.
Important new algorithms were developed, most notably Monte Carlo Tree Search
(MCTS), as well as major progress was made in deep reinforcement learning. We
will provide a high-level overview of the research that culminated in AlphaGo (that
6.1 Two-Agent Zero-Sum Problems 159

Fig. 6.4 Influence in the game of Go. Empty intersections are marked as being part of Black’s or
White’s Territory

Fig. 6.5 AlphaGo versus Lee Sedol in 2016 in Seoul

beat the champions), and its successor, AlphaGo Zero (that learns Go tabula rasa).
First we will describe the Go matches.
The games against Fan Hui were played in October 2015 in London as part of
the development effort of AlphaGo. Fan Hui is the 2013, 2014, and 2015 European
Go Champion, then rated at 2p dan. The games against Lee Sedol were played in
May 2016 in Seoul, and were widely covered by the media (see Fig. 6.5; image by
160 6 Two-Agent Self-Play

Fig. 6.6 AlphaGo on the Cover of Nature

DeepMind). Although there is no official worldwide ranking in international Go,

in 2016 Lee Sedol was widely considered one of the four best players in the world.
A year later another match was played, this time in China, against the Chinese
champion Ke Jie, who was ranked number one in the Korean, Japanese, and Chinese
ranking systems at the time of the match. All three matches were won convincingly
by AlphaGo. Beating the best Go players appeared on the cover of the journal
Nature, see Fig. 6.6.
The AlphaGo series of programs actually consists of three programs: AlphaGo,
AlphaGo Zero, and AlphaZero. AlphaGo is the program that beat the human Go
champions. It consists of a combination of supervised learning from grandmaster
games and from self-play games. The second program, AlphaGo Zero, is a full
re-design, based solely on self-play. It performs tabula rasa learning of Go, and
plays stronger than AlphaGo. AlphaZero is a generalization of this program that
also plays chess and shogi. Section 6.3 will describe the programs in more detail.
Let us now have an in-depth look at the self-play algorithms as featured in
AlphaGo Zero and AlphaZero.
6.2 Tabula Rasa Self-Play Agents 161

Agent1

𝑠10 𝑟10 𝑎1 𝑎2 𝑟20 𝑠20

Agent2

Fig. 6.7 Agent-Agent World

6.2 Tabula Rasa Self-Play Agents

Model-based reinforcement learning showed us that by learning a local transition

model, good sample efficiency can be achieved when the accuracy of the model is
sufficient. When we have perfect knowledge of the transitions, as we have in this
chapter, then we can plan far into the future, without error.
In regular agent-environment reinforcement learning the complexity of the envi-
ronment does not change as the agent learns, and as a consequence, the intelligence
of the agent’s policy is limited by the complexity of the environment. However, in
self-play a cycle of mutual improvement occurs; the intelligence of the environment
improves because the agent is learning. With self-play, we can create a system
that can transcend the original environment, and keep growing, and growing, in
a virtuous cycle of mutual learning. Intelligence emerging out of nothing. This is
the kind of system that is needed when we wish beat the best known entity in a
certain domain, since copying from a teacher will not help us to transcend it.
Studying how such a high level of play is achieved is interesting, for three
reasons: (1) it is exciting to follow an AI success story, (2) it is interesting to see
which techniques were used and how it is possible to achieve beyond-human
intelligence, and (3) it is interesting to see if we can learn a few techniques that
can be used in other domains, beyond two-agent zero-sum games, to see if we can
achieve super-intelligence there as well.
Let us have a closer look at the self-learning agent architecture that is used by
AlphaGo Zero. We will see that two-agent self-play actually consists of three levels
of self-play: move-level self-play, example-level self-play, and tournament-level
self-play.
First, we will discuss the general architecture, and how it creates a cycle of
virtuous improvement. Next, we will describe the levels in detail.
162 6 Two-Agent Self-Play

learning/planning Policy/Value acting

Opponent Transition Rules

playing

Fig. 6.8 Playing with Known Transition Rules

Cycle of Virtuous Improvement

In contrast to the agent/environment model, we now have two agents (Fig. 6.7). In
comparison with the model-based world of Chap. 5 (Fig. 6.8) our learned model has
been replaced by perfect knowledge of the transition rules, and the environment is
now called opponent: the negative version of the same agent playing the role of
agent2 .
The goal in this chapter is to reach the highest possible performance in terms of
level of play, without using any hand-coded domain knowledge. In applications such
as chess and Go a perfect transition model is present. Together with a learned reward
function and a learned policy function, we can create a self-learning system in which
a virtuous cycle of ever improving performance occurs. Figure 6.9 illustrates such a
system: (1) the searcher uses the evaluation network to estimate reward values and
policy actions, and the search results are used in games against the opponent in
self-play, (2) the game results are then collected in a buffer, which is used to train
the evaluation network in self-learning, and (3) by playing a tournament against
a copy of ourselves a virtuous cycle of ever-increasing function improvement is
created.

AlphaGo Zero Self-Play in Detail

Let us look in more detail at how self-learning works in AlphaGo Zero. AlphaGo
Zero uses a model-based actor critic approach with a planner that improves a single
value/policy network. For policy improvement it uses MCTS, for learning a single
deep residual network with a policy head and a value head (Sect. B.2.6), see Fig. 6.9.
MCTS improves the quality of the training examples in each iteration (left panel),
and the net is trained with these better examples, improving its quality (right panel).
The output of MCTS is used to train the evaluation network, whose output is
then used as evalution function in that same MCTS. A loop is wrapped around the
search-eval functions to keep training the network with the game results, creating
a learning curriculum. Let us put these ideas into pseudocode.
6.2 Tabula Rasa Self-Play Agents 163

...

MCTS

opponent games net3

reward MCTS
search eval
net2

game examples new net MCTS

train
net1

MCTS

net0

Fig. 6.9 Self-play loop improving quality of net

1 for tourn in range (1 , max_tourns ) : # curric . of tournaments

2 for game in range (1 , max_games ) : # play a tourn . of games
3 trim ( triples ) # if buffer full : replace old entries
4 while not game_over () : # generate the states of one game
5 move = mcts ( board , eval ( net ) ) # move is (s , a ) pair
6 game_pairs += move
7 m a k e _ m o v e _ a n d _ s w i t c h _ s i d e ( board , move )
8 triples += add ( game_pairs , game_outcome ( game_pairs ) )
# add to buf
9 net = train ( net , triples ) # retrain with (s ,a , outc ) triples

Listing 6.1 Self-play pseudocode

The Cycle in Pseudocode

Conceptually self-play is as ingenious as it is elegant: a double training loop around

an MCTS player with a neural network as evaluation and policy function that help
MCTS. Figure 6.10 and Listing 6.1 show the self-play loop in detail. The numbers in
the figure correspond to the line numbers in the pseudocode.
Let us perform an outside-in walk-through of this system. Line 1 is the main
self-play loop. It controls how long the execution of the curriculum of self-play
tournaments will continue. Line 2 executes the training episodes, the tournaments
of self-play games after which the network is retrained. Line 4 plays such a game
to create (state, action) pairs for each move, and the outcome of the game. Line 5
calls MCTS to generate an action in each state. MCTS performs the simulations
where it uses the policy head of the net in P-UCT selection, and the value head of
the net at the MCTS leaves. Line 6 appends the state/action pair to the list of game
164 6 Two-Agent Self-Play

2/4 game_pairs
policy/value
5 game_pairs ← mcts 5 pol/val ← eval(net(state))
state

8 triples 1 tourn: iterate with new net

9 net ← train(net, triples)

Fig. 6.10 A diagram of self-play with line-numbers

moves. Line 7 performs the move on the board, and switches color to the other
player, for the next move in the while loop. At line 8 a full game has ended, and the
outcome is known. Line 8 adds the outcome of each game to the (state, action)-pairs,
to make the (state, action, outcome)-triples for the network to train on. Note that
since the network is a two-headed policy/value net, both an action and an outcome
are needed for network training. On the last line this triples-buffer is then used to
train the network. The newly trained network is used in the next self-play iteration
as the evaluation function by the searcher. With this net another tournament is
played, using the searcher’s look-ahead to generate a next batch of higher-quality
examples, resulting in a sequence of stronger and stronger networks (Fig. 6.9 right
panel).
In the pseudocode we see the three self-play loops where the principle of playing
against a copy of yourself is used:
1. Move-level: in the MCTS playouts, our opponent actually is a copy of ourselves
(line 5)—hence, self-play at the level of game moves
2. Example-level: the input for self-training the approximator for the policy and
the reward functions is generated by our own games (line 2)—hence, self-play at
the level of the value/policy network.
3. Tournament-level: the self-play loop creates a training curriculum that starts
tabula rasa and ends at world champion level. The system trains at the level of
the player against itself (line 1)—hence, self-play, of the third kind.
All three of these levels use their own kind of self-play, of which we will describe
the details in the following sections. We start with move-level self-play.

6.2.1 Move-Level Self-Play

At the innermost level, we use the agent to play against itself, as its own opponent.
Whenever it is my opponent’s turn to move, I play its move, trying to find the best
move for my opponent (which will be the worst possible move for me). This scheme
uses the same knowledge for player and opponent. This is different from the real
world, where the agents are different, with different brains, different reasoning
6.2 Tabula Rasa Self-Play Agents 165

action/state value

eval

Fig. 6.11 Search-Eval Architecture of Games

skills, and different experience. Our scheme is symmetrical: when we assume that
our agent plays a strong game, then the opponent is also assumed to play strongly,
and we can hope to learn from the strong counter play. (We thus assume that our
agent plays with the same knowledge as we have; we are not trying to consciously
exploit opponent weaknesses.)5

6.2.1.1 Minimax

This principle of generating the counter play by playing yourself while switching
perspectives has been used since the start of artificial intelligence. It is known as
minimax.
The games of chess, checkers and Go are challenging games. The architecture
that has been used to program chess and checkers players has been the same since
the earliest paper designs of Turing [788]: a search routine based on minimax which
searches to a certain depth, and an evaluation function to estimate the score of
board positions using heuristic rules of thumb when this depth is reached. In chess
and checkers, for example, the number of pieces on the board of a player is a crude
but effective approximation of the strength of a state for that player. Figure 6.11
shows a diagram of this classic search-eval architecture.6
Based on this principle many successful search algorithms have been developed,
of which alpha-beta is the best known [416, 593]. Since the size of the state space
is exponential in the depth of lookahead, however, many enhancements had to be
developed to manage the size of the state space and to allow deep lookahead to
occur [600].
The word minimax is a contraction of maximizing/minimizing (and then reversed
for easy pronunciation). It means that in zero-sum games the two players alternate
making moves, and that on even moves, when player A is to choose a move, the
5 There is also research into opponent modeling, where we try to exploit our opponent’s weak-
nesses [320, 91, 259]. Here, we assume an identical opponent, which often works best in chess and
Go.
6 Because the agent knows the transition function 𝑇 , it can calculate the new state 𝑠0 for each

action 𝑎. The reward 𝑟 is calculated at terminal states, where it is equal to the value 𝑣. Hence,
in this diagram the search function provides the state to the eval function. See [788, 600] for an
explanation of the search-eval architecture.
166 6 Two-Agent Self-Play

1 2 1

6 1 3 3 4 2 1 6 5

Fig. 6.12 Minimax tree

best move is the one that maximizes the score for player A, while on odd moves
the best move for player B is the move that minimizes the score for player A.
Figure 6.12 depicts this situation in a tree. The score values in the nodes are
chosen to show how minimax works. At the top is the root of the tree, level 0, a
square node where player A is to move.
Since we assume that all players rationally choose the best move, the value of the
root node is determined by the value of the best move, the maximum of its children.
Each child, at level 1, is a circle node where player B chooses its best move, in order
to minimize the score for player A. The leaves of this tree, at level 2, are again max
squares (even though there is no child to choose from anymore). Note how for each
circle node the value is the minimum of its children, and for the square node, the
value is the maximum of the tree circle nodes.
Python pseudocode for a recursive minimax procedure is shown in Listing 6.2.
Note the extra hyperparameter d. This is the search depth counting upwards from
the leaves. At depth 0 are the leaves, where the heuristic evaluation function is
called to score the board.7 Also note that the code for making moves on the board—
transitioning actions into the new states—is not shown in the listing. It is assumed to
happen inside the children dictionary. We frivolously mix actions and states in these
sections, since an action fully determines which state will follow. (At the end of this
chapter, the exercises provide more detail about move making and unmaking.)
AlphaGo Zero uses MCTS, a more advanced search algorithm than minimax,
that we will discuss shortly.
7The heuristic evaluation function is originally a linear combination of hand-crafted heuristic
rules, such as material balance (which side has more pieces) or center control. At first, the linear
combinations (coefficients) were not only hand-coded, but also hand-tuned. Later they were trained
by supervised learning [61, 613, 771, 253]. More recently, NNUE was introduced as a non-linear
neural network to use as evaluation function in an alpha-beta framework [557].
6.2 Tabula Rasa Self-Play Agents 167

1 INF = 99999 # a value assumed to be larger than eval ever returns

2
3 def minimax (n , d ) :
4 if d <= 0:
5 return heur istic_ eval ( n )
6 elif n [ ’ type ’] == ’ MAX ’:
7 g = - INF
8 for c in n [ ’ children ’ ]:
9 g = max (g , minimax (c , d -1) )
10 elif n [ ’ type ’] == ’ MIN ’:
11 g = INF
12 for c in n [ ’ children ’ ]:
13 g = min (g , minimax (c , d -1) )
14 return g
15
16 print ( " Minimax ␣ value : ␣ " , minimax ( root , 2) )

Listing 6.2 Minimax code [600]

Fig. 6.13 Three Lines of Play [65]

Beyond Heuristics

Minimax-based procedures traverse the state space by recursively following all

actions in each state that they visit [788]. Minimax works just like a standard
depth-first search procedure, such as we have been taught in our undergraduate
algorithms and data structures courses. It is a straightforward, rigid, approach, that
searches all branches of the node to the same search depth.
To focus the search effort on promising parts of the tree, researchers have subse-
quently introduced many algorithmic enhancements, such as alpha-beta cutoffs,
iterative deepening, transposition tables, null windows, and null moves [416, 602,
422, 714, 203].
In the early 1990s experiments with a different approach started, based on ran-
dom playouts of a single line of play [6, 102, 118] (Fig. 6.13 and 6.14). In Fig. 6.14
168 6 Two-Agent Self-Play

Fig. 6.14 Searching a Tree versus Searching a Path

this different approach is illustrated. We see a search of a single line of play ver-
sus a search of a full subtree. It turned out that averaging many such playouts
could also be used to approximate the value of the root, in addition to the classic
recursive tree search approach. In 2006, a tree version of this approach was intro-
duced that proved successful in Go. This algorithms was called Monte Carlo Tree
Search [164, 117]. Also in that year Kocsis and Szepesvári created a selection rule
for the exploration/exploitation trade-off that performed well and converged to the
minimax value [419]. Their rule is called UCT, for upper confidence bounds applied
to trees.

6.2.1.2 Monte Carlo Tree Search

Monte Carlo Tree Search has two main advantages over minimax and alpha-beta.
First, MCTS is based on averaging single lines of play, instead of recursively travers-
ing subtrees. The computational complexity of a path from the root to a leaf is
polynomial in the search depth. The computational complexity of a tree is expo-
nential in the search depth. Especially in applications with many actions per state
it is much easier to manage the search time with an algorithm that expands one
path at a time.8
Second, MCTS does not need a heuristic evaluation function. It plays out a line
of play in the game from the root to an end position. In end-positions the score of
the game, a win or a loss, is known. By averaging many of these playouts the value
of the root is approximated. Minimax has to cope with an exponential search tree,
which it cuts off after a certain search depth, at which point it uses the heuristic
to estimate the scores at the leaves. There are, however, games where no efficient
8Compare chess and Go: in chess the typical number of moves in a position is 25, for Go this
number is 250. A chess-tree of depth 5 has 255 = 9765625 leaves. A Go-tree of depth 5 has
2505 = 976562500000 leaves. A depth-5 minimax search in Go would take prohibitively long; an
MCTS search of 1000 expansions expands the same number of paths from root to leaf in both
games.
6.2 Tabula Rasa Self-Play Agents 169

Fig. 6.15 Monte Carlo Tree Search [117]

heuristic evaluation function can be found. In this case MCTS has a clear advantage,
since it works without a heuristic score function.
MCTS has proven to be successful in many different applications. Since its
introduction in 2006 MCTS has transformed the field of heuristic search. Let us see
in more detail how it works.
Monte Carlo Tree Search consists of four operations: select, expand, playout,
and backpropagate (Fig. 6.15). The third operation (playout) is also called rollout,
simulation, and sampling. Backpropagation is sometimes called backup. Select is
the downward policy action trial part, backup is the upward error/learning part of
the algorithm. We will discuss the operations in more detail in a short while.
MCTS is a succesful planning-based reinforcement learning algorithm, with an
advanced exploration/exploitation selection rule. MCTS starts from the initial state
𝑠0 , using the transition function to generate successor states. In MCTS the state
space is traversed iteratively, and the tree data structure is built in a step by step
fashion, node by node, playout by playout. A typical size of an MCTS search is to
do 1000–10,000 iterations. In MCTS each iteration starts at the root 𝑠0 , traversing a
path in the tree down to the leaves using a selection rule, expanding a new node,
and performing a random playout. The result of the playout is then propagated
back to the root. During the backpropagation, statistics at all internal nodes are
updated. These statistics are then used in future iterations by the selection rule to
go to the currently most interesting part of the tree.
The statistics consist of two counters: the win count 𝑤 and the visit count 𝑣.
During backpropagation, the visit count 𝑣 at all nodes that are on the path back
from the leaf to the root are incremented. When the result of the playout was a win,
then the win count 𝑤 of those nodes is also incremented. If the result was a loss,
then the win count is left unchanged.
The selection rule uses the win rate 𝑤/𝑣 and the visit count 𝑣 to decide whether
to exploit high-win-rate parts of the tree or to explore low-visit-count parts. An
170 6 Two-Agent Self-Play

1 def m o n t e _ c a r l o _ t r e e _ s e a r c h ( root ) :
2 while re source s_left ( time , computational power ) :
3 leaf = select ( root ) # leaf = unvisited node
4 s i m u l a t i o n _ r e s u l t = rollout ( leaf )
5 backpropagate ( leaf , s i m u l a t i o n _ r e s u l t )
6 return best_child ( root ) # or : child with highest visit count
7
8 def select ( node ) :
9 while fu lly_ex panded ( node ) :
10 node = best_child ( node ) # traverse down path of best
UCT nodes
11 return expand ( node . children ) or node # no children / node is
terminal
12
13 def rollout ( node ) :
14 while non_terminal ( node ) :
15 node = rol lout_p olicy ( node )
16 return result ( node )
17
18 def rollou t_poli cy ( node ) :
19 return pick_random ( node . children )
20
21 def backpropagate ( node , result ) :
22 if is_root ( node ) return
23 node . stats = update_stats ( node , result )
24 backpropagate ( node . parent )
25
26 def best_child ( node , c_param =1.0) :
27 c ho ic e s_ w ei gh t s = [
28 ( c . q / c . n ) + c_param * np . sqrt (( np . log ( node . n ) / c . n ) )
# UCT
29 for c in node . children
30 ]
31 return node . children [ np . argmax ( c h oi ce s _w ei g ht s ) ]

Listing 6.3 MCTS pseudo-Python [117, 173]

often-used selection rule is UCT (Sect. 6.2.1.2). It is this selection rule that governs
the exploration/exploitation trade-off in MCTS.

The Four MCTS Operations

Let us look in more detail at the four operations. Please refer to Listing 6.3 and
Fig. 6.15 [117]. As we see in the figure and the listing, the main steps are repeated
as long as there is time left. Per step, the activities are as follows.
1. Select In the selection step the tree is traversed from the root node down until
a leaf of the MCTS search tree is reached where a new child is selected that is
not part of the tree yet. At each internal state the selection rule is followed to
6.2 Tabula Rasa Self-Play Agents 171

determine which action to take and thus which state to go to next. The UCT rule
works well in many applications [419].
The selections at these states are part of the policy 𝜋(𝑠) of actions of the state.
2. Expand Then, in the expansion step, a child is added to the tree. In most cases
only one child is added. In some MCTS versions all successors of a leaf are added
to the tree [117].
3. Playout Subsequently, during the playout step random moves are played in a
form of self-play until the end of the game is reached. (These nodes are not added
to the MCTS tree, but their search result is, in the backpropagation step.) The
reward 𝑟 of this simulated game is +1 in case of a win for the first player, 0 in
case of a draw, and −1 in case of a win for the opponent.9
4. Backpropagation In the backpropagation step, reward 𝑟 is propagated back up-
wards in the tree, through the nodes that were traversed down previously. Two
counts are updated: the visit count, for all nodes, and the win count, depending
on the reward value. Note that in a two-agent game, nodes in the MCTS tree
alternate color. If white has won, then only white-to-play nodes are incremented;
if black has won, then only the black-to-play nodes.
MCTS is on-policy: the values that are backed up are those of the nodes that
were selected.

Pseudocode

Many websites contain useful resources on MCTS, including example code (see
Listing 6.3).10 The pseudocode in the listing is from an example program for game
play. The MCTS algorithm can be coded in many different ways. For implementation
details, see [173] and the comprehensive survey [117].
MCTS is a popular algorithm. An easy way to use it in Python is by installing it
from a pip package (pip install mcts).

Policies

At the end of the search, after the predetermined iterations have been performed, or
when time is up, MCTS returns the value and the action with the highest visit count.
An alternative would be to return the action with the highest win rate. However,
the visit count takes into account the win rate (through UCT) and the number of
simulations on which it is based. A high win rate may be based on a low number of
simulations, and can thus be high variance. High visit counts will be low variance.
9 Originally, playouts were random (the Monte Carlo part in the name of MCTS) following
Brügmann’s [118] and Bouzy and Helmstetter’s [102] original approach. In practice, most Go
playing programs improve on the random playouts by using databases of small 3 × 3 patterns with
best replies and other fast heuristics [269, 165, 137, 702, 170]. Small amounts of domain knowledge
are used after all, albeit not in the form of a heuristic evaluation function.
10 https://int8.io/monte-carlo-tree-search-beginners-guide/
172 6 Two-Agent Self-Play

Due to selection rule, high visit count implies high win-rate with high confidence,
while high win rate may be low confidence [117]. The action of this initial state 𝑠0
constitutes the deterministic policy 𝜋(𝑠0 ).

UCT Selection

The adaptive exploration/exploitation behavior of MCTS is governed by the selec-

tion rule, for which often UCT is chosen. UCT is an adaptive exploration/exploitation
rule that achieves high performance in many different domains.
UCT was introduced in 2006 by Kocsis and Szepesvári [419]. The paper provides
a theoretical guarantee of eventual convergence to the minimax value. The selection
rule was named UCT, for upper confidence bounds for multi-armed bandits applied
to trees. (Bandit theory was also mentioned in Sect. 2.2.4.3).
The selection rule determines the way in which the current values of the children
influence which part of the tree will be explored more. The UCT formula is
√︄
𝑤𝑎 ln 𝑛
UCT(𝑎) = + 𝐶𝑝 (6.1)
𝑛𝑎 𝑛𝑎

where 𝑤 𝑎 is the number of wins in child 𝑎, 𝑛 𝑎 is the number of times child 𝑎 has
been visited, 𝑛 is the number of times the parent node has been visited, and 𝐶 𝑝 ≥ 0
is a constant, the tunable exploration/exploitation hyperparameter. The first term in
the UCT equation, the win rate 𝑤 𝑎
𝑛𝑎 , is the exploitation term. A child with√︃
a high win
rate receives through this term an exploitation bonus. The second term ln𝑛𝑎𝑛 is for
exploration. A child that has been visited infrequently has a higher exploration term.
The level of exploration can be adjusted by the 𝐶 𝑝 constant. A low 𝐶 𝑝 does little
exploration; a high 𝐶 𝑝 has more exploration. The selection rule then is to select
the child with the highest UCT sum (the familiar arg max function of value-based
methods). √︃
𝑤𝑎 ln 𝑛
The UCT formula balances win rate 𝑛𝑎 and “newness” 𝑛𝑎 in the selection of
nodes to expand.11
Alternative selection rules have been proposed, such as Auer’s
UCB1 [30, 31, 32] and P-UCT [638, 504].

P-UCT

We should note that the MCTS that is used in the AlphaGo Zero program is a
little different. MCTS is used inside the training loop, as an integral part of the
11 The square-root term is a measure of the variance (uncertainty) of the action value. The use of
the natural logarithm ensures that, since increases get smaller over time, old actions are selected
less frequently. However, since logarithm values are unbounded, eventually all actions will be
selected [743].
6.2 Tabula Rasa Self-Play Agents 173

self-generation of training examples, to enhance the quality of the examples for

every self-play iteration, using both value and policy inputs to guide the search.
Also, in the AlphaGo Zero program MCTS backups rely fully on the value
function approximator; no playout is performed anymore. The MC part in the name
of MCTS, which stands for the Monte Carlo playouts, really has become a misnomer
for this neural network-guided tree searcher.
Furthermore, selection in self-play MCTS is different. UCT-based node selection
now also uses the input from the policy head of the trained function approximators,
in addition to the win rate and newness. What remains is that through the UCT
mechanism MCTS can focus its search effort greedily on the part with the highest
win rate, while at the same time balancing exploration of parts of the tree that are
underexplored.
The formula that is used to incorporate input from the policy head of the deep
network is a variant of P-UCT [706, 530, 638, 504] (for predictor-UCT). Let us
compare P-UCT with UCT. The P-UCT formula adds the policy head 𝜋(𝑎|𝑠) to
Eq. 6.1 √
𝑤𝑎 𝑛
P-UCT(𝑎) = + 𝐶 𝑝 𝜋(𝑎|𝑠) .
𝑛𝑎 1 + 𝑛𝑎
P-UCT adds the 𝜋(𝑎|𝑠) term specifying the probability of the action 𝑎 to the explo-
ration part of the UCT formula.12

Exploration/Exploitation

The search process of MCTS is guided by the statistics values in the tree. MCTS
discovers during the search where the promising parts of the tree are. The tree
expansion of MCTS is inherently variable-depth and variable-width (in contrast to
minimax-based algorithms such as alpha-beta, which are inherently fixed-depth
and fixed-width). In Fig. 6.16 we see a snapshot of the search tree of an actual MCTS
optimization. Some parts of the tree are searched more deeply than others [808].
An important element of MCTS is the exploration/exploitation trade-off, that
can be tuned with the 𝐶 𝑝 hyperparameter. The effectiveness of MCTS in different
applications depends on the value of this hyperparameter. Typical initial choices
for Go programs are 𝐶 𝑝 = 1 or 𝐶 𝑝 = 0.1 [117], although in AlphaGo we see
highly explorative choices such as 𝐶 𝑝 = 5. In general, experience has learned that
when compute power is low, 𝐶 𝑝 should be low, and when more compute power is
available, more exploration (higher 𝐶 𝑝 ) is advisable [117, 434].
12Note further the small differences under the square root (no logarithm, and the 1 in the denomi-
nator) also change the UCT function profile somewhat, ensuring correct behavior at unvisited
actions [530].
174 6 Two-Agent Self-Play
0

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

21 22 23 24 25 26 27 28 29 30 31 32 33 34 20 22 23 20 21 20 21 20 21 22 20 21 22 23 24 26 27 28 29 30 31 32 33 34 20 21 22 23 20 21 22 23 24 25 20 21 22 23 24 25 26 27 29 30 31 32 33 20 21 22 23 24 25 26 27 28 30 31 32 33 34 20 21 22 23 24 25 26 27 28 29 31 32 33 34 20 21 22 23 24 20 21 22 23 24 25 26 27 28 29 30 31 33 34 20 21 22 23 24 25 26 20 21 22 23 24 25 26

20 20 20 20 21 20 20 21 20 20 20 20 20 21 20 20 20 20 20 20 20 21 22 20 21 22 23 24 25 26 27 28 31 32 33 34 20 21 22 20 21 22 20 21 20 20 20 20 20 20 21 20 21 22

21 22 23 20 22 23 20 20 21 22 23 20 21 22 20 21 20 20 21 20 21 22 20 21 22 23 24 25 26 27 28 31 33 34 20 21 22 23 20 21

21 22 23 24 25 26 27 28 31 33 34 20 22 23 20 21 20 21 20 21 22 23 25 26 27 28 31 33 34 20 21 22 23 24 26 27 28 31 33 34 20 21 22 23 24 25 27 20 21 22 23 20 21 22 23 20 21 22 23 24 25 26 27 28 33 34 20 21 22 23 24 25 26 27 28 31 34 20 21 22 23

21 22 21 21 21 21 22 21 21 21 22 20 20 20 21 22 20 21 20 20 21 22 20 21 22 20 21 20 21 22 21 20 20 20 21 20 20 20 20 21 22 20 21 22 20 21 22 23 20 22 20 20 20 21 22 20 21 22 23 20 21 20 20 21 20 21 22 20 21 21 22 23 24 25 26 27