An Introduction to
Deep Reinforcement
Learning
Ehsan Abbasnejad
Remember:
Supervised Learning
We have a set of sample observations, with labels
learn to predict the labels, given a new sample
cat
Learn the function that
associates a picture of a
dog/cat with the label
dog
Remember: supervised learning
We need thousands of samples
Samples have to be provided by experts
There are applications where
• We can’t provide expert samples
• Expert examples are not what we mimic
• There is an interaction with the world
Deep Reinforcement Learning
AlphaGo
Scenario of Reinforcement Learning
Observation Action
State Change the
environment
Agent
Don’t do that Reward
Environment
Scenario of Reinforcement Learning
Agent learns to take actions maximizing expected
reward.
Observation Action
State Change the
environment
Agent
Thank you. Reward
https://yoast.com/how-t Environment
o-clean-site-structure/
Machine Learning Actor/Policy
≈ Looking for a Function
Action = π( Observation )
Observation Action
Function Function
input output
Used to pick the Reward
best function
Environment
Reinforcement Learning in a nutshell
RL is a general-purpose framework for decision-making
• RL is for an agent with the capacity to act
• Each action influences the agent’s future state
• Success is measured by a scalar reward signal
Goal: select actions to maximise future reward
Deep Learning in a nutshell
DL is a general-purpose framework for representation learning
• Given an objective
• Learning representation that is required to achieve objective
• Directly from raw inputs
• Using minimal domain knowledge
Goal: Learn the representation that achieves the
objective
Deep Reinforcement Learning in a nutshell
A single agent that solves human level tasks
• RL defines the objective
• DL gives the mechanism and representation
• RL+DL=Deep reinforcement learning
This can lead to general intelligence
Reinforcement Learning is multi-disciplinary
Machine Compute
Engineering
Learning Science
Operations Reinforcement Optimal
Research Learning Control
Economics Game Math
Theory
Agent and Environment
• At each step, the agent
• Selects an action
• Observes the environment
observation
• Receives reward
action
reward
• The environment:
• Receives action
• Emits new observation
• Emits reward for the agent
Learning to play Go
Observation Action
Reward
Next Move
Environment
Agent learns to take
Learning to play Go actions maximizing
expected reward.
Observation Action
Reward
reward = 0 in most cases
If win, reward = 1
If loss, reward = -1
Environment
Learning to play Go
• Supervised: Learning from teacher
Next move: Next move:
“5-5” “3-3”
• Reinforcement Learning
Learning from experience
First move …… many moves Win!
(Two agents play
…… with each other.)
Alpha Go is supervised learning + reinforcement learning.
https://image.freepik.com/free-vector/variety-of-human-avatar
s_23-2147506285.jpg
http://www.freepik.com/free-vector/variety-of-human-av
Learning a chat-bot
atars_766615.htm
•Machine obtains feedback from user
How are Hello
you?
Bye bye ☺ Hi ☺
-10 3
•Chat-bot learns to maximize the expected reward
Learning a chat-bot
• Let two agents talk to each other (sometimes generate good
dialogue, sometimes bad)
How old are you? How old are you?
See you. I am 16.
See you. I though you were 12.
See you. What make you
think so?
Learning a chat-bot
• By this approach, we can generate a lot of dialogues.
• Use some predefined rules to evaluate the goodness of a dialogue
Dialogue 1 Dialogue 2 Dialogue 3 Dialogue 4
Dialogue 5 Dialogue 6 Dialogue 7 Dialogue 8
Machine learns from the evaluation
Deep Reinforcement Learning for Dialogue
Generation https://arxiv.org/pdf/1606.01541v3.pdf
Learning a chat-bot
“Hello” Say “Hi”
• Supervised
“Bye bye” Say “Good bye”
• Reinforcement
……. ……. ……
Hello ☺ …… Bad
Agent Agent
More applications
•Flying Helicopter
• https://www.youtube.com/watch?v=0JL04JJjocc
•Driving
• https://www.youtube.com/watch?v=0xo1Ldx3L5Q
•Robot
• https://www.youtube.com/watch?v=370cT-OAzzM
•Google Cuts Its Giant Electricity Bill With DeepMind-Powered AI
• http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-its-giant-electricity-bill-
with-deepmind-powered-ai
•Text generation
• https://www.youtube.com/watch?v=pbQ4qe8EwLo
Example: Playing Video Game
• Widely studies:
• Gym: https://gym.openai.com/
• Universe: https://openai.com/blog/universe/
Machine learns to play video
games as human players
➢ What machine
observes is pixels
➢ Machine learns to take
proper action itself
Example: Playing Video Game
Termination: all the aliens are
• Space invader killed, or your spaceship is
Score destroyed.
(reward)
Kill the
aliens
shield
fire
Example: Playing Video Game
• Space invader
•Play yourself:
http://www.2600online.com/spaceinvader
s.html
•How about machine:
https://gym.openai.com/evaluations/eval
_Eduozx4HRyqgTCVk9ltw
Example: Playing Video Game
(kill an
alien)
Usually there is some randomness in the environment
Example: Playing Video Game
This is an episode.
After many turns Game Over
(spaceship destroyed) Learn to maximize the
expected cumulative reward
per episode
Paradigm
Supervised Unsupervised Reinforcement
Learning Learning Learning
Objective
→ Classification → Inference → Prediction
Applications
→ Regression → Generation → Control
Prediction
Control
SETTING
Environment
State/Observation
Action
Reward
Agent
using
policy
MARKOV DECISION PROCESSES (MDP)
Transition Reward
State Action
function function
space space
● State: Markov property considers only the previous state
● Decision: agent takes actions, and those decisions have consequences
● Process: there is a transition function (dynamics of the system)
● Reward: depends on the state and action, often related to the state
Goal: maximise overall reward
Partially Observable MARKOV DECISION
PROCESSES (POMDP)
Transition Reward
State Action
function function
space space
● State: Markov property considers only the previous state but the agent cannot
directly observe the underlying state.
● Decision: agent takes actions, and those decisions have consequences
● Process: there is a transition function (dynamics of the system)
● Reward: depends on the state and action, often related to the state
Goal: maximise overall reward
MARKOV DECISION PROCESSES (MDP)
Transition Reward
State Action
function function
space space
Computing Rewards
Episodic vs continuing: “Game over” after N steps
Additive rewards (can be infinite for continuing tasks)
Discounted rewards ...
DISCOUNT FACTOR
→ We want to be greedy but not impulsive
→ Implicitly takes uncertainty in dynamics into account (we
don’t know the future)
→ Mathematically: γ<1 allows infinite horizon returns
Return:
SOLVING AN MDP
Objective:
Goal:
SOLVING AN MDP
● If the state and actions are discrete:
○ We have a table of state-action probabilities
○ Learning is filling this table: (dynamic
programming)
Action
State
SOLVING AN MDP
● If the state and actions are discrete:
○ We have a table of state-action probabilities
○ Learning is filling this table: (dynamic
programming)
Action State
Action
State
State
SOLVING AN MDP
● If the state and actions
are discrete:
● Let’s try different
actions and see which
one succeed
Exploration-Exploitation dilemma
Do we want to stick to action we think
would be good or try something new
Choosing Actions
• Take the action with highest probability (Q-function): Greedy
• Proportionate by its probability: Sampling
• Greedy most times, with some probability random
VALUE FUNCTIONS
→ Value = expected gain of a state
→ Q function – action specific value function
→ Advantage function – how much more valuable is an action
→ Value depends on future rewards → depends on policy
VALUE FUNCTIONS
State
Action
State
Solving Reinforcement Learning
• Model-based approaches:
• We model the environment. Do we really need to model all the details of the
world?
• Model free approaches:
• We model the state-actions
Alpha Go: policy-based + value-based +
model-based
Model-free
Approach
Policy-based Value-based
Learning an Actor Actor + Critic Learning a Critic
Model-based Approach
POLICY ITERATION
Policy Policy
Evaluation Update
Q-LEARNING
Q-LEARNING
FUNCTION APPROXIMATION
Model:
Training
data:
Loss
function: where
IMPLEMENTATION
Action-in Action-out Off-Policy Learning
→ The target depends in
part on our model → old
observations are still
useful
→ Use a Replay Buffer of
most recent transitions
as dataset
Properties of
Reinforcement Learning
•Reward delay
• In space invader, only “fire” obtains reward
•Although the moving before “fire” is important
• In Go playing, it may be better to sacrifice immediate reward to gain more long-term reward
•Agent’s actions affect the subsequent data it receives
• E.g. Exploration
DQN ISSUES
→ Convergence is not guaranteed – hope for deep magic!
Replay Buffer Reward scaling Using replicas
→ Double Q Learning – decouple action selection and value
estimation
UW CSE DEEP LEARNING - FELIX LEEB 51
POLICY GRADIENTS
→ Parameterize policy and update those parameters directly
→ Enables new kinds of policies: stochastic, continuous action
spaces
→ On policy learning → learn directly from your actions
52
POLICY GRADIENTS
→ Approximate expectation value from
samples 53
VARIANCE REDUCTION
→ Constant offsets make it harder to differentiate
the right direction
→ Remove offset → a priori value of each state
UW CSE DEEP LEARNING - FELIX LEEB 54
ADVANCED POLICY GRADIENT METHODS
Rajeswaran et al. Heess et al.
(2017) (2017)
55
ACTOR CRITIC
Critic using Q learning update
Estimate Propose
Advantage Actions
Actor using policy gradient update
UW CSE DEEP LEARNING - FELIX LEEB 56
ASYNC ADVANTAGE ACTOR-CRITIC (A3C)
Mnih et al.
(2016)
57
ASYNC ADVANTAGE ACTOR-CRITIC (A3C)
Deep Reinforcement Learning
Actor-Critic
Actor-Critic
interacts with the
environment
Update actor from
Learning
based on
Actor-Critic
left
Network right
Network fire
Network
Demo of A3C
• Visual Doom AI Competition @ CIG 2016
• https://www.youtube.com/watch?v=94EPSjQH38Y
Why is it challenging
• Exploration-exploitation dilemma
• How to reward the algorithm.
• How to learn when rewards are very sparse
• What representation do we need for states?
• How to update the policy
• How to incorporate the prior (or logic-based) knowledge
• How to learn for multiple tasks: General Artificial Intelligence
Reference
•Textbook: Reinforcement Learning: An Introduction
• http://incompleteideas.net/sutton/book/the-book.html
•Lectures of David Silver
• http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html (10 lectures, around
1:30 each)
• http://videolectures.net/rldm2015_silver_reinforcement_learning/ (Deep
Reinforcement Learning )
•Lectures of John Schulman
• https://youtu.be/aUrX-rP_ss4