REINFORCEMENT LEARNING
Navneet Goyal
Slides developed using material from:
Reinforcement Learning: An Introduction
Richard S. Sutton and Andrew G. Barto
MIT Press
&
Machine Learning by Stephen Marsland, CRC Press (Chapter 13)
Topics
Exploration vs. Exploitation
Learning through Interaction with environment
State & Action spaces
Examples from Game Playing & Robotics
Episodic & Continual Tasks
Reward & Delayed Reward
Policy
Markov Decision Process (MDP) & SMDP
Q-learning Algorithm
Introducing Reinforcement
Learning
Listen to the title track of the Web Series
“Money Heist” – My Life is Going on…
If I am losing now, but I am winning late, that’s
all I want!
I don’t care at all, I am lost!
Random Notes
RL is generally used to solve Markov decision
problem (MDP) and its variants
What is MDP?
Will explore in a while!
Learning types
Supervised learning:
a situation in which sample (input, output) pairs of
the function to be learned can be perceived or are
given
You can think it as if there is a kind teacher
Training data (X,Y) (features, label)
Predict Y, minimizing some loss
Examples: Regression, Classification.
Unsupervised Learning
Only data points X given (features only)
Find Similar Xs
Clustering
Reference: MDP & RL, Machine Learning10-.‐601B (CMU)
Learning types
Reinforcement learning:
in the case of the agent acts on its environment, it
receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to
achieve its goal
Training data:(S,A,R)
Develop an optimal policy (sequence of
decision rules) for the learner so as to
maximize its “Long-term Reward”
Robotics, Board-games etc.
Reference: MDP & RL, Machine Learning10-.‐601B (CMU)
Supervised vs. Reinforcement
Reference: MDP & RL, Machine Learning10-.‐601B (CMU)
Examples of RL
Reference: MDP & RL, Machine Learning10-.‐601B (CMU)
MDP
Goal – reach the smiling face as fast as
possible
Noisy movement
When you want to move forward, you don’t
necessarily move forward!
Actions - NEWS
Figure source: Towards Data Science
MDP
world
𝑎 𝑎
(𝑆, 𝐴, 𝑃𝑠𝑠 ′ , 𝑅𝑠𝑠 ′ , 𝛾)
S – states
A – actions
𝑎
𝑃𝑠𝑠 ′ - transition probabilities (from state s to s’
through action a
𝑎
𝑅𝑠𝑠 ′ - reward for moving from state s to s’ through
action a
𝛾 - discount
MDP
world
1 2 3
4 5 6
7 8 9
0.9
0.05 0.05
MDP
world
🍪
☠️
A $ today is better than a $ tomorrow!!
0.9
MDP Assumptions 0.05 0.05
Fully observable domain
You wanted to go N, but ended up W ( I would know
this)
Markovian property
History is not important
How you ended up in a particular state is not importan
Time of transition
We will assume that the time of transition is unity (1),
which means it is the same for every transition.
For the SMDP, this quantity is not fixed
Elements of an MDP
The framework of the MDP has the following
elements:
(1) state of the system,
(2) actions,
(3) transition probabilities,
(4) transition rewards,
(5) a policy, and
(6) a performance metric (value function)
Elements of an MDP
(1) state of the system/dynamic system
The “state” of a system is a parameter or a set
of parameters that can be used to describe a
system. For example:
Coordinates of a robot moving in a room
Queue in a supermarket (no. of people in the
queue)
Both are examples of dynamic systems!
Elements of an MDP
(2) actions
Robot can move in NEWS (4 actions)
Open a new counter or do not open a new
counter (2 actions)
Elements of an MDP
(3) Transition probabilities
Assume that action a is selected in state S.
Let the next state be S’.
𝑎
Let 𝑃𝑠𝑠 ′ denote the probability of going from
state S to state S’ under the influence of
action a in one step.
If an MDP has 3 states and 2 actions, there
are 9 transition probabilities per action.
Elements of an MDP
(4) Immediate rewards
Usually, the system receives an immediate reward
(which could be positive or negative) when it
transitions from one state to another
It maps each state (or state-action pair) to a real no.
𝑎
called reward denoted by 𝑅𝑠𝑠 ′
(5) Policy
The policy defines the action to be chosen in every
state visited by the system
A map from state space to action space. May be
stochastic
Elements of an MDP
(6) Value function
Value of a state (or state-action pair) is the
total expected award starting from that state
(or state-action pair)
MDP – The Q & V Functions
∞
𝑄 𝜋 (𝑠, 𝑎) = 𝐸 𝛾 𝑡−1 |𝑠0 = 𝑠, 𝑎0 = 𝑎, 𝜋
𝑡=1
- From state s, action a is taken
- On expectation, how many cookies I will get?
- After this, you follow policy 𝜋
𝑉 𝜋 (𝑠) = 𝑄 𝜋 (𝑠, 𝜋(𝑠))
- when 𝑎 also comes from policy 𝜋
MDP problem- to find 𝜋*
𝜋
𝜋∗ = max ∀𝑠 ∈ 𝑆, 𝑉 (𝑠)
𝜋
MDP and Reinforcement
Learning
𝑎 𝑎
(𝑆, 𝐴, 𝑃𝑠𝑠 ′ , 𝑅𝑠𝑠 ′ , 𝛾)
P & R can also change…typically, P changes
Dots & Lines
Figure source: [Link]/utbildning/kandidatexjobb/datateknik/2010/
arvidsson_oskar_OCH_wallgren_linus_K10047.pdf
Dots & Lines
Simple Agent
It always make
the same move,
independent of
the opponent.
chooses the
minimum
numbered line
possible,
according to
figure 2.
Figure source: [Link]/utbildning/kandidatexjobb/datateknik/2010/
arvidsson_oskar_OCH_wallgren_linus_K10047.pdf
Dots & Lines
Simple Agent
The purpose of
this agent is to
give the Q-
learning
algorithm an easy
and predictable
opponent to play
against,
Figure source: [Link]/utbildning/kandidatexjobb/datateknik/2010/
arvidsson_oskar_OCH_wallgren_linus_K10047.pdf
Dots & Lines
State-space size
Each line is unique
and can be drawn
or not drawn!
16 dots & 24 lines
Number of states
= 224
4 dots & 4 lines
(24 states –
enumerate to
check)
Figure source: [Link]/utbildning/kandidatexjobb/datateknik/2010/
arvidsson_oskar_OCH_wallgren_linus_K10047.pdf
Dots & Lines
State-action pairs
Mean number of lines
possible to draw in
each state =12
Hence, number of
state-action pairs =
224*12 = 200 m
For 4 dots & 4 lines
(24 *2 = 32)
for 9 dots and 12
lines
(212 *6 = 24,576)
Figure source: [Link]/utbildning/kandidatexjobb/datateknik/2010/
arvidsson_oskar_OCH_wallgren_linus_K10047.pdf
Dots & Lines: Q-Learning
Reward
+ve or –ve reward only
when the game ends
+ve or –ve reward when
the game ends and also
when a box is completed
Reward for winning the
game must be far
greater than that for
completing a box
Likewise the punishment
for losing a game needs
to be more severe than
that of losing a box.
Figure source: [Link]/utbildning/kandidatexjobb/datateknik/2010/
arvidsson_oskar_OCH_wallgren_linus_K10047.pdf
Dots & Lines: Q-Learning
Figure source: [Link]/utbildning/kandidatexjobb/datateknik/2010/
arvidsson_oskar_OCH_wallgren_linus_K10047.pdf
Reinforcement Learning
Search is fundamental part of any
Reinforcement Learner
A RL algorithm searches over a state space of
possible inputs and outputs in order to try to
maximize a “reward”
RL is defined in terms of an agent and its
environment
Agent is learning and the environment is where it
is learning and what it is learning about
Environment provides feedback on how good a
strategy is in terms of some “reward”
Exploration vs. Exploitation
Online decision making involves a
fundamental choice:
Exploitation: make the best decision given current
information (greedy approach)
Exploration: gather more information
(e-greedy or softmax)
The best long–term strategy may involve
short–term sacrifices
Gather enough information to make the best
overall decisions
Reference: Reinforcement Learning: Exploration vs Exploitation
Slides by Marcello Restelli
Exploration vs.
Exploitation: Examples
Restaurant Selection
Exploitation: Go to favorite restaurant
Exploration: Try a new restaurant
Online Banner Advertisements
Exploitation: Show the most successful advert
Exploration: Show a different advert
Oil Drilling
Exploitation: Drill at the best known location
Exploration: Drill at a new location
Game Playing
Exploitation: Play the move you believe is best
Exploration: Play an experimental move
Clinical Trial
Exploitation: Choose the best treatment so far
Exploration: Try a new treatment
Reference: Reinforcement Learning: Exploration vs Exploitation
Slides by Marcello Restelli
Reinforcement Learning
A child learning to stand up and walk!
Tries out many different strategies for staying
upright and gets feedback about which work
by whether or not it ends up flat on its face
The methods that seem to work are tried
over and over again till they are perfected or
better solutions found
Those that did not work are discarded
Reinforcement Learning
Important to note that its not the last action
that the child does before falling that makes
it fall over
Suppose the child tipped over something and
fell and before falling it waved its hands
around in desperation
Sequence of actions
Reinforcement Learning
“Trial & Error” learning
Also known as “Law of Effect”
Repeat actions that are reinforced by a
feeling of satisfaction
Let’s now try to see how all this can be
applied to Machine Learning
Reinforcement Learning
RL maps states or situations to actions in order
to maximize some numerical award
Robot with multiple sensors
Sensors don’t tell its location, only what it can see
about it
Possible ways in which the robot can drive its motors
are actions
Reward will be how well it does its task without
crashing into obstacles
In RL, algorithm gets feedback in the form of
rewards
Figure Source:Machine Learning by Stephen Marsland, CRC
Press (Chapter 13)
Reinforcement learning
Task
Learn how to behave successfully to achieve a goal
while interacting with an external environment
Learn via experiences!
Examples
Game playing: player knows whether it win or lose,
but not know how to move at each step
Control: a traffic system can measure the delay of cars,
but not know how to decrease it.
RL is learning from
interaction
RL model
Each percept(e) is enough to determine the State(the
state is accessible)
The agent can decompose the Reward component
from a percept.
The agent task: to find a optimal policy, mapping
states to actions, that maximize long-run measure of
the reinforcement
Think of reinforcement as reward
Can be modeled as MDP model!
Reinforcement Learning
In Supervised learning, the algorithm is
taught the correct answer
The reward function evaluates the current
solution, but does not suggest how to
improve it
Contrast this with Perceptron!!
Delayed Reward
Game playing (Chess, TicTacToe)
Example: Getting Lost
Arriving in a foreign city after many hours of
flying
Catch a train to the city and check in at a B&B
facility in old part of the city
Too tired to observe the surroundings
Late in the night, you are starving and you
venture out to find a place to eat and you are
completely lost
You don’t even remember the name of the
facility except that it is in the old part of the city
Example: Getting Lost
Only two things that are in your favor!!
Guess what are they
You recognize the building
You know reinforcement learning!!
Since you have walked only through the old city, you need
not worry about any street that takes you out of the old
part
At a bus stop you see the map of the area and you note it
down
Also, you notice a 24x7 grocery store and you buy some
potato chips
Delayed reward (you decide not to eat them immediately
and reward yourself only when you take action that lead
you to B & B)
Figure Source:Machine Learning by Stephen Marsland, CRC
Press (Chapter 13)
Example: Getting Lost
You decide not to eat until you reach your
destination
Is it a good strategy?
Just hope that you don’t faint from hunger
first!!
Inspired by the “delayed reward”, you decide
that the B&B is definitely in square F as its
name sounds vaguely familiar
You decide on a reward structure so that you
can follow a RL algorithm back to B&B
Example: Getting Lost
Don’t stay still (reward of -5, which means you pinch
yourself whenever you are feeling sleepy)
F is called the “absorbing” state (reward yourself by eating
all the chips)
Moving between two squares could be good as it might be
taking you closer to F (but you wouldn’t know unless you
refer to the map)
So you decide to reward yourself only after reaching F
All other actions are neutral
When no direct road between two squares – no reward as it
is not a viable action (short cuts not recommended!!)
Reward Matrix is now ready
Figure Source:Machine Learning by Stephen Marsland, CRC
Press (Chapter 13)
Example: Getting Lost
As a reinforcement learner, you don’t know
the reward matrix!!
This is exactly what you are trying to discover
Figure Source:Machine Learning by Stephen Marsland, CRC
Press (Chapter 13)
State & Action Space
No. of states
All possible actions
Reducing the size of state & action space is
always a good idea
Size of state space
5 inputs - each an integer between 0 & 100
Size = 1005
Divide 100 into 2 classes (<50 and >= 50)
Size = 25
Carrots & Sticks – The
Reward Function
Reward function takes two inputs
the current state and
the chosen action
and outputs a numerical reward
If we are in state A and do nothing, the reward is
-5
The reward tells the learner about the goal and
not how the goal should be achieved (which
would be supervised learning)
Choice of a suitable reward function is absolutely
critical
Carrots & Sticks – The
Reward Function
Maze traversing robot
Two reward functions:
Receive a reward of 50 when you find the center of
the maze
Receive a reward of -1 for each move and a reward
of +50 when you find the center of the maze
Carrots & Sticks – The
Reward Function
In the first version, the robot will learn to get to
the center of the maze
So it will in the second case also
In the second version, shorter routes are given
preference
Maze problem is “episodic”
Learning is split into episodes and have a definite
end point
Reward can be given at the end and then
propagated back through all the actions that
were performed to update the learner
Carrots & Sticks – The
Reward Function
Continual tasks (opposite of Episodic)
No cut-off when the task stops
Child learning to walk
A child can walk successfully when it doesn’t fall over at
all, not when it doesn’t fall over for 10 minutes!
Reward has two parts:
Intermediate
Final pay-off
The thing that is driving the learning is the final reward
Doesn’t work for continual tasks as there is no terminal
state, so we want to predict the reward into the infinite
future, which is not possible
Discounting
Solution to the problem!
Take into account how certain we can be about things
that happen in the future
We discount our predictions of rewards in the future
according to how much chance there is that they are
wrong
The rewards that we are going to get soon are more
accurate than those a long time in the future
Introduce a parameter, 0<=γ<=1
Reward. γt, where t is the number of time steps in the
future
Discounting
γk 0 as k ∞
Prediction of the total future reward can therefore be
given by:
Interpretation of γ:
Value closer to 0 – less far we go into the future
Value is 1 – no discounting
Can be applied to episodic cases also when the eventual reward is
long way off
Learning to walk example:
When you fall you give yourself a reward of -1, otherwise no
reward
Discounting
Learning to walk example:
When you fall you give yourself a reward of -1,
otherwise no reward
The -1 reward is discounted in the future
k steps into the future has a reward of γk
Learner tries to make k as large as possible,
resulting in proper walking
Action Selection Techniques
RLA looks at the actions that can be performed
in a given state and computes the “value” of
each action
Value = avg. reward that is expected for carrying
out that action in the current state
Simplest way: compute the average reward that
has been recd. each time in the past, Qs,t (a),
where t is the number of times the action has
been taken before in this state
Qs,t (a) will eventually converge to the true
prediction of the reward for that action
Action Selection Techniques
Based on the current average reward
predictions, there are three ways of choosing a:
Greedy
Pick the value with highest values of Qs,t (a)
Always choose to exploit your current knowledge
ε-greedy
With some small probability ε, pick some other action at
random
Nearly every time we are taking the greedy option
Some exploration is thrown into the mix
Finds better solutions over time than pure greedy, since it can
explore
Soft-max
Softmax Function
Softmax Function
Softmax Function
The soft maximum approximates the hard maximum and is a convex function just like the
hard maximum. But the soft maximum is smooth. It has no sudden changes in direction
and can be differentiated as many times as you like. These properties make it easy for
convex optimization algorithms to work with the soft maximum. In fact, the function may
have been invented for optimization;
Softmax Function
• Accuracy of the soft maximum approximation depends on scale
• Multiplying x and y by a large constant brings the soft maximum
closer to the hard maximum
• For example, g(1, 2) = 2.31, but g(10, 20) = 20.00004
• “hardness” of the soft maximum can be controlled by
generalizing the soft maximum to depend on a parameter k.
g(x, y; k) = log( exp(kx) + exp(ky) ) / k
• Soft maximum can be made as close to the hard maximum as
desired by making k large enough
• For every value of k the soft maximum is differentiable, but the
size of the derivatives increase as k increases.
• In the limit the derivative becomes infinite as the soft maximum
converges to the hard maximum
Action Selection Techniques
Soft-max
P(Qs,t (a)) = exp(Qs,t (a)/τ)/{∑bexp(Qs,t (a)/τ)}
where τ is the temperature (simulated annealing)
When τ is large, all actions have similar
probabilities
When τ is small, selection probabilities matter
more
Current best (greedy) action will be chosen most
of the time, but others will be chosen
proportional to their estimated reward, which is
updated every time they are used
Reinforcement Learning
In interactive problems it is often impractical
to obtain examples of desired behavior that
are both correct and representative of all the
situations in which the agent has to act
In uncharted territory--where one would
expect learning to be most beneficial--an
agent must be able to learn from its own
experience
Reinforcement Learning
One of the challenges that arise in
reinforcement learning and not in other kinds
of learning is the trade-off between
exploration and exploitation
The agent has to exploit what it already
knows in order to obtain reward, but it also
has to explore in order to make better action
selections in the future
Policy
Action selection is a trade-off between exploitation &
exploration so as to maximize expected reward into the
future
Can we not make an explicit decision that we are going
to take the optimal choice at each stage and not do
exploration any more?
The choice of which action to take in each state in order
to get optimal results is called a policy, π
The hope is that we can learn a better policy that is
specific to the current state st
Crux of learning part of RL – learn a policy π from states
of actions
Policy
A policy defines the learning agent's way of behaving at a
given time
A mapping from perceived states of the environment to
actions to be taken when in those states
In psychology, it is called a set of stimulus-response rules
or associations.
A simple function or lookup table, or may involve
extensive computation such as a search process
The policy is the core of a reinforcement learning agent
in the sense that it alone is sufficient to determine
behavior.
Markov Decision Process (MDP)
A RL problem that has the Markov property is known as a
MDP
We can make decisions (predictions) about the likely
behavior of the learner, and its expected rewards, using
just the current state data
First order Markov!
Examples
A master chess player makes a move. The choice
is informed both by planning--anticipating
possible replies and counter replies--and by
immediate, intuitive judgments of the
desirability of particular positions and moves.
An adaptive controller adjusts parameters of a
petroleum refinery's operation in real time. The
controller optimizes the yield/cost/quality trade-
off on the basis of specified marginal costs
without sticking strictly to the set points
originally suggested by engineers
Examples
A gazelle calf struggles to its feet minutes after
being born. Half an hour later it is running at 20
miles per hour.
A mobile robot decides whether it should enter a
new room in search of more trash to collect or
start trying to find its way back to its battery
recharging station. It makes its decision based
on how quickly and easily it has been able to find
the recharger in the past
Reinforcement Learning
4 main sub-elements of a reinforcement learning
system:
a policy
a reward function
a value function, and, optionally,
a model of the environment
Reinforcement Learning:
Reward Function
A reward function defines the goal
it maps each perceived state (or state-action pair) of
the environment to a single number, a reward,
indicating the intrinsic desirability of that state
RLAs sole objective is to maximize the total reward it
receives in the long run
The reward function defines what are the good and
bad events for the agent
In a biological system, it would not be inappropriate
to identify rewards with pleasure and pain.
Reinforcement Learning:
Reward Function
Alterable by the agent: may serve as a basis for
altering the policy
For example, if an action selected by the policy is
followed by low reward, then the policy may be
changed to select some other action in that
situation in the future
Reinforcement Learning:
Value Function
Reward function indicates what is good in an
immediate sense, a value function specifies what
is good in the long run
The value of a state is the total amount of reward an
agent can expect to accumulate over the future, starting
from that state
For example, a state might always yield a low immediate
reward but still have a high value because it is regularly
followed by other states that yield high rewards.
Or the reverse could be true
Reinforcement Learning:
Value Function
To make a human analogy, rewards are like pleasure (if
high) and pain (if low), whereas values correspond to a
more refined and farsighted judgment of how pleased or
displeased we are that our environment is in a particular
state
Rewards are in a sense primary, whereas values, as
predictions of rewards, are secondary.
Without rewards there could be no values, and the only
purpose of estimating values is to achieve more reward.
Reinforcement Learning:
Value Function
A reinforcement learner is trying to decide on what
action to take in order to maximize the expected reward
into the future
2 ways to compute a value:
At a current state, average across all actions that can be taken
State-value function, V(s)
Consider the current state and each possible action that can be
taken separately
Action-value function, Q(s,a)
In either case, we are thinking about what the expected
reward would be if we started in state s
Reinforcement Learning:
Value Function
A word about Evolutionary
Methods
If the space of policies is sufficiently small, or can be
structured so that good policies are common or easy to
find, then evolutionary methods can be effective
In addition, evolutionary methods have advantages on
problems in which the learning agent cannot accurately
sense the state of its environment
Q Learning Algorithm
In Q-learning, an evaluation function over states
and actions is learned
It is simple and easy to implement
It keeps information about all state-action pairs
Memory intense
Let’s try to understand Q-learning algorithm
using the popular dots and boxes game
Reinforcement Learning
Reinforcement learning is a subset of the machine learning
techniques. These techniques learn through trial and error
in search of an optimal solution.
Q-learning
Q-learning is a simple reinforcement learning algorithm.
Agent
An agent is a player in the game implementation, either a
computer or an interface towards humans.
State
The state of the environment; the current game configuration.
Action
An action is a transition from one state to another, which is
what a player does or can do in a specific state.
Reinforcement Learning
Feedback
An action generates feedback from the environment. The feedback
can be either positive or negative. The feedback is used in the Q-
learning algorithm for estimating how good an action is. The term
reward is used for positive feedback and negative feedback is called
punishment.
The learning rate
A parameter in the Q-learning algorithm, describing the relationship
between old and new memories. A high value means that new
memories take precedence over old memories and vice versa. A
value of zero means that nothing will be learnt.
Discount factor
The discount factor is a parameter in the Q-learning algorithm,
affecting the importance of future feedback. A high discount factor
increases the importance of future feedback.
Q Learning Algorithm
Reward System
Choose the action which gets the maximum expected
reward
Reward function takes the current state and the chosen
action to produce a numerical reward
State A – do nothing: Reward of -5
Move!!
Sub-goals to be avoided
Choosing a suitable reward function is critical
Q Learning Algorithm
Reward System
Receive a reward of +50 when you find the centre of the
maze
Receive a reward of -1 for each move and a reward of +50
when you find the centre of the maze
Which one would you prefer?
Maze problem is EPISODIC!!
Learning is split into episodes.
Definite endpoint (robot reaching the centre of the maze)
Reward given at the end and propagated back through all the
actions that were performed to update the learner
Q Learning Algorithm
Reward System
Continual Tasks: No cut off when the task ends
For example: Child learning to walk
A child can walk successfully when it does not fall over at all, not
when it does not fall over for 10 mins.
Reward needs to be broken into 2 parts:
Immediate part
Pay-off at the end
The factor driving the learning is the total reward – which is the
expected reward from now until the task is completed – when the
learner reaches the absorbing state - backpacker reaching sqaure F
F has a huge pay-off
Does not work for continual tasks!!
Summary
Reinforcement learning is suitable for learning in
uncertain environments where rewards may be
delayed and subject to chance
The goal of a reinforcement learning program is to
maximise the eventual reward
Q-learning is a form of reinforcement learning that
doesn’t require that the learner has prior knowledge
of how its actions affect the environment