0% found this document useful (0 votes)

19 views86 pages

Reinforcement Learning

Reinforcement Learning Andrew NG

Uploaded by

f20212393

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views86 pages

Reinforcement Learning

Reinforcement Learning Andrew NG

Uploaded by

f20212393

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

REINFORCEMENT LEARNING

Navneet Goyal

Slides developed using material from:

Reinforcement Learning: An Introduction
Richard S. Sutton and Andrew G. Barto
MIT Press
&
Machine Learning by Stephen Marsland, CRC Press (Chapter 13)
Topics
 Exploration vs. Exploitation
 Learning through Interaction with environment
 State & Action spaces
 Examples from Game Playing & Robotics
 Episodic & Continual Tasks
 Reward & Delayed Reward
 Policy
 Markov Decision Process (MDP) & SMDP
 Q-learning Algorithm
Introducing Reinforcement
Learning
 Listen to the title track of the Web Series
“Money Heist” – My Life is Going on…

 If I am losing now, but I am winning late, that’s

all I want!
 I don’t care at all, I am lost!
Random Notes

 RL is generally used to solve Markov decision

problem (MDP) and its variants
 What is MDP?

Will explore in a while!

Learning types
 Supervised learning:
a situation in which sample (input, output) pairs of
the function to be learned can be perceived or are
given
 You can think it as if there is a kind teacher
 Training data (X,Y) (features, label)
 Predict Y, minimizing some loss
 Examples: Regression, Classification.
 Unsupervised Learning
 Only data points X given (features only)
 Find Similar Xs
 Clustering

Reference: MDP & RL, Machine Learning10-.‐601B (CMU)

Learning types
 Reinforcement learning:
in the case of the agent acts on its environment, it
receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to
achieve its goal
 Training data:(S,A,R)
 Develop an optimal policy (sequence of
decision rules) for the learner so as to
maximize its “Long-term Reward”
 Robotics, Board-games etc.

Reference: MDP & RL, Machine Learning10-.‐601B (CMU)

Supervised vs. Reinforcement

Reference: MDP & RL, Machine Learning10-.‐601B (CMU)

Examples of RL

Reference: MDP & RL, Machine Learning10-.‐601B (CMU)

MDP

 Goal – reach the smiling face as fast as

possible
 Noisy movement
 When you want to move forward, you don’t
necessarily move forward!
 Actions - NEWS

Figure source: Towards Data Science

MDP
world

𝑎 𝑎
(𝑆, 𝐴, 𝑃𝑠𝑠 ′ , 𝑅𝑠𝑠 ′ , 𝛾)

S – states
A – actions
𝑎
𝑃𝑠𝑠 ′ - transition probabilities (from state s to s’
through action a
𝑎
𝑅𝑠𝑠 ′ - reward for moving from state s to s’ through
action a
𝛾 - discount
MDP
world

1 2 3
4 5 6
7 8 9

0.9
0.05 0.05
MDP
world

🍪
☠️

A $ today is better than a $ tomorrow!!

0.9

MDP Assumptions 0.05 0.05

 Fully observable domain

 You wanted to go N, but ended up W ( I would know
this)

 Markovian property
 History is not important
 How you ended up in a particular state is not importan

 Time of transition
 We will assume that the time of transition is unity (1),
which means it is the same for every transition.
 For the SMDP, this quantity is not fixed
Elements of an MDP
 The framework of the MDP has the following
elements:
(1) state of the system,
(2) actions,
(3) transition probabilities,
(4) transition rewards,
(5) a policy, and
(6) a performance metric (value function)
Elements of an MDP
(1) state of the system/dynamic system
The “state” of a system is a parameter or a set
of parameters that can be used to describe a
system. For example:
 Coordinates of a robot moving in a room
 Queue in a supermarket (no. of people in the
queue)
Both are examples of dynamic systems!
Elements of an MDP
(2) actions
 Robot can move in NEWS (4 actions)
 Open a new counter or do not open a new
counter (2 actions)
Elements of an MDP
(3) Transition probabilities
 Assume that action a is selected in state S.
Let the next state be S’.
𝑎
 Let 𝑃𝑠𝑠 ′ denote the probability of going from

state S to state S’ under the influence of

action a in one step.
 If an MDP has 3 states and 2 actions, there
are 9 transition probabilities per action.
Elements of an MDP
(4) Immediate rewards
 Usually, the system receives an immediate reward
(which could be positive or negative) when it
transitions from one state to another
 It maps each state (or state-action pair) to a real no.
𝑎
called reward denoted by 𝑅𝑠𝑠 ′

(5) Policy
 The policy defines the action to be chosen in every
state visited by the system
 A map from state space to action space. May be
stochastic
Elements of an MDP
(6) Value function
 Value of a state (or state-action pair) is the
total expected award starting from that state
(or state-action pair)
MDP – The Q & V Functions
∞

𝑄 𝜋 (𝑠, 𝑎) = 𝐸 ෍ 𝛾 𝑡−1 |𝑠0 = 𝑠, 𝑎0 = 𝑎, 𝜋

𝑡=1
- From state s, action a is taken
- On expectation, how many cookies I will get?
- After this, you follow policy 𝜋
𝑉 𝜋 (𝑠) = 𝑄 𝜋 (𝑠, 𝜋(𝑠))
- when 𝑎 also comes from policy 𝜋
MDP problem- to find 𝜋*

𝜋
𝜋∗ = max ∀𝑠 ∈ 𝑆, 𝑉 (𝑠)
𝜋
MDP and Reinforcement
Learning
𝑎 𝑎
(𝑆, 𝐴, 𝑃𝑠𝑠 ′ , 𝑅𝑠𝑠 ′ , 𝛾)

P & R can also change…typically, P changes

Dots & Lines

Figure source: [Link]/utbildning/kandidatexjobb/datateknik/2010/

arvidsson_oskar_OCH_wallgren_linus_K10047.pdf
Dots & Lines
Simple Agent
 It always make
the same move,
independent of
the opponent.
 chooses the
minimum
numbered line
possible,
according to
figure 2.
Figure source: [Link]/utbildning/kandidatexjobb/datateknik/2010/
arvidsson_oskar_OCH_wallgren_linus_K10047.pdf
Dots & Lines
Simple Agent
 The purpose of
this agent is to
give the Q-
learning
algorithm an easy
and predictable
opponent to play
against,

Figure source: [Link]/utbildning/kandidatexjobb/datateknik/2010/

arvidsson_oskar_OCH_wallgren_linus_K10047.pdf
Dots & Lines
State-space size
 Each line is unique
and can be drawn
or not drawn!
 16 dots & 24 lines
 Number of states
= 224
 4 dots & 4 lines
(24 states –
enumerate to
check)
Figure source: [Link]/utbildning/kandidatexjobb/datateknik/2010/
arvidsson_oskar_OCH_wallgren_linus_K10047.pdf
Dots & Lines
State-action pairs
 Mean number of lines
possible to draw in
each state =12
 Hence, number of
state-action pairs =
224*12 = 200 m
 For 4 dots & 4 lines
(24 *2 = 32)
 for 9 dots and 12
lines
(212 *6 = 24,576)

Figure source: [Link]/utbildning/kandidatexjobb/datateknik/2010/

arvidsson_oskar_OCH_wallgren_linus_K10047.pdf
Dots & Lines: Q-Learning
Reward
 +ve or –ve reward only
when the game ends
 +ve or –ve reward when
the game ends and also
when a box is completed
 Reward for winning the
game must be far
greater than that for
completing a box
 Likewise the punishment
for losing a game needs
to be more severe than
that of losing a box.
Figure source: [Link]/utbildning/kandidatexjobb/datateknik/2010/
arvidsson_oskar_OCH_wallgren_linus_K10047.pdf
Dots & Lines: Q-Learning

Figure source: [Link]/utbildning/kandidatexjobb/datateknik/2010/

arvidsson_oskar_OCH_wallgren_linus_K10047.pdf
Reinforcement Learning
 Search is fundamental part of any
Reinforcement Learner
 A RL algorithm searches over a state space of
possible inputs and outputs in order to try to
maximize a “reward”
 RL is defined in terms of an agent and its
environment
 Agent is learning and the environment is where it
is learning and what it is learning about
 Environment provides feedback on how good a
strategy is in terms of some “reward”
Exploration vs. Exploitation
 Online decision making involves a
fundamental choice:
 Exploitation: make the best decision given current
information (greedy approach)
 Exploration: gather more information
(e-greedy or softmax)
 The best long–term strategy may involve
short–term sacrifices
 Gather enough information to make the best
overall decisions
Reference: Reinforcement Learning: Exploration vs Exploitation
Slides by Marcello Restelli
Exploration vs.
Exploitation: Examples
 Restaurant Selection
 Exploitation: Go to favorite restaurant
 Exploration: Try a new restaurant
 Online Banner Advertisements
 Exploitation: Show the most successful advert
 Exploration: Show a different advert
 Oil Drilling
 Exploitation: Drill at the best known location
 Exploration: Drill at a new location
 Game Playing
 Exploitation: Play the move you believe is best
 Exploration: Play an experimental move
 Clinical Trial
 Exploitation: Choose the best treatment so far
 Exploration: Try a new treatment

Reference: Reinforcement Learning: Exploration vs Exploitation

Slides by Marcello Restelli
Reinforcement Learning
 A child learning to stand up and walk!
 Tries out many different strategies for staying
upright and gets feedback about which work
by whether or not it ends up flat on its face
 The methods that seem to work are tried
over and over again till they are perfected or
better solutions found
 Those that did not work are discarded
Reinforcement Learning
 Important to note that its not the last action
that the child does before falling that makes
it fall over
 Suppose the child tipped over something and
fell and before falling it waved its hands
around in desperation
 Sequence of actions
Reinforcement Learning
 “Trial & Error” learning
 Also known as “Law of Effect”
 Repeat actions that are reinforced by a
feeling of satisfaction
 Let’s now try to see how all this can be
applied to Machine Learning
Reinforcement Learning
 RL maps states or situations to actions in order
to maximize some numerical award
 Robot with multiple sensors
 Sensors don’t tell its location, only what it can see
about it
 Possible ways in which the robot can drive its motors
are actions
 Reward will be how well it does its task without
crashing into obstacles
 In RL, algorithm gets feedback in the form of
rewards
Figure Source:Machine Learning by Stephen Marsland, CRC
Press (Chapter 13)
Reinforcement learning

 Task
Learn how to behave successfully to achieve a goal
while interacting with an external environment
 Learn via experiences!
 Examples
 Game playing: player knows whether it win or lose,
but not know how to move at each step
 Control: a traffic system can measure the delay of cars,
but not know how to decrease it.
RL is learning from
interaction
RL model

 Each percept(e) is enough to determine the State(the

state is accessible)
 The agent can decompose the Reward component
from a percept.
 The agent task: to find a optimal policy, mapping
states to actions, that maximize long-run measure of
the reinforcement
 Think of reinforcement as reward
 Can be modeled as MDP model!
Reinforcement Learning
 In Supervised learning, the algorithm is
taught the correct answer
 The reward function evaluates the current
solution, but does not suggest how to
improve it
 Contrast this with Perceptron!!
 Delayed Reward
 Game playing (Chess, TicTacToe)
Example: Getting Lost
 Arriving in a foreign city after many hours of
flying
 Catch a train to the city and check in at a B&B
facility in old part of the city
 Too tired to observe the surroundings
 Late in the night, you are starving and you
venture out to find a place to eat and you are
completely lost
 You don’t even remember the name of the
facility except that it is in the old part of the city
Example: Getting Lost
 Only two things that are in your favor!!
 Guess what are they
 You recognize the building
 You know reinforcement learning!!
 Since you have walked only through the old city, you need
not worry about any street that takes you out of the old
part
 At a bus stop you see the map of the area and you note it
down
 Also, you notice a 24x7 grocery store and you buy some
potato chips
 Delayed reward (you decide not to eat them immediately
and reward yourself only when you take action that lead
you to B & B)
Figure Source:Machine Learning by Stephen Marsland, CRC
Press (Chapter 13)
Example: Getting Lost
 You decide not to eat until you reach your
destination
 Is it a good strategy?
 Just hope that you don’t faint from hunger
first!!
 Inspired by the “delayed reward”, you decide
that the B&B is definitely in square F as its
name sounds vaguely familiar
 You decide on a reward structure so that you
can follow a RL algorithm back to B&B
Example: Getting Lost
 Don’t stay still (reward of -5, which means you pinch
yourself whenever you are feeling sleepy)
 F is called the “absorbing” state (reward yourself by eating
all the chips)
 Moving between two squares could be good as it might be
taking you closer to F (but you wouldn’t know unless you
refer to the map)
 So you decide to reward yourself only after reaching F
 All other actions are neutral
 When no direct road between two squares – no reward as it
is not a viable action (short cuts not recommended!!)
 Reward Matrix is now ready
Figure Source:Machine Learning by Stephen Marsland, CRC
Press (Chapter 13)
Example: Getting Lost
 As a reinforcement learner, you don’t know
the reward matrix!!
 This is exactly what you are trying to discover
Figure Source:Machine Learning by Stephen Marsland, CRC
Press (Chapter 13)
State & Action Space
 No. of states
 All possible actions
 Reducing the size of state & action space is
always a good idea
 Size of state space
 5 inputs - each an integer between 0 & 100
 Size = 1005
 Divide 100 into 2 classes (<50 and >= 50)
 Size = 25
Carrots & Sticks – The
Reward Function
 Reward function takes two inputs
 the current state and
 the chosen action
and outputs a numerical reward
 If we are in state A and do nothing, the reward is
-5
 The reward tells the learner about the goal and
not how the goal should be achieved (which
would be supervised learning)
 Choice of a suitable reward function is absolutely
critical
Carrots & Sticks – The
Reward Function
 Maze traversing robot
 Two reward functions:
 Receive a reward of 50 when you find the center of
the maze
 Receive a reward of -1 for each move and a reward
of +50 when you find the center of the maze
Carrots & Sticks – The
Reward Function
 In the first version, the robot will learn to get to
the center of the maze
 So it will in the second case also
 In the second version, shorter routes are given
preference
 Maze problem is “episodic”
 Learning is split into episodes and have a definite
end point
 Reward can be given at the end and then
propagated back through all the actions that
were performed to update the learner
Carrots & Sticks – The
Reward Function
 Continual tasks (opposite of Episodic)
 No cut-off when the task stops
 Child learning to walk
 A child can walk successfully when it doesn’t fall over at
all, not when it doesn’t fall over for 10 minutes!
 Reward has two parts:
 Intermediate
 Final pay-off
 The thing that is driving the learning is the final reward
 Doesn’t work for continual tasks as there is no terminal
state, so we want to predict the reward into the infinite
future, which is not possible
Discounting
 Solution to the problem!
 Take into account how certain we can be about things
that happen in the future
 We discount our predictions of rewards in the future
according to how much chance there is that they are
wrong
 The rewards that we are going to get soon are more
accurate than those a long time in the future
 Introduce a parameter, 0<=γ<=1
 Reward. γt, where t is the number of time steps in the
future
Discounting
 γk 0 as k ∞
 Prediction of the total future reward can therefore be
given by:

 Interpretation of γ:
 Value closer to 0 – less far we go into the future
 Value is 1 – no discounting
 Can be applied to episodic cases also when the eventual reward is
long way off
 Learning to walk example:
 When you fall you give yourself a reward of -1, otherwise no
reward
Discounting
 Learning to walk example:
 When you fall you give yourself a reward of -1,
otherwise no reward
 The -1 reward is discounted in the future
 k steps into the future has a reward of γk
 Learner tries to make k as large as possible,
resulting in proper walking
Action Selection Techniques
 RLA looks at the actions that can be performed
in a given state and computes the “value” of
each action
 Value = avg. reward that is expected for carrying
out that action in the current state
 Simplest way: compute the average reward that
has been recd. each time in the past, Qs,t (a),
where t is the number of times the action has
been taken before in this state
 Qs,t (a) will eventually converge to the true
prediction of the reward for that action
Action Selection Techniques
 Based on the current average reward
predictions, there are three ways of choosing a:
 Greedy
 Pick the value with highest values of Qs,t (a)
 Always choose to exploit your current knowledge
 ε-greedy
 With some small probability ε, pick some other action at
random
 Nearly every time we are taking the greedy option
 Some exploration is thrown into the mix
 Finds better solutions over time than pure greedy, since it can
explore
 Soft-max
Softmax Function
Softmax Function
Softmax Function

The soft maximum approximates the hard maximum and is a convex function just like the
hard maximum. But the soft maximum is smooth. It has no sudden changes in direction
and can be differentiated as many times as you like. These properties make it easy for
convex optimization algorithms to work with the soft maximum. In fact, the function may
have been invented for optimization;
Softmax Function
• Accuracy of the soft maximum approximation depends on scale
• Multiplying x and y by a large constant brings the soft maximum
closer to the hard maximum
• For example, g(1, 2) = 2.31, but g(10, 20) = 20.00004
• “hardness” of the soft maximum can be controlled by
generalizing the soft maximum to depend on a parameter k.
g(x, y; k) = log( exp(kx) + exp(ky) ) / k

• Soft maximum can be made as close to the hard maximum as

desired by making k large enough
• For every value of k the soft maximum is differentiable, but the
size of the derivatives increase as k increases.
• In the limit the derivative becomes infinite as the soft maximum
converges to the hard maximum
Action Selection Techniques
 Soft-max
P(Qs,t (a)) = exp(Qs,t (a)/τ)/{∑bexp(Qs,t (a)/τ)}
where τ is the temperature (simulated annealing)
 When τ is large, all actions have similar
probabilities
 When τ is small, selection probabilities matter
more
 Current best (greedy) action will be chosen most
of the time, but others will be chosen
proportional to their estimated reward, which is
updated every time they are used
Reinforcement Learning
 In interactive problems it is often impractical
to obtain examples of desired behavior that
are both correct and representative of all the
situations in which the agent has to act
 In uncharted territory--where one would
expect learning to be most beneficial--an
agent must be able to learn from its own
experience
Reinforcement Learning
 One of the challenges that arise in
reinforcement learning and not in other kinds
of learning is the trade-off between
exploration and exploitation
 The agent has to exploit what it already
knows in order to obtain reward, but it also
has to explore in order to make better action
selections in the future
Policy
 Action selection is a trade-off between exploitation &
exploration so as to maximize expected reward into the
future
 Can we not make an explicit decision that we are going
to take the optimal choice at each stage and not do
exploration any more?
 The choice of which action to take in each state in order
to get optimal results is called a policy, π
 The hope is that we can learn a better policy that is
specific to the current state st
 Crux of learning part of RL – learn a policy π from states
of actions
Policy

 A policy defines the learning agent's way of behaving at a

given time
 A mapping from perceived states of the environment to
actions to be taken when in those states
 In psychology, it is called a set of stimulus-response rules
or associations.
 A simple function or lookup table, or may involve
extensive computation such as a search process
 The policy is the core of a reinforcement learning agent
in the sense that it alone is sufficient to determine
behavior.
Markov Decision Process (MDP)
 A RL problem that has the Markov property is known as a
MDP
 We can make decisions (predictions) about the likely
behavior of the learner, and its expected rewards, using
just the current state data
 First order Markov!
Examples
 A master chess player makes a move. The choice
is informed both by planning--anticipating
possible replies and counter replies--and by
immediate, intuitive judgments of the
desirability of particular positions and moves.
 An adaptive controller adjusts parameters of a
petroleum refinery's operation in real time. The
controller optimizes the yield/cost/quality trade-
off on the basis of specified marginal costs
without sticking strictly to the set points
originally suggested by engineers
Examples
 A gazelle calf struggles to its feet minutes after
being born. Half an hour later it is running at 20
miles per hour.
 A mobile robot decides whether it should enter a
new room in search of more trash to collect or
start trying to find its way back to its battery
recharging station. It makes its decision based
on how quickly and easily it has been able to find
the recharger in the past
Reinforcement Learning
 4 main sub-elements of a reinforcement learning
system:
 a policy
 a reward function
 a value function, and, optionally,
 a model of the environment
Reinforcement Learning:
Reward Function
 A reward function defines the goal
 it maps each perceived state (or state-action pair) of
the environment to a single number, a reward,
indicating the intrinsic desirability of that state
 RLAs sole objective is to maximize the total reward it
receives in the long run
 The reward function defines what are the good and
bad events for the agent
 In a biological system, it would not be inappropriate
to identify rewards with pleasure and pain.
Reinforcement Learning:
Reward Function
 Alterable by the agent: may serve as a basis for
altering the policy
 For example, if an action selected by the policy is
followed by low reward, then the policy may be
changed to select some other action in that
situation in the future
Reinforcement Learning:
Value Function
 Reward function indicates what is good in an
immediate sense, a value function specifies what
is good in the long run
 The value of a state is the total amount of reward an
agent can expect to accumulate over the future, starting
from that state
 For example, a state might always yield a low immediate
reward but still have a high value because it is regularly
followed by other states that yield high rewards.
 Or the reverse could be true
Reinforcement Learning:
Value Function
 To make a human analogy, rewards are like pleasure (if
high) and pain (if low), whereas values correspond to a
more refined and farsighted judgment of how pleased or
displeased we are that our environment is in a particular
state
 Rewards are in a sense primary, whereas values, as
predictions of rewards, are secondary.
 Without rewards there could be no values, and the only
purpose of estimating values is to achieve more reward.
Reinforcement Learning:
Value Function
 A reinforcement learner is trying to decide on what
action to take in order to maximize the expected reward
into the future
 2 ways to compute a value:
 At a current state, average across all actions that can be taken
 State-value function, V(s)
 Consider the current state and each possible action that can be
taken separately
 Action-value function, Q(s,a)
 In either case, we are thinking about what the expected
reward would be if we started in state s
Reinforcement Learning:
Value Function
A word about Evolutionary
Methods
 If the space of policies is sufficiently small, or can be
structured so that good policies are common or easy to
find, then evolutionary methods can be effective
 In addition, evolutionary methods have advantages on
problems in which the learning agent cannot accurately
sense the state of its environment
Q Learning Algorithm

 In Q-learning, an evaluation function over states

and actions is learned
 It is simple and easy to implement
 It keeps information about all state-action pairs
 Memory intense
 Let’s try to understand Q-learning algorithm
using the popular dots and boxes game
Reinforcement Learning
 Reinforcement learning is a subset of the machine learning
techniques. These techniques learn through trial and error
in search of an optimal solution.
 Q-learning
Q-learning is a simple reinforcement learning algorithm.
 Agent
An agent is a player in the game implementation, either a
computer or an interface towards humans.
 State
The state of the environment; the current game configuration.
 Action
An action is a transition from one state to another, which is
what a player does or can do in a specific state.
Reinforcement Learning
 Feedback
An action generates feedback from the environment. The feedback
can be either positive or negative. The feedback is used in the Q-
learning algorithm for estimating how good an action is. The term
reward is used for positive feedback and negative feedback is called
punishment.
 The learning rate
A parameter in the Q-learning algorithm, describing the relationship
between old and new memories. A high value means that new
memories take precedence over old memories and vice versa. A
value of zero means that nothing will be learnt.
 Discount factor
The discount factor is a parameter in the Q-learning algorithm,
affecting the importance of future feedback. A high discount factor
increases the importance of future feedback.
Q Learning Algorithm
Reward System
 Choose the action which gets the maximum expected
reward
 Reward function takes the current state and the chosen
action to produce a numerical reward
 State A – do nothing: Reward of -5
 Move!!
 Sub-goals to be avoided
 Choosing a suitable reward function is critical
Q Learning Algorithm
Reward System
 Receive a reward of +50 when you find the centre of the
maze
 Receive a reward of -1 for each move and a reward of +50
when you find the centre of the maze
 Which one would you prefer?
 Maze problem is EPISODIC!!
 Learning is split into episodes.
 Definite endpoint (robot reaching the centre of the maze)
 Reward given at the end and propagated back through all the
actions that were performed to update the learner
Q Learning Algorithm
Reward System
 Continual Tasks: No cut off when the task ends
 For example: Child learning to walk
 A child can walk successfully when it does not fall over at all, not
when it does not fall over for 10 mins.
 Reward needs to be broken into 2 parts:
 Immediate part
 Pay-off at the end

 The factor driving the learning is the total reward – which is the
expected reward from now until the task is completed – when the
learner reaches the absorbing state - backpacker reaching sqaure F
 F has a huge pay-off
 Does not work for continual tasks!!
Summary