0% found this document useful (0 votes)
11 views48 pages

Lecture 4 - ModelFreePrediction

Uploaded by

Husein Yusuf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views48 pages

Lecture 4 - ModelFreePrediction

Uploaded by

Husein Yusuf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Reinforcement Learning

Model-Free Prediction

Natnael Argaw

Reinforcement learning

Lecture by Natnael Argaw (PhD) @CopyRight Hado van Hasselt


Background

Sutton & Barto 2018, Chapters 5 + 6 + 7 + 9 +12

Don’t worry about reading all of this at once!


Most important chapters, for now: 5 + 6
You can also defer some reading, e.g., until the reading week

2
Recap

► Reinforcement learning is the science of learning to make decisions


► Agents can learn a policy, value function and/or a model
► The general problem involves taking into account time and consequences
► Decisions affect the reward, the agent state, and environment state

3
Lecture overview

► Last lectures (3+4):


► Planning by dynamic programming to solve a known MDP
► This and next lectures (5→8):
► Model-free prediction to estimate values in an unknown MDP
► Model-free control to optimise values in an unknown MDP
► Function approximation and (some) deep reinforcement learning (but more to follow later)
► Off-policy learning

4
Model-Free Prediction:
Monte Carlo Algorithms

5
Monte Carlo Algorithms

► We can use experience samples to learn without a model


► We call direct sampling of episodes Monte Carlo
► MC is model-free: no knowledge of MDP required, only samples

6
Monte Carlo: Bandits

► Simple example, multi-armed bandit:


► For each action, average reward samples

► Note: we changed notation R → R 1 for the reward after A


t t+ t
In MDPs, the reward is said to arrive on the time step after the action

7
Monte Carlo: Bandits with States

► Consider bandits with different states


► episodes are still one step
► actions do not affect state transitions

=⇒ no long-term consequences

► Then, we want to estimate

q(s, a) = E [Rt+1 |St = s, At = a]

► These are called contextual bandits

8
Introduction to Function
Approximation

9
Value Function Approximation

► So far we mostly considered lookup tables


► Every state s has an entry v(s)
► Or every state-action pair s, a has an entry q(s, a)
► Problem with large MDPs:
► There are too many states and/or actions to store in memory
► It is too slow to learn the value of each state individually
► Individual states are often not fully observable

10
Value Function Approximation

Solution for large MDPs:


► Estimate value function with function approximation

vw(s) ≈ vπ(s) (or v∗(s))


qw(s, a) ≈ qπ(s, a) (or q∗(s, a))

► Update parameter w (e.g., using MC, TD learning,


non-linear techniques like neural networks)
► Generalise to unseen states

11
Agent state
update

Solution for large MDPs, if the environment state is not fully observable
► Use the agent state:
St = uω (St−1, At−1, Ot )
with parameters ω (typically ω ∈ Rn)
► Henceforth, St denotes the agent state
► Think of this as either a vector inside the agent,
or, in the simplest case, just the current observation: St = Ot

► For now we are not going to talk about how to learn the agent state update
► Feel free to consider St as an observation

12
Linear Function Approximation

13
Feature Vectors

► A useful special case: linear functions


► Represent state by a feature vector

► x : S → Rm is a fixed mapping from agent state (e.g., observation) to features


► Shorthand: xt = x(St )
► For example:
► Distance of robot from landmarks
► Trends in the stock market
► Piece and pawn configurations in chess

14
Linear Value Function Approximation

► Approximate value function by a linear combination of features

► Objective function (‘loss‘) is quadratic in w

L(w) = ES∼d [(vπ(S) − wTx(S))2]

► Stochastic gradient descent converges on global optimum


► Update rule is simple

∇w vw(St ) = x(St ) = xt =⇒ ∆w = α(vπ(St ) − vw(St ))xt

Update = step-size × prediction error × feature vector

15
Table Lookup Features

► Table lookup is a special case of linear value function approximation


► Let the n states be given by S = {s1, . . . , sn }.
► Use one-hot feature:

16
Model-Free Prediction:
Monte Carlo Algorithms
(Continuing from before...)

17
Monte Carlo: Bandits with States

► q could be a parametric function, e.g., neural network, and we could use loss

We can sample this to get a stochastic gradient update (SGD)


► The tabular case is a special case (only updates the value in cell [St, At ])
► Also works for large (continuous) state spaces S — this is just regression

18
Monte Carlo: Bandits with States

► When using linear functions, q(s, a) = wTx(s, a) and

∇wt qwt (St, At ) = x(s, a)

► Then the SGD update is

wt+1 = wt + α(Rt+1 − qwt (St, At ))x(s, a) .

► Linear update = step-size × prediction error × feature vector


► Non-linear update = step-size × prediction error × gradient

19
Monte-Carlo Policy Evaluation

► Now we consider sequential decision problems


► Goal: learn vπ from episodes of experience under policy π

S1, A1, R2, ..., Sk ∼ π

► The return is the total discounted reward (for an episode ending at time T > t):

Gt = Rt+1 + γRt+2 + ... + γT −t−1 RT

► The value function is the expected return:

vπ(s) = E [Gt | St = s, π]

► We can just use sample average return instead of expected return


► We call this Monte Carlo policy evaluation

20
Disadvantages of Monte-Carlo Learning

► We have seen MC algorithms can be used to learn value predictions


► But when episodes are long, learning can be slow
► ...we have to wait until an episode ends before we can learn
► ...return can have high variance
► Are there alternatives?

21
Temporal-Difference Learning

22
Temporal Difference Learning by Sampling Bellman Equations
► Previous lecture: Bellman equations,

vπ(s) = E [Rt+1 + γvπ(St+1) | St = s, At ∼ π(St )]

► Previous lecture: Approximate by iterating,

vk+1(s) = E [Rt+1 + γvk (St+1) | St = s, At ∼ π(St )]

► We can sample this!


vt+1(St ) = Rt+1 + γvt (St+1)
► This is likely quite noisy — better to take a small step (with parameter α):

23
Temporal difference learning

► Prediction setting: learn vπ online from experience under policy π


► Monte-Carlo
► Update value vn(St ) towards sampled return Gt

vn+1(St ) = vn(St ) + α (Gt − vn(St ))

► Temporal-difference learning:
► Update value vt (St ) towards estimated return Rt+1 + γv(St+1)

24
Dynamic Programming Backup

v(St ) ← E [Rt+1 + γv(St+1) | At ∼ π(St )]

st

r
t +1
s
t +1

T
T TT T T T

TT T T T T

25
Monte-Carlo Backup

v(St ) ← v(St ) + α (Gt − v(St ))

st

T
T T TT T
T T
T

TT T
T
TT T TT
T

26
Temporal-Difference Backup

v(St ) ← v(St ) + α (Rt+1 + γv(St+1) − v(St ))

st

r
s t
t +1 +1

T
T TT TT T
T T

T
T T
T TT T
T TT

27
Bootstrapping and Sampling

► Bootstrapping: update involves an estimate


► MC does not bootstrap
► DP bootstraps
► TD bootstraps
► Sampling: update samples an expectation
► MC samples
► DP does not sample
► TD samples

28
Temporal difference learning

► We can apply the same idea to action values


► Temporal-difference learning for action values:
► Update value qt (St, At ) towards estimated return Rt+1 + γq(St+1, At+1)

29
Temporal-Difference Learning

► TD is model-free (no knowledge of MDP) and learn directly from experience


► TD can learn from incomplete episodes, by bootstrapping
► TD can learn during each episode

30
Comparing MC and TD

31
Advantages and Disadvantages of MC vs. TD

► TD can learn before knowing the final outcome


► TD can learn online after every step
► MC must wait until end of episode before return is known
► TD can learn without the final outcome
► TD can learn from incomplete sequences
► MC can only learn from complete sequences
► TD works in continuing (non-terminating) environments
► MC only works for episodic (terminating) environments
► TD is independent of the temporal span of the prediction
► TD can learn from single transitions
► MC must store all predictions (or states) to update at the end of an episode
► TD needs reasonable value estimates
32
Bias/Variance Trade-Off

► MC return Gt = Rt+1 + γRt+2 + . . . is an unbiased estimate of vπ(St )


► TD target Rt+1 + γvt (St+1) is a biased estimate of vπ(St ) (unless vt (St+1) = vπ(St+1))
► But the TD target has lower variance:
► Return depends on many random actions, transitions, rewards
► TD target depends on one random action, transition, reward

33
Bias/Variance Trade-Off

► In some cases, TD can have irreducible bias


► due to nonlinearity of approximation functions,
wrong initial estimates,...
► The world may be partially observable
► MC would implicitly account for all the latent variables
► The function to approximate the values may fit poorly
► In the tabular case, both MC and TD will converge: vt → vπ

34
Batch MC and TD

35
Batch MC and TD

► Tabular MC and TD converge: vt → vπ as experience → ∞ and αt → 0


► But what about finite experience?
► Consider a fixed batch of experience:

► Repeatedly sample each episode k ∈ [1, K] and apply MC or


TD(0)

= sampling from an empirical model

Batch methods are often used in scenarios where the data collection process
is separated from the learning process.
36
Differences in batch solutions

► MC converges to best mean-squared fit for the observed returns


○ This property is advantageous when the primary objective is to
accurately estimate the value function based on the available data.

► TD converges to solution of max likelihood Markov model, given the data


► This property is beneficial when the goal is to learn an efficient
policy by exploiting the Markov structure of the environment.

37
Advantages and Disadvantages of MC vs. TD

► TD exploits Markov property


► Can help in fully-observable environments
► MC does not exploit Markov property
► Can help in partially-observable environments
► With finite data, or with function approximation, the solutions may differ

38
Between MC and TD:
Multi-Step TD

39
Unified View of Reinforcement Learning

40
Multi-Step Updates

► TD uses value estimates which might be inaccurate


► In addition, information can propagate back quite slowly
► In MC information propagates faster, but the updates are noisier
► We can go in between TD and MC

41
Multi-Step Prediction

► Let TD target look n steps into the future

42
Multi-Step Returns
► Consider the following n-step returns for n = 1, 2, ∞:

43
Mixed Multi-Step Returns

44
Mixing multi-step returns
► Multi-step returns bootstrap on one state, v(St+n):

and different step sizes.

45
Benefits of Multi-Step Learning

46
Benefits of multi-step returns ● Efficient Bootstrapping
● Variance Reduction
● Accelerated Learning
● Enhanced Exploration
● Improved Sample Efficiency
● Long-Term Dependency Handling
● Balanced Bias and Variance
● Versatile Applicability
► Multi-step returns have benefits from both TD and MC ● Integration with Function
Approximation
► Bootstrapping can have issues with bias ● Adaptability

► Monte Carlo can have issues with variance


► Typically, intermediate values of n or λ are good (e.g., n = 10, λ = 0.9)

47
End of Lecture

48

You might also like