Reinforcement Learning
Model-Free Prediction
Natnael Argaw
Reinforcement learning
Lecture by Natnael Argaw (PhD) @CopyRight Hado van Hasselt
Background
Sutton & Barto 2018, Chapters 5 + 6 + 7 + 9 +12
Don’t worry about reading all of this at once!
Most important chapters, for now: 5 + 6
You can also defer some reading, e.g., until the reading week
2
Recap
► Reinforcement learning is the science of learning to make decisions
► Agents can learn a policy, value function and/or a model
► The general problem involves taking into account time and consequences
► Decisions affect the reward, the agent state, and environment state
3
Lecture overview
► Last lectures (3+4):
► Planning by dynamic programming to solve a known MDP
► This and next lectures (5→8):
► Model-free prediction to estimate values in an unknown MDP
► Model-free control to optimise values in an unknown MDP
► Function approximation and (some) deep reinforcement learning (but more to follow later)
► Off-policy learning
4
Model-Free Prediction:
Monte Carlo Algorithms
5
Monte Carlo Algorithms
► We can use experience samples to learn without a model
► We call direct sampling of episodes Monte Carlo
► MC is model-free: no knowledge of MDP required, only samples
6
Monte Carlo: Bandits
► Simple example, multi-armed bandit:
► For each action, average reward samples
► Note: we changed notation R → R 1 for the reward after A
t t+ t
In MDPs, the reward is said to arrive on the time step after the action
7
Monte Carlo: Bandits with States
► Consider bandits with different states
► episodes are still one step
► actions do not affect state transitions
►
=⇒ no long-term consequences
► Then, we want to estimate
q(s, a) = E [Rt+1 |St = s, At = a]
► These are called contextual bandits
8
Introduction to Function
Approximation
9
Value Function Approximation
► So far we mostly considered lookup tables
► Every state s has an entry v(s)
► Or every state-action pair s, a has an entry q(s, a)
► Problem with large MDPs:
► There are too many states and/or actions to store in memory
► It is too slow to learn the value of each state individually
► Individual states are often not fully observable
10
Value Function Approximation
Solution for large MDPs:
► Estimate value function with function approximation
vw(s) ≈ vπ(s) (or v∗(s))
qw(s, a) ≈ qπ(s, a) (or q∗(s, a))
► Update parameter w (e.g., using MC, TD learning,
non-linear techniques like neural networks)
► Generalise to unseen states
11
Agent state
update
Solution for large MDPs, if the environment state is not fully observable
► Use the agent state:
St = uω (St−1, At−1, Ot )
with parameters ω (typically ω ∈ Rn)
► Henceforth, St denotes the agent state
► Think of this as either a vector inside the agent,
or, in the simplest case, just the current observation: St = Ot
► For now we are not going to talk about how to learn the agent state update
► Feel free to consider St as an observation
12
Linear Function Approximation
13
Feature Vectors
► A useful special case: linear functions
► Represent state by a feature vector
► x : S → Rm is a fixed mapping from agent state (e.g., observation) to features
► Shorthand: xt = x(St )
► For example:
► Distance of robot from landmarks
► Trends in the stock market
► Piece and pawn configurations in chess
14
Linear Value Function Approximation
► Approximate value function by a linear combination of features
► Objective function (‘loss‘) is quadratic in w
L(w) = ES∼d [(vπ(S) − wTx(S))2]
► Stochastic gradient descent converges on global optimum
► Update rule is simple
∇w vw(St ) = x(St ) = xt =⇒ ∆w = α(vπ(St ) − vw(St ))xt
Update = step-size × prediction error × feature vector
15
Table Lookup Features
► Table lookup is a special case of linear value function approximation
► Let the n states be given by S = {s1, . . . , sn }.
► Use one-hot feature:
16
Model-Free Prediction:
Monte Carlo Algorithms
(Continuing from before...)
17
Monte Carlo: Bandits with States
► q could be a parametric function, e.g., neural network, and we could use loss
We can sample this to get a stochastic gradient update (SGD)
► The tabular case is a special case (only updates the value in cell [St, At ])
► Also works for large (continuous) state spaces S — this is just regression
18
Monte Carlo: Bandits with States
► When using linear functions, q(s, a) = wTx(s, a) and
∇wt qwt (St, At ) = x(s, a)
► Then the SGD update is
wt+1 = wt + α(Rt+1 − qwt (St, At ))x(s, a) .
► Linear update = step-size × prediction error × feature vector
► Non-linear update = step-size × prediction error × gradient
19
Monte-Carlo Policy Evaluation
► Now we consider sequential decision problems
► Goal: learn vπ from episodes of experience under policy π
S1, A1, R2, ..., Sk ∼ π
► The return is the total discounted reward (for an episode ending at time T > t):
Gt = Rt+1 + γRt+2 + ... + γT −t−1 RT
► The value function is the expected return:
vπ(s) = E [Gt | St = s, π]
► We can just use sample average return instead of expected return
► We call this Monte Carlo policy evaluation
20
Disadvantages of Monte-Carlo Learning
► We have seen MC algorithms can be used to learn value predictions
► But when episodes are long, learning can be slow
► ...we have to wait until an episode ends before we can learn
► ...return can have high variance
► Are there alternatives?
21
Temporal-Difference Learning
22
Temporal Difference Learning by Sampling Bellman Equations
► Previous lecture: Bellman equations,
vπ(s) = E [Rt+1 + γvπ(St+1) | St = s, At ∼ π(St )]
► Previous lecture: Approximate by iterating,
vk+1(s) = E [Rt+1 + γvk (St+1) | St = s, At ∼ π(St )]
► We can sample this!
vt+1(St ) = Rt+1 + γvt (St+1)
► This is likely quite noisy — better to take a small step (with parameter α):
23
Temporal difference learning
► Prediction setting: learn vπ online from experience under policy π
► Monte-Carlo
► Update value vn(St ) towards sampled return Gt
vn+1(St ) = vn(St ) + α (Gt − vn(St ))
► Temporal-difference learning:
► Update value vt (St ) towards estimated return Rt+1 + γv(St+1)
24
Dynamic Programming Backup
v(St ) ← E [Rt+1 + γv(St+1) | At ∼ π(St )]
st
r
t +1
s
t +1
T
T TT T T T
TT T T T T
25
Monte-Carlo Backup
v(St ) ← v(St ) + α (Gt − v(St ))
st
T
T T TT T
T T
T
TT T
T
TT T TT
T
26
Temporal-Difference Backup
v(St ) ← v(St ) + α (Rt+1 + γv(St+1) − v(St ))
st
r
s t
t +1 +1
T
T TT TT T
T T
T
T T
T TT T
T TT
27
Bootstrapping and Sampling
► Bootstrapping: update involves an estimate
► MC does not bootstrap
► DP bootstraps
► TD bootstraps
► Sampling: update samples an expectation
► MC samples
► DP does not sample
► TD samples
28
Temporal difference learning
► We can apply the same idea to action values
► Temporal-difference learning for action values:
► Update value qt (St, At ) towards estimated return Rt+1 + γq(St+1, At+1)
29
Temporal-Difference Learning
► TD is model-free (no knowledge of MDP) and learn directly from experience
► TD can learn from incomplete episodes, by bootstrapping
► TD can learn during each episode
30
Comparing MC and TD
31
Advantages and Disadvantages of MC vs. TD
► TD can learn before knowing the final outcome
► TD can learn online after every step
► MC must wait until end of episode before return is known
► TD can learn without the final outcome
► TD can learn from incomplete sequences
► MC can only learn from complete sequences
► TD works in continuing (non-terminating) environments
► MC only works for episodic (terminating) environments
► TD is independent of the temporal span of the prediction
► TD can learn from single transitions
► MC must store all predictions (or states) to update at the end of an episode
► TD needs reasonable value estimates
32
Bias/Variance Trade-Off
► MC return Gt = Rt+1 + γRt+2 + . . . is an unbiased estimate of vπ(St )
► TD target Rt+1 + γvt (St+1) is a biased estimate of vπ(St ) (unless vt (St+1) = vπ(St+1))
► But the TD target has lower variance:
► Return depends on many random actions, transitions, rewards
► TD target depends on one random action, transition, reward
33
Bias/Variance Trade-Off
► In some cases, TD can have irreducible bias
► due to nonlinearity of approximation functions,
wrong initial estimates,...
► The world may be partially observable
► MC would implicitly account for all the latent variables
► The function to approximate the values may fit poorly
► In the tabular case, both MC and TD will converge: vt → vπ
34
Batch MC and TD
35
Batch MC and TD
► Tabular MC and TD converge: vt → vπ as experience → ∞ and αt → 0
► But what about finite experience?
► Consider a fixed batch of experience:
► Repeatedly sample each episode k ∈ [1, K] and apply MC or
TD(0)
►
= sampling from an empirical model
Batch methods are often used in scenarios where the data collection process
is separated from the learning process.
36
Differences in batch solutions
► MC converges to best mean-squared fit for the observed returns
○ This property is advantageous when the primary objective is to
accurately estimate the value function based on the available data.
► TD converges to solution of max likelihood Markov model, given the data
► This property is beneficial when the goal is to learn an efficient
policy by exploiting the Markov structure of the environment.
37
Advantages and Disadvantages of MC vs. TD
► TD exploits Markov property
► Can help in fully-observable environments
► MC does not exploit Markov property
► Can help in partially-observable environments
► With finite data, or with function approximation, the solutions may differ
38
Between MC and TD:
Multi-Step TD
39
Unified View of Reinforcement Learning
40
Multi-Step Updates
► TD uses value estimates which might be inaccurate
► In addition, information can propagate back quite slowly
► In MC information propagates faster, but the updates are noisier
► We can go in between TD and MC
41
Multi-Step Prediction
► Let TD target look n steps into the future
42
Multi-Step Returns
► Consider the following n-step returns for n = 1, 2, ∞:
43
Mixed Multi-Step Returns
44
Mixing multi-step returns
► Multi-step returns bootstrap on one state, v(St+n):
and different step sizes.
45
Benefits of Multi-Step Learning
46
Benefits of multi-step returns ● Efficient Bootstrapping
● Variance Reduction
● Accelerated Learning
● Enhanced Exploration
● Improved Sample Efficiency
● Long-Term Dependency Handling
● Balanced Bias and Variance
● Versatile Applicability
► Multi-step returns have benefits from both TD and MC ● Integration with Function
Approximation
► Bootstrapping can have issues with bias ● Adaptability
► Monte Carlo can have issues with variance
► Typically, intermediate values of n or λ are good (e.g., n = 10, λ = 0.9)
47
End of Lecture
48