0% found this document useful (0 votes)

11 views48 pages

Lecture 4 - ModelFreePrediction

Uploaded by

Husein Yusuf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views48 pages

Lecture 4 - ModelFreePrediction

Uploaded by

Husein Yusuf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Reinforcement Learning

Model-Free Prediction

Natnael Argaw

Reinforcement learning

Lecture by Natnael Argaw (PhD) @CopyRight Hado van Hasselt

Background

Sutton & Barto 2018, Chapters 5 + 6 + 7 + 9 +12

Don’t worry about reading all of this at once!

Most important chapters, for now: 5 + 6
You can also defer some reading, e.g., until the reading week

2
Recap

► Reinforcement learning is the science of learning to make decisions

► Agents can learn a policy, value function and/or a model
► The general problem involves taking into account time and consequences
► Decisions affect the reward, the agent state, and environment state

3
Lecture overview

► Last lectures (3+4):

► Planning by dynamic programming to solve a known MDP
► This and next lectures (5→8):
► Model-free prediction to estimate values in an unknown MDP
► Model-free control to optimise values in an unknown MDP
► Function approximation and (some) deep reinforcement learning (but more to follow later)
► Off-policy learning

4
Model-Free Prediction:
Monte Carlo Algorithms

5
Monte Carlo Algorithms

► We can use experience samples to learn without a model

► We call direct sampling of episodes Monte Carlo
► MC is model-free: no knowledge of MDP required, only samples

6
Monte Carlo: Bandits

► Simple example, multi-armed bandit:

► For each action, average reward samples

► Note: we changed notation R → R 1 for the reward after A

t t+ t
In MDPs, the reward is said to arrive on the time step after the action

7
Monte Carlo: Bandits with States

► Consider bandits with different states

► episodes are still one step
► actions do not affect state transitions
►
=⇒ no long-term consequences

► Then, we want to estimate

q(s, a) = E [Rt+1 |St = s, At = a]

► These are called contextual bandits

8
Introduction to Function
Approximation

9
Value Function Approximation

► So far we mostly considered lookup tables

► Every state s has an entry v(s)
► Or every state-action pair s, a has an entry q(s, a)
► Problem with large MDPs:
► There are too many states and/or actions to store in memory
► It is too slow to learn the value of each state individually
► Individual states are often not fully observable

10
Value Function Approximation

Solution for large MDPs:

► Estimate value function with function approximation

vw(s) ≈ vπ(s) (or v∗(s))

qw(s, a) ≈ qπ(s, a) (or q∗(s, a))

► Update parameter w (e.g., using MC, TD learning,

non-linear techniques like neural networks)
► Generalise to unseen states

11
Agent state
update

Solution for large MDPs, if the environment state is not fully observable
► Use the agent state:
St = uω (St−1, At−1, Ot )
with parameters ω (typically ω ∈ Rn)
► Henceforth, St denotes the agent state
► Think of this as either a vector inside the agent,
or, in the simplest case, just the current observation: St = Ot

► For now we are not going to talk about how to learn the agent state update
► Feel free to consider St as an observation

12
Linear Function Approximation

13
Feature Vectors

► A useful special case: linear functions

► Represent state by a feature vector

► x : S → Rm is a fixed mapping from agent state (e.g., observation) to features

► Shorthand: xt = x(St )
► For example:
► Distance of robot from landmarks
► Trends in the stock market
► Piece and pawn configurations in chess

14
Linear Value Function Approximation

► Approximate value function by a linear combination of features

► Objective function (‘loss‘) is quadratic in w

L(w) = ES∼d [(vπ(S) − wTx(S))2]

► Stochastic gradient descent converges on global optimum

► Update rule is simple

∇w vw(St ) = x(St ) = xt =⇒ ∆w = α(vπ(St ) − vw(St ))xt

Update = step-size × prediction error × feature vector

15
Table Lookup Features

► Table lookup is a special case of linear value function approximation

► Let the n states be given by S = {s1, . . . , sn }.
► Use one-hot feature:

16
Model-Free Prediction:
Monte Carlo Algorithms
(Continuing from before...)

17
Monte Carlo: Bandits with States

► q could be a parametric function, e.g., neural network, and we could use loss

We can sample this to get a stochastic gradient update (SGD)

► The tabular case is a special case (only updates the value in cell [St, At ])
► Also works for large (continuous) state spaces S — this is just regression

18
Monte Carlo: Bandits with States

► When using linear functions, q(s, a) = wTx(s, a) and

∇wt qwt (St, At ) = x(s, a)

► Then the SGD update is

wt+1 = wt + α(Rt+1 − qwt (St, At ))x(s, a) .

► Linear update = step-size × prediction error × feature vector

► Non-linear update = step-size × prediction error × gradient

19
Monte-Carlo Policy Evaluation

► Now we consider sequential decision problems

► Goal: learn vπ from episodes of experience under policy π

S1, A1, R2, ..., Sk ∼ π

► The return is the total discounted reward (for an episode ending at time T > t):

Gt = Rt+1 + γRt+2 + ... + γT −t−1 RT

► The value function is the expected return:

vπ(s) = E [Gt | St = s, π]

► We can just use sample average return instead of expected return

► We call this Monte Carlo policy evaluation

20
Disadvantages of Monte-Carlo Learning

► We have seen MC algorithms can be used to learn value predictions

► But when episodes are long, learning can be slow
► ...we have to wait until an episode ends before we can learn
► ...return can have high variance
► Are there alternatives?

21
Temporal-Difference Learning

22
Temporal Difference Learning by Sampling Bellman Equations
► Previous lecture: Bellman equations,

vπ(s) = E [Rt+1 + γvπ(St+1) | St = s, At ∼ π(St )]

► Previous lecture: Approximate by iterating,

vk+1(s) = E [Rt+1 + γvk (St+1) | St = s, At ∼ π(St )]

► We can sample this!

vt+1(St ) = Rt+1 + γvt (St+1)
► This is likely quite noisy — better to take a small step (with parameter α):

23
Temporal difference learning

► Prediction setting: learn vπ online from experience under policy π

► Monte-Carlo
► Update value vn(St ) towards sampled return Gt

vn+1(St ) = vn(St ) + α (Gt − vn(St ))

► Temporal-difference learning:
► Update value vt (St ) towards estimated return Rt+1 + γv(St+1)

24
Dynamic Programming Backup

v(St ) ← E [Rt+1 + γv(St+1) | At ∼ π(St )]

r
t +1
s
t +1

T
T TT T T T

TT T T T T

25
Monte-Carlo Backup

v(St ) ← v(St ) + α (Gt − v(St ))

T
T T TT T
T T
T

TT T
T
TT T TT
T

26
Temporal-Difference Backup

v(St ) ← v(St ) + α (Rt+1 + γv(St+1) − v(St ))

r
s t
t +1 +1

T
T TT TT T
T T

T
T T
T TT T
T TT

27
Bootstrapping and Sampling

► Bootstrapping: update involves an estimate

► MC does not bootstrap
► DP bootstraps
► TD bootstraps
► Sampling: update samples an expectation
► MC samples
► DP does not sample
► TD samples

28
Temporal difference learning

► We can apply the same idea to action values

► Temporal-difference learning for action values:
► Update value qt (St, At ) towards estimated return Rt+1 + γq(St+1, At+1)

29
Temporal-Difference Learning

► TD is model-free (no knowledge of MDP) and learn directly from experience

► TD can learn from incomplete episodes, by bootstrapping
► TD can learn during each episode

30
Comparing MC and TD

31
Advantages and Disadvantages of MC vs. TD

► TD can learn before knowing the final outcome

► TD can learn online after every step
► MC must wait until end of episode before return is known
► TD can learn without the final outcome
► TD can learn from incomplete sequences
► MC can only learn from complete sequences
► TD works in continuing (non-terminating) environments
► MC only works for episodic (terminating) environments
► TD is independent of the temporal span of the prediction
► TD can learn from single transitions
► MC must store all predictions (or states) to update at the end of an episode
► TD needs reasonable value estimates
32
Bias/Variance Trade-Off

► MC return Gt = Rt+1 + γRt+2 + . . . is an unbiased estimate of vπ(St )

► TD target Rt+1 + γvt (St+1) is a biased estimate of vπ(St ) (unless vt (St+1) = vπ(St+1))
► But the TD target has lower variance:
► Return depends on many random actions, transitions, rewards
► TD target depends on one random action, transition, reward

33
Bias/Variance Trade-Off

► In some cases, TD can have irreducible bias

► due to nonlinearity of approximation functions,
wrong initial estimates,...
► The world may be partially observable
► MC would implicitly account for all the latent variables
► The function to approximate the values may fit poorly
► In the tabular case, both MC and TD will converge: vt → vπ

34
Batch MC and TD

35
Batch MC and TD

► Tabular MC and TD converge: vt → vπ as experience → ∞ and αt → 0

► But what about finite experience?
► Consider a fixed batch of experience:

► Repeatedly sample each episode k ∈ [1, K] and apply MC or

TD(0)
►
= sampling from an empirical model

Batch methods are often used in scenarios where the data collection process
is separated from the learning process.
36
Differences in batch solutions

► MC converges to best mean-squared fit for the observed returns

○ This property is advantageous when the primary objective is to
accurately estimate the value function based on the available data.

► TD converges to solution of max likelihood Markov model, given the data

► This property is beneficial when the goal is to learn an efficient
policy by exploiting the Markov structure of the environment.

37
Advantages and Disadvantages of MC vs. TD

► TD exploits Markov property

► Can help in fully-observable environments
► MC does not exploit Markov property
► Can help in partially-observable environments
► With finite data, or with function approximation, the solutions may differ

38
Between MC and TD:
Multi-Step TD

39
Unified View of Reinforcement Learning

40
Multi-Step Updates

► TD uses value estimates which might be inaccurate

► In addition, information can propagate back quite slowly
► In MC information propagates faster, but the updates are noisier
► We can go in between TD and MC

41
Multi-Step Prediction

► Let TD target look n steps into the future

42
Multi-Step Returns
► Consider the following n-step returns for n = 1, 2, ∞:

43
Mixed Multi-Step Returns

44
Mixing multi-step returns
► Multi-step returns bootstrap on one state, v(St+n):

and different step sizes.

45
Benefits of Multi-Step Learning

46
Benefits of multi-step returns ● Eﬃcient Bootstrapping
● Variance Reduction
● Accelerated Learning
● Enhanced Exploration
● Improved Sample Eﬃciency
● Long-Term Dependency Handling
● Balanced Bias and Variance
● Versatile Applicability
► Multi-step returns have benefits from both TD and MC ● Integration with Function
Approximation
► Bootstrapping can have issues with bias ● Adaptability

► Monte Carlo can have issues with variance

► Typically, intermediate values of n or λ are good (e.g., n = 10, λ = 0.9)

47
End of Lecture

Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Model Free Prediction
No ratings yet
Model Free Prediction
38 pages
Advanced Reinforcement Learning
No ratings yet
Advanced Reinforcement Learning
46 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
2023 Week3 Modelfree
No ratings yet
2023 Week3 Modelfree
63 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
No ratings yet
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
55 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
48 pages
Model-Free Prediction Explained
No ratings yet
Model-Free Prediction Explained
51 pages
AI Reinforcement Learning Guide
No ratings yet
AI Reinforcement Learning Guide
8 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
16 RL
No ratings yet
16 RL
51 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
2.2+model Free+Control
No ratings yet
2.2+model Free+Control
92 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
50 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
26 pages
3 Evaluation
No ratings yet
3 Evaluation
41 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
30 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
DSA5102 Lecture12
No ratings yet
DSA5102 Lecture12
41 pages
Model Free Methods
No ratings yet
Model Free Methods
31 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Lec 10
No ratings yet
Lec 10
50 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
30 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
Markov Decision & RL Overview
No ratings yet
Markov Decision & RL Overview
39 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
30 pages
5 Temporal Difference Learning
No ratings yet
5 Temporal Difference Learning
25 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Monte Carlo Methods in AI & Data Science
No ratings yet
Monte Carlo Methods in AI & Data Science
40 pages
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
No ratings yet
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
9 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
Monte Carlo Methods in Reinforcement Learning
No ratings yet
Monte Carlo Methods in Reinforcement Learning
245 pages
CS188 Project 2: Multi-Agent Pacman
No ratings yet
CS188 Project 2: Multi-Agent Pacman
38 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement Learning Algorithms
No ratings yet
Reinforcement Learning Algorithms
98 pages
Reinforcement Learning Basics and Algorithms
No ratings yet
Reinforcement Learning Basics and Algorithms
42 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
سحب عينات الدم الطبعة الخامسة Phlebotomy
89% (44)
سحب عينات الدم الطبعة الخامسة Phlebotomy
147 pages
Exit Exam Question 2015
No ratings yet
Exit Exam Question 2015
20 pages
Cyber - Access Control and Cryptographic Concepts
No ratings yet
Cyber - Access Control and Cryptographic Concepts
28 pages
Lecture 2 - Exploration and Control - Slides
No ratings yet
Lecture 2 - Exploration and Control - Slides
51 pages
Introduction To Cognitive Science
No ratings yet
Introduction To Cognitive Science
49 pages
CV CH - 1 Introduction
No ratings yet
CV CH - 1 Introduction
42 pages
CV CH - 4 - Low Level Feature Extraction
No ratings yet
CV CH - 4 - Low Level Feature Extraction
72 pages
CV CH - 5 - High Level Feature Extraction
No ratings yet
CV CH - 5 - High Level Feature Extraction
62 pages
03 NetworkStructure
No ratings yet
03 NetworkStructure
75 pages
Understanding Homophily in Networks
No ratings yet
Understanding Homophily in Networks
24 pages
Transforming Circular Fashion in Vietnam The Role
No ratings yet
Transforming Circular Fashion in Vietnam The Role
12 pages
TLE 9 Household-Services Q1 W6 M6 LDS Types of Stain ALG RTP
No ratings yet
TLE 9 Household-Services Q1 W6 M6 LDS Types of Stain ALG RTP
4 pages
El Filibusterismo
No ratings yet
El Filibusterismo
37 pages
Capital Asset and Capital Gains Loss
No ratings yet
Capital Asset and Capital Gains Loss
4 pages
PIM International Vol 7 No 2 June 2013 SP
No ratings yet
PIM International Vol 7 No 2 June 2013 SP
84 pages
Reading 3 Overview of Asset Allocation - Answers
No ratings yet
Reading 3 Overview of Asset Allocation - Answers
23 pages
Hosmillo, Nicole Raz - Ancient Chinese Education
No ratings yet
Hosmillo, Nicole Raz - Ancient Chinese Education
7 pages
Seven Barriers To Great Communication
100% (6)
Seven Barriers To Great Communication
4 pages
9-4 Transfer Loads From Slab To Beams-2
No ratings yet
9-4 Transfer Loads From Slab To Beams-2
52 pages
Expatriates Share What They Miss Most
No ratings yet
Expatriates Share What They Miss Most
2 pages
Cshpresidentials Proposed 3 Storey Residential Building
No ratings yet
Cshpresidentials Proposed 3 Storey Residential Building
6 pages
Glycogen Storage Disorders
No ratings yet
Glycogen Storage Disorders
8 pages
Ravikumar K: On IT Infrastructure Library (ITIL) Framework
No ratings yet
Ravikumar K: On IT Infrastructure Library (ITIL) Framework
4 pages
MGT CH 5
No ratings yet
MGT CH 5
4 pages
Read Entity Data
No ratings yet
Read Entity Data
2 pages
Workplace Hazard Management Guide
100% (1)
Workplace Hazard Management Guide
29 pages
Soal PAS Bahasa Inggris Kelas 8 Kurmer
100% (6)
Soal PAS Bahasa Inggris Kelas 8 Kurmer
10 pages
Values and Human Rights - Module
100% (1)
Values and Human Rights - Module
7 pages
Latihan String
No ratings yet
Latihan String
12 pages
Software Engineering Lab Guide
No ratings yet
Software Engineering Lab Guide
62 pages
Digital Alarm System for Security Projects
No ratings yet
Digital Alarm System for Security Projects
17 pages
JAR A320 Neo Checklist-Rev8-1
No ratings yet
JAR A320 Neo Checklist-Rev8-1
1 page
Safety Campaign - Awareness On Permit To Work System
100% (1)
Safety Campaign - Awareness On Permit To Work System
36 pages
Molecules: Cholinesterase Inhibitors From An Endophytic Fungus Fv-Er401: Metabolomics, Isolation and Molecular Docking
No ratings yet
Molecules: Cholinesterase Inhibitors From An Endophytic Fungus Fv-Er401: Metabolomics, Isolation and Molecular Docking
18 pages
CCNA 200-125 Exam Q&A Guide
No ratings yet
CCNA 200-125 Exam Q&A Guide
13 pages
Department of History Presidency University Post Graduate Syllabus
No ratings yet
Department of History Presidency University Post Graduate Syllabus
64 pages
Gangadikal Trek Kudremukha Karnataka Hikes
No ratings yet
Gangadikal Trek Kudremukha Karnataka Hikes
10 pages
Philippine History Controversies
No ratings yet
Philippine History Controversies
32 pages
Extracapsular Cataract Extraction (Ecce)
No ratings yet
Extracapsular Cataract Extraction (Ecce)
9 pages
Analyze Lottery Numbers in Excel
No ratings yet
Analyze Lottery Numbers in Excel
11 pages

Lecture 4 - ModelFreePrediction

Uploaded by

Lecture 4 - ModelFreePrediction

Uploaded by

Reinforcement Learning

Lecture by Natnael Argaw (PhD) @CopyRight Hado van Hasselt

Sutton & Barto 2018, Chapters 5 + 6 + 7 + 9 +12

Don’t worry about reading all of this at once!

► Reinforcement learning is the science of learning to make decisions

► Last lectures (3+4):

► We can use experience samples to learn without a model

► Simple example, multi-armed bandit:

► Note: we changed notation R → R 1 for the reward after A

► Consider bandits with different states

► Then, we want to estimate

q(s, a) = E [Rt+1 |St = s, At = a]

► These are called contextual bandits

► So far we mostly considered lookup tables

Solution for large MDPs:

vw(s) ≈ vπ(s) (or v∗(s))

► Update parameter w (e.g., using MC, TD learning,

► A useful special case: linear functions

► x : S → Rm is a fixed mapping from agent state (e.g., observation) to features

► Approximate value function by a linear combination of features

► Objective function (‘loss‘) is quadratic in w

L(w) = ES∼d [(vπ(S) − wTx(S))2]

► Stochastic gradient descent converges on global optimum

∇w vw(St ) = x(St ) = xt =⇒ ∆w = α(vπ(St ) − vw(St ))xt

Update = step-size × prediction error × feature vector

► Table lookup is a special case of linear value function approximation

We can sample this to get a stochastic gradient update (SGD)

► When using linear functions, q(s, a) = wTx(s, a) and

∇wt qwt (St, At ) = x(s, a)

► Then the SGD update is

wt+1 = wt + α(Rt+1 − qwt (St, At ))x(s, a) .

► Linear update = step-size × prediction error × feature vector

► Now we consider sequential decision problems

S1, A1, R2, ..., Sk ∼ π

Gt = Rt+1 + γRt+2 + ... + γT −t−1 RT

► The value function is the expected return:

► We can just use sample average return instead of expected return

► We have seen MC algorithms can be used to learn value predictions

vπ(s) = E [Rt+1 + γvπ(St+1) | St = s, At ∼ π(St )]

► Previous lecture: Approximate by iterating,

vk+1(s) = E [Rt+1 + γvk (St+1) | St = s, At ∼ π(St )]

► We can sample this!

► Prediction setting: learn vπ online from experience under policy π

vn+1(St ) = vn(St ) + α (Gt − vn(St ))

v(St ) ← E [Rt+1 + γv(St+1) | At ∼ π(St )]

v(St ) ← v(St ) + α (Gt − v(St ))

v(St ) ← v(St ) + α (Rt+1 + γv(St+1) − v(St ))

► Bootstrapping: update involves an estimate

► We can apply the same idea to action values

► TD is model-free (no knowledge of MDP) and learn directly from experience

► TD can learn before knowing the final outcome

► MC return Gt = Rt+1 + γRt+2 + . . . is an unbiased estimate of vπ(St )

► In some cases, TD can have irreducible bias

► Tabular MC and TD converge: vt → vπ as experience → ∞ and αt → 0

► Repeatedly sample each episode k ∈ [1, K] and apply MC or

► MC converges to best mean-squared fit for the observed returns

► TD converges to solution of max likelihood Markov model, given the data

► TD exploits Markov property

► TD uses value estimates which might be inaccurate

► Let TD target look n steps into the future

and different step sizes.

► Monte Carlo can have issues with variance

You might also like