0% found this document useful (0 votes)

24 views24 pages

10 - Reinforcement Learning

The document states that the training data is current only up to October 2023. No additional information or context is provided. It emphasizes the limitation of the data's recency.

Uploaded by

Ahmet Çelik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views24 pages

10 - Reinforcement Learning

The document states that the training data is current only up to October 2023. No additional information or context is provided. It emphasizes the limitation of the data's recency.

Uploaded by

Ahmet Çelik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Computational Control

Reinforcement learning

Saverio Bolognani

Automatic Control Laboratory (IfA)

ETH Zurich
Can we find the optimal policy without model information?
Two different settings:

Based on collected data (typically in the form of repeated episodes)

(x0 , u0 , r0 ), (x1 , u1 , r1 ), . . . (xT , uT , rT )

▶ repetition of the control task in a numerical simulator

▶ repetition of the control task in a controlled environment

Online during the control task with no prior training "

→ adaptive control
- extremely challenging dual control task (learn AND control)
- few guarantees
- an open problem for decades!

1 / 22
Policy iteration and value iteration methods require either
a full (and sufficiently rich) trajectory or
a model.

Is it possible to learn the system as data is collected along a single trajectory?

Remember the difficult step in policy iteration: for a given policy π, estimate
X u π ′
Qπ (x, u) = Rxu + γ Pxx ′ V (x )
x′

where " ∞
#
X
V π (x) = E γ k rk |x0 = x
k=0

In Monte carlo learning we did it by simply evaluating it based on the episode data
T
X
Qπ (xk , uk ) ≈ gk = γ i−k ri
i=k

2 / 22
Monte Carlo learning

u1 u2 u3

x1 x2 x3 ...
r1 r2 r3 r4 r5 r6 r7 r8 r9

g1 = r1 + γr2 + γ 2 r3 + . . .
g2 = r2 + γr3 + γ 2 r4 + . . .
g3 = r3 + γr4 + γ 2 r5 + . . .

Q( , ) Q( , ) Q( , ) Q( , ) ...

3 / 22
We havent’t used Bellman equation for Q
X u X
Q(x, u) = Rxu + γ Pxx ′ π(x ′ , u′ )Q(x ′ , u′ )
x′ u′

(although of course the estimate of Q will satisfy that, with infinite data)

Temporal Difference error

Given an individual data point

xk , uk , xk+1 , uk+1 , rk

define the TD error as

ek = rk + γQ(xk+1 , uk+1 ) − Q(xk , uk )

According to Bellman equation, we have

E[ek ] = 0

4 / 22
Stochastic approximation
Suppose that we have a random variable

e(q)

and we want to find q that solves E[e(q)] = 0.

Stochastic approximation
The iteration
qk+1 ← qk − αk e(qk )

converges to the solution q∗ of E[e(q)] = 0 if

e(q) is bounded
E[e(q)] is non-decreasing in q (and increasing at q∗ )
the sequence αk satisfies
∞
X ∞
X
αk = ∞, αk2 < ∞.
k=0 k=0

5 / 22
TD-learning (SARSA)
Key idea: Use the empirical evaluations of the TD error

ek = rk + γQ(xk+1 , uk+1 ) − Q(xk , uk )

to update the Q function via the stochastic approximation iteration

Q(xk , uk ) ← Q(xk , uk ) + αk rk + γQ(xk+1 , uk+1 ) − Q(xk , uk )

Usually with a constant αk ⇒ steady-state non-zero variance

u1 u2 u3

x1 x2 x3 ...
r1 r2 r3 r4 r5 r6 r7 r8 r9

Q( , ) Q( , )

6 / 22
TD-learning (SARSA)

Moreover, instead of alternating between

complete policy evaluation (full episode)
policy improvement
we can interleave the two operations.

TD-learning (policy iteration)

At every time k
1 Select uk via an ϵ-greedy policy based on the latest estimate of Q(xk , uk )
(
argmaxu Q(x, u) with probability 1 − ϵ
π(x, u) ←
Uniform(U) with probability ϵ

2 Perform an iterative update of the Q function

Q(xk , uk ) ← Q(xk , uk ) + α rk + γQ(xk+1 , uk+1 ) − Q(xk , uk )

Meanwhile, allow ϵ → 0 as k → ∞.

7 / 22
TD-learning (value iteration)

Instead of using the temporal difference error to perform policy evaluation (via
stochastic approximation), we can directly use it to do a stochastic
approximation of the Bellman optimality principle.

Bellman optimality principle (in Q)

X
∗ ∗ ′ ′
Q (x, u) = Rxu +γ u
Pxx ′ min
′
Q (x , u )
u
x′

Individual realization

Q∗ (xk , uk ) = rk + γ min
′
Q∗ (xk+1 , uk+1 )
u

Notice that the expectation of the individual realization is equal to the Bellman
optimality principle.
(Can we do the same with the Bellman optimality principle on the value function?)

8 / 22
Q-learning

The stochastic approximation update for the Bellman optimality principle is

Q(xk , uk ) ← Q(xk , uk ) + αk rk + γ min Q(xk+1 , u) − Q(xk , uk )
u

Important things to notice:

The next input uk+1 is irrelevant (off-policy algorithm), it doesn’t need to be
the optimal one.
Convergence to the optimal Q is guaranteed under
▶ non-summability/square-summability assumption on αk
▶ full state-action space exploration
The policy is typically updated as ϵ-greedy
Same scalability issue, and same solution: function approximations
▶ linear approximants
▶ neural networks
Forward dynamic programming!

9 / 22
Q-learning with linear approximant

Consider again the linear parametrization

d
X
Qθ (x, u) = ϕℓ (x, u)θl = ϕ⊤ (x, u)θ
ℓ=1

How do we update the parameter θ in order to satisfy the Bellman optimality

principle
X u
Q∗ (x, u) = Rxu + γ Pxx ′ min′
Q ∗
(x ′
, u ′
)
u
x′

that it
X
ϕ⊤ (x, u)θ∗ = Rxu + γ u
Pxx ′ min
′
ϕ⊤ (x ′ , u′ )θ∗
u
x′

iteratively as samples become available?

10 / 22
X
⊤ ∗ ⊤ ′ ′ ∗
ϕ (x, u)θ = Rxu +γ u
Pxx ′ min
′
ϕ (x , u )θ
u
x′
| {z }
Q+

First, notice that this equation (most likely) does not have a solution.

Loss function

1 ⊤ 2
min ϕ (x, u)θ − Q+
|2
θ
{z }
L(θ)

A simple iteration that converges to this minimum is the gradient descent:

θk+1 = θk − α∇L(θk )

= θk − α ϕ⊤ (x, u)θk − Q+ ϕ(x, u)

11 / 22
Q-learning as stochastic gradient descent

θk+1 = θk − α ϕ⊤ (x, u)θk − Q+ ϕ(x, u)

Similarly to stochastic approximation, we can

replace the real gradient with samples whose expectation is the gradient
use a non-summable/square-summable step size

Replace
X ⊤ ′ ′ ∗
Q+ = Rxu + γ u
Pxx ′ min
′
ϕ (x , u )θ
u
x′

with
rk + γ min ϕ⊤ (xk+1 , u)θk
u

and obtain the iterative update

θk+1 = θk − αk ϕ⊤ (xk , uk )θk − rk + γ min ϕ⊤ (xk+1 , u)θk ϕ(xk , uk )
u

12 / 22
General approximators

Much more complex approximators can be used (neural networks) as long as:
they provide a parametrized approximation Qθ (x, u)
they allow to minimize Qθ (x, u) over u
they allow to perform a stochastic gradient step in θ to minimize a loss
function (quality of the approximation)

13 / 22
www.incontrolpodcast.com

Listen to Anu telling us about the challenge of adaptation

14 / 22
Really model free?

Monte Carlo and Temporal Difference methods seem to do without a model of

the system.
Is that really true? What model information are we assuming?
It is possible to learn directly the optimal policy π(x, u) without also learning the
value function or Q function (which implies knowledge of a Markovian state x)

Policy gradient methods

15 / 22
Policy gradient

Consider a trajectory τ of the system

τ = (x0 , u0 , x1 , u1 , . . . , xT , uT )

Assume for simplicity that x0 is fixed over multiple episodes/experiments.

For a fixed x0 , let P(τ ) be the probability of trajectory τ happening.

Let the associated cost be
T
X
R(τ ) = γ t rt
t=0

Parametrized policy
Let πθ (x, u) be a stochastic policy parametrized in θ
based on a linear combination of basis functions ϕ⊤ θ
based on some parametric form (Gaussian)
neural network with weights θ

16 / 22
Goal: minimize the expected cost
X
J(θ) := Eτ ∼πθ R(τ ) = Pθ (τ )R(τ )
τ

where Pθ (τ ) is the probability that trajectory τ happens when the policy πθ is used.

Two sources of complexity

1 It’s an expectation, therefore a sum over all possible trajectories
2 Pθ (τ ) is unknown: it depends on both
▶ the transition probabilities of the system
▶ the policy πθ

Trajectory probability

Pθ (τ ) = ΠTt=0 P(xt+1 |xt , ut ) πθ (ut , xt )

| {z } | {z }
Transition probabilities Policy

17 / 22
As we are trying to minimize J(θ), let’s try to derive the gradient ∇J(θ).

Proposition
h i
∇J(θ) = Eτ ∼πθ ∇ log Pθ (τ ) R(τ )

where Pθ (τ ) is the probability of τ when πθ is used.

Good news! We knew that J(θ) is an expectation, but also ∇J(θ) is an

expectation. Do you remember stochastic gradient descent?
Proof:
" #
X
∇J(θ) = ∇ Eτ ∼πθ R(τ ) = ∇ Pθ (τ )R(τ )
τ
X
= ∇Pθ (τ )R(τ )
τ
X Pθ (τ ) X
= ∇Pθ (τ )R(τ ) = Pθ (τ )∇ log Pθ (τ )R(τ )
τ
Pθ (τ ) τ

= Eτ ∼πθ [∇ log Pθ (τ )R(τ )]

18 / 22
h i
∇J(θ) = Eτ ∼πθ ∇ log Pθ (τ ) R(τ )

What is ∇ log Pθ (τ )?

Proposition

T
X
∇ log Pθ (τ ) = ∇ log πθ (ut , xt )
t=0

Amazing result!
no more dependance on the transition probabilities
depending on the parametrization of πθ , ∇ log πθ is known
we need a trajectory τ to compute it

Proof:

∇ log Pθ (τ ) = ∇ log ΠTt=0 P(xt+1 |xt , ut )πθ (ut , xt )
T T
! T
X X X
=∇ log P(xt+1 |xt , ut ) + log πθ (ut , xy ) = ∇ log πθ (ut , xt )
t=0 t=0 t=0

19 / 22
Putting things together

Instead of computing
h i
∇J(θ) = Eτ ∼πθ ∇ log Pθ (τ ) R(τ )

we use a sample
∇ log Pθ (τ̂ ) R(τ̂ )
where τ̂ comes from the distribution τ ∼ πθ .
How do we sample this distribution?

Policy gradient algorithm

Iteratively, repeat
1 Generate τ via πθ
2 Compute R(τ )
PT
3 Update θ ← θ − α R(τ ) t=0 ∇ log πθ (ut , xt )

20 / 22
Why not learn all the time?

Consider a simple system

1 1 0
xt+1 = xt + u
0 1 1 t

and an LQR-type cost with

1 0
Q= , R=r
0 0

Consider the alternatives

Q-learning (what is the Q function in an LQR problem?)
▶ prior information: Markovian state
▶ good parametrization of the Q function
policy gradient

21 / 22
10
policy gradient
9 random search
optimal
8
cost 7
6
5
0 5000 10000 15000 20000 25000 30000
samples

22 / 22
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License

https://bsaver.io/COCO

09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
ml4r 2025 05
No ratings yet
ml4r 2025 05
22 pages
Reinforcement Learning Concepts Explained
No ratings yet
Reinforcement Learning Concepts Explained
4 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Dis9 Sol
No ratings yet
Dis9 Sol
8 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Maxent RL
No ratings yet
Maxent RL
25 pages
DSA5102 Lecture12
No ratings yet
DSA5102 Lecture12
41 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
30 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Lecture 4 - ModelFreePrediction
No ratings yet
Lecture 4 - ModelFreePrediction
48 pages
Handout 5
No ratings yet
Handout 5
72 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Reinforcement Learning 3 Recap
No ratings yet
Reinforcement Learning 3 Recap
3 pages
Value-Policy Integration in RL
No ratings yet
Value-Policy Integration in RL
21 pages
2.2+model Free+Control
No ratings yet
2.2+model Free+Control
92 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
Advanced Reinforcement Learning
No ratings yet
Advanced Reinforcement Learning
46 pages
Markov Decision Processes for Traffic Control
No ratings yet
Markov Decision Processes for Traffic Control
31 pages
SRE Report Merged
No ratings yet
SRE Report Merged
16 pages
Lecture 4 Pre
No ratings yet
Lecture 4 Pre
85 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
Policy Gradient Methods Guide
No ratings yet
Policy Gradient Methods Guide
28 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
Lec 22
No ratings yet
Lec 22
22 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
MDP Solution Methods: Iteration & LP
No ratings yet
MDP Solution Methods: Iteration & LP
34 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
26 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Class Notes 2
No ratings yet
Class Notes 2
6 pages
Lecture 4: Model Free Control: Emma Brunskill
No ratings yet
Lecture 4: Model Free Control: Emma Brunskill
66 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
No ratings yet
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
32 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
THESIS-Neural Networks and Sliding Modes Control
No ratings yet
THESIS-Neural Networks and Sliding Modes Control
258 pages
05 - Robust MPC
No ratings yet
05 - Robust MPC
28 pages
Deep Ukf
No ratings yet
Deep Ukf
13 pages
Makaleler
No ratings yet
Makaleler
108 pages
Blackmore GNC06
No ratings yet
Blackmore GNC06
15 pages
Receding Horizon HN Control for Time-Delay Systems
No ratings yet
Receding Horizon HN Control for Time-Delay Systems
10 pages
Applsci 13 08204
No ratings yet
Applsci 13 08204
14 pages
Fundamentals of Acoustics
No ratings yet
Fundamentals of Acoustics
180 pages
Dynamics Lecture Notes: Newton's Laws
No ratings yet
Dynamics Lecture Notes: Newton's Laws
24 pages
Equilibrium of Force System: Source: Engineering Mechanics by Ferdinand L Singer
No ratings yet
Equilibrium of Force System: Source: Engineering Mechanics by Ferdinand L Singer
7 pages
33 As Statistics Unit 5 Test
No ratings yet
33 As Statistics Unit 5 Test
2 pages
Software Metrics for Managers
No ratings yet
Software Metrics for Managers
27 pages
DSS Assignment
No ratings yet
DSS Assignment
29 pages
Bright Public School - Class 6 - I Term Exam - Sep 25 - 26 - Maths
No ratings yet
Bright Public School - Class 6 - I Term Exam - Sep 25 - 26 - Maths
6 pages
Assignment On Properties of Determinants-1
No ratings yet
Assignment On Properties of Determinants-1
3 pages
Polytechnic University of The Philippines Statistical Analysis With Software Application
100% (1)
Polytechnic University of The Philippines Statistical Analysis With Software Application
9 pages
GE 4 - Mathematics in The Modern World M2
No ratings yet
GE 4 - Mathematics in The Modern World M2
12 pages
Supervisory Report: Batang Antipulen Yo, Dangal NG Bayan Ko
No ratings yet
Supervisory Report: Batang Antipulen Yo, Dangal NG Bayan Ko
5 pages
Interaction Formula For Purlins
No ratings yet
Interaction Formula For Purlins
10 pages
DC Motor Control Trainer Manual
No ratings yet
DC Motor Control Trainer Manual
47 pages
Model Evaluation Techniques Guide
No ratings yet
Model Evaluation Techniques Guide
40 pages
De Broglie Wave-Particle Duality
No ratings yet
De Broglie Wave-Particle Duality
22 pages
ECE Numerical Methods Test
No ratings yet
ECE Numerical Methods Test
4 pages
TiAl Casting Process Optimization
No ratings yet
TiAl Casting Process Optimization
5 pages
Civil & Environmental Engineering Courses
No ratings yet
Civil & Environmental Engineering Courses
6 pages
Formation Evaluation & Petrophysics Guide
100% (3)
Formation Evaluation & Petrophysics Guide
273 pages
Kendall's Tau and Spearman's Rank Correlation Coefficient Assess Statistical
No ratings yet
Kendall's Tau and Spearman's Rank Correlation Coefficient Assess Statistical
7 pages
Lecture Notes 1 - Planar Mechanisms Part 2 - F2018
No ratings yet
Lecture Notes 1 - Planar Mechanisms Part 2 - F2018
43 pages
Handbook of Electrochemistry G Zoski
0% (1)
Handbook of Electrochemistry G Zoski
7 pages
Chapter-2.4-Part-1
No ratings yet
Chapter-2.4-Part-1
12 pages
Assignment2 BMS
No ratings yet
Assignment2 BMS
10 pages
Mathematical Biology Textbook
No ratings yet
Mathematical Biology Textbook
119 pages
Go Kart Aerodynamics Analysis
No ratings yet
Go Kart Aerodynamics Analysis
13 pages
CH10
No ratings yet
CH10
39 pages
Indices and Surds
No ratings yet
Indices and Surds
26 pages
Issyll PDF
No ratings yet
Issyll PDF
141 pages
Solution Manual For Cost Management Measuring Monitoring and Motivating Performance 2nd Edition by Eldenburg Newest Edition 2025
100% (12)
Solution Manual For Cost Management Measuring Monitoring and Motivating Performance 2nd Edition by Eldenburg Newest Edition 2025
154 pages

10 - Reinforcement Learning

Uploaded by

10 - Reinforcement Learning

Uploaded by

Computational Control

Automatic Control Laboratory (IfA)

Based on collected data (typically in the form of repeated episodes)

(x0 , u0 , r0 ), (x1 , u1 , r1 ), . . . (xT , uT , rT )

▶ repetition of the control task in a numerical simulator

Online during the control task with no prior training "

Is it possible to learn the system as data is collected along a single trajectory?

Temporal Difference error

define the TD error as

ek = rk + γQ(xk+1 , uk+1 ) − Q(xk , uk )

According to Bellman equation, we have

and we want to find q that solves E[e(q)] = 0.

converges to the solution q∗ of E[e(q)] = 0 if

ek = rk + γQ(xk+1 , uk+1 ) − Q(xk , uk )

to update the Q function via the stochastic approximation iteration

Usually with a constant αk ⇒ steady-state non-zero variance

Moreover, instead of alternating between

TD-learning (policy iteration)

2 Perform an iterative update of the Q function

Bellman optimality principle (in Q)

The stochastic approximation update for the Bellman optimality principle is

Important things to notice:

Consider again the linear parametrization

How do we update the parameter θ in order to satisfy the Bellman optimality

iteratively as samples become available?

A simple iteration that converges to this minimum is the gradient descent:

Similarly to stochastic approximation, we can

and obtain the iterative update

Listen to Anu telling us about the challenge of adaptation

Monte Carlo and Temporal Difference methods seem to do without a model of

Policy gradient methods

Consider a trajectory τ of the system

Assume for simplicity that x0 is fixed over multiple episodes/experiments.

For a fixed x0 , let P(τ ) be the probability of trajectory τ happening.

Two sources of complexity

Pθ (τ ) = ΠTt=0 P(xt+1 |xt , ut ) πθ (ut , xt )

where Pθ (τ ) is the probability of τ when πθ is used.

Good news! We knew that J(θ) is an expectation, but also ∇J(θ) is an

= Eτ ∼πθ [∇ log Pθ (τ )R(τ )]

Policy gradient algorithm

Consider a simple system

and an LQR-type cost with

Consider the alternatives

You might also like