Computational Control
Reinforcement learning
Saverio Bolognani
Automatic Control Laboratory (IfA)
ETH Zurich
Can we find the optimal policy without model information?
Two different settings:
Based on collected data (typically in the form of repeated episodes)
(x0 , u0 , r0 ), (x1 , u1 , r1 ), . . . (xT , uT , rT )
▶ repetition of the control task in a numerical simulator
▶ repetition of the control task in a controlled environment
Online during the control task with no prior training "
→ adaptive control
- extremely challenging dual control task (learn AND control)
- few guarantees
- an open problem for decades!
1 / 22
Policy iteration and value iteration methods require either
a full (and sufficiently rich) trajectory or
a model.
Is it possible to learn the system as data is collected along a single trajectory?
Remember the difficult step in policy iteration: for a given policy π, estimate
X u π ′
Qπ (x, u) = Rxu + γ Pxx ′ V (x )
x′
where " ∞
#
X
V π (x) = E γ k rk |x0 = x
k=0
In Monte carlo learning we did it by simply evaluating it based on the episode data
T
X
Qπ (xk , uk ) ≈ gk = γ i−k ri
i=k
2 / 22
Monte Carlo learning
u1 u2 u3
x1 x2 x3 ...
r1 r2 r3 r4 r5 r6 r7 r8 r9
g1 = r1 + γr2 + γ 2 r3 + . . .
g2 = r2 + γr3 + γ 2 r4 + . . .
g3 = r3 + γr4 + γ 2 r5 + . . .
Q( , ) Q( , ) Q( , ) Q( , ) ...
3 / 22
We havent’t used Bellman equation for Q
X u X
Q(x, u) = Rxu + γ Pxx ′ π(x ′ , u′ )Q(x ′ , u′ )
x′ u′
(although of course the estimate of Q will satisfy that, with infinite data)
Temporal Difference error
Given an individual data point
xk , uk , xk+1 , uk+1 , rk
define the TD error as
ek = rk + γQ(xk+1 , uk+1 ) − Q(xk , uk )
According to Bellman equation, we have
E[ek ] = 0
4 / 22
Stochastic approximation
Suppose that we have a random variable
e(q)
and we want to find q that solves E[e(q)] = 0.
Stochastic approximation
The iteration
qk+1 ← qk − αk e(qk )
converges to the solution q∗ of E[e(q)] = 0 if
e(q) is bounded
E[e(q)] is non-decreasing in q (and increasing at q∗ )
the sequence αk satisfies
∞
X ∞
X
αk = ∞, αk2 < ∞.
k=0 k=0
5 / 22
TD-learning (SARSA)
Key idea: Use the empirical evaluations of the TD error
ek = rk + γQ(xk+1 , uk+1 ) − Q(xk , uk )
to update the Q function via the stochastic approximation iteration
Q(xk , uk ) ← Q(xk , uk ) + αk rk + γQ(xk+1 , uk+1 ) − Q(xk , uk )
Usually with a constant αk ⇒ steady-state non-zero variance
u1 u2 u3
x1 x2 x3 ...
r1 r2 r3 r4 r5 r6 r7 r8 r9
Q( , ) Q( , )
6 / 22
TD-learning (SARSA)
Moreover, instead of alternating between
complete policy evaluation (full episode)
policy improvement
we can interleave the two operations.
TD-learning (policy iteration)
At every time k
1 Select uk via an ϵ-greedy policy based on the latest estimate of Q(xk , uk )
(
argmaxu Q(x, u) with probability 1 − ϵ
π(x, u) ←
Uniform(U) with probability ϵ
2 Perform an iterative update of the Q function
Q(xk , uk ) ← Q(xk , uk ) + α rk + γQ(xk+1 , uk+1 ) − Q(xk , uk )
Meanwhile, allow ϵ → 0 as k → ∞.
7 / 22
TD-learning (value iteration)
Instead of using the temporal difference error to perform policy evaluation (via
stochastic approximation), we can directly use it to do a stochastic
approximation of the Bellman optimality principle.
Bellman optimality principle (in Q)
X
∗ ∗ ′ ′
Q (x, u) = Rxu +γ u
Pxx ′ min
′
Q (x , u )
u
x′
Individual realization
Q∗ (xk , uk ) = rk + γ min
′
Q∗ (xk+1 , uk+1 )
u
Notice that the expectation of the individual realization is equal to the Bellman
optimality principle.
(Can we do the same with the Bellman optimality principle on the value function?)
8 / 22
Q-learning
The stochastic approximation update for the Bellman optimality principle is
Q(xk , uk ) ← Q(xk , uk ) + αk rk + γ min Q(xk+1 , u) − Q(xk , uk )
u
Important things to notice:
The next input uk+1 is irrelevant (off-policy algorithm), it doesn’t need to be
the optimal one.
Convergence to the optimal Q is guaranteed under
▶ non-summability/square-summability assumption on αk
▶ full state-action space exploration
The policy is typically updated as ϵ-greedy
Same scalability issue, and same solution: function approximations
▶ linear approximants
▶ neural networks
Forward dynamic programming!
9 / 22
Q-learning with linear approximant
Consider again the linear parametrization
d
X
Qθ (x, u) = ϕℓ (x, u)θl = ϕ⊤ (x, u)θ
ℓ=1
How do we update the parameter θ in order to satisfy the Bellman optimality
principle
X u
Q∗ (x, u) = Rxu + γ Pxx ′ min′
Q ∗
(x ′
, u ′
)
u
x′
that it
X
ϕ⊤ (x, u)θ∗ = Rxu + γ u
Pxx ′ min
′
ϕ⊤ (x ′ , u′ )θ∗
u
x′
iteratively as samples become available?
10 / 22
X
⊤ ∗ ⊤ ′ ′ ∗
ϕ (x, u)θ = Rxu +γ u
Pxx ′ min
′
ϕ (x , u )θ
u
x′
| {z }
Q+
First, notice that this equation (most likely) does not have a solution.
Loss function
1 ⊤ 2
min ϕ (x, u)θ − Q+
|2
θ
{z }
L(θ)
A simple iteration that converges to this minimum is the gradient descent:
θk+1 = θk − α∇L(θk )
= θk − α ϕ⊤ (x, u)θk − Q+ ϕ(x, u)
11 / 22
Q-learning as stochastic gradient descent
θk+1 = θk − α ϕ⊤ (x, u)θk − Q+ ϕ(x, u)
Similarly to stochastic approximation, we can
replace the real gradient with samples whose expectation is the gradient
use a non-summable/square-summable step size
Replace
X ⊤ ′ ′ ∗
Q+ = Rxu + γ u
Pxx ′ min
′
ϕ (x , u )θ
u
x′
with
rk + γ min ϕ⊤ (xk+1 , u)θk
u
and obtain the iterative update
θk+1 = θk − αk ϕ⊤ (xk , uk )θk − rk + γ min ϕ⊤ (xk+1 , u)θk ϕ(xk , uk )
u
12 / 22
General approximators
Much more complex approximators can be used (neural networks) as long as:
they provide a parametrized approximation Qθ (x, u)
they allow to minimize Qθ (x, u) over u
they allow to perform a stochastic gradient step in θ to minimize a loss
function (quality of the approximation)
13 / 22
www.incontrolpodcast.com
Listen to Anu telling us about the challenge of adaptation
14 / 22
Really model free?
Monte Carlo and Temporal Difference methods seem to do without a model of
the system.
Is that really true? What model information are we assuming?
It is possible to learn directly the optimal policy π(x, u) without also learning the
value function or Q function (which implies knowledge of a Markovian state x)
Policy gradient methods
15 / 22
Policy gradient
Consider a trajectory τ of the system
τ = (x0 , u0 , x1 , u1 , . . . , xT , uT )
Assume for simplicity that x0 is fixed over multiple episodes/experiments.
For a fixed x0 , let P(τ ) be the probability of trajectory τ happening.
Let the associated cost be
T
X
R(τ ) = γ t rt
t=0
Parametrized policy
Let πθ (x, u) be a stochastic policy parametrized in θ
based on a linear combination of basis functions ϕ⊤ θ
based on some parametric form (Gaussian)
neural network with weights θ
16 / 22
Goal: minimize the expected cost
X
J(θ) := Eτ ∼πθ R(τ ) = Pθ (τ )R(τ )
τ
where Pθ (τ ) is the probability that trajectory τ happens when the policy πθ is used.
Two sources of complexity
1 It’s an expectation, therefore a sum over all possible trajectories
2 Pθ (τ ) is unknown: it depends on both
▶ the transition probabilities of the system
▶ the policy πθ
Trajectory probability
Pθ (τ ) = ΠTt=0 P(xt+1 |xt , ut ) πθ (ut , xt )
| {z } | {z }
Transition probabilities Policy
17 / 22
As we are trying to minimize J(θ), let’s try to derive the gradient ∇J(θ).
Proposition
h i
∇J(θ) = Eτ ∼πθ ∇ log Pθ (τ ) R(τ )
where Pθ (τ ) is the probability of τ when πθ is used.
Good news! We knew that J(θ) is an expectation, but also ∇J(θ) is an
expectation. Do you remember stochastic gradient descent?
Proof:
" #
X
∇J(θ) = ∇ Eτ ∼πθ R(τ ) = ∇ Pθ (τ )R(τ )
τ
X
= ∇Pθ (τ )R(τ )
τ
X Pθ (τ ) X
= ∇Pθ (τ )R(τ ) = Pθ (τ )∇ log Pθ (τ )R(τ )
τ
Pθ (τ ) τ
= Eτ ∼πθ [∇ log Pθ (τ )R(τ )]
18 / 22
h i
∇J(θ) = Eτ ∼πθ ∇ log Pθ (τ ) R(τ )
What is ∇ log Pθ (τ )?
Proposition
T
X
∇ log Pθ (τ ) = ∇ log πθ (ut , xt )
t=0
Amazing result!
no more dependance on the transition probabilities
depending on the parametrization of πθ , ∇ log πθ is known
we need a trajectory τ to compute it
Proof:
∇ log Pθ (τ ) = ∇ log ΠTt=0 P(xt+1 |xt , ut )πθ (ut , xt )
T T
! T
X X X
=∇ log P(xt+1 |xt , ut ) + log πθ (ut , xy ) = ∇ log πθ (ut , xt )
t=0 t=0 t=0
19 / 22
Putting things together
Instead of computing
h i
∇J(θ) = Eτ ∼πθ ∇ log Pθ (τ ) R(τ )
we use a sample
∇ log Pθ (τ̂ ) R(τ̂ )
where τ̂ comes from the distribution τ ∼ πθ .
How do we sample this distribution?
Policy gradient algorithm
Iteratively, repeat
1 Generate τ via πθ
2 Compute R(τ )
PT
3 Update θ ← θ − α R(τ ) t=0 ∇ log πθ (ut , xt )
20 / 22
Why not learn all the time?
Consider a simple system
1 1 0
xt+1 = xt + u
0 1 1 t
and an LQR-type cost with
1 0
Q= , R=r
0 0
Consider the alternatives
Q-learning (what is the Q function in an LQR problem?)
▶ prior information: Markovian state
▶ good parametrization of the Q function
policy gradient
21 / 22
10
policy gradient
9 random search
optimal
8
cost 7
6
5
0 5000 10000 15000 20000 25000 30000
samples
22 / 22
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License
https://bsaver.io/COCO