0% found this document useful (0 votes)
36 views28 pages

Continuous Control

The document discusses continuous control in reinforcement learning, focusing on strategies like Q-networks and actor-critic algorithms such as DDPG, TD3, and SAC. It highlights challenges in evaluating actions in continuous spaces and introduces methods to improve sample efficiency and reduce bias in learning. Various algorithms are outlined, emphasizing their unique approaches to handling continuous action spaces and maximizing performance.

Uploaded by

Yicheng Jiang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views28 pages

Continuous Control

The document discusses continuous control in reinforcement learning, focusing on strategies like Q-networks and actor-critic algorithms such as DDPG, TD3, and SAC. It highlights challenges in evaluating actions in continuous spaces and introduces methods to improve sample efficiency and reduce bias in learning. Various algorithms are outlined, emphasizing their unique approaches to handling continuous action spaces and maximizing performance.

Uploaded by

Yicheng Jiang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Continuous Control

Continuous Control
Setting:
• multi-dimensional continuous action space
• huge state space (discrete or continuous and multi-dimensional)

Possible Strategy:
• Recall π*(s) = argmaxa q*(s,a)
• This inspires following approach:
1. Train a ”Q-network” Qφ(s,a) such that Qφ(s,a) q*(s,a)
2. Use π(s) = argmaxa Qφ(s,a)

Challenge: Qφ(s,a) requires a neural network evaluation. How to we


evaluate argmaxa Qφ(s,a) when the action space is continuous?
Recall DQN Algorithm
Input: Finite replay buffer D
Initialize φ in Q-network
Put some data in D (use algorithm below without updating φ)
Repeat:
In state s, take action a = argmaxa’ Qφ(s,a’) (but with prob ε choose random a)
Observe r, s’. Add (s,a,r,s’) to front of replay buffer D.
Grab minibatch B from D
φ ←φ ← φ - α
← (1-β) + βφ
s ← s’

Here both the maxa’ and argmaxa’ are problematic due to the continuous multi-
dimensional action space.
Could try to directly maximizing
over a
For current φ, could attempt to maximize Qφ(s’,a) over a:
Repeat many times:
← +α

Does not introduce new sampling, but is computational inefficient, and


needs to be done for every s’ in minibatch. Also needs to be done at
every time step during inference.

Most importantly, it doesn’t work well, most likely because it is


overfitting an approximation.
Actor-Critic Algorithms

• DDPG
• TD3
• SAC
• REDQ
Actor-Critic Approach
• DQN only has a critic network Qφ(s,a)
• Train an actor network (s) so that (s)
• Then can replace Qφ(s’,a’) in DQN with
Evaluation: Improvement:

a
φ 𝑄 𝜙 (s ,a ) s 𝜃 (s)

Want Qφ(s,a) q(s)(s,a) Qφ(s,a)

So set target = r +Qφ(s’, (s’) ) So maximize over θ


Basic actor-critic algorithm for continuous control
Input: Finite replay buffer D
Initialize φ in Q-network and θ in actor network
Put some data in D (just use random policy (s) ).
Repeat:
In state s, take action a = (s)
Observe r, s’. Add (s,a,r,s’) to front of replay buffer D.
Grab minibatch B from D
φ ←φ ← φ - α
θ←θ+α
← (1-β) + βφ
s ← s’
DDPG: 2016
DDPG algorithm

• Bad name: better would be “basic


actor-critic” algorithm
• Very close to algorithm on previous
slide.
• Can ignore done signal d. Set d=0
• Can ignore clipping
• Note that it also uses Polyak
averaging for θ
• Note that it includes exploration:
a = (s) + ε
• Here ε is a vector. Each component
of the vector is drawn from N(0,σ2)
is another hyper parameter.
From DDPG paper:
DDPG variants: all but light grey
DDPG specialized to continuous
bandit
• No states, reward r(a).
• Replace Qφ(s,a) with Qφ(a), and (s) with b.

s
φ 𝑄 𝜙 (a ) s 𝜃 (s)

Want Qφ(a) r(a) Qφ(a)

So set target = r So maximize over a


DDPG for continuous bandit (in HW assignments)
Input: Finite replay buffer D
Initialize φ in Q-network and vector b
Put some data in D (eg, random values around initial action b).
Repeat:
Take action b + ε
Observe r. Add (a,r) to front of replay buffer D.
Grab minibatch B from D
φ ←φ ← φ - α full gradient now
b←b+α
TD3: Twin Delayed DDPG
• Very similar to DDPG. Can ignore the “Delayed”
• Most important change: address maximization bias by using double Q-learning.
• Two critic networks: φ1 and φ2, and one policy network θ
• For double-Q learning, use “clipped double-Q learning”:
• Replace DDPG target with:

(s’, ) , (s’, ) ]

• Use this same target for updating φ1 and φ2 (but φ1 and φ2 start with different
random values).

• If you use more than two critics, run into under estimation bias!
TD3 algorithm for continuous control
Input: Finite replay buffer D
Initialize φ1, φ2 in Q-networks and θ in actor network
Put some data in D (just use random policy (s) ).
Repeat:
In state s, take action a = (s)
Observe r, s’. Add (s,a,r,s’) to front of replay buffer D.
Grab minibatch B from D
φ1 ← φ 1 - α
φ2 ← φ 2 - α
θ←θ+α
← (1-β) i + βφi , i=1,2
i

s ← s’
TD3: 2018
SAC: Soft Actor Critic
• Similar to TD3 but does exploration differently
• Can perform better than TD3 when dimensions of state and action
spaces are large.
• Basic idea: (s) is a deterministic policy. Instead of doing exploration as
a = (s) + ε, generate a stochastic policy (.|s) directly from optimization
problem.

(s)
s 𝜃
(s)

• Set (.|s) to a normal dist with mean (s) and std (s), with each dimension being
independent.
• Equivalently, set a = (s) + (s) where : “reparameterization trick”
SAC: New objective

entropy := H((.|s) ) := [ - ln (.|s) ]

, a (one sample)

Before we sought a policy that maximizes the expected return. Now we seek
a policy that maximizes expected return and the policy entropy.
SAC: Same as TD3
except now have
entropy samples in
update equations.
SAC: 2019
REDQ
• Update To Data (UTD) ratio
• DDPG, TD3, SAC all do one update of parameters for each
environmental interaction. Update To Data (UTD) ratio = 1
• To improve sample efficiency, natural to attempt multiple updates for
each environment interaction UTD ratio >> 1.
• Doesn’t work: overfits to existing data in buffer; Qφ(s,a) becomes inaccurate
for (s,a)’s not in buffer.
• Natural to try regularization such as ensembling many ’s
• But then how do we handle minimization term for double-Q learning?
• Minimizing in target over all the ’s leads to under-estimation bias.
• Solution: Randomly select two ’s for each update.
REDQ: 2021

You might also like