0% found this document useful (0 votes)

36 views28 pages

Continuous Control

The document discusses continuous control in reinforcement learning, focusing on strategies like Q-networks and actor-critic algorithms such as DDPG, TD3, and SAC. It highlights challenges in evaluating actions in continuous spaces and introduces methods to improve sample efficiency and reduce bias in learning. Various algorithms are outlined, emphasizing their unique approaches to handling continuous action spaces and maximizing performance.

Uploaded by

Yicheng Jiang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views28 pages

Continuous Control

Uploaded by

Yicheng Jiang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Continuous Control

Continuous Control
Setting:
• multi-dimensional continuous action space
• huge state space (discrete or continuous and multi-dimensional)

Possible Strategy:
• Recall π*(s) = argmaxa q*(s,a)
• This inspires following approach:
1. Train a ”Q-network” Qφ(s,a) such that Qφ(s,a) q*(s,a)
2. Use π(s) = argmaxa Qφ(s,a)

Challenge: Qφ(s,a) requires a neural network evaluation. How to we

evaluate argmaxa Qφ(s,a) when the action space is continuous?
Recall DQN Algorithm
Input: Finite replay buffer D
Initialize φ in Q-network
Put some data in D (use algorithm below without updating φ)
Repeat:
In state s, take action a = argmaxa’ Qφ(s,a’) (but with prob ε choose random a)
Observe r, s’. Add (s,a,r,s’) to front of replay buffer D.
Grab minibatch B from D
φ ←φ ← φ - α
← (1-β) + βφ
s ← s’

Here both the maxa’ and argmaxa’ are problematic due to the continuous multi-
dimensional action space.
Could try to directly maximizing
over a
For current φ, could attempt to maximize Qφ(s’,a) over a:
Repeat many times:
← +α

Does not introduce new sampling, but is computational inefficient, and

needs to be done for every s’ in minibatch. Also needs to be done at
every time step during inference.

Most importantly, it doesn’t work well, most likely because it is

overfitting an approximation.
Actor-Critic Algorithms

• DDPG
• TD3
• SAC
• REDQ
Actor-Critic Approach
• DQN only has a critic network Qφ(s,a)
• Train an actor network (s) so that (s)
• Then can replace Qφ(s’,a’) in DQN with
Evaluation: Improvement:

a
φ 𝑄 𝜙 (s ,a ) s 𝜃 (s)

Want Qφ(s,a) q(s)(s,a) Qφ(s,a)

So set target = r +Qφ(s’, (s’) ) So maximize over θ

Basic actor-critic algorithm for continuous control
Input: Finite replay buffer D
Initialize φ in Q-network and θ in actor network
Put some data in D (just use random policy (s) ).
Repeat:
In state s, take action a = (s)
Observe r, s’. Add (s,a,r,s’) to front of replay buffer D.
Grab minibatch B from D
φ ←φ ← φ - α
θ←θ+α
← (1-β) + βφ
s ← s’
DDPG: 2016
DDPG algorithm

• Bad name: better would be “basic

actor-critic” algorithm
• Very close to algorithm on previous
slide.
• Can ignore done signal d. Set d=0
• Can ignore clipping
• Note that it also uses Polyak
averaging for θ
• Note that it includes exploration:
a = (s) + ε
• Here ε is a vector. Each component
of the vector is drawn from N(0,σ2)
is another hyper parameter.
From DDPG paper:
DDPG variants: all but light grey
DDPG specialized to continuous
bandit
• No states, reward r(a).
• Replace Qφ(s,a) with Qφ(a), and (s) with b.

s
φ 𝑄 𝜙 (a ) s 𝜃 (s)

Want Qφ(a) r(a) Qφ(a)

So set target = r So maximize over a

DDPG for continuous bandit (in HW assignments)
Input: Finite replay buffer D
Initialize φ in Q-network and vector b
Put some data in D (eg, random values around initial action b).
Repeat:
Take action b + ε
Observe r. Add (a,r) to front of replay buffer D.
Grab minibatch B from D
φ ←φ ← φ - α full gradient now
b←b+α
TD3: Twin Delayed DDPG
• Very similar to DDPG. Can ignore the “Delayed”
• Most important change: address maximization bias by using double Q-learning.
• Two critic networks: φ1 and φ2, and one policy network θ
• For double-Q learning, use “clipped double-Q learning”:
• Replace DDPG target with:

(s’, ) , (s’, ) ]

• Use this same target for updating φ1 and φ2 (but φ1 and φ2 start with different
random values).

• If you use more than two critics, run into under estimation bias!
TD3 algorithm for continuous control
Input: Finite replay buffer D
Initialize φ1, φ2 in Q-networks and θ in actor network
Put some data in D (just use random policy (s) ).
Repeat:
In state s, take action a = (s)
Observe r, s’. Add (s,a,r,s’) to front of replay buffer D.
Grab minibatch B from D
φ1 ← φ 1 - α
φ2 ← φ 2 - α
θ←θ+α
← (1-β) i + βφi , i=1,2
i

s ← s’
TD3: 2018
SAC: Soft Actor Critic
• Similar to TD3 but does exploration differently
• Can perform better than TD3 when dimensions of state and action
spaces are large.
• Basic idea: (s) is a deterministic policy. Instead of doing exploration as
a = (s) + ε, generate a stochastic policy (.|s) directly from optimization
problem.

(s)
s 𝜃
(s)

• Set (.|s) to a normal dist with mean (s) and std (s), with each dimension being
independent.
• Equivalently, set a = (s) + (s) where : “reparameterization trick”
SAC: New objective

entropy := H((.|s) ) := [ - ln (.|s) ]

, a (one sample)

Before we sought a policy that maximizes the expected return. Now we seek
a policy that maximizes expected return and the policy entropy.
SAC: Same as TD3
except now have
entropy samples in
update equations.
SAC: 2019
REDQ
• Update To Data (UTD) ratio
• DDPG, TD3, SAC all do one update of parameters for each
environmental interaction. Update To Data (UTD) ratio = 1
• To improve sample efficiency, natural to attempt multiple updates for
each environment interaction UTD ratio >> 1.
• Doesn’t work: overfits to existing data in buffer; Qφ(s,a) becomes inaccurate
for (s,a)’s not in buffer.
• Natural to try regularization such as ensembling many ’s
• But then how do we handle minimization term for double-Q learning?
• Minimizing in target over all the ’s leads to under-estimation bias.
• Solution: Randomly select two ’s for each update.
REDQ: 2021

Title of Your RL Project
No ratings yet
Title of Your RL Project
1 page
Conservativeddpg
No ratings yet
Conservativeddpg
13 pages
Monte Carlo Beam Search For Actor-Critic Reinforcement Learning in Continuous Control
No ratings yet
Monte Carlo Beam Search For Actor-Critic Reinforcement Learning in Continuous Control
8 pages
Multi-Pass Q-Networks for RL Actions
No ratings yet
Multi-Pass Q-Networks for RL Actions
8 pages
Twin Delayed Multi-Agent Deep Deterministic Policy Gradient
No ratings yet
Twin Delayed Multi-Agent Deep Deterministic Policy Gradient
5 pages
Walking Through Original DQN Paper - by Stas Olekhnovich - Medium
No ratings yet
Walking Through Original DQN Paper - by Stas Olekhnovich - Medium
13 pages
Lecture 37 - Deep Deterministic Policy Gradient (DDPG)
No ratings yet
Lecture 37 - Deep Deterministic Policy Gradient (DDPG)
17 pages
Origins of Life Questions and Debates
No ratings yet
Origins of Life Questions and Debates
12 pages
What Is TD Learning
No ratings yet
What Is TD Learning
15 pages
RL Unit V Qa
No ratings yet
RL Unit V Qa
13 pages
Report
No ratings yet
Report
6 pages
Soft Actor-Critic Algorithms and Applications: UC Berkeley, Google Brain, Contributed Equally
No ratings yet
Soft Actor-Critic Algorithms and Applications: UC Berkeley, Google Brain, Contributed Equally
17 pages
Chapter 1 Introduction RL Report Kiran
No ratings yet
Chapter 1 Introduction RL Report Kiran
2 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
Deep Q-Learning
No ratings yet
Deep Q-Learning
14 pages
Enhancing DDPG with Prioritized Replay
No ratings yet
Enhancing DDPG with Prioritized Replay
10 pages
RL Algorithms in Gymnasium
No ratings yet
RL Algorithms in Gymnasium
59 pages
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
No ratings yet
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
9 pages
Langevin Soft Actor Crit
No ratings yet
Langevin Soft Actor Crit
27 pages
AI Plays Geometry Dash
No ratings yet
AI Plays Geometry Dash
7 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
Application of New Idea
No ratings yet
Application of New Idea
20 pages
Deep Q-Networks: Target Networks Explained
No ratings yet
Deep Q-Networks: Target Networks Explained
7 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
15 Deep Reinforcement Learning v24.2
No ratings yet
15 Deep Reinforcement Learning v24.2
115 pages
Deep RL for Autonomous Car Racing
No ratings yet
Deep RL for Autonomous Car Racing
6 pages
Lecture 6 Deep Q Network and Its Variants
No ratings yet
Lecture 6 Deep Q Network and Its Variants
59 pages
Home Work SAC PPO and DDPG Reinforcement Learning
No ratings yet
Home Work SAC PPO and DDPG Reinforcement Learning
10 pages
Final Report RL
No ratings yet
Final Report RL
5 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
RLDL PBL AmriteshChandra 09411503121
No ratings yet
RLDL PBL AmriteshChandra 09411503121
15 pages
4a - Approximate Reinforcement Learning
No ratings yet
4a - Approximate Reinforcement Learning
55 pages
Deep Q Learning
No ratings yet
Deep Q Learning
5 pages
ReinforcementLearningAssign2 1)
No ratings yet
ReinforcementLearningAssign2 1)
7 pages
TD Learning & Deep Q-Networks
No ratings yet
TD Learning & Deep Q-Networks
20 pages
Yang 20 A
No ratings yet
Yang 20 A
4 pages
13-RL DRL
No ratings yet
13-RL DRL
102 pages
L. Li 2023
No ratings yet
L. Li 2023
13 pages
Reinforcement Learning - Personal Study Notes
No ratings yet
Reinforcement Learning - Personal Study Notes
12 pages
DL Questions
No ratings yet
DL Questions
30 pages
Lecture 10
No ratings yet
Lecture 10
25 pages
Hansen 2022
No ratings yet
Hansen 2022
20 pages
Deep Reinforcement Learning in Large Discrete Action Spaces
No ratings yet
Deep Reinforcement Learning in Large Discrete Action Spaces
11 pages
Deepmind Control Suite
No ratings yet
Deepmind Control Suite
24 pages
Reinforcement Learning in Super Mario
No ratings yet
Reinforcement Learning in Super Mario
59 pages
Soft Actor-Critic:: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
No ratings yet
Soft Actor-Critic:: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
14 pages
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
No ratings yet
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
10 pages
Final MSC Report Divyam Rastogi
No ratings yet
Final MSC Report Divyam Rastogi
78 pages
Answer Key
No ratings yet
Answer Key
12 pages
MIT 6.S191: Deep Reinforcement Learning
100% (4)
MIT 6.S191: Deep Reinforcement Learning
48 pages
Tac 232
No ratings yet
Tac 232
7 pages
DRL hw2 2022 Fin2
No ratings yet
DRL hw2 2022 Fin2
6 pages
Data-Enabled Predictive Control Algorithm
No ratings yet
Data-Enabled Predictive Control Algorithm
8 pages
Final Pong RL Report
No ratings yet
Final Pong RL Report
3 pages
RL Report
No ratings yet
RL Report
37 pages
Deep Q Network
No ratings yet
Deep Q Network
6 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
BACOSTMX Module 4-Process Costing
No ratings yet
BACOSTMX Module 4-Process Costing
59 pages
Game Informer - January 2015 USA
100% (1)
Game Informer - January 2015 USA
104 pages
Alarm Fatigue: Use of An Evidence-Based Alarm Management Strategy
No ratings yet
Alarm Fatigue: Use of An Evidence-Based Alarm Management Strategy
8 pages
Agritourism Market Trends and Insights
No ratings yet
Agritourism Market Trends and Insights
5 pages
14-Day Abs Challenge PDF
No ratings yet
14-Day Abs Challenge PDF
12 pages
2024 Residence Fees
No ratings yet
2024 Residence Fees
10 pages
ANZ Bank Statement
100% (2)
ANZ Bank Statement
5 pages
Datasheet gsc3510 English
No ratings yet
Datasheet gsc3510 English
2 pages
New Ojt Report
No ratings yet
New Ojt Report
24 pages
1213sem2 Me5612
100% (1)
1213sem2 Me5612
5 pages
Abdul Rahim Sulaiman Case
No ratings yet
Abdul Rahim Sulaiman Case
18 pages
SAP Keyboard Shortcuts
No ratings yet
SAP Keyboard Shortcuts
1 page
Germanene Nanoribbon Insights
No ratings yet
Germanene Nanoribbon Insights
11 pages
A Low Power Asynchronous Viterbi Decoder Using LEDR Encoding
No ratings yet
A Low Power Asynchronous Viterbi Decoder Using LEDR Encoding
6 pages
Business Research Methods: An of The Plastic Industry
No ratings yet
Business Research Methods: An of The Plastic Industry
3 pages
Hospital Management Assignment Overview
No ratings yet
Hospital Management Assignment Overview
1 page
ENVIS Centre On Wildlife & Protected Areas: Tiger Reserves
No ratings yet
ENVIS Centre On Wildlife & Protected Areas: Tiger Reserves
8 pages
Assignment Pocm
No ratings yet
Assignment Pocm
21 pages
HR Audit (MCQ)
100% (8)
HR Audit (MCQ)
30 pages
Personal Particulars Form: Photo
No ratings yet
Personal Particulars Form: Photo
11 pages
XII Computer Ut Paper Bajoria School
No ratings yet
XII Computer Ut Paper Bajoria School
3 pages
New M
No ratings yet
New M
461 pages
NH-58 & NH-72 4-Laning Project Plan
No ratings yet
NH-58 & NH-72 4-Laning Project Plan
1 page
Meeting Minutes (1) F&B Logistic Technical & Safety
No ratings yet
Meeting Minutes (1) F&B Logistic Technical & Safety
3 pages
Measuring and Managing Construction Worker Fatigue
No ratings yet
Measuring and Managing Construction Worker Fatigue
200 pages
Research Progress On Green Adsorption Process For Water Poll - 2025 - Hybrid Adv
No ratings yet
Research Progress On Green Adsorption Process For Water Poll - 2025 - Hybrid Adv
16 pages
Interpretable Machine Learning Christoph Molnar Instant Download
No ratings yet
Interpretable Machine Learning Christoph Molnar Instant Download
84 pages
D2 - R761-Prosesskode 1 - 2018eng
No ratings yet
D2 - R761-Prosesskode 1 - 2018eng
258 pages
DatenblattZFECCOMdrop 71548
No ratings yet
DatenblattZFECCOMdrop 71548
2 pages
Batman - C2C Crochet Blanket Pattern - PrettyThingsByKatja
100% (1)
Batman - C2C Crochet Blanket Pattern - PrettyThingsByKatja
28 pages

Continuous Control

Uploaded by

Continuous Control

Uploaded by

Continuous Control

Challenge: Qφ(s,a) requires a neural network evaluation. How to we

Does not introduce new sampling, but is computational inefficient, and

Most importantly, it doesn’t work well, most likely because it is

Want Qφ(s,a) q(s)(s,a) Qφ(s,a)

So set target = r +Qφ(s’, (s’) ) So maximize over θ

• Bad name: better would be “basic

Want Qφ(a) r(a) Qφ(a)

So set target = r So maximize over a

entropy := H((.|s) ) := [ - ln (.|s) ]

You might also like