0% found this document useful (0 votes)

318 views65 pages

An Introduction To Deep ReinforcementLearning

Uploaded by

Anonymous 9qlmzmlqxw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

318 views65 pages

An Introduction To Deep ReinforcementLearning

Uploaded by

Anonymous 9qlmzmlqxw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

An Introduction to

Deep Reinforcement
Learning
Ehsan Abbasnejad
Remember:
Supervised Learning

We have a set of sample observations, with labels

learn to predict the labels, given a new sample

cat
Learn the function that
associates a picture of a
dog/cat with the label

dog
Remember: supervised learning

We need thousands of samples

Samples have to be provided by experts
There are applications where
• We can’t provide expert samples
• Expert examples are not what we mimic
• There is an interaction with the world
Deep Reinforcement Learning
AlphaGo
Scenario of Reinforcement Learning
Observation Action
State Change the
environment
Agent

Don’t do that Reward

Environment
Scenario of Reinforcement Learning
Agent learns to take actions maximizing expected
reward.

Observation Action
State Change the
environment
Agent

Thank you. Reward

https://yoast.com/how-t Environment
o-clean-site-structure/
Machine Learning Actor/Policy
≈ Looking for a Function
Action = π( Observation )

Observation Action
Function Function
input output

Used to pick the Reward

best function

Environment
Reinforcement Learning in a nutshell

RL is a general-purpose framework for decision-making

• RL is for an agent with the capacity to act
• Each action influences the agent’s future state
• Success is measured by a scalar reward signal

Goal: select actions to maximise future reward

Deep Learning in a nutshell

DL is a general-purpose framework for representation learning

• Given an objective
• Learning representation that is required to achieve objective
• Directly from raw inputs
• Using minimal domain knowledge

Goal: Learn the representation that achieves the

objective
Deep Reinforcement Learning in a nutshell

A single agent that solves human level tasks

• RL defines the objective
• DL gives the mechanism and representation
• RL+DL=Deep reinforcement learning

This can lead to general intelligence

Reinforcement Learning is multi-disciplinary

Machine Compute
Engineering
Learning Science

Operations Reinforcement Optimal

Research Learning Control

Economics Game Math

Theory
Agent and Environment

• At each step, the agent

• Selects an action
• Observes the environment
observation

• Receives reward

action
reward

• The environment:
• Receives action
• Emits new observation
• Emits reward for the agent
Learning to play Go
Observation Action

Reward

Next Move

Environment
Agent learns to take
Learning to play Go actions maximizing
expected reward.
Observation Action

Reward
reward = 0 in most cases
If win, reward = 1
If loss, reward = -1

Environment
Learning to play Go
• Supervised: Learning from teacher

Next move: Next move:

“5-5” “3-3”
• Reinforcement Learning
Learning from experience
First move …… many moves Win!
(Two agents play
…… with each other.)
Alpha Go is supervised learning + reinforcement learning.
https://image.freepik.com/free-vector/variety-of-human-avatar
s_23-2147506285.jpg
http://www.freepik.com/free-vector/variety-of-human-av

Learning a chat-bot
atars_766615.htm

•Machine obtains feedback from user

How are Hello
you?

Bye bye ☺ Hi ☺

-10 3
•Chat-bot learns to maximize the expected reward
Learning a chat-bot

• Let two agents talk to each other (sometimes generate good

dialogue, sometimes bad)
How old are you? How old are you?

See you. I am 16.

See you. I though you were 12.

See you. What make you

think so?
Learning a chat-bot

• By this approach, we can generate a lot of dialogues.

• Use some predefined rules to evaluate the goodness of a dialogue

Dialogue 1 Dialogue 2 Dialogue 3 Dialogue 4

Dialogue 5 Dialogue 6 Dialogue 7 Dialogue 8

Machine learns from the evaluation

Deep Reinforcement Learning for Dialogue
Generation https://arxiv.org/pdf/1606.01541v3.pdf
Learning a chat-bot
“Hello” Say “Hi”
• Supervised
“Bye bye” Say “Good bye”

• Reinforcement
……. ……. ……
Hello ☺ …… Bad
Agent Agent
More applications
•Flying Helicopter
• https://www.youtube.com/watch?v=0JL04JJjocc

•Driving
• https://www.youtube.com/watch?v=0xo1Ldx3L5Q

•Robot
• https://www.youtube.com/watch?v=370cT-OAzzM
•Google Cuts Its Giant Electricity Bill With DeepMind-Powered AI
• http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-its-giant-electricity-bill-
with-deepmind-powered-ai

•Text generation
• https://www.youtube.com/watch?v=pbQ4qe8EwLo
Example: Playing Video Game

• Widely studies:
• Gym: https://gym.openai.com/
• Universe: https://openai.com/blog/universe/
Machine learns to play video
games as human players
➢ What machine
observes is pixels
➢ Machine learns to take
proper action itself
Example: Playing Video Game
Termination: all the aliens are
• Space invader killed, or your spaceship is
Score destroyed.
(reward)
Kill the
aliens

shield
fire
Example: Playing Video Game

• Space invader
•Play yourself:
http://www.2600online.com/spaceinvader
s.html
•How about machine:
https://gym.openai.com/evaluations/eval
_Eduozx4HRyqgTCVk9ltw
Example: Playing Video Game

(kill an
alien)
Usually there is some randomness in the environment
Example: Playing Video Game

This is an episode.
After many turns Game Over
(spaceship destroyed) Learn to maximize the
expected cumulative reward
per episode
Paradigm
Supervised Unsupervised Reinforcement
Learning Learning Learning

Objective

→ Classification → Inference → Prediction

Applications
→ Regression → Generation → Control
Prediction

Control
SETTING

Environment

State/Observation
Action
Reward

Agent
using
policy
MARKOV DECISION PROCESSES (MDP)

Transition Reward
State Action
function function
space space

● State: Markov property considers only the previous state

● Decision: agent takes actions, and those decisions have consequences
● Process: there is a transition function (dynamics of the system)
● Reward: depends on the state and action, often related to the state

Goal: maximise overall reward

Partially Observable MARKOV DECISION
PROCESSES (POMDP)

Transition Reward
State Action
function function
space space

● State: Markov property considers only the previous state but the agent cannot
directly observe the underlying state.
● Decision: agent takes actions, and those decisions have consequences
● Process: there is a transition function (dynamics of the system)
● Reward: depends on the state and action, often related to the state

Goal: maximise overall reward

MARKOV DECISION PROCESSES (MDP)

Transition Reward
State Action
function function
space space
Computing Rewards

Episodic vs continuing: “Game over” after N steps

Additive rewards (can be infinite for continuing tasks)
Discounted rewards ...
DISCOUNT FACTOR
→ We want to be greedy but not impulsive
→ Implicitly takes uncertainty in dynamics into account (we
don’t know the future)
→ Mathematically: γ<1 allows infinite horizon returns

Return:
SOLVING AN MDP

Objective:

Goal:
SOLVING AN MDP

● If the state and actions are discrete:

○ We have a table of state-action probabilities
○ Learning is filling this table: (dynamic
programming)
Action
State
SOLVING AN MDP

● If the state and actions are discrete:

○ We have a table of state-action probabilities
○ Learning is filling this table: (dynamic
programming)
Action State

Action
State

State
SOLVING AN MDP

● If the state and actions

are discrete:
● Let’s try different
actions and see which
one succeed
Exploration-Exploitation dilemma

Do we want to stick to action we think

would be good or try something new
Choosing Actions

• Take the action with highest probability (Q-function): Greedy

• Proportionate by its probability: Sampling
• Greedy most times, with some probability random
VALUE FUNCTIONS

→ Value = expected gain of a state

→ Q function – action specific value function
→ Advantage function – how much more valuable is an action
→ Value depends on future rewards → depends on policy
VALUE FUNCTIONS
State

Action

State
Solving Reinforcement Learning

• Model-based approaches:
• We model the environment. Do we really need to model all the details of the
world?

• Model free approaches:

• We model the state-actions
Alpha Go: policy-based + value-based +
model-based

Model-free
Approach
Policy-based Value-based

Learning an Actor Actor + Critic Learning a Critic

Model-based Approach
POLICY ITERATION

Policy Policy
Evaluation Update
Q-LEARNING
Q-LEARNING
FUNCTION APPROXIMATION

Model:

Training
data:
Loss
function: where
IMPLEMENTATION
Action-in Action-out Off-Policy Learning
→ The target depends in
part on our model → old
observations are still
useful
→ Use a Replay Buffer of
most recent transitions
as dataset
Properties of
Reinforcement Learning
•Reward delay
• In space invader, only “fire” obtains reward
•Although the moving before “fire” is important
• In Go playing, it may be better to sacrifice immediate reward to gain more long-term reward
•Agent’s actions affect the subsequent data it receives
• E.g. Exploration
DQN ISSUES
→ Convergence is not guaranteed – hope for deep magic!

Replay Buffer Reward scaling Using replicas

→ Double Q Learning – decouple action selection and value

estimation

UW CSE DEEP LEARNING - FELIX LEEB 51

POLICY GRADIENTS
→ Parameterize policy and update those parameters directly
→ Enables new kinds of policies: stochastic, continuous action
spaces

→ On policy learning → learn directly from your actions

52
POLICY GRADIENTS

→ Approximate expectation value from

samples 53
VARIANCE REDUCTION

→ Constant offsets make it harder to differentiate

the right direction
→ Remove offset → a priori value of each state

UW CSE DEEP LEARNING - FELIX LEEB 54

ADVANCED POLICY GRADIENT METHODS

Rajeswaran et al. Heess et al.

(2017) (2017)

55
ACTOR CRITIC

Critic using Q learning update

Estimate Propose
Advantage Actions

Actor using policy gradient update

UW CSE DEEP LEARNING - FELIX LEEB 56

ASYNC ADVANTAGE ACTOR-CRITIC (A3C)

Mnih et al.
(2016)
57
ASYNC ADVANTAGE ACTOR-CRITIC (A3C)
Deep Reinforcement Learning
Actor-Critic
Actor-Critic
interacts with the
environment

Update actor from

Learning
based on
Actor-Critic

left
Network right
Network fire

Network
Demo of A3C

• Visual Doom AI Competition @ CIG 2016

• https://www.youtube.com/watch?v=94EPSjQH38Y
Why is it challenging

• Exploration-exploitation dilemma
• How to reward the algorithm.
• How to learn when rewards are very sparse
• What representation do we need for states?
• How to update the policy
• How to incorporate the prior (or logic-based) knowledge
• How to learn for multiple tasks: General Artificial Intelligence
Reference

•Textbook: Reinforcement Learning: An Introduction

• http://incompleteideas.net/sutton/book/the-book.html
•Lectures of David Silver
• http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html (10 lectures, around
1:30 each)
• http://videolectures.net/rldm2015_silver_reinforcement_learning/ (Deep
Reinforcement Learning )

•Lectures of John Schulman

• https://youtu.be/aUrX-rP_ss4

Deep Reinforcement Learning for Snake Game
No ratings yet
Deep Reinforcement Learning for Snake Game
9 pages
A Beginners Guide To Deep Reinforcement Learning PDF
No ratings yet
A Beginners Guide To Deep Reinforcement Learning PDF
9 pages
Unity ML-Agents Reinforcement Learning Guide
No ratings yet
Unity ML-Agents Reinforcement Learning Guide
43 pages
Network Embedding Techniques Explained
No ratings yet
Network Embedding Techniques Explained
60 pages
Intro To Machine Learning With PyTorch
No ratings yet
Intro To Machine Learning With PyTorch
48 pages
Automatic Facial Emotion Recognition
No ratings yet
Automatic Facial Emotion Recognition
52 pages
Machine Learning Is Fun 1565131730
No ratings yet
Machine Learning Is Fun 1565131730
48 pages
Deep Learning Book
100% (5)
Deep Learning Book
42 pages
Deeplearningsmartnetworks 190505233523
100% (1)
Deeplearningsmartnetworks 190505233523
101 pages
1 Introduction To Reinforcement Learning
100% (2)
1 Introduction To Reinforcement Learning
104 pages
Machine Learning
No ratings yet
Machine Learning
20 pages
MIT 6.S191: Deep Reinforcement Learning
100% (4)
MIT 6.S191: Deep Reinforcement Learning
48 pages
Understanding Machine Learning Basics
100% (1)
Understanding Machine Learning Basics
64 pages
Basics of Deep Learning
100% (2)
Basics of Deep Learning
17 pages
Supervised, Unsupervised & Reinforcement Learning
No ratings yet
Supervised, Unsupervised & Reinforcement Learning
11 pages
Back Propagation Back Propagation Network Network Network Network
No ratings yet
Back Propagation Back Propagation Network Network Network Network
29 pages
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
No ratings yet
Deep Reinforcement Learning: Overcoming The Challenges of Deep Learning in Discrete and Continuous Markov Decision Processes
110 pages
Top 10 Deep Learning Algorithms You Should Know in 2023
No ratings yet
Top 10 Deep Learning Algorithms You Should Know in 2023
14 pages
Deep Learning Essentials
100% (1)
Deep Learning Essentials
140 pages
02 - Lecture Note - TensorFlow Ops
No ratings yet
02 - Lecture Note - TensorFlow Ops
21 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Swin Transformer Hierarchical Vision Transformer Using Shifted Windows
No ratings yet
Swin Transformer Hierarchical Vision Transformer Using Shifted Windows
11 pages
Linear Algebra For Deep Learning. The Math Behind Every Deep Learning - by Vihar Kurama - Towards Data Science
No ratings yet
Linear Algebra For Deep Learning. The Math Behind Every Deep Learning - by Vihar Kurama - Towards Data Science
20 pages
Concept Drift in Large Language Models - Ketan Sanjay Desale
No ratings yet
Concept Drift in Large Language Models - Ketan Sanjay Desale
183 pages
Python Metaprogramming
100% (1)
Python Metaprogramming
93 pages
Lecture 14 Autoencoders
No ratings yet
Lecture 14 Autoencoders
39 pages
1281819944artificial Intelligence & Machine Learning
No ratings yet
1281819944artificial Intelligence & Machine Learning
109 pages
How To Code A Neural Network With Backpropagation in Python
No ratings yet
How To Code A Neural Network With Backpropagation in Python
133 pages
Machine Learning Model Interpretation Guide
No ratings yet
Machine Learning Model Interpretation Guide
78 pages
Computer Vision 2011
100% (1)
Computer Vision 2011
103 pages
Introduction To Neural Networks
100% (1)
Introduction To Neural Networks
25 pages
Ch3 Auto Encoder
No ratings yet
Ch3 Auto Encoder
40 pages
Neural ODEs: Continuous-Depth Models
No ratings yet
Neural ODEs: Continuous-Depth Models
13 pages
Sandro Skansi - Introduction To Deep Learning. From Logical Calculus To Artificial Intelligence (2018, Springer)
No ratings yet
Sandro Skansi - Introduction To Deep Learning. From Logical Calculus To Artificial Intelligence (2018, Springer)
193 pages
Deep Reinforcement Learning in Trading
No ratings yet
Deep Reinforcement Learning in Trading
9 pages
DCGANs for Unsupervised Learning
No ratings yet
DCGANs for Unsupervised Learning
27 pages
Intelligent Agents: Fundamentals of Artificial Intelligence
No ratings yet
Intelligent Agents: Fundamentals of Artificial Intelligence
51 pages
(Addison-Wesley Data & Analytics Series) Laura Graesser - Wah Loon Keng - Foundations of Deep Reinforcement Learning - Theory and Practice in Python-Addison-Wesley Professional (2019) PDF
100% (1)
(Addison-Wesley Data & Analytics Series) Laura Graesser - Wah Loon Keng - Foundations of Deep Reinforcement Learning - Theory and Practice in Python-Addison-Wesley Professional (2019) PDF
656 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
1 page
An Introduction To Mathematics Behind Neural Networks
No ratings yet
An Introduction To Mathematics Behind Neural Networks
5 pages
Purdue - 19-Photonic Neuromorphic Computing PDF
No ratings yet
Purdue - 19-Photonic Neuromorphic Computing PDF
35 pages
Topological Deep Learning: Going Beyond Graph Data
No ratings yet
Topological Deep Learning: Going Beyond Graph Data
81 pages
Graph Representation Learning
No ratings yet
Graph Representation Learning
141 pages
A Gentle Introduction To Neural Networks With Python
No ratings yet
A Gentle Introduction To Neural Networks With Python
85 pages
Think Python PDF: Computer Science Guide
No ratings yet
Think Python PDF: Computer Science Guide
3 pages
Machine Learning For Algorithmic Trading: T Kondratieva
No ratings yet
Machine Learning For Algorithmic Trading: T Kondratieva
5 pages
One-Shot Learning in Vision
No ratings yet
One-Shot Learning in Vision
20 pages
Deep Learning Quiz: Week 1 & 2
No ratings yet
Deep Learning Quiz: Week 1 & 2
5 pages
Automatic Differentiation With Pytorch: Stat 479: Deep Learning, Spring 2019 Sebastian Raschka
No ratings yet
Automatic Differentiation With Pytorch: Stat 479: Deep Learning, Spring 2019 Sebastian Raschka
43 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
36 pages
Machine Learning and Deep Learning
No ratings yet
Machine Learning and Deep Learning
6 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Deep Reinforcement Learning Overview
No ratings yet
Deep Reinforcement Learning Overview
9 pages
Basics of Reinforcement Learning
No ratings yet
Basics of Reinforcement Learning
15 pages
Unit-6 Reinforcement Learning
No ratings yet
Unit-6 Reinforcement Learning
75 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
63 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
(Ebook) Understanding Children's Development by Peter K. Smith Helen Cowie & Mark Blades ISBN 9781405176019, 1405176016 Download Full Chapters
No ratings yet
(Ebook) Understanding Children's Development by Peter K. Smith Helen Cowie & Mark Blades ISBN 9781405176019, 1405176016 Download Full Chapters
89 pages
Programming Languages Build Prove and Compare Norman Ramsey PDF Download
No ratings yet
Programming Languages Build Prove and Compare Norman Ramsey PDF Download
155 pages
Understanding the RAFT Model in Writing
No ratings yet
Understanding the RAFT Model in Writing
4 pages
Teaching TLE in the New Normal
No ratings yet
Teaching TLE in the New Normal
9 pages
Ifa Fikadu Assignment 2
No ratings yet
Ifa Fikadu Assignment 2
15 pages
NCR Tutors
No ratings yet
NCR Tutors
6 pages
Lesson 9 On Milil
No ratings yet
Lesson 9 On Milil
28 pages
History of Probability
No ratings yet
History of Probability
16 pages
Teaching Load Allocation Odd Sem 2025-26 SCSE
No ratings yet
Teaching Load Allocation Odd Sem 2025-26 SCSE
33 pages
#Q1 Wk2-2 Socratic Method of Philosophizing
No ratings yet
#Q1 Wk2-2 Socratic Method of Philosophizing
34 pages
Hyderabad Math Wizard Solves Riemann Hypothesis
No ratings yet
Hyderabad Math Wizard Solves Riemann Hypothesis
2 pages
High School Social Anxiety Coping
No ratings yet
High School Social Anxiety Coping
19 pages
Assessing Data Quality of Annotations With Krippendorff's Alpha For Applications in Computer Vision
No ratings yet
Assessing Data Quality of Annotations With Krippendorff's Alpha For Applications in Computer Vision
9 pages
Sociolinguistics and Social Class
No ratings yet
Sociolinguistics and Social Class
17 pages
Precursors of Formal Thought: A Longitudinal Study: Joël Bradmetz
No ratings yet
Precursors of Formal Thought: A Longitudinal Study: Joël Bradmetz
21 pages
Successful Quarterlife Crisis Transition Factors
No ratings yet
Successful Quarterlife Crisis Transition Factors
8 pages
Prac - Res.1 Quarter 2 Module 5
No ratings yet
Prac - Res.1 Quarter 2 Module 5
13 pages
Fair Practices in Data Analysis
No ratings yet
Fair Practices in Data Analysis
5 pages
Psychology Ch1 - Notes XI
No ratings yet
Psychology Ch1 - Notes XI
4 pages
Admin, Finky (1-10) 0
No ratings yet
Admin, Finky (1-10) 0
10 pages
Core vs. Basic Competencies Guide
No ratings yet
Core vs. Basic Competencies Guide
14 pages
GRADE 7 Curriculum-Map
No ratings yet
GRADE 7 Curriculum-Map
4 pages
SCHUTZE Narrative Interview
No ratings yet
SCHUTZE Narrative Interview
13 pages
Math 4 Q4-W5 COT LP
No ratings yet
Math 4 Q4-W5 COT LP
7 pages
IMAT Study Materials by Section
No ratings yet
IMAT Study Materials by Section
3 pages
Ansh Itverticals
No ratings yet
Ansh Itverticals
7 pages
STS Midterm Project Guidelines
No ratings yet
STS Midterm Project Guidelines
2 pages
Schaefer2024rel PPT Ch01 Mhe 013124 Access Mhe 030424
No ratings yet
Schaefer2024rel PPT Ch01 Mhe 013124 Access Mhe 030424
46 pages
Photoshop Ai
100% (1)
Photoshop Ai
2 pages
Nursing Theories and Philosophies Overview
No ratings yet
Nursing Theories and Philosophies Overview
6 pages

An Introduction To Deep ReinforcementLearning

Uploaded by

An Introduction To Deep ReinforcementLearning

Uploaded by

An Introduction to

We have a set of sample observations, with labels

We need thousands of samples

Don’t do that Reward

Thank you. Reward

Used to pick the Reward

RL is a general-purpose framework for decision-making

Goal: select actions to maximise future reward

DL is a general-purpose framework for representation learning

Goal: Learn the representation that achieves the

A single agent that solves human level tasks

This can lead to general intelligence

Operations Reinforcement Optimal

Economics Game Math

• At each step, the agent

Next move: Next move:

•Machine obtains feedback from user

• Let two agents talk to each other (sometimes generate good

See you. I am 16.

See you. I though you were 12.

See you. What make you

• By this approach, we can generate a lot of dialogues.

Dialogue 1 Dialogue 2 Dialogue 3 Dialogue 4

Dialogue 5 Dialogue 6 Dialogue 7 Dialogue 8

Machine learns from the evaluation

→ Classification → Inference → Prediction

● State: Markov property considers only the previous state

Goal: maximise overall reward

Goal: maximise overall reward

Episodic vs continuing: “Game over” after N steps

● If the state and actions are discrete:

● If the state and actions are discrete:

● If the state and actions

Do we want to stick to action we think

• Take the action with highest probability (Q-function): Greedy

→ Value = expected gain of a state

• Model free approaches:

Learning an Actor Actor + Critic Learning a Critic

Replay Buffer Reward scaling Using replicas

→ Double Q Learning – decouple action selection and value

UW CSE DEEP LEARNING - FELIX LEEB 51

→ On policy learning → learn directly from your actions

→ Approximate expectation value from

→ Constant offsets make it harder to differentiate

UW CSE DEEP LEARNING - FELIX LEEB 54

Rajeswaran et al. Heess et al.

Critic using Q learning update

Actor using policy gradient update

UW CSE DEEP LEARNING - FELIX LEEB 56

Update actor from

• Visual Doom AI Competition @ CIG 2016

•Textbook: Reinforcement Learning: An Introduction

•Lectures of John Schulman

You might also like