Reinforcement Learning
Definition
• Software agent learns to perform certain actions in an environment
which lead it to maximum reward.
• Exploration and Exploitation
• Multiple Trials
Type of ML
Machine Learning
Reinforcement:
Supervised • Cause and Effect
• Agent learns to interact
with environment for
reward.
Unsupervised
Reinforcement
Intuitive example
• Imagine you are supposed to cross an unknown field in the middle of
a pitch-black night without a torch.
• There can be pits and stones in the field, the position of those are
unfamiliar to you.
• There's a simple rule - if you fall into a hole or hit a rock, you must
start again from your initial point.
Block Diagram
Definitions
• Agent: Entity performing action in environment to gain reward.
• Action (a): All possible moves by agent.
• Environment (e): Scenario faced by agent.
• State (s): Current situation returned by the agent.
Definitions
• Reward(R): An immediate return sent from Environment to evaluate
last action by agent.
• Policy (𝜋): Strategy that an agent employs to determine next action
based on state s.
• Value (V): The expected long-term return with discount 𝑉𝜋 𝑠 .
Opposed to R.
• Q value or action value (Q): 𝑄𝜋 𝑠, 𝑎 : Long term return of current
state s, taking action a under policy 𝜋
Types of Reinforcement Learning
Reinforcement
Value Based
Policy based
Model Based
Value Based
• Try to maximize a value function 𝑉(𝑠)
max 𝑉𝜋(𝑠)
• The value of reward which the agent expects to gain in the future
upon starting at that state s.
• 𝐸- 𝑅/01 + 𝛾𝑅/04 + 𝛾 4 𝑅/05 + ⋯ |𝑆/ = 𝑠
Policy Based
• Try to produce a policy such that the action performed at each state is
optimal to gain maximum reward in the future.
• 𝜋 𝑠, 𝑎
• Deterministic
• At any state s, same action a is produced by policy 𝜋
• Stochastic: 𝜋 𝑎 𝑠 = 𝑃(𝐴/ = 𝑎|𝑆/ = 𝑠)
• Each action has a certain probability.
Model Based
• In this type of reinforcement learning, create a virtual model for each
environment,
• The agent learns to perform in that specific environment.
• Since the model differs for each environment, there is no singular
solution or algorithm for this type.
Multi-arm Bandit Problem
• Consider Casino section with 10 slot machine. It has written “Play for
Free ! Max. payout is $10.
• Each slot machine has different average payout.
• Goal: Find which one gives most average reward so as to maximize
reward in shortest time.