0% found this document useful (0 votes)
48 views38 pages

Reinforcement Learning

Reinforcement Learning (RL) is a machine learning approach where an agent learns to make decisions by interacting with an environment, receiving rewards or penalties to maximize total rewards over time. Key components include the agent, environment, states, actions, rewards, policies, and value functions, with applications in robotics, game playing, recommendation systems, and finance. Popular algorithms like Q-Learning and challenges such as exploration vs. exploitation are also discussed, along with real-world examples like personalized news recommendations and robot vacuum cleaners.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views38 pages

Reinforcement Learning

Reinforcement Learning (RL) is a machine learning approach where an agent learns to make decisions by interacting with an environment, receiving rewards or penalties to maximize total rewards over time. Key components include the agent, environment, states, actions, rewards, policies, and value functions, with applications in robotics, game playing, recommendation systems, and finance. Popular algorithms like Q-Learning and challenges such as exploration vs. exploitation are also discussed, along with real-world examples like personalized news recommendations and robot vacuum cleaners.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Reinforcement Learning

By
Dr Ravi Prakash Verma
Professor
Department of CSAI
ABESIT
Reinforcement Learning
• Introduction
• Reinforcement Learning is a type of machine learning where an agent learns to make
decisions by interacting with an environment.
• The agent receives feedback in the form of rewards or penalties, and its goal is to maximize
the total reward over time.
• Example Train a robot to walk
• Terminology in RL:
• Agent: The learner or decision maker.
• Environment: The world the agent interacts with.
• State (s): A representation of the current situation.
• Action (a): Choices the agent can make.
• Reward (r): Feedback from the environment after an action.
• Policy (π): A strategy that maps states to actions.
• Value Function (V): Expected return (total reward) from a state.
• Q-Value Function (Q): Expected return for taking an action in a state.
Reinforcement Learning
• Applications of RL:
• Robotics (e.g., walking, manipulation)
• Game Playing (e.g., AlphaGo, OpenAI Five)
• Recommendation Systems
• Autonomous Vehicles
• Finance (e.g., trading strategies)
Reinforcement Learning
• The RL Loop:
• The agent observes the current state.
• It selects an action based on its policy.
• The environment returns a new state and a reward.
• The agent updates its policy to improve future decisions.
• Types of Reinforcement Learning:
[Link]-Free vs. Model-Based
1. Model-Free: Learns by trial and error (e.g., Q-Learning, SARSA).
2. Model-Based: Tries to model the environment.
[Link]-Based vs. Policy-Based vs. Actor-Critic
1. Value-Based: Learns value functions (e.g., Q-Learning).
2. Policy-Based: Directly learns the policy (e.g., REINFORCE).
3. Actor-Critic: Combines both approaches.
Reinforcement Learning
• Popular RL Algorithms:
• Q-Learning
• Deep Q-Network (DQN)
• SARSA
• Policy Gradient Methods
• Proximal Policy Optimization (PPO)
• A3C (Asynchronous Advantage Actor-Critic)
• Challenges in RL:
• Exploration vs. Exploitation
• Sparse or Delayed Rewards
• High-dimensional State Spaces
• Sample Inefficiency
Reinforcement Learning
• Learning Task
• A learning task defines:
• What the agent is supposed to learn and How success is measured.
• In Reinforcement Learning, a learning task involves the agent learning how to act to
maximize rewards through interactions with the environment.

• Components of a Learning Task in RL:


• Objective – Maximize long-term rewards (a.k.a. the return).
• Environment – The world or simulator the agent interacts with.
• Agent – The learner that improves over time.
• Performance Metric – Usually cumulative reward.
• Feedback Type – Reward signals (positive/negative reinforcement).
Reinforcement Learning
• RL-Specific Learning Tasks
Task Type Description Example
Prediction Estimate value functions (e.g., how Estimate value of being in a
good is a state?) room
Control Find the best policy to maximize Learn how to win a game
reward
Exploration vs. Exploitation Balance trying new actions vs. using Try new moves in chess or stick
known good ones to the winning ones
Reinforcement Learning
• Example Learning Task
• A self-driving car (agent) learns to drive safely and quickly (goal) by receiving +10 for
reaching destination, −100 for accidents, and −1 per second of delay (reward
signals).
• Over time, it learns a policy to drive efficiently = Learning Task completed
• A self-driving car is learning to drive from point A to point B. It:
• Receives +10 for reaching the destination
• Receives −100 for a crash
• Receives −1 for every second it takes to reach the destination
• The goal is to learn the best policy (strategy) to maximize total rewards.
Reinforcement Learning
• Solution with Q-Learning as an example
• The simple environment
• States (S)
• S0: Starting point
• S1: Turn Left
• S2: Turn Right
• S3: Obstacle (crash)
• S4: Destination
• Actions (A)
• A0: Move Forward
• A1: Turn Left
• A2: Turn Right
Reinforcement Learning
• Initialize Q-Table
• The Q-table is initialized with zeros:
State Action 0 (FWD) Action 1 (Left) Action 2 (Right)
S0 0 0 0
S1 0 0 0
S2 0 0 0
S3 0 0 0
S4 - -
Reinforcement Learning
Reinforcement Learning
• Example Walkthrough
Reinforcement Learning
• Updated Q-table (after one episode)
State FWD (A0) Left (A1) Right (A2)
S0 0 −0.5 0
S1 5 0 0

• Now the agent learns that:


• Going Left from S0 isn't too bad (−0.5),
• Going Forward from S1 to reach destination is very good (+5.0 expected return).
• Total Return (Reward):
• For this trajectory:
• Total Return=−1(S0→S1)+10(S1→S4)=9
Reinforcement Learning
• Example: Personalized News Recommendation System
• Scenario:
• Imagine an online news platform (like Google News or Flipboard).
• It wants to recommend articles to users based on their interests, in real-time.
• The goal is to keep the user engaged.
• Reinforcement Learning in Action
Component RL Equivalent
User’s current behavior (e.g., scroll, click, time spent) State (S)
Recommended article options Actions (A)
User clicks or ignores Reward (R)
Agent (the recommender) Learns a policy (π) to recommend better next time
Reinforcement Learning
• Walk through a scenario:
• User profile: Interested in sports and politics.
• Agent recommends:
• A1 = Political article
• A2 = Sports article
• A3 = Tech article
• Suppose the agent picks:
• A3: Tech article → user scrolls past it → Reward = 0
• Then it tries:
• A2: Sports article → user clicks and reads it → Reward = +1
• The RL system updates its policy to favor sports recommendations for this user.
Reinforcement Learning
• What’s the Agent Learning?
• The agent is learning a policy
• π(state) → best action (article) to recommend that maximizes user engagement over
time.
• It can use
• Q-learning
• Bandit algorithms (for simpler cases)
• Contextual bandits
• Deep RL for large-scale recommendation systems (like YouTube or TikTok)
• Long-Term Objective
• The agent is rewarded not just for one good click, but for maximizing total
engagement across a session (e.g., 10 minutes of reading).
• So it's not just “what’s best now?” but:
• “What action now leads to the most reward over time?”
Reinforcement Learning
• Real-World Usage
Company Application
YouTube Suggesting videos based on watch behavior
Netflix Personalizing movie/TV show recommendations
Amazon Product recommendations
Spotify Song playlists and Discover Weekly

• Setup
• Our agent (news recommender) can show one article at a time from 3 categories:
• A1 = Politics
• A2 = Sports
• A3 = Tech
• The agent is trying to maximize user engagement — measured by
User Behavior Reward
Scrolls past article 0
Clicks article 1
Reads full article 2
Reinforcement Learning
• Learning Over Time (Q-learning)
Reinforcement Learning
• Step-by-Step Learning
• Step 1: Agent recommends a Tech article (A3)
• User scrolls past → Reward = 0
• Current state = S = SportsReader
• Next state = S' = NoEngagement
• 𝑄(𝑆,𝐴3)=0, and max𝑄(𝑆′)=0
• 𝑄(𝑆,𝐴3)=0+0.5⋅[0+0.9⋅0−0]=0
• No learning happens here (bad choice, no reward)
• Step 2: Agent recommends a Sports article (A2)
• User clicks → Reward = +1
• Next state = S' = Clicked
• 𝑄(𝑆,𝐴2)=0, and assume max𝑄(𝑆′)=0
• 𝑄(𝑆,𝐴2)=0+0.5⋅[1+0.9⋅0−0]=0.5
• Agent learns that sports articles might be good for this user!
Reinforcement Learning
Step 3: Agent recommends a Politics article (A1)
• User reads full article → Reward = +2
• Q(S,A1)=0, and again maxQ(S′)=0
• Q(S,A1)=0+0.5⋅[2+0.9⋅0−0]=1.0
• Wow! The user really liked politics — stronger positive feedback.
• Q-table After Learning (State: SportsReader)
Action Q-value
A1 (Politics) 1
A2 (Sports) 0.5
A3 (Tech) 0
Reinforcement Learning
• Agent’s Updated Strategy
• Based on the current Q-values, the agent will now prefer
• Politics > Sports > Tech
• It learned from trial and error what type of content this user prefers, just by
interacting and updating its Q-values.
Reinforcement Learning
• Markov Decision process
• Example: Robot Vacuum Cleaner in a Room
• Imagine a robot vacuum cleaner that moves around a small grid room to clean dirt
and avoid walls. It learns the best strategy to clean the entire room efficiently.
• Step 1: Markov Decision Process (MDP) Components
• An MDP is defined by five key elements:

Symbol Meaning Our Example


Each cell in the grid (clean/dirty) + robot's
S Set of states
location
A Set of actions Move Up, Down, Left, Right, or Clean
(P(s’, s,a)) Transition probability
R(s,a) Reward function +10 for cleaning dirt, -1 for bumping into wall
γ Discount factor How much future rewards are valued (e.g., 0.9)
Reinforcement Learning
Step 2: Example Scenario
• Let’s assume a 2x2 room:
A B
C D
• Cell B is dirty
• Robot starts at A
• Actions and Rewards
• From A, robot can move:
• Right → to B
• Down → to C
• If robot cleans B, gets +10
• If robot tries to go left from A (into a wall), gets -1
Reinforcement Learning
• Step 3: MDP Transition Example
• Let’s define:
• s="Robot at A“
• a="Move Right"
• s′="Robot at B“
• If the transition is deterministic, then:
• P(s′∣s,a)=1and
• R(s,a)=0 (no reward for moving)
• Step 4: Now Add a Reward Step
• Let’s say:
• s="Robot at dirty B"
• a="Clean"
• s′="Robot at clean B"
• Reward R(s,a)=+10
• The agent uses this to learn a policy
π(s)→best action to take in state s
Reinforcement Learning
Reinforcement Learning
• Summary
• In this robot vacuum cleaner example:
MDP Element Example
States SSS Position of robot + status of dirt (dirty/clean)
Actions AAA Move Up, Down, Left, Right, Clean
Transitions PPP Moving from one cell to another
Rewards RRR +10 for cleaning, -1 for bumping into wall
Policy π\piπ “If in B and dirty → Clean. If in A → go Right”
Reinforcement Learning
• Q-Learning – A Model-Free Reinforcement Learning Algorithm
• What is Q-Learning?
• Q-Learning is a model-free, off-policy reinforcement learning algorithm used to
learn the value of actions in states.
• It helps an agent learn the optimal policy — the best action to take in any given state
— by learning the expected rewards.
Reinforcement Learning
• Q-Learning Terminology
Term Description
State (s) The current situation or location of the agent
Action (a) Choices the agent can make
Reward (r) Feedback received after taking an action
Q(s, a) Expected value (future reward) of taking action a in state s
α (alpha) Learning rate — how much new info overrides old info
γ (gamma) Discount factor — importance of future rewards
Reinforcement Learning
• Q-Learning Update Formula (function)

Term Meaning
Q(s,a) Current Q-value for state sss and action aaa
α (learning rate) How fast new knowledge replaces old (0 to 1)
r Immediate reward received after taking action a in state s
γ (discount factor) Importance of future rewards (0 to 1)
maxa′Q(s′,a′) Maximum predicted reward from the next state s'
Reinforcement Learning
• Q-Learning Update Formula (function)
• What Does It Do?
• This function updates the Q-value of a state-action pair based on
• The current Q-value
• The new reward received
• The estimated best future value (lookahead)
• It balances exploration (trying new paths) and exploitation (choosing the best-known path).
• "Update the value of this action by blending the current value with the newly observed
experience.“
• The closer α is to 1, the more you trust the new experience.
• The closer γ is to 1, the more you value long-term reward.
Reinforcement Learning
• Q-Learning Update Formula (function)
• Example
• Let’s say:
• State s=A
• Action a=Right
• Next state 𝑠′=𝐵
• Reward 𝑟=10
• Learning rate α=0.5
• Discount factor 𝛾=0.9
• Q(A,Right)=0
• max a Q(B,a)=0
• Then, Q(A,Right)=0+0.5⋅(10+0.9⋅0−0)=5.0
• If later maxQ(B,a)=6, the update becomes:
• Q(A,Right)=5+0.5⋅(10+0.9⋅6−5)=10.2
Reinforcement Learning
• Q-Learning Update Formula (function)
• Benefits of Q-Learning
• Learns optimal policies without needing a model of the environment.
• Can handle stochastic (random) environments.
• Works with exploration techniques like ε-greedy to balance trial and error.
Reinforcement Learning
• Understand With a Simple Example
• Imagine a robot in a 2x2 grid, trying to reach a goal at B and learn the best route.
A B
C D

• Start at A
• Goal is B (Reward +10)
• All other moves = 0 reward
• Invalid moves (like moving off-grid) = -1
Reinforcement Learning
• Actions: Up, Down, Left, Right
• Let’s initialize the Q-table with zeros
Q={
('A', 'Right'): 0, ('A', 'Down'): 0,
('B', 'Left'): 0, # Goal, will get reward
('C', 'Up'): 0, ('C', 'Right'): 0,
('D', 'Left'): 0, ('D', 'Up'): 0
}
Reinforcement Learning
• Agent’s First Move
• From A, takes action “Right” to B
• s = A, a = Right
• Moves to s' = B, gets r = +10
• 𝑄(𝐴,Right)←0+𝛼[10+𝛾⋅max𝑎𝑄(𝐵,𝑎)−0]
• Assuming:
• α = 0.5
• γ = 0.9
• max Q(B, a) = 0 (initially)
• Q(A, Right)=0+0.5⋅[10+0.9⋅0−0]=5.0
• Q[('A', 'Right')] = 5.0
• Next Move: From D → Up → B
• s = D, a = Up, s' = B, r = +10
• Q(D, Up)=0+0.5⋅[10+0.9⋅0−0]=5.0
Reinforcement Learning
• Q-Table Gets Better Over Time
• As the agent explores and updates the Q-values, it gets closer to learning the
optimal policy, i.e., best action in every state
• Final Learned Behavior
• From A → Right → B
• From C → Right → D → Up → B
• It learns to maximize reward by choosing actions with highest Q-values.
Reinforcement Learning
• Q-Learning Pseudocode
Initialize Q(s, a) arbitrarily for all s ∈ S, a ∈ A
Repeat for each episode:
Initialize state s
Repeat for each step of the episode:
Choose action a using policy derived from Q (e.g., ε-greedy)
Take action a, observe reward r and next state s'
Update Q(s, a) using:
Q(s, a) ← Q(s, a) + α [r + γ * max(Q(s', a')) - Q(s, a)]
Set s ← s'
until s is terminal
Reinforcement Learning

You might also like