0% found this document useful (0 votes)

47 views12 pages

Reinforcement Learning - Personal Study Notes

Uploaded by

xedac78301

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views12 pages

Reinforcement Learning - Personal Study Notes

Uploaded by

xedac78301

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

RL Learning Journey - January 2025

Reinforcement Learning

Learning Through Trial, Error, and Rewards

The Big Idea: RL is how we teach agents to make sequential decisions

by learning from consequences. No labeled data needed - just rewards
and punishments! Think of it as training a dog, playing chess, or
learning to ride a bike.

1. What Makes RL Different?

Unlike supervised learning (learning from examples) or unsupervised

learning (finding patterns), RL learns from interaction with an environment.

Key Distinction: In RL, there's no supervisor, only a reward signal.

The agent must discover which actions yield the most reward through
trial and error!

The RL Loop:

Agent
↑
State | Action
(St) | (At)
↓
Environment
↑
Reward | Next State
(Rt) | (St+1)
2. Core Components

Essential Elements:

Agent: The learner/decision maker (e.g., robot, game player)

Environment: Everything the agent interacts with

State (S): Current situation/observation

Action (A): What the agent can do

Reward (R): Immediate feedback signal

Policy (π): Agent's behavior strategy

Value Function (V): Expected future rewards

3. Markov Decision Process (MDP)

Foundation of RL: MDPs provide the mathematical framework. The

Markov property states that the future depends only on the present, not
the past!

MDP = ⟨S, A, P, R, γ⟩ S: State space A: Action space P:

Transition probability P(s'|s,a) R: Reward function
R(s,a,s') γ: Discount factor (0 ≤ γ ≤ 1)
Why Discount Factor (γ)?

γ = 0: Only care about immediate rewards (myopic)

γ = 1: Future rewards equally important (far-sighted)

0 < γ < 1: Balance between immediate and future (typical: 0.9-0.99)

4. The Bellman Equations

The Bellman equations are the heart of RL - they express the

relationship between value of a state and values of successor states!

State Value Function: V(s) = E[Rt+1 + γV(St+1) | St = s]

Action Value Function (Q-function): Q(s,a) = E[Rt+1 + γ
max Q(St+1, a') | St = s, At = a] a' Optimal Policy: π*
(s) = argmax Q*(s,a) a

5. Exploration vs Exploitation

The Central Dilemma: Should I stick with what I know works (exploit)
or try something new that might be better (explore)?

Exploration Strategies:

ε-greedy: Random action with probability ε

Boltzmann/Softmax: Probabilistic based on Q-values

UCB (Upper Confidence Bound): Optimism in face of uncertainty

Thompson Sampling: Bayesian approach

ε-greedy Implementation:

if random() < epsilon:

action = random_action() # Explore
else:
action = argmax(Q[state]) # Exploit

Start with high ε (0.9) and decay over time!

6. Dynamic Programming Methods

When We Know the Model:

Policy Iteration:

1. Policy Evaluation: Compute V^π

2. Policy Improvement: Greedy w.r.t V^π

3. Repeat until convergence

Value Iteration:

V(s) ← max Σ P(s'|s,a)[R(s,a,s') + γV(s')] a s'

7. Model-Free Methods

A. Monte Carlo Methods

Learn from complete episodes - wait until end to update!

First-Visit MC:
1. Generate episode following π
2. For each state s appearing in episode:
G ← return following first occurrence of s
N(s) ← N(s) + 1
V(s) ← V(s) + 1/N(s)[G - V(s)]

B. Temporal Difference Learning

Learn from incomplete episodes - update after each step!

TD(0) Update: V(s) ← V(s) + α[r + γV(s') - V(s)] └─── TD

error ───┘ where α is learning rate

8. Q-Learning - The Classic

Off-Policy TD Control: Q-Learning learns the optimal policy while

following any policy! It's the foundation of many modern RL
algorithms.

Q-Learning Algorithm:
Initialize Q(s,a) arbitrarily
For each episode:
Initialize s
For each step:
Choose a from s using ε-greedy
Take action a, observe r, s'
Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
a'
s ← s'

9. SARSA - The Safe Alternative

On-Policy TD Control:

Q(s,a) ← Q(s,a) + α[r + γQ(s',a') - Q(s,a)] where

a' is actually taken (not max)

SARSA is more conservative - learns the policy it follows!

10. Deep Reinforcement Learning

When state/action spaces are huge, we need function approximation.

Enter neural networks!

DQN (Deep Q-Networks) - Atari Breakthrough:

Experience Replay: Store transitions, sample randomly

Target Network: Separate network for stable targets

CNN for Vision: Process raw pixels

DQN Loss: L = E[(r + γ max Q(s',a'; θ⁻) - Q(s,a; θ))²]

a' θ: main network parameters θ⁻: target network
parameters (updated slowly)

11. Policy Gradient Methods

Direct Optimization: Instead of learning values, directly optimize the

policy parameters!
REINFORCE Algorithm:

1. Sample trajectory τ ~ πθ
2. Calculate return G(τ)
3. Update: θ ← θ + α∇θ log πθ(a|s) × G(τ)

Intuition: Increase probability of good actions!

Actor-Critic Methods:

Actor: Policy network (selects actions)

Critic: Value network (evaluates actions)

Reduces variance compared to REINFORCE

12. Advanced Algorithms

Algorithm Type Key Innovation

A3C Actor-Critic Asynchronous parallel training

PPO Policy Gradient Clipped objective for stability

SAC Actor-Critic Maximum entropy for exploration

TD3 Actor-Critic Twin critics, delayed updates

Rainbow Value-based Combines 7 DQN improvements

13. Multi-Armed Bandits

Simplified RL Problem:

No states, just actions and rewards - like slot machines!

Applications:

A/B testing

Ad selection

Clinical trials

Recommendation systems

UCB Formula:

A_t = argmax [Q_t(a) + c√(ln t / N_t(a))] a

14. Practical Implementation Tips

Hard-Won Wisdom:

Start simple - tabular methods before deep RL

Reward shaping is crucial but dangerous

Normalize observations and rewards

Use multiple random seeds

Monitor exploration rate

Visualize learned behavior regularly

15. Common Challenges

RL is Hard Because:

Sample Efficiency: Needs lots of interactions

Stability: Training can diverge easily

Reward Design: Wrong rewards = wrong behavior

Partial Observability: Hidden state information

Non-stationarity: Environment changes

Credit Assignment: Which action caused reward?

16. OpenAI Gym Example

Basic Q-Learning with Gym:

import gym
import numpy as np

env = gym.make('FrozenLake-v1')
Q = np.zeros([env.observation_space.n,
env.action_space.n])

# Hyperparameters
α = 0.8 # learning rate
γ = 0.95 # discount factor
ε = 0.1 # exploration rate

for episode in range(2000):

state = env.reset()
done = False

while not done:

# ε-greedy action selection
if np.random.random() < ε:
action = env.action_space.sample()
else:
action = np.argmax(Q[state])

# Take action
next_state, reward, done, _ = env.step(action)

# Q-learning update
Q[state, action] = Q[state, action] + α * (
reward + γ * np.max(Q[next_state]) - Q[state, action]
)

state = next_state

17. Real-World Applications

Domain Application RL Approach

Self-play, Tree search +

Gaming AlphaGo, OpenAI Five
NN

Sim-to-real, Model-
Robotics Manipulation, Walking
based

Trading, Portfolio
Finance Risk-aware RL
management

Healthcare Treatment planning Safe RL, Offline RL

Energy Data center cooling Model-free control

Transportation Traffic control Multi-agent RL

18. Inverse RL & Imitation Learning

Learning from Demonstrations:

Behavioral Cloning:

Supervised learning on expert trajectories

Inverse RL:

Infer reward function from expert behavior

GAIL (Generative Adversarial Imitation Learning):

GAN-style approach to imitation

19. Multi-Agent RL

When Multiple Agents Interact:

Competitive (zero-sum games)

Cooperative (shared rewards)

Mixed (general-sum games)

Challenge: Non-stationarity from other learning agents!

20. Study Resources & Next Steps

My Learning Roadmap:

1. Sutton & Barto book (the bible!)

2. David Silver's course (DeepMind)

3. Implement tabular methods from scratch

4. OpenAI Gym environments

5. Deep RL with Stable Baselines3

6. Read key papers (DQN, A3C, PPO)

7. Build custom environment

8. Participate in competitions

Key Papers to Read:

"Playing Atari with Deep RL" (DQN, 2013)

"Proximal Policy Optimization" (PPO, 2017)

"Soft Actor-Critic" (SAC, 2018)

"AlphaGo Zero" (Self-play, 2017)

Final Thoughts:
RL is challenging but incredibly powerful. It's the closest we have to
general intelligence - learning from interaction, just like humans do. The
key is patience, lots of experiments, and understanding the
fundamentals deeply. Remember: in RL, failure is just another data
point for learning!

"The only real mistake is the one from which we learn nothing" - In RL, every mistake teaches the
agent!

10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Deep Reinforcement Learning Overview
No ratings yet
Deep Reinforcement Learning Overview
52 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
RL Algorithms in Gymnasium
No ratings yet
RL Algorithms in Gymnasium
59 pages
w7 - Reinforcement Learning
No ratings yet
w7 - Reinforcement Learning
5 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Sections
No ratings yet
Sections
76 pages
Unit - 5
No ratings yet
Unit - 5
43 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Unit 3 Ai
No ratings yet
Unit 3 Ai
5 pages
RL Concepts and Methods
No ratings yet
RL Concepts and Methods
8 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
37 RL
No ratings yet
37 RL
18 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Learning Task
No ratings yet
Learning Task
14 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Unit 3
No ratings yet
Unit 3
12 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Overview of Reinforcement Learning
No ratings yet
Overview of Reinforcement Learning
17 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Chapter 1 Introduction RL Report Kiran
No ratings yet
Chapter 1 Introduction RL Report Kiran
2 pages
Maai 6
No ratings yet
Maai 6
143 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Green and Black Modern Machine Learning Presentation
No ratings yet
Green and Black Modern Machine Learning Presentation
14 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
15 Deep Reinforcement Learning v24.2
No ratings yet
15 Deep Reinforcement Learning v24.2
115 pages
Reinforcement Learning With Python
No ratings yet
Reinforcement Learning With Python
24 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Hota ML ReinforcementLearning
No ratings yet
Hota ML ReinforcementLearning
12 pages
Q Learing
No ratings yet
Q Learing
30 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
19 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Unit 3
No ratings yet
Unit 3
13 pages
Unit 5 ML
No ratings yet
Unit 5 ML
49 pages
Lecture Notes On Reinforcement Learning Basics
No ratings yet
Lecture Notes On Reinforcement Learning Basics
6 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Deep Q-Learning with Python Guide
No ratings yet
Deep Q-Learning with Python Guide
12 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
RL Week - 1
No ratings yet
RL Week - 1
53 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
Week 14 RL October 19
No ratings yet
Week 14 RL October 19
25 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
50 pages
RL Introduction
No ratings yet
RL Introduction
225 pages
Chapter 18, 19
100% (1)
Chapter 18, 19
16 pages
Understanding Critical Discourse Analysis
No ratings yet
Understanding Critical Discourse Analysis
15 pages
ENG 9 Lesson 7
No ratings yet
ENG 9 Lesson 7
4 pages
Strategy Execution Course Outline
No ratings yet
Strategy Execution Course Outline
12 pages
GRADES 1 To 12 Daily Lesson Log Monday Tuesday Wednesday Thursday Friday
No ratings yet
GRADES 1 To 12 Daily Lesson Log Monday Tuesday Wednesday Thursday Friday
4 pages
Curriculum Implementation in Schools
No ratings yet
Curriculum Implementation in Schools
15 pages
Elapsed Time Week 6 DLL
No ratings yet
Elapsed Time Week 6 DLL
6 pages
Language Teaching Methods Overview
No ratings yet
Language Teaching Methods Overview
17 pages
Eastern Visayas State University-Ormoc City Campus
No ratings yet
Eastern Visayas State University-Ormoc City Campus
4 pages
Sapang Palay National High School Fatima V, Area E, City of San Jose Del Monte, Bulacan
No ratings yet
Sapang Palay National High School Fatima V, Area E, City of San Jose Del Monte, Bulacan
43 pages
Ge 4: Mathematics in The Modern World
No ratings yet
Ge 4: Mathematics in The Modern World
1 page
Best Teacher Description
No ratings yet
Best Teacher Description
8 pages
Academic Stress and Performance in Students
No ratings yet
Academic Stress and Performance in Students
14 pages
Art Curriculum Guide Grades 1-10 Final As of 01-17-2016
100% (5)
Art Curriculum Guide Grades 1-10 Final As of 01-17-2016
93 pages
Esl 500 Activities For The Primary Classroom
No ratings yet
Esl 500 Activities For The Primary Classroom
144 pages
Iddf
No ratings yet
Iddf
2 pages
Agency Detection-HADD
No ratings yet
Agency Detection-HADD
3 pages
The Ultimate List of Graphic Organizers For Teachers and Students
No ratings yet
The Ultimate List of Graphic Organizers For Teachers and Students
22 pages
Meet Mrs. Shaw: Family Science Educator
No ratings yet
Meet Mrs. Shaw: Family Science Educator
5 pages
Rubric Handbook - University of Manitoba FINAL June 2015
No ratings yet
Rubric Handbook - University of Manitoba FINAL June 2015
21 pages
AIfor Civil Engineers Module 1
No ratings yet
AIfor Civil Engineers Module 1
21 pages
LP in English
No ratings yet
LP in English
10 pages
Cso First 100 Days
100% (1)
Cso First 100 Days
24 pages
En Mathematics 11 Foundations-Of-Mathematics
No ratings yet
En Mathematics 11 Foundations-Of-Mathematics
2 pages
Mental and Physical Development Terms
No ratings yet
Mental and Physical Development Terms
2 pages
Conceptual Framework
86% (7)
Conceptual Framework
11 pages
IPM GR 3 Jan 2024
50% (2)
IPM GR 3 Jan 2024
113 pages
Competency-Based Approach in Algerian Education
No ratings yet
Competency-Based Approach in Algerian Education
30 pages
ATL LP GC Student Presentation
No ratings yet
ATL LP GC Student Presentation
20 pages
Practical Research 1 (RSCH-110) OEd Answers (G11)
No ratings yet
Practical Research 1 (RSCH-110) OEd Answers (G11)
19 pages