0% found this document useful (0 votes)
47 views12 pages

Reinforcement Learning - Personal Study Notes

Uploaded by

xedac78301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views12 pages

Reinforcement Learning - Personal Study Notes

Uploaded by

xedac78301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

RL Learning Journey - January 2025

Reinforcement Learning

Learning Through Trial, Error, and Rewards

The Big Idea: RL is how we teach agents to make sequential decisions


by learning from consequences. No labeled data needed - just rewards
and punishments! Think of it as training a dog, playing chess, or
learning to ride a bike.

1. What Makes RL Different?

Unlike supervised learning (learning from examples) or unsupervised


learning (finding patterns), RL learns from interaction with an environment.

Key Distinction: In RL, there's no supervisor, only a reward signal.


The agent must discover which actions yield the most reward through
trial and error!

The RL Loop:

Agent

State | Action
(St) | (At)

Environment

Reward | Next State
(Rt) | (St+1)
2. Core Components

Essential Elements:

Agent: The learner/decision maker (e.g., robot, game player)

Environment: Everything the agent interacts with

State (S): Current situation/observation

Action (A): What the agent can do

Reward (R): Immediate feedback signal

Policy (π): Agent's behavior strategy

Value Function (V): Expected future rewards

3. Markov Decision Process (MDP)

Foundation of RL: MDPs provide the mathematical framework. The


Markov property states that the future depends only on the present, not
the past!

MDP = ⟨S, A, P, R, γ⟩ S: State space A: Action space P:


Transition probability P(s'|s,a) R: Reward function
R(s,a,s') γ: Discount factor (0 ≤ γ ≤ 1)
Why Discount Factor (γ)?

γ = 0: Only care about immediate rewards (myopic)

γ = 1: Future rewards equally important (far-sighted)

0 < γ < 1: Balance between immediate and future (typical: 0.9-0.99)

4. The Bellman Equations

The Bellman equations are the heart of RL - they express the


relationship between value of a state and values of successor states!

State Value Function: V(s) = E[Rt+1 + γV(St+1) | St = s]


Action Value Function (Q-function): Q(s,a) = E[Rt+1 + γ
max Q(St+1, a') | St = s, At = a] a' Optimal Policy: π*
(s) = argmax Q*(s,a) a

5. Exploration vs Exploitation

The Central Dilemma: Should I stick with what I know works (exploit)
or try something new that might be better (explore)?

Exploration Strategies:

ε-greedy: Random action with probability ε

Boltzmann/Softmax: Probabilistic based on Q-values

UCB (Upper Confidence Bound): Optimism in face of uncertainty

Thompson Sampling: Bayesian approach


ε-greedy Implementation:

if random() < epsilon:


action = random_action() # Explore
else:
action = argmax(Q[state]) # Exploit

Start with high ε (0.9) and decay over time!

6. Dynamic Programming Methods

When We Know the Model:

Policy Iteration:

1. Policy Evaluation: Compute V^π

2. Policy Improvement: Greedy w.r.t V^π

3. Repeat until convergence

Value Iteration:

V(s) ← max Σ P(s'|s,a)[R(s,a,s') + γV(s')] a s'

7. Model-Free Methods

A. Monte Carlo Methods

Learn from complete episodes - wait until end to update!


First-Visit MC:
1. Generate episode following π
2. For each state s appearing in episode:
G ← return following first occurrence of s
N(s) ← N(s) + 1
V(s) ← V(s) + 1/N(s)[G - V(s)]

B. Temporal Difference Learning

Learn from incomplete episodes - update after each step!

TD(0) Update: V(s) ← V(s) + α[r + γV(s') - V(s)] └─── TD


error ───┘ where α is learning rate

8. Q-Learning - The Classic

Off-Policy TD Control: Q-Learning learns the optimal policy while


following any policy! It's the foundation of many modern RL
algorithms.

Q-Learning Algorithm:
Initialize Q(s,a) arbitrarily
For each episode:
Initialize s
For each step:
Choose a from s using ε-greedy
Take action a, observe r, s'
Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
a'
s ← s'

9. SARSA - The Safe Alternative


On-Policy TD Control:

Q(s,a) ← Q(s,a) + α[r + γQ(s',a') - Q(s,a)] where


a' is actually taken (not max)

SARSA is more conservative - learns the policy it follows!

10. Deep Reinforcement Learning

When state/action spaces are huge, we need function approximation.


Enter neural networks!

DQN (Deep Q-Networks) - Atari Breakthrough:

Experience Replay: Store transitions, sample randomly

Target Network: Separate network for stable targets

CNN for Vision: Process raw pixels

DQN Loss: L = E[(r + γ max Q(s',a'; θ⁻) - Q(s,a; θ))²]


a' θ: main network parameters θ⁻: target network
parameters (updated slowly)

11. Policy Gradient Methods

Direct Optimization: Instead of learning values, directly optimize the


policy parameters!
REINFORCE Algorithm:

1. Sample trajectory τ ~ πθ
2. Calculate return G(τ)
3. Update: θ ← θ + α∇θ log πθ(a|s) × G(τ)

Intuition: Increase probability of good actions!

Actor-Critic Methods:

Actor: Policy network (selects actions)

Critic: Value network (evaluates actions)

Reduces variance compared to REINFORCE

12. Advanced Algorithms

Algorithm Type Key Innovation

A3C Actor-Critic Asynchronous parallel training

PPO Policy Gradient Clipped objective for stability

SAC Actor-Critic Maximum entropy for exploration

TD3 Actor-Critic Twin critics, delayed updates

Rainbow Value-based Combines 7 DQN improvements

13. Multi-Armed Bandits


Simplified RL Problem:

No states, just actions and rewards - like slot machines!

Applications:

A/B testing

Ad selection

Clinical trials

Recommendation systems

UCB Formula:

A_t = argmax [Q_t(a) + c√(ln t / N_t(a))] a

14. Practical Implementation Tips

Hard-Won Wisdom:

Start simple - tabular methods before deep RL

Reward shaping is crucial but dangerous

Normalize observations and rewards

Use multiple random seeds

Monitor exploration rate

Visualize learned behavior regularly

15. Common Challenges


RL is Hard Because:

Sample Efficiency: Needs lots of interactions

Stability: Training can diverge easily

Reward Design: Wrong rewards = wrong behavior

Partial Observability: Hidden state information

Non-stationarity: Environment changes

Credit Assignment: Which action caused reward?

16. OpenAI Gym Example

Basic Q-Learning with Gym:

import gym
import numpy as np

env = gym.make('FrozenLake-v1')
Q = np.zeros([env.observation_space.n,
env.action_space.n])

# Hyperparameters
α = 0.8 # learning rate
γ = 0.95 # discount factor
ε = 0.1 # exploration rate

for episode in range(2000):


state = env.reset()
done = False

while not done:


# ε-greedy action selection
if np.random.random() < ε:
action = env.action_space.sample()
else:
action = np.argmax(Q[state])

# Take action
next_state, reward, done, _ = env.step(action)

# Q-learning update
Q[state, action] = Q[state, action] + α * (
reward + γ * np.max(Q[next_state]) - Q[state, action]
)

state = next_state

17. Real-World Applications

Domain Application RL Approach

Self-play, Tree search +


Gaming AlphaGo, OpenAI Five
NN

Sim-to-real, Model-
Robotics Manipulation, Walking
based

Trading, Portfolio
Finance Risk-aware RL
management

Healthcare Treatment planning Safe RL, Offline RL

Energy Data center cooling Model-free control

Transportation Traffic control Multi-agent RL

18. Inverse RL & Imitation Learning


Learning from Demonstrations:

Behavioral Cloning:

Supervised learning on expert trajectories

Inverse RL:

Infer reward function from expert behavior

GAIL (Generative Adversarial Imitation Learning):

GAN-style approach to imitation

19. Multi-Agent RL

When Multiple Agents Interact:

Competitive (zero-sum games)

Cooperative (shared rewards)

Mixed (general-sum games)

Challenge: Non-stationarity from other learning agents!

20. Study Resources & Next Steps

My Learning Roadmap:

1. Sutton & Barto book (the bible!)

2. David Silver's course (DeepMind)

3. Implement tabular methods from scratch


4. OpenAI Gym environments

5. Deep RL with Stable Baselines3

6. Read key papers (DQN, A3C, PPO)

7. Build custom environment

8. Participate in competitions

Key Papers to Read:

"Playing Atari with Deep RL" (DQN, 2013)

"Proximal Policy Optimization" (PPO, 2017)

"Soft Actor-Critic" (SAC, 2018)

"AlphaGo Zero" (Self-play, 2017)

Final Thoughts:
RL is challenging but incredibly powerful. It's the closest we have to
general intelligence - learning from interaction, just like humans do. The
key is patience, lots of experiments, and understanding the
fundamentals deeply. Remember: in RL, failure is just another data
point for learning!

"The only real mistake is the one from which we learn nothing" - In RL, every mistake teaches the
agent!

You might also like