RL Learning Journey - January 2025
Reinforcement Learning
Learning Through Trial, Error, and Rewards
The Big Idea: RL is how we teach agents to make sequential decisions
by learning from consequences. No labeled data needed - just rewards
and punishments! Think of it as training a dog, playing chess, or
learning to ride a bike.
1. What Makes RL Different?
Unlike supervised learning (learning from examples) or unsupervised
learning (finding patterns), RL learns from interaction with an environment.
Key Distinction: In RL, there's no supervisor, only a reward signal.
The agent must discover which actions yield the most reward through
trial and error!
The RL Loop:
Agent
↑
State | Action
(St) | (At)
↓
Environment
↑
Reward | Next State
(Rt) | (St+1)
2. Core Components
Essential Elements:
Agent: The learner/decision maker (e.g., robot, game player)
Environment: Everything the agent interacts with
State (S): Current situation/observation
Action (A): What the agent can do
Reward (R): Immediate feedback signal
Policy (π): Agent's behavior strategy
Value Function (V): Expected future rewards
3. Markov Decision Process (MDP)
Foundation of RL: MDPs provide the mathematical framework. The
Markov property states that the future depends only on the present, not
the past!
MDP = ⟨S, A, P, R, γ⟩ S: State space A: Action space P:
Transition probability P(s'|s,a) R: Reward function
R(s,a,s') γ: Discount factor (0 ≤ γ ≤ 1)
Why Discount Factor (γ)?
γ = 0: Only care about immediate rewards (myopic)
γ = 1: Future rewards equally important (far-sighted)
0 < γ < 1: Balance between immediate and future (typical: 0.9-0.99)
4. The Bellman Equations
The Bellman equations are the heart of RL - they express the
relationship between value of a state and values of successor states!
State Value Function: V(s) = E[Rt+1 + γV(St+1) | St = s]
Action Value Function (Q-function): Q(s,a) = E[Rt+1 + γ
max Q(St+1, a') | St = s, At = a] a' Optimal Policy: π*
(s) = argmax Q*(s,a) a
5. Exploration vs Exploitation
The Central Dilemma: Should I stick with what I know works (exploit)
or try something new that might be better (explore)?
Exploration Strategies:
ε-greedy: Random action with probability ε
Boltzmann/Softmax: Probabilistic based on Q-values
UCB (Upper Confidence Bound): Optimism in face of uncertainty
Thompson Sampling: Bayesian approach
ε-greedy Implementation:
if random() < epsilon:
action = random_action() # Explore
else:
action = argmax(Q[state]) # Exploit
Start with high ε (0.9) and decay over time!
6. Dynamic Programming Methods
When We Know the Model:
Policy Iteration:
1. Policy Evaluation: Compute V^π
2. Policy Improvement: Greedy w.r.t V^π
3. Repeat until convergence
Value Iteration:
V(s) ← max Σ P(s'|s,a)[R(s,a,s') + γV(s')] a s'
7. Model-Free Methods
A. Monte Carlo Methods
Learn from complete episodes - wait until end to update!
First-Visit MC:
1. Generate episode following π
2. For each state s appearing in episode:
G ← return following first occurrence of s
N(s) ← N(s) + 1
V(s) ← V(s) + 1/N(s)[G - V(s)]
B. Temporal Difference Learning
Learn from incomplete episodes - update after each step!
TD(0) Update: V(s) ← V(s) + α[r + γV(s') - V(s)] └─── TD
error ───┘ where α is learning rate
8. Q-Learning - The Classic
Off-Policy TD Control: Q-Learning learns the optimal policy while
following any policy! It's the foundation of many modern RL
algorithms.
Q-Learning Algorithm:
Initialize Q(s,a) arbitrarily
For each episode:
Initialize s
For each step:
Choose a from s using ε-greedy
Take action a, observe r, s'
Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
a'
s ← s'
9. SARSA - The Safe Alternative
On-Policy TD Control:
Q(s,a) ← Q(s,a) + α[r + γQ(s',a') - Q(s,a)] where
a' is actually taken (not max)
SARSA is more conservative - learns the policy it follows!
10. Deep Reinforcement Learning
When state/action spaces are huge, we need function approximation.
Enter neural networks!
DQN (Deep Q-Networks) - Atari Breakthrough:
Experience Replay: Store transitions, sample randomly
Target Network: Separate network for stable targets
CNN for Vision: Process raw pixels
DQN Loss: L = E[(r + γ max Q(s',a'; θ⁻) - Q(s,a; θ))²]
a' θ: main network parameters θ⁻: target network
parameters (updated slowly)
11. Policy Gradient Methods
Direct Optimization: Instead of learning values, directly optimize the
policy parameters!
REINFORCE Algorithm:
1. Sample trajectory τ ~ πθ
2. Calculate return G(τ)
3. Update: θ ← θ + α∇θ log πθ(a|s) × G(τ)
Intuition: Increase probability of good actions!
Actor-Critic Methods:
Actor: Policy network (selects actions)
Critic: Value network (evaluates actions)
Reduces variance compared to REINFORCE
12. Advanced Algorithms
Algorithm Type Key Innovation
A3C Actor-Critic Asynchronous parallel training
PPO Policy Gradient Clipped objective for stability
SAC Actor-Critic Maximum entropy for exploration
TD3 Actor-Critic Twin critics, delayed updates
Rainbow Value-based Combines 7 DQN improvements
13. Multi-Armed Bandits
Simplified RL Problem:
No states, just actions and rewards - like slot machines!
Applications:
A/B testing
Ad selection
Clinical trials
Recommendation systems
UCB Formula:
A_t = argmax [Q_t(a) + c√(ln t / N_t(a))] a
14. Practical Implementation Tips
Hard-Won Wisdom:
Start simple - tabular methods before deep RL
Reward shaping is crucial but dangerous
Normalize observations and rewards
Use multiple random seeds
Monitor exploration rate
Visualize learned behavior regularly
15. Common Challenges
RL is Hard Because:
Sample Efficiency: Needs lots of interactions
Stability: Training can diverge easily
Reward Design: Wrong rewards = wrong behavior
Partial Observability: Hidden state information
Non-stationarity: Environment changes
Credit Assignment: Which action caused reward?
16. OpenAI Gym Example
Basic Q-Learning with Gym:
import gym
import numpy as np
env = gym.make('FrozenLake-v1')
Q = np.zeros([env.observation_space.n,
env.action_space.n])
# Hyperparameters
α = 0.8 # learning rate
γ = 0.95 # discount factor
ε = 0.1 # exploration rate
for episode in range(2000):
state = env.reset()
done = False
while not done:
# ε-greedy action selection
if np.random.random() < ε:
action = env.action_space.sample()
else:
action = np.argmax(Q[state])
# Take action
next_state, reward, done, _ = env.step(action)
# Q-learning update
Q[state, action] = Q[state, action] + α * (
reward + γ * np.max(Q[next_state]) - Q[state, action]
)
state = next_state
17. Real-World Applications
Domain Application RL Approach
Self-play, Tree search +
Gaming AlphaGo, OpenAI Five
NN
Sim-to-real, Model-
Robotics Manipulation, Walking
based
Trading, Portfolio
Finance Risk-aware RL
management
Healthcare Treatment planning Safe RL, Offline RL
Energy Data center cooling Model-free control
Transportation Traffic control Multi-agent RL
18. Inverse RL & Imitation Learning
Learning from Demonstrations:
Behavioral Cloning:
Supervised learning on expert trajectories
Inverse RL:
Infer reward function from expert behavior
GAIL (Generative Adversarial Imitation Learning):
GAN-style approach to imitation
19. Multi-Agent RL
When Multiple Agents Interact:
Competitive (zero-sum games)
Cooperative (shared rewards)
Mixed (general-sum games)
Challenge: Non-stationarity from other learning agents!
20. Study Resources & Next Steps
My Learning Roadmap:
1. Sutton & Barto book (the bible!)
2. David Silver's course (DeepMind)
3. Implement tabular methods from scratch
4. OpenAI Gym environments
5. Deep RL with Stable Baselines3
6. Read key papers (DQN, A3C, PPO)
7. Build custom environment
8. Participate in competitions
Key Papers to Read:
"Playing Atari with Deep RL" (DQN, 2013)
"Proximal Policy Optimization" (PPO, 2017)
"Soft Actor-Critic" (SAC, 2018)
"AlphaGo Zero" (Self-play, 2017)
Final Thoughts:
RL is challenging but incredibly powerful. It's the closest we have to
general intelligence - learning from interaction, just like humans do. The
key is patience, lots of experiments, and understanding the
fundamentals deeply. Remember: in RL, failure is just another data
point for learning!
"The only real mistake is the one from which we learn nothing" - In RL, every mistake teaches the
agent!