BAI701 Module 5 Notes
BAI701 Module 5 Notes
MODULE - 5: NOTES
Name of the Faculty : Dr. Pradeep N.R Designation : Associate Professor
Subject name : Deep Learning & Subject code : BAI701
Reinforcement Learning
Department : AI&DS Semester : VII
CIE Marks : 50 SEE Marks : 50
Teaching Hrs/Week (L:T:P:S) : (3:0:2:0) Total Marks : 100
Total Hours of Pedagogy : 40 Credits : 04
Syllabus:
Content Theory Mathematics Numerical
• Reinforced learning
Characteristics of reinforced learning
Algorithms: Value Based, Policy Based, Model
Based; Positive Vs. Negative Reinforced Learning
Page 1 of 18
1.1 DEEP REINFORCEMENT LEARNING – INTRODUCTION:
Agent(): An entity that can perceive/explore the environment and act upon it.
Environment(): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.
Action(): Actions are the moves taken by an agent within the environment.
State(): State is a situation returned by the environment after each action taken by the agent.
Reward(): A feedback returned to the agent from the environment to evaluate the action of
Page 2 of 18
the agent.
Policy(): Policy is a strategy applied by the agent for the next action based on the
current state.
Value(): It is expected long-term retuned with the discount factor and opposite to the short-
term reward.
Q-value(): It is mostly similar to the value, but it takes one additional parameter as a
current action (a).
In RL, the agent is not instructed about the environment and what actions need to be
taken.
It is based on the hit and trial process.
The agent takes the next action and changes states according to the feedback of the
previous action.
The agent may get a delayed reward.
The environment is stochastic, and the agent needs to explore it to reach to get the
maximum positive rewards.
There are mainly three ways to implement reinforcement-learning in ML, which are:
Value-based: The value-based approach is about to find the optimal value function, which is
the maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π.
Policy-based: Policy-based approach is to find the optimal policy for the maximum future
rewards without using the value function. In this approach, the agent tries to apply such a
Page 3 of 18
policy that the action performed in each step helps to maximize the future reward. The
policy-based approach has mainly two types of policy:
Deterministic: The same action is produced by the policy (π) at any state.
Stochastic: In this policy, probability determines the produced action.
Model-based: In the model-based approach, a virtual model is created for the environment,
and the agent explores that environment to learn it. There is no particular solution or algorithm
for this approach because the model representation is different for each environment.
There are four main elements of Reinforcement Learning, which are given below: Policy, Reward
Signal, Value Function, Model of the environment
Policy: A policy can be defined as a way how an agent behaves at a given time. It maps the
perceived states of the environment to the actions taken on those states. A policy is the core
element of the RL as it alone can define the behavior of the agent. In some cases, it may be a
simple function or a lookup table, whereas, for other cases, it may involve general computation
as a search process. It could be deterministic or a stochastic policy:
For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P[At =a | St = s]
Reward Signal: The goal of reinforcement learning is defined by the reward signal. At each
state, the environment sends an immediate signal to the learning agent, and this signal is
known as a reward signal. These rewards are given according to the good and bad actions
taken by the agent. The agent's main objective is to maximize the total number of rewards for
good actions. The reward signal can change the policy, such as if an action selected by the
agent leads to low reward, then the policy may change to select other actions in the future.
Value Function: The value function gives information about how good the situation and
action are and how much reward an agent can expect. A reward indicates the immediate
signal for each good and bad action, whereas a value function specifies the good state
and action for the future. The value function depends on the reward as, without reward,
there could be no value. The goal of estimating values is to achieve more rewards.
Model: The last element of reinforcement learning is the model, which mimics the behaviour
of the environment. With the help of the model, one can make inferences about how the
environment will behave. Such as, if a state and an action are given, then a model can
predict the next state and reward.
The model is used for planning, which means it provides a way to take a course of action by
considering all future situations before actually experiencing those situations.
The approaches for solving the RL problems with the help of the model are termed as the
model-based approach. Comparatively, an approach without using a model is called a
Page 4 of 18
model-free approach.
6. How does Reinforcement Learning Work?
To understand the working process of the RL, we need to consider two main things:
Environment: It can be anything such as a room, maze, football ground, etc.
Agent: An intelligent agent such as AI robot.
Let's take an example of a maze environment that the agent needs to explore. Consider the below
image:
In the above image, the agent is at the very first block of the maze. The maze is consisting of an S6
block, which is a wall, S8 a fire pit, and S4 a diamond block
The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the S 4 block, then get the
+1 reward; if it reaches the fire pit, then gets -1 reward point. It can take four actions: move up, move
down, move left, and move right.
The agent can take any path to reach to the final point, but he needs to make it in possible fewer steps.
Suppose the agent considers the path S9-S5-S1-S2-S3, so he will get the +1- reward point.
The agent will try to remember the preceding steps that it has taken to reach the final step. To
memorize the steps, it assigns 1 value to each previous step. Consider the below step:
Page 5 of 18
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
Now, the agent has successfully stored the previous steps assigning the 1 value to each previous block.
But what will the agent do if he starts moving from the block, which has 1 value block on both sides?
Consider the below diagram:
It will be a difficult condition for the agent whether he should go up or down as each block has the
same value. So, the above approach is not suitable for the agent to reach the destination. Hence to
solve the problem, we will use the Bellman equation, which is the main concept behind reinforcement
learning.
Positive Reinforcement: The positive reinforcement learning means adding something to increase the
tendency that expected behavior would occur again. It impacts positively on the behavior of the agent
and increases the strength of the behavior. This type of reinforcement can sustain the changes for a long
time, but too much positive reinforcement may lead to an overload of states that can reduce the
consequences.
Page 6 of 18
Advantages are:
Maximizes Performance
Sustain Change for a long period of time
Too much Reinforcement can lead to an overload of states which can diminish the results
Negative Reinforcement: The negative reinforcement learning is opposite to the positive reinforcement
as it increases the tendency that the specific behavior will occur again by avoiding the negative
condition. It can be more effective than the positive reinforcement depending on situation and behavior,
but it provides reinforcement only to meet minimum behavior. Advantages are:
Increases Behavior
Provide defiance to a minimum standard of performance
It Only provides enough to meet up the minimum behavior
We can represent the agent state using the Markov State that contains all the required information from
the history. The State St is Markov state if it follows the given condition:
P[St+1 | St ] = P[St +1 | S1,. , St]
The Markov state follows the Markov property, which says that the future is independent of the past
and can only be defined with the present. The RL works on fully observable
environments, where the agent can observe the environment and act for the new state. The complete
process is known as Markov Decision process, which is explained below:
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
9. Explain Markov Decision Process
Markov Decision Process or MDP, is used to formalize the reinforcement learning problems. If the
environment is completely observable, then its dynamic can be modeled as a Markov Process. In MDP,
the agent constantly interacts with the environment and performs actions; at each action, the
environment responds and generates a new state.
MDP is used to describe the environment for the RL, and almost all the RL problem can be formalized
using MDP.
MDP contains a tuple of four elements (S, A, Pa, Ra):
A set of finite States S
Page 7 of 18
A set of finite Actions A
Rewards received after transitioning from state S to state S', due to action a.
Probability Pa.
MDP uses Markov property, and to better understand the MDP, we need to learn about it. Markov
Property: It says that "If the agent is present in the current state S1, performs an action a1 and move to
the state s2, then the state transition from s1 to s2 only depends on the current state and future action
and states do not depend on past actions, rewards, or states." Or, in other words, as per Markov
Property, the current state transition does
not depend on any past action or state. Hence, MDP is an RL problem that satisfies the Markov
property. Such as in a Chess game, the players only focus on the current state and do not need to
remember past actions or states.
Finite MDP: A finite MDP is when there are finite states, finite rewards, and finite actions. In RL, we
consider only the finite MDP.
Markov Process:
Markov Process is a memoryless process with a sequence of random states S 1, S2, St that
uses the Markov Property. Markov process is also known as Markov chain, which is a tuple (S, P) on
state S and transition function P. These two components (S and P) can define the dynamics of
the system.
Q-learning is an off policy RL algorithm, which is used for the temporal difference Learning. The
temporal difference learning methods are the way of comparing temporally successive predictions.
It learns the value function Q (S, a), which means how good to take action "a" at a particular state
"s."
The below flowchart explains the working of Q- learning:
Page 8 of 18
11. Difference between Reinforcement Learning and Supervised Learning.
The Reinforcement Learning and Supervised Learning both are the part of machine learning, but both
types of learnings are far opposite to each other. The RL agents interact with the environment,
explore it, take action, and get rewarded. Whereas supervised learning algorithms learn from the labeled
dataset and, on the basis of the training, predict the output. The difference table between RL and Supervised
learning is given below:
Stateless algorithms in Multi-Armed Bandits (MAB) attempt to balance exploration (trying new arms)
and exploitation (choosing the best-known arm) without maintaining complex state or long-term
memory, aside from basic statistics like averages.
These algorithms are widely used in:
Online advertising
Recommendation systems
Clinical trials
Parameter tuning
Adaptive routing
Page 9 of 18
2.1 Naive Algorithm (Pure Exploitation)
Concept
The Naive (or Greedy-0) algorithm always selects the arm with the highest estimated reward.
There is no exploration.
Steps
1. Initialize counts and rewards for all arms.
2. Play each arm once (optional).
3. At each step, select the arm with the current highest mean reward
4. Update that arm's mean reward after receiving the new reward.
Advantages
a. Simple and fast.
b. Good when environment is deterministic or rewards are stable.
Disadvantages
a. No exploration → stuck with initial bad choices.
b. Poor long-term performance.
c. Sensitive to noise.
2.5 Summary
Naive Algorithm is simple but unreliable.
ε-Greedy introduces random exploration — practical and easy.
UCB1 provides a mathematically optimal balance — preferred for scalable, real-time systems.
Safety and Risk Unsafe exploration can be costly or harmful in real-world applications.
Page 11 of 18
Constraints
Computational Training advanced RL algorithms demands substantial GPU resources and
Complexity time.
Generalization Issues Agents often overfit to training environments and fail to perform well in
unseen scenarios.
LEARNING)
3.2.1 Role of Deep Learning in Reinforcement Learning
Deep Learning (DL) plays a crucial role in enhancing Reinforcement Learning (RL) by allowing agents to
handle complex, high-dimensional environments that traditional RL methods cannot manage efficiently.
a. Motivation
Traditional RL methods rely on tabular representations of state and action values, which become infeasible
when the state or action space is large or continuous. Deep Learning provides function approximation
capabilities to generalize across unseen states.
b. Integration of DL with RL
RL Component DL Contribution
Value Function Approximation Deep Neural Networks (DNNs) approximate value functions such as
Q(s,a) (as in Deep Q-Networks).
Policy Representation Policies are represented using deep networks that map states to
actions (e.g., in Policy Gradient or Actor-Critic methods).
Feature Extraction DL automatically extracts spatial or temporal features from raw
input (e.g., images, video frames, or audio signals).
Scalability DL enables RL to scale to complex tasks such as playing Atari games,
robotics, and autonomous driving.
c. Example – Deep Q-Network (DQN)
Combines Q-learning with Convolutional Neural Networks (CNNs). Uses Experience Replay (to stabilize
learning) and Target Networks (to avoid divergence). Enabled RL agents to outperform humans in many
Atari 2600 games.
d. Advantages
a. Handles raw, unstructured data (images, speech, text).
b. Learns non-linear relationships between states and actions.
c. Improves generalization to similar but unseen environments.
e. Challenges
a. High computational cost and training instability.
b. Lack of interpretability (black-box nature).
c. Data inefficiency and convergence issues in continuous or sparse environments.
3.2.2 Straw-Man Algorithm in Reinforcement Learning
Page 12 of 18
A straw-man algorithm refers to a simple baseline or reference model used to highlight limitations or
compare the performance of more sophisticated RL algorithms.
a. Purpose
a. Serves as a benchmark to evaluate improvement over naïve methods.
b. Helps identify specific challenges or weaknesses in RL systems.
c. Provides an interpretable and low-complexity starting point for experimentation.
b. Typical Straw-Man Approaches
Algorithm Description Limitations (Challenge
Representation)
Random Policy Chooses actions randomly Demonstrates inefficiency of
without learning. uninformed exploration.
Greedy Policy Always selects the action Fails to balance exploration and
with the highest immediate exploitation; may get stuck in local
reward. optima.
Naïve Q-Learning Learns Q-values without Unstable and prone to divergence,
stabilization techniques (no showing the need for Deep RL
replay buffer or target enhancements.
network).
c. Why “Straw-Man”?
It acts as a conceptual baseline, not intended to be optimal. Used to stress-test RL frameworks under
simplified assumptions and provides insights into how Deep Learning improvements overcome basic RL
challenges (instability, sparse rewards, exploration inefficiency).
3.2.3 Challenges Highlighted by Straw-Man Algorithms
Challenge Description
Exploration vs. Exploitation Random and greedy agents illustrate the difficulty of balancing
discovery and performance.
Sample Inefficiency Simple algorithms require excessive episodes to learn effective
policies.
Instability Without DL-based stabilization (like target networks), learning can
diverge.
Sparse Rewards Straw-man agents struggle in environments with delayed or
infrequent rewards.
Generalization They fail to perform well in unseen environments, showing the need
for deep function approximators.
4. Key Insight
Deep Learning gives RL the ability to generalize and scale beyond small state spaces, while Straw-Man
Algorithms expose fundamental RL weaknesses (instability, inefficiency, lack of exploration), motivating the
design of Deep RL architectures.
Page 13 of 18
5. Conclusion
The integration of Deep Learning into Reinforcement Learning transforms simple, unstable algorithms into
powerful agents capable of handling complex, real-world environments. By comparing performance against
straw-man baselines, researchers can measure progress, identify bottlenecks, and design more robust, data-
efficient, and interpretable RL algorithms.
LEARNING)
4.1 Self-Learning Robots: Deep Learning of Locomotion Skills
a. Introduction
Robotic locomotion — the ability of robots to move autonomously across varied terrains — is a fundamental
challenge in AI. Deep Reinforcement Learning (DRL) enables robots to learn locomotion behaviors
autonomously through experience, without explicit programming of motion dynamics.
b. Motivation
The goal is to create adaptive and robust movement strategies across complex environments. Challenges
include high-dimensional control, continuous action spaces, stability, and sim-to-real transfer.
c. Core Concepts and Framework
An RL setup involves an agent (robot), environment, state (sensory inputs), actions (torques/joint
movements), and reward (movement efficiency, stability).
d. Methodology
Algorithms such as DDPG, PPO, SAC, and TRPO are used. These methods train policy and value networks via
gradient updates to maximize rewards.
e. Example Case Study
OpenAI’s quadruped robot learned to walk using PPO. Training in simulation led to emergent gait patterns
similar to animals, guided by reward functions emphasizing distance and stability.
f. Sim-to-Real Transfer
Domain randomization and fine-tuning bridge simulation and real-world execution.
g. Key Achievements
Autonomous walking, running, and adaptation to uneven terrains; robust recovery after disturbances.
h. Challenges
High computational cost, reward shaping complexity, and real-world safety issues.
i. Impact
Showcased RL’s power in robotics, inspiring progress in exoskeletons, drones, and autonomous navigation.
Page 14 of 18
b. Motivation
End-to-end learning enables robots to interpret visual data and act accordingly, improving adaptability in
unstructured settings.
c. Core Framework
Input: camera images; CNNs extract features; Policy Network maps visuals to motor actions; Reward:
success of task (e.g., grasping).
d. Example Case Study
Google’s robotic arm (Levine et al., 2016) learned grasping using 800,000 attempts with CNN-based policies,
achieving robust performance without explicit programming.
e. Techniques and Architectures
End-to-end learning, DQN/Actor-Critic for control, autoencoders for compression, and imitation+RL hybrid
learning.
f. Achievements
Robots learned to grasp unseen objects, self-calibrate, and adapt visually without 3D modeling.
g. Challenges
Data inefficiency, poor visual generalization, training instability, and sim-to-real transfer issues.
h. Impact
Enabled perception-driven robotics for manipulation, visual navigation, and industrial automation.
4.3 Comparative Summary
Aspect Self-Learning Locomotion Deep Visuomotor Skills
Input Type Sensor readings Camera images
Output Motion commands Motor actions from vision
Learning Algorithm PPO, DDPG, TRPO DQN, Actor-Critic, Imitation+RL
Goal Efficient locomotion Visual control actions
Environment Physical terrain Visual workspace
Challenges Stability, sim-to-real Data efficiency, generalization
Impact Humanoid and legged robotics Manipulation and visual autonomy
LEARNING)
5.1 AlphaGo: Championship-Level Play at Go
a. Introduction
AlphaGo, developed by DeepMind (a subsidiary of Google), represents a landmark achievement in artificial
intelligence and reinforcement learning. It became the first computer program to defeat a professional
human Go player, later surpassing the world champion. Go is an ancient board game with more possible
positions than atoms in the universe — making traditional search-based approaches infeasible.
b. Background and Motivation
Game Complexity: Go has a branching factor of approximately 10¹⁷⁰, vastly larger than chess (~10⁴⁷).
Traditional AI Limitations: Rule-based and brute-force search methods (like Alpha-Beta pruning) used in
Page 15 of 18
chess programs failed to scale.
Goal: Combine Deep Learning (DL) and Reinforcement Learning (RL) to learn value functions and policies
that approximate human-level intuition.
c. Architecture and Methodology
Component Description
Policy Network A deep neural network trained to predict expert moves from
human games. Used to narrow the search space.
Value Network Estimates the probability of winning from any given board
position.
Monte Carlo Tree Search (MCTS) Combines the outputs of policy and value networks to
explore future game states effectively.
Reinforcement Learning (Self-Play) AlphaGo played millions of games against itself, improving
the policy iteratively via gradient updates (Policy Gradient
Methods).
d. Training Phases
Supervised Learning Phase – Trained on 30 million human expert moves, achieving ~57% accuracy in move
prediction.
Reinforcement Learning Phase – Improved through self-play using policy gradient reinforcement learning to
optimize move selection.
Value Network Training – Using self-play games to estimate the expected outcome (win/loss).
Integration with MCTS – Balanced exploration and exploitation through tree-based rollouts guided by the
learned policy.
e. Achievements
2015: Defeated European Champion Fan Hui (5–0).
2016: Defeated World Champion Lee Sedol (4–1).
2017: AlphaGo Master defeated world No. 1 Ke Jie in three consecutive matches.
f. Key Innovations
Combination of Deep Neural Networks with Tree Search.
End-to-end learning from raw Go board representations (19×19 grid).
Effective use of self-play to exceed human knowledge.
g. Challenges Faced
Enormous computational demands (initially trained using hundreds of GPUs).
High sensitivity to hyperparameters in RL.
Limited interpretability of neural decisions.
h. Impact
Page 16 of 18
AlphaGo demonstrated that Deep Reinforcement Learning can master complex, intuitive, and creative tasks
beyond symbolic AI’s reach. It opened new research directions in generalized learning, strategic planning,
and policy optimization.
a. Introduction
AlphaZero, also developed by DeepMind (2017), was the next evolution of AlphaGo. Unlike AlphaGo, which
relied on human data and domain-specific heuristics, AlphaZero started with zero human knowledge —
learning purely through self-play reinforcement learning.
b. Motivation
The goal was to generalize AlphaGo’s architecture to any two-player perfect-information game (like chess,
shogi, and Go) using a single unified algorithm, eliminating the need for human demonstrations or
handcrafted features.
d. Training Process
Self-Play – The agent plays against itself to generate experience data.
Monte Carlo Tree Search (MCTS) – Guided by the current policy and value estimates.
Network Update – Using collected data to improve policy and value prediction.
Iteration – The new network replaces the old one if it performs better.
e. Key Improvements over AlphaGo
Aspect AlphaGo AlphaZero
Human Data Usage Used expert game datasets. Learned purely through self-play.
Game Specificity Designed for Go. Applicable to Go, Chess, and Shogi.
Architecture Separate Policy and Value Networks. Unified Policy-Value Network.
Learning Paradigm Supervised + RL. Pure RL (Self-play).
Efficiency Extensive computation required. More sample efficient.
f. Achievements
a. Defeated Stockfish (Chess) and Elmo (Shogi) within 24 hours of self-training.
Page 17 of 18
b. Achieved superhuman performance in Go, surpassing AlphaGo Zero.
c. Unified algorithm proved domain-agnostic intelligence.
g. Challenges
High computational cost (requires TPUs/GPUs).
Lack of interpretability in neural strategies.
Sparse rewards and slow convergence.
Difficulty scaling to stochastic or imperfect-information environments.
h. Comparative Summary
Aspect AlphaGo AlphaZero
Learning Approach Human data + RL self-play Pure RL self-play
Architecture Separate Policy & Value Unified Policy-Value Network
Networks
Human Involvement High (supervised pre-training) None (zero human knowledge)
Game Coverage Go only Go, Chess, Shogi
Algorithmic Innovation Deep RL + MCTS Generalized Deep RL + MCTS
Outcome Beat World Champion Beat top engines in multiple domains
Page 18 of 18