0% found this document useful (0 votes)

156 views18 pages

BAI701 Module 5 Notes

This document outlines the syllabus and key concepts for a Deep Learning and Reinforcement Learning course for the academic year 2025-26. It covers various topics including reinforcement learning fundamentals, algorithms, and the Markov Decision Process, along with practical examples and comparisons to supervised learning. The document also details the assessment structure and types of questions for evaluations.

Uploaded by

sunilsandy87

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

156 views18 pages

BAI701 Module 5 Notes

Uploaded by

sunilsandy87

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

DEEP LEARNING & RE-INFORCEMENT LEARNING – VII SEM AI&DS (BAI701)

Academic Year: 2025-‘26, Odd Semester (Aug-Dec)

MODULE - 5: NOTES
Name of the Faculty : Dr. Pradeep N.R Designation : Associate Professor
Subject name : Deep Learning & Subject code : BAI701
Reinforcement Learning
Department : AI&DS Semester : VII
CIE Marks : 50 SEE Marks : 50
Teaching Hrs/Week (L:T:P:S) : (3:0:2:0) Total Marks : 100
Total Hours of Pedagogy : 40 Credits : 04

Module 5: Deep Reinforcement Learning – Introduction, Stateless Algorithms, Continuation of

Stateless Algorithms, Multi-Armed Bandits, The Basic Framework of Reinforcement Learning,
Continuation of Basic Framework of Reinforcement Learning, Case studies.

Syllabus:
Content Theory Mathematics Numerical
• Reinforced learning
Characteristics of reinforced learning
Algorithms: Value Based, Policy Based, Model
Based; Positive Vs. Negative Reinforced Learning

Models: Markov Decision Process, Q Learning,

Stateless Algorithms
Multi-Armed Bandits
Basic Framework of Reinforcement Learning
Case Studies

Type of question and marks:

Type Theory Mathematics Numerical
Marks 6 or 8 or 10 marks 6 or 8 marks 00 marks

Page 1 of 18
1.1 DEEP REINFORCEMENT LEARNING – INTRODUCTION:

1. What is reinforcement learning? State one practical example.

 Reinforcement Learning is a feedback-based Machine learning technique in which an agent

learns to behave in an environment by performing the actions and seeing the results of actions.
For each good action, the agent gets positive feedback, and for each bad action, the agent gets
negative feedback or penalty.
 In Reinforcement Learning, the agent learns automatically using feedbacks without any
labeled data, unlike supervised learning.
 Since there is no labeled data, so the agent is bound to learn by its experience only.
 RL solves a specific type of problem where decision making is sequential, and the goal is long-
term, such as game-playing, robotics, etc.
 The agent interacts with the environment and explores it by itself. The primary goal of an agent
in reinforcement learning is to improve the performance by getting the maximum positive
rewards.
 The agent learns with the process of hit and trial, and based on the experience, it learns to
perform the task in a better way. Hence, we can say that "Reinforcement learning is a type of
machine learning method where an intelligent agent (computer program) interacts with the
environment and learns to act within that." How a Robotic dog learns the movement of his
arms is an example of Reinforcement learning.
 It is a core part of Artificial intelligence, and all AI agent works on the concept of reinforcement
learning. Here we do not need to pre-program the agent, as it learns from its own experience
without any human intervention.
 Example:
Suppose there is an AI agent present within a maze environment, and his goal is to find the
diamond. The agent interacts with the environment by performing some actions, and based on
those actions, the state of the agent gets changed, and it also receives a reward or penalty as
feedback.
 The agent continues doing these three things (take action, change state/remain in the same
state, and get feedback), and by doing these actions, he learns and explores the environment.
The agent learns that what actions lead to positive feedback or rewards and what actions lead to
negative feedback penalty. As a positive reward, the agent gets a positive point, and as a penalty,
it gets a negative point.
2. State key constituents of reinforcement learning. (Explain key terms in
reinforcement learning.)

 Agent(): An entity that can perceive/explore the environment and act upon it.
 Environment(): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.
 Action(): Actions are the moves taken by an agent within the environment.
 State(): State is a situation returned by the environment after each action taken by the agent.
 Reward(): A feedback returned to the agent from the environment to evaluate the action of
Page 2 of 18
the agent.
 Policy(): Policy is a strategy applied by the agent for the next action based on the
current state.
 Value(): It is expected long-term retuned with the discount factor and opposite to the short-
term reward.
 Q-value(): It is mostly similar to the value, but it takes one additional parameter as a
current action (a).

3. State key features of reinforcement learning.

 In RL, the agent is not instructed about the environment and what actions need to be
taken.
 It is based on the hit and trial process.
 The agent takes the next action and changes states according to the feedback of the
previous action.
 The agent may get a delayed reward.
 The environment is stochastic, and the agent needs to explore it to reach to get the
maximum positive rewards.

4. Explain approaches to implement reinforcement learning.

OR
Explain value-based, policy-based, and model-based reinforcement learning.

There are mainly three ways to implement reinforcement-learning in ML, which are:
 Value-based: The value-based approach is about to find the optimal value function, which is
the maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π.
 Policy-based: Policy-based approach is to find the optimal policy for the maximum future
rewards without using the value function. In this approach, the agent tries to apply such a
Page 3 of 18
policy that the action performed in each step helps to maximize the future reward. The
policy-based approach has mainly two types of policy:
 Deterministic: The same action is produced by the policy (π) at any state.
 Stochastic: In this policy, probability determines the produced action.

 Model-based: In the model-based approach, a virtual model is created for the environment,
and the agent explores that environment to learn it. There is no particular solution or algorithm
for this approach because the model representation is different for each environment.

5. Explain elements of reinforcement learning.

There are four main elements of Reinforcement Learning, which are given below: Policy, Reward
Signal, Value Function, Model of the environment
 Policy: A policy can be defined as a way how an agent behaves at a given time. It maps the
perceived states of the environment to the actions taken on those states. A policy is the core
element of the RL as it alone can define the behavior of the agent. In some cases, it may be a
simple function or a lookup table, whereas, for other cases, it may involve general computation
as a search process. It could be deterministic or a stochastic policy:
 For deterministic policy: a = π(s)
 For stochastic policy: π(a | s) = P[At =a | St = s]
 Reward Signal: The goal of reinforcement learning is defined by the reward signal. At each
state, the environment sends an immediate signal to the learning agent, and this signal is
known as a reward signal. These rewards are given according to the good and bad actions
taken by the agent. The agent's main objective is to maximize the total number of rewards for
good actions. The reward signal can change the policy, such as if an action selected by the
agent leads to low reward, then the policy may change to select other actions in the future.
 Value Function: The value function gives information about how good the situation and
action are and how much reward an agent can expect. A reward indicates the immediate
signal for each good and bad action, whereas a value function specifies the good state
and action for the future. The value function depends on the reward as, without reward,
there could be no value. The goal of estimating values is to achieve more rewards.
 Model: The last element of reinforcement learning is the model, which mimics the behaviour
of the environment. With the help of the model, one can make inferences about how the
environment will behave. Such as, if a state and an action are given, then a model can
predict the next state and reward.
 The model is used for planning, which means it provides a way to take a course of action by
considering all future situations before actually experiencing those situations.
 The approaches for solving the RL problems with the help of the model are termed as the
model-based approach. Comparatively, an approach without using a model is called a
Page 4 of 18
model-free approach.
6. How does Reinforcement Learning Work?

To understand the working process of the RL, we need to consider two main things:
 Environment: It can be anything such as a room, maze, football ground, etc.
 Agent: An intelligent agent such as AI robot.
Let's take an example of a maze environment that the agent needs to explore. Consider the below
image:

In the above image, the agent is at the very first block of the maze. The maze is consisting of an S6
block, which is a wall, S8 a fire pit, and S4 a diamond block

The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the S 4 block, then get the
+1 reward; if it reaches the fire pit, then gets -1 reward point. It can take four actions: move up, move
down, move left, and move right.

The agent can take any path to reach to the final point, but he needs to make it in possible fewer steps.
Suppose the agent considers the path S9-S5-S1-S2-S3, so he will get the +1- reward point.
The agent will try to remember the preceding steps that it has taken to reach the final step. To
memorize the steps, it assigns 1 value to each previous step. Consider the below step:

Page 5 of 18
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
Now, the agent has successfully stored the previous steps assigning the 1 value to each previous block.
But what will the agent do if he starts moving from the block, which has 1 value block on both sides?
Consider the below diagram:

It will be a difficult condition for the agent whether he should go up or down as each block has the
same value. So, the above approach is not suitable for the agent to reach the destination. Hence to
solve the problem, we will use the Bellman equation, which is the main concept behind reinforcement
learning.

7. Explain types of reinforcement learning: (Positive & Negative reinforcement)

Positive Reinforcement: The positive reinforcement learning means adding something to increase the
tendency that expected behavior would occur again. It impacts positively on the behavior of the agent
and increases the strength of the behavior. This type of reinforcement can sustain the changes for a long
time, but too much positive reinforcement may lead to an overload of states that can reduce the
consequences.

Page 6 of 18
Advantages are:
 Maximizes Performance
 Sustain Change for a long period of time
 Too much Reinforcement can lead to an overload of states which can diminish the results
Negative Reinforcement: The negative reinforcement learning is opposite to the positive reinforcement
as it increases the tendency that the specific behavior will occur again by avoiding the negative
condition. It can be more effective than the positive reinforcement depending on situation and behavior,
but it provides reinforcement only to meet minimum behavior. Advantages are:
 Increases Behavior
 Provide defiance to a minimum standard of performance
 It Only provides enough to meet up the minimum behavior

8. How to represent the agent state?

We can represent the agent state using the Markov State that contains all the required information from
the history. The State St is Markov state if it follows the given condition:
 P[St+1 | St ] = P[St +1 | S1,. , St]
The Markov state follows the Markov property, which says that the future is independent of the past
and can only be defined with the present. The RL works on fully observable
environments, where the agent can observe the environment and act for the new state. The complete
process is known as Markov Decision process, which is explained below:
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
9. Explain Markov Decision Process
Markov Decision Process or MDP, is used to formalize the reinforcement learning problems. If the
environment is completely observable, then its dynamic can be modeled as a Markov Process. In MDP,
the agent constantly interacts with the environment and performs actions; at each action, the
environment responds and generates a new state.

MDP is used to describe the environment for the RL, and almost all the RL problem can be formalized
using MDP.
MDP contains a tuple of four elements (S, A, Pa, Ra):
 A set of finite States S

Page 7 of 18
 A set of finite Actions A
 Rewards received after transitioning from state S to state S', due to action a.
 Probability Pa.
MDP uses Markov property, and to better understand the MDP, we need to learn about it. Markov
Property: It says that "If the agent is present in the current state S1, performs an action a1 and move to
the state s2, then the state transition from s1 to s2 only depends on the current state and future action
and states do not depend on past actions, rewards, or states." Or, in other words, as per Markov
Property, the current state transition does
not depend on any past action or state. Hence, MDP is an RL problem that satisfies the Markov
property. Such as in a Chess game, the players only focus on the current state and do not need to
remember past actions or states.
Finite MDP: A finite MDP is when there are finite states, finite rewards, and finite actions. In RL, we
consider only the finite MDP.
Markov Process:
Markov Process is a memoryless process with a sequence of random states S 1, S2, St that
uses the Markov Property. Markov process is also known as Markov chain, which is a tuple (S, P) on
state S and transition function P. These two components (S and P) can define the dynamics of
the system.

10. Explain Q-Learning.

 Q-learning is an off policy RL algorithm, which is used for the temporal difference Learning. The
temporal difference learning methods are the way of comparing temporally successive predictions.
 It learns the value function Q (S, a), which means how good to take action "a" at a particular state
"s."
 The below flowchart explains the working of Q- learning:

Page 8 of 18
11. Difference between Reinforcement Learning and Supervised Learning.

The Reinforcement Learning and Supervised Learning both are the part of machine learning, but both
types of learnings are far opposite to each other. The RL agents interact with the environment,
explore it, take action, and get rewarded. Whereas supervised learning algorithms learn from the labeled
dataset and, on the basis of the training, predict the output. The difference table between RL and Supervised
learning is given below:

Reinforcement Learning Supervised Learning

RL works by interacting with the Supervised learning works on the existing
environment. dataset.
The RL algorithm works like the human brain Supervised Learning works as when a human
works when making some decisions. learns things in the supervision of a guide.
There is no labeled dataset is present The labeled dataset is present.
No previous training is provided to the Training is provided to the algorithm so that
learning agent. it can predict the output.
RL helps to take decisions sequentially. In Supervised learning, decisions are made
when input is given.

2 MULTI-ARMED BANDITS – STATELESS ALGORITHMS

Stateless algorithms in Multi-Armed Bandits (MAB) attempt to balance exploration (trying new arms)
and exploitation (choosing the best-known arm) without maintaining complex state or long-term
memory, aside from basic statistics like averages.
These algorithms are widely used in:
 Online advertising
 Recommendation systems
 Clinical trials
 Parameter tuning
 Adaptive routing

Page 9 of 18
2.1 Naive Algorithm (Pure Exploitation)
Concept
The Naive (or Greedy-0) algorithm always selects the arm with the highest estimated reward.
There is no exploration.
Steps
1. Initialize counts and rewards for all arms.
2. Play each arm once (optional).
3. At each step, select the arm with the current highest mean reward
4. Update that arm's mean reward after receiving the new reward.
Advantages
a. Simple and fast.
b. Good when environment is deterministic or rewards are stable.
Disadvantages
a. No exploration → stuck with initial bad choices.
b. Poor long-term performance.
c. Sensitive to noise.

2.2 ε-Greedy Algorithm

Concept
This algorithm chooses:
a. Best-known arm with probability (1 − ε) → exploitation
b. Random arm with probability ε → exploration
Steps
1. Initialize average reward for each arm.
2. For each round:
a. Generate random number r.
b. If r < ε, explore.
o Else exploit.
3. Update reward statistics.
Advantages
a. Simple, yet effective.
b. Guarantees exploration → avoids early bias.
c. ε can be decayed (e.g., ε = 1/t) for better long-term behavior.
Disadvantages
a. Exploration is uniformly random, even when some arms are clearly bad.
b. Requires hyperparameter tuning (fixed or decaying ε).
Use Cases
a. Online ads
b. A/B testing
c. Recommendation systems

2.3 Upper Confidence Bound (UCB1) – Upper Bounding Method

Concept
UCB balances exploration and exploitation using a confidence bound principle.
For each arm, choose the arm with highest Upper Confidence Bound:
Interpretation
a. First term → exploitation
b. Second term → exploration bonus
c. Arms with less trials get higher bonus, forcing exploration.
Advantages
Page 10 of 18
a. Theoretically optimal (logarithmic regret)
b. Exploration is not random but strategic
c. Learns faster in uncertain/dynamic environments
Disadvantages
a. Requires maintaining counts and logs
b. Slightly more computational than ε-Greedy
c. May over-explore initially
Use Cases
a. Industrial control
b. Adaptive caching
c. Large-scale recommender engines

2.4 Comparative Analysis Table

Feature Naive Algorithm ε-Greedy Algorithm UCB1 (Upper Bounding)

Exploration None Controlled (random) Strategic & mathematical
Exploitation Full Mostly Balanced
Complexity Very Low Low Moderate
Convergence Poor Good with decaying ε Excellent
Regret High Medium Low (near-optimal)
Sensitivity to Noise Very High Medium Low
Best Use Case Baseline Startups & simple Large-scale adaptive systems
systems
Exploration Quality No exploration Random exploration Intelligent exploration
Performance Stability Weak Good Strong & proven

2.5 Summary
 Naive Algorithm is simple but unreliable.
 ε-Greedy introduces random exploration — practical and easy.
 UCB1 provides a mathematically optimal balance — preferred for scalable, real-time systems.

3. THE BASIC FRAMEWORK OF REINFORCEMENT LEARNING

3.1 CHALLENGES OF REINFORCEMENT LEARNING

CHALLENGES DESCRIPTION
Sample Inefficiency RL often requires a large number of interactions with the environment to
learn optimal policies.
Exploration–Exploitation Balancing trying new actions vs. using known good actions is difficult and
Trade-off can slow learning.
Sparse Rewards Environments with delayed or infrequent rewards make it hard for agents
to learn effective strategies.
Credit Assignment Identifying which specific actions led to a final outcome is challenging.
Problem
High Dimensional State As state/action spaces grow, learning becomes complex and
Space computationally expensive.

Safety and Risk Unsafe exploration can be costly or harmful in real-world applications.
Page 11 of 18
Constraints
Computational Training advanced RL algorithms demands substantial GPU resources and
Complexity time.
Generalization Issues Agents often overfit to training environments and fail to perform well in
unseen scenarios.

3.2 ROLE OF DEEP LEARNING AND A STRAW-MAN ALGORITHM (CHALLENGES IN REINFORCEMENT

LEARNING)
3.2.1 Role of Deep Learning in Reinforcement Learning
Deep Learning (DL) plays a crucial role in enhancing Reinforcement Learning (RL) by allowing agents to
handle complex, high-dimensional environments that traditional RL methods cannot manage efficiently.
a. Motivation
Traditional RL methods rely on tabular representations of state and action values, which become infeasible
when the state or action space is large or continuous. Deep Learning provides function approximation
capabilities to generalize across unseen states.
b. Integration of DL with RL
RL Component DL Contribution
Value Function Approximation Deep Neural Networks (DNNs) approximate value functions such as
Q(s,a) (as in Deep Q-Networks).
Policy Representation Policies are represented using deep networks that map states to
actions (e.g., in Policy Gradient or Actor-Critic methods).
Feature Extraction DL automatically extracts spatial or temporal features from raw
input (e.g., images, video frames, or audio signals).
Scalability DL enables RL to scale to complex tasks such as playing Atari games,
robotics, and autonomous driving.
c. Example – Deep Q-Network (DQN)
Combines Q-learning with Convolutional Neural Networks (CNNs). Uses Experience Replay (to stabilize
learning) and Target Networks (to avoid divergence). Enabled RL agents to outperform humans in many
Atari 2600 games.
d. Advantages
a. Handles raw, unstructured data (images, speech, text).
b. Learns non-linear relationships between states and actions.
c. Improves generalization to similar but unseen environments.
e. Challenges
a. High computational cost and training instability.
b. Lack of interpretability (black-box nature).
c. Data inefficiency and convergence issues in continuous or sparse environments.
3.2.2 Straw-Man Algorithm in Reinforcement Learning

Page 12 of 18
A straw-man algorithm refers to a simple baseline or reference model used to highlight limitations or
compare the performance of more sophisticated RL algorithms.
a. Purpose
a. Serves as a benchmark to evaluate improvement over naïve methods.
b. Helps identify specific challenges or weaknesses in RL systems.
c. Provides an interpretable and low-complexity starting point for experimentation.
b. Typical Straw-Man Approaches
Algorithm Description Limitations (Challenge
Representation)
Random Policy Chooses actions randomly Demonstrates inefficiency of
without learning. uninformed exploration.
Greedy Policy Always selects the action Fails to balance exploration and
with the highest immediate exploitation; may get stuck in local
reward. optima.
Naïve Q-Learning Learns Q-values without Unstable and prone to divergence,
stabilization techniques (no showing the need for Deep RL
replay buffer or target enhancements.
network).
c. Why “Straw-Man”?
It acts as a conceptual baseline, not intended to be optimal. Used to stress-test RL frameworks under
simplified assumptions and provides insights into how Deep Learning improvements overcome basic RL
challenges (instability, sparse rewards, exploration inefficiency).
3.2.3 Challenges Highlighted by Straw-Man Algorithms
Challenge Description
Exploration vs. Exploitation Random and greedy agents illustrate the difficulty of balancing
discovery and performance.
Sample Inefficiency Simple algorithms require excessive episodes to learn effective
policies.
Instability Without DL-based stabilization (like target networks), learning can
diverge.
Sparse Rewards Straw-man agents struggle in environments with delayed or
infrequent rewards.
Generalization They fail to perform well in unseen environments, showing the need
for deep function approximators.
4. Key Insight
Deep Learning gives RL the ability to generalize and scale beyond small state spaces, while Straw-Man
Algorithms expose fundamental RL weaknesses (instability, inefficiency, lack of exploration), motivating the
design of Deep RL architectures.
Page 13 of 18
5. Conclusion
The integration of Deep Learning into Reinforcement Learning transforms simple, unstable algorithms into
powerful agents capable of handling complex, real-world environments. By comparing performance against
straw-man baselines, researchers can measure progress, identify bottlenecks, and design more robust, data-
efficient, and interpretable RL algorithms.

4. CASE STUDIES IN REINFORCEMENT LEARNING

LEARNING)
4.1 Self-Learning Robots: Deep Learning of Locomotion Skills
a. Introduction
Robotic locomotion — the ability of robots to move autonomously across varied terrains — is a fundamental
challenge in AI. Deep Reinforcement Learning (DRL) enables robots to learn locomotion behaviors
autonomously through experience, without explicit programming of motion dynamics.
b. Motivation
The goal is to create adaptive and robust movement strategies across complex environments. Challenges
include high-dimensional control, continuous action spaces, stability, and sim-to-real transfer.
c. Core Concepts and Framework
An RL setup involves an agent (robot), environment, state (sensory inputs), actions (torques/joint
movements), and reward (movement efficiency, stability).
d. Methodology
Algorithms such as DDPG, PPO, SAC, and TRPO are used. These methods train policy and value networks via
gradient updates to maximize rewards.
e. Example Case Study
OpenAI’s quadruped robot learned to walk using PPO. Training in simulation led to emergent gait patterns
similar to animals, guided by reward functions emphasizing distance and stability.
f. Sim-to-Real Transfer
Domain randomization and fine-tuning bridge simulation and real-world execution.
g. Key Achievements
Autonomous walking, running, and adaptation to uneven terrains; robust recovery after disturbances.
h. Challenges
High computational cost, reward shaping complexity, and real-world safety issues.
i. Impact
Showcased RL’s power in robotics, inspiring progress in exoskeletons, drones, and autonomous navigation.

4.2 Deep Learning of Visuomotor Skills

a. Introduction
Visuomotor skills involve linking vision and motor control. Using CNNs with DRL, robots learn visuomotor
behaviors from raw pixels without handcrafted pipelines.

Page 14 of 18
b. Motivation
End-to-end learning enables robots to interpret visual data and act accordingly, improving adaptability in
unstructured settings.
c. Core Framework
Input: camera images; CNNs extract features; Policy Network maps visuals to motor actions; Reward:
success of task (e.g., grasping).
d. Example Case Study
Google’s robotic arm (Levine et al., 2016) learned grasping using 800,000 attempts with CNN-based policies,
achieving robust performance without explicit programming.
e. Techniques and Architectures
End-to-end learning, DQN/Actor-Critic for control, autoencoders for compression, and imitation+RL hybrid
learning.
f. Achievements
Robots learned to grasp unseen objects, self-calibrate, and adapt visually without 3D modeling.
g. Challenges
Data inefficiency, poor visual generalization, training instability, and sim-to-real transfer issues.
h. Impact
Enabled perception-driven robotics for manipulation, visual navigation, and industrial automation.
4.3 Comparative Summary
Aspect Self-Learning Locomotion Deep Visuomotor Skills
Input Type Sensor readings Camera images
Output Motion commands Motor actions from vision
Learning Algorithm PPO, DDPG, TRPO DQN, Actor-Critic, Imitation+RL
Goal Efficient locomotion Visual control actions
Environment Physical terrain Visual workspace
Challenges Stability, sim-to-real Data efficiency, generalization
Impact Humanoid and legged robotics Manipulation and visual autonomy

5. CASE STUDIES IN REINFORCEMENT LEARNING (AlphaGo & Alpha Zero)

LEARNING)
5.1 AlphaGo: Championship-Level Play at Go
a. Introduction
AlphaGo, developed by DeepMind (a subsidiary of Google), represents a landmark achievement in artificial
intelligence and reinforcement learning. It became the first computer program to defeat a professional
human Go player, later surpassing the world champion. Go is an ancient board game with more possible
positions than atoms in the universe — making traditional search-based approaches infeasible.
b. Background and Motivation
Game Complexity: Go has a branching factor of approximately 10¹⁷⁰, vastly larger than chess (~10⁴⁷).
Traditional AI Limitations: Rule-based and brute-force search methods (like Alpha-Beta pruning) used in
Page 15 of 18
chess programs failed to scale.
Goal: Combine Deep Learning (DL) and Reinforcement Learning (RL) to learn value functions and policies
that approximate human-level intuition.
c. Architecture and Methodology
Component Description
Policy Network A deep neural network trained to predict expert moves from
human games. Used to narrow the search space.
Value Network Estimates the probability of winning from any given board
position.
Monte Carlo Tree Search (MCTS) Combines the outputs of policy and value networks to
explore future game states effectively.
Reinforcement Learning (Self-Play) AlphaGo played millions of games against itself, improving
the policy iteratively via gradient updates (Policy Gradient
Methods).

d. Training Phases
Supervised Learning Phase – Trained on 30 million human expert moves, achieving ~57% accuracy in move
prediction.
Reinforcement Learning Phase – Improved through self-play using policy gradient reinforcement learning to
optimize move selection.
Value Network Training – Using self-play games to estimate the expected outcome (win/loss).
Integration with MCTS – Balanced exploration and exploitation through tree-based rollouts guided by the
learned policy.

e. Achievements
2015: Defeated European Champion Fan Hui (5–0).
2016: Defeated World Champion Lee Sedol (4–1).
2017: AlphaGo Master defeated world No. 1 Ke Jie in three consecutive matches.

f. Key Innovations
Combination of Deep Neural Networks with Tree Search.
End-to-end learning from raw Go board representations (19×19 grid).
Effective use of self-play to exceed human knowledge.
g. Challenges Faced
Enormous computational demands (initially trained using hundreds of GPUs).
High sensitivity to hyperparameters in RL.
Limited interpretability of neural decisions.

h. Impact

Page 16 of 18
AlphaGo demonstrated that Deep Reinforcement Learning can master complex, intuitive, and creative tasks
beyond symbolic AI’s reach. It opened new research directions in generalized learning, strategic planning,
and policy optimization.

5.2 AlphaZero: Enhancements to Zero Human Knowledge

a. Introduction
AlphaZero, also developed by DeepMind (2017), was the next evolution of AlphaGo. Unlike AlphaGo, which
relied on human data and domain-specific heuristics, AlphaZero started with zero human knowledge —
learning purely through self-play reinforcement learning.

b. Motivation
The goal was to generalize AlphaGo’s architecture to any two-player perfect-information game (like chess,
shogi, and Go) using a single unified algorithm, eliminating the need for human demonstrations or
handcrafted features.

c. Core Architectural Features

Component AlphaZero Implementation
Input Raw board positions and legal moves (no handcrafted features).
Network Type Single deep residual neural network trained jointly for policy and value.
Learning Algorithm Self-play using Reinforcement Learning, guided by MCTS with learned
policy/value priors.
Training Signal Final game outcome (+1, -1, 0) used as reinforcement signal for all moves.
Optimization Combined Policy Gradient + MCTS for stability and performance.

d. Training Process
Self-Play – The agent plays against itself to generate experience data.
Monte Carlo Tree Search (MCTS) – Guided by the current policy and value estimates.
Network Update – Using collected data to improve policy and value prediction.
Iteration – The new network replaces the old one if it performs better.
e. Key Improvements over AlphaGo
Aspect AlphaGo AlphaZero
Human Data Usage Used expert game datasets. Learned purely through self-play.
Game Specificity Designed for Go. Applicable to Go, Chess, and Shogi.
Architecture Separate Policy and Value Networks. Unified Policy-Value Network.
Learning Paradigm Supervised + RL. Pure RL (Self-play).
Efficiency Extensive computation required. More sample efficient.

f. Achievements
a. Defeated Stockfish (Chess) and Elmo (Shogi) within 24 hours of self-training.
Page 17 of 18
b. Achieved superhuman performance in Go, surpassing AlphaGo Zero.
c. Unified algorithm proved domain-agnostic intelligence.

g. Challenges
 High computational cost (requires TPUs/GPUs).
 Lack of interpretability in neural strategies.
 Sparse rewards and slow convergence.
 Difficulty scaling to stochastic or imperfect-information environments.

h. Comparative Summary
Aspect AlphaGo AlphaZero
Learning Approach Human data + RL self-play Pure RL self-play
Architecture Separate Policy & Value Unified Policy-Value Network
Networks
Human Involvement High (supervised pre-training) None (zero human knowledge)
Game Coverage Go only Go, Chess, Shogi
Algorithmic Innovation Deep RL + MCTS Generalized Deep RL + MCTS
Outcome Beat World Champion Beat top engines in multiple domains

Page 18 of 18

ML Unit I Notes
No ratings yet
ML Unit I Notes
27 pages
BAI701 - DLRL - Module 5 Notes
No ratings yet
BAI701 - DLRL - Module 5 Notes
29 pages
BAI701 DLRL Module 3 Notes
No ratings yet
BAI701 DLRL Module 3 Notes
18 pages
DLRL Module 1 Updated
No ratings yet
DLRL Module 1 Updated
25 pages
BAI702 Important Questions With Answers
No ratings yet
BAI702 Important Questions With Answers
2 pages
Module 1 de and Mlops Notes
No ratings yet
Module 1 de and Mlops Notes
40 pages
BAI701 Module 4 Notes
No ratings yet
BAI701 Module 4 Notes
12 pages
Comprehensive Machine Learning Notes
No ratings yet
Comprehensive Machine Learning Notes
96 pages
Deep Learning Question Bank Iv-I
No ratings yet
Deep Learning Question Bank Iv-I
5 pages
Deep Learnng Model QP
No ratings yet
Deep Learnng Model QP
2 pages
BAI701 - DLRL - Module 4 Notes
No ratings yet
BAI701 - DLRL - Module 4 Notes
34 pages
DL 3 Regularization
No ratings yet
DL 3 Regularization
50 pages
Aiml-Lab Manual 2025 - Lab 2 - Ai Problem Solving Agents
No ratings yet
Aiml-Lab Manual 2025 - Lab 2 - Ai Problem Solving Agents
5 pages
BCS602 Model Set 1 Paper
No ratings yet
BCS602 Model Set 1 Paper
2 pages
Deep Learning Exam Questions 2022
No ratings yet
Deep Learning Exam Questions 2022
3 pages
Deep Learning Module-01 Notes
No ratings yet
Deep Learning Module-01 Notes
69 pages
Reinforcement Learning Question Bank
No ratings yet
Reinforcement Learning Question Bank
1 page
ML QuestionBank
No ratings yet
ML QuestionBank
6 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
ML II Module 3 Notes Updated
No ratings yet
ML II Module 3 Notes Updated
8 pages
Loan Prediction for CS Students
No ratings yet
Loan Prediction for CS Students
21 pages
RLDL IPU 2024 Mid-Term Question Paper
No ratings yet
RLDL IPU 2024 Mid-Term Question Paper
1 page
Machine Learning Question Bank for B.Tech
No ratings yet
Machine Learning Question Bank for B.Tech
29 pages
Deep Learning - AD3501 - Important Questions and Question Bank
No ratings yet
Deep Learning - AD3501 - Important Questions and Question Bank
11 pages
Deep Learning Exam Questions Guide
No ratings yet
Deep Learning Exam Questions Guide
32 pages
ML Unit 3 Notes
No ratings yet
ML Unit 3 Notes
15 pages
Deep Learning Simp 21cs743
No ratings yet
Deep Learning Simp 21cs743
3 pages
AIML Module - 4
No ratings yet
AIML Module - 4
25 pages
Bai701 DLRL Module 1
No ratings yet
Bai701 DLRL Module 1
53 pages
Deep Learning - AD3501 - Important Questions and Question Bank
No ratings yet
Deep Learning - AD3501 - Important Questions and Question Bank
18 pages
Unit 1 Reinforcement Learning
No ratings yet
Unit 1 Reinforcement Learning
70 pages
Deep Learning - Question Bank
No ratings yet
Deep Learning - Question Bank
6 pages
BAD703 Module 4
No ratings yet
BAD703 Module 4
21 pages
Question Bank - Module 2 - Module-3 Module 4 - Module 5
No ratings yet
Question Bank - Module 2 - Module-3 Module 4 - Module 5
4 pages
Bad601 Lab Maual
No ratings yet
Bad601 Lab Maual
34 pages
ML Question Bank For Semester End Examination (Final)
0% (1)
ML Question Bank For Semester End Examination (Final)
4 pages
DLRL Module 1
No ratings yet
DLRL Module 1
20 pages
DLRL Module 2
No ratings yet
DLRL Module 2
22 pages
Bai602 ML Lesson Plan 2024-25 Even Aiml Dept
No ratings yet
Bai602 ML Lesson Plan 2024-25 Even Aiml Dept
5 pages
MLT All Unit Important Questions AKTU (Edushine Classes)
No ratings yet
MLT All Unit Important Questions AKTU (Edushine Classes)
8 pages
Comprehensive Guide to Machine Learning Techniques
No ratings yet
Comprehensive Guide to Machine Learning Techniques
39 pages
New Generative AI and Agentic AI Syllabus - 2025
No ratings yet
New Generative AI and Agentic AI Syllabus - 2025
6 pages
BCM601-Module 1
No ratings yet
BCM601-Module 1
35 pages
Find S and Candidate Elimination Problems Solved
No ratings yet
Find S and Candidate Elimination Problems Solved
19 pages
Unit-III - Chapter7-Learning Rule Sets
No ratings yet
Unit-III - Chapter7-Learning Rule Sets
44 pages
Deep Learning NPTEL Syllabus
No ratings yet
Deep Learning NPTEL Syllabus
1 page
Reinforcement Learning Question Bank
No ratings yet
Reinforcement Learning Question Bank
11 pages
Data Engineering and MLops
No ratings yet
Data Engineering and MLops
3 pages
RNN Notes
No ratings yet
RNN Notes
45 pages
Module-02 AIML NOTES
No ratings yet
Module-02 AIML NOTES
29 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
3 pages
Representation Power of MLPs
No ratings yet
Representation Power of MLPs
141 pages
Unit 5 (DL)
No ratings yet
Unit 5 (DL)
18 pages
NN DL
No ratings yet
NN DL
1 page
Regularization Techniques in Deep Learning
No ratings yet
Regularization Techniques in Deep Learning
59 pages
21CS54 TIE SIMPdocx (1) (1) (1) (1) PDF
No ratings yet
21CS54 TIE SIMPdocx (1) (1) (1) (1) PDF
4 pages
Vtu Questions From Previous Ai ML Question Papers
No ratings yet
Vtu Questions From Previous Ai ML Question Papers
4 pages
Digital Design Lab Manual BCS302
No ratings yet
Digital Design Lab Manual BCS302
64 pages
Deep Learning - AD3501 - Notes - Unit 3 - Recurrent Neural Networks
No ratings yet
Deep Learning - AD3501 - Notes - Unit 3 - Recurrent Neural Networks
33 pages
RL Unit 1
100% (1)
RL Unit 1
26 pages
EAST Terminal Fraud Definitions Terminology ATM UPT POS
No ratings yet
EAST Terminal Fraud Definitions Terminology ATM UPT POS
8 pages
Circular Plate Analysis with Hole
No ratings yet
Circular Plate Analysis with Hole
14 pages
Simple Interest and Compound Interest Set - 2-1739431423285
No ratings yet
Simple Interest and Compound Interest Set - 2-1739431423285
9 pages
Quantitative Indices of Soil Development For Pedogenetic Evaluation
No ratings yet
Quantitative Indices of Soil Development For Pedogenetic Evaluation
47 pages
Chlmlabt4 PDF
100% (1)
Chlmlabt4 PDF
8 pages
Law and Social Transformation-LL.M. DONE
100% (1)
Law and Social Transformation-LL.M. DONE
41 pages
DB Hikra Solb2 1500V 2308 en
No ratings yet
DB Hikra Solb2 1500V 2308 en
2 pages
Instrument Design & Installation Specs
100% (1)
Instrument Design & Installation Specs
79 pages
Year9 History Scheme of Work Collins
No ratings yet
Year9 History Scheme of Work Collins
3 pages
Dissertation Topics of Clinical Psychology
100% (2)
Dissertation Topics of Clinical Psychology
7 pages
Advanced Crop Yield Prediction Using Machine Learning and Deep Learning: A Comprehensive Review
No ratings yet
Advanced Crop Yield Prediction Using Machine Learning and Deep Learning: A Comprehensive Review
14 pages
I3 Aseembly Manual3
No ratings yet
I3 Aseembly Manual3
171 pages
CCN142 RSLogix 5000 Level 4 Motion Programming Using Ladder Logic
No ratings yet
CCN142 RSLogix 5000 Level 4 Motion Programming Using Ladder Logic
2 pages
English-Somali Vocabulary: Space
No ratings yet
English-Somali Vocabulary: Space
3 pages
Active Learning Techniques Guide
100% (2)
Active Learning Techniques Guide
28 pages
Open Access: Kicks, Spits, and Headers
No ratings yet
Open Access: Kicks, Spits, and Headers
179 pages
Science Worksheet: Lenses
100% (1)
Science Worksheet: Lenses
2 pages
Heater Fabrication Inspection Plan
No ratings yet
Heater Fabrication Inspection Plan
16 pages
Internal Audit Essentials and Practices
No ratings yet
Internal Audit Essentials and Practices
10 pages
Tirunelveli Project Model
No ratings yet
Tirunelveli Project Model
1 page
CHILD PORNOGRAPHY Bulawan Natalie Joy V.
No ratings yet
CHILD PORNOGRAPHY Bulawan Natalie Joy V.
12 pages
Boiler Info B3
No ratings yet
Boiler Info B3
10 pages
Sustainable Construction in Developing Nations
No ratings yet
Sustainable Construction in Developing Nations
7 pages
Circuit Breaker Specs for Engineers
No ratings yet
Circuit Breaker Specs for Engineers
2 pages
Critique of the Greatest Happiness Principle
No ratings yet
Critique of the Greatest Happiness Principle
5 pages
What Is A Data Driven Organization
No ratings yet
What Is A Data Driven Organization
11 pages
LLX200-DA: Lightweight 200kV X-Ray Generator
No ratings yet
LLX200-DA: Lightweight 200kV X-Ray Generator
2 pages
Cummins 4B Marine Engine Data Sheet
No ratings yet
Cummins 4B Marine Engine Data Sheet
2 pages
Water Complete Line - Brochure
No ratings yet
Water Complete Line - Brochure
5 pages
Hatchet Chapter 9 and 10 Questions
No ratings yet
Hatchet Chapter 9 and 10 Questions
2 pages

BAI701 Module 5 Notes

Uploaded by

BAI701 Module 5 Notes

Uploaded by

DEPARTMENT OF ARTIFICIAL INTELLIGENCE & DATA SCIENCE

DEEP LEARNING & RE-INFORCEMENT LEARNING – VII SEM AI&DS (BAI701)

Module 5: Deep Reinforcement Learning – Introduction, Stateless Algorithms, Continuation of

Models: Markov Decision Process, Q Learning,

Type of question and marks:

1. What is reinforcement learning? State one practical example.

 Reinforcement Learning is a feedback-based Machine learning technique in which an agent

3. State key features of reinforcement learning.

4. Explain approaches to implement reinforcement learning.

5. Explain elements of reinforcement learning.

7. Explain types of reinforcement learning: (Positive & Negative reinforcement)

8. How to represent the agent state?

10. Explain Q-Learning.

Reinforcement Learning Supervised Learning

2 MULTI-ARMED BANDITS – STATELESS ALGORITHMS

2.2 ε-Greedy Algorithm

2.3 Upper Confidence Bound (UCB1) – Upper Bounding Method

2.4 Comparative Analysis Table

Feature Naive Algorithm ε-Greedy Algorithm UCB1 (Upper Bounding)

3. THE BASIC FRAMEWORK OF REINFORCEMENT LEARNING

3.1 CHALLENGES OF REINFORCEMENT LEARNING

3.2 ROLE OF DEEP LEARNING AND A STRAW-MAN ALGORITHM (CHALLENGES IN REINFORCEMENT

4. CASE STUDIES IN REINFORCEMENT LEARNING

4.2 Deep Learning of Visuomotor Skills

5. CASE STUDIES IN REINFORCEMENT LEARNING (AlphaGo & Alpha Zero)

5.2 AlphaZero: Enhancements to Zero Human Knowledge

c. Core Architectural Features

You might also like