0% found this document useful (0 votes)
46 views15 pages

Reinforcement Learning and Deep Learning

The document discusses key concepts in reinforcement learning, including Markov decision processes, SARSA vs Q-learning, and actor-critic methods like A2C and A3C. It also covers the importance of the Bellman equation, the need for target networks in DQN, and the challenges of POMDPs. Additionally, it explains meta-learning, model-based techniques, and various properties of dynamic programming, alongside practical applications in fields like robotics and healthcare.

Uploaded by

leg3endary0777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views15 pages

Reinforcement Learning and Deep Learning

The document discusses key concepts in reinforcement learning, including Markov decision processes, SARSA vs Q-learning, and actor-critic methods like A2C and A3C. It also covers the importance of the Bellman equation, the need for target networks in DQN, and the challenges of POMDPs. Additionally, it explains meta-learning, model-based techniques, and various properties of dynamic programming, alongside practical applications in fields like robotics and healthcare.

Uploaded by

leg3endary0777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Question 1

1(a) What are the main components of Markov decision process? (2


marks)
 States (S): All possible situations the agent can be in.
 Actions (A): All possible moves/choices the agent can take.
 Transition probabilities (P): Probability of moving from one
state to another after an action.
 Reward function (R): Immediate feedback (positive/negative)
after an action.
 Policy (π): Strategy that defines which action to take in a given
state.

1(b) When to use SARSA over Q-learning? (2 marks)


 SARSA (on-policy): Used when we want the agent to learn
based on the policy it actually follows (including exploration).
 Q-learning (off-policy): Learns optimal policy regardless of
current behavior.
👉 Use SARSA when the environment is risky/unstable and we
want safer learning (since it accounts for exploratory actions).

1(c) What is the difference between A2C and A3C actor critic? (2
marks)
 A2C (Advantage Actor-Critic): Synchronous version where
multiple agents run in parallel and update gradients together.
 A3C (Asynchronous Advantage Actor-Critic): Asynchronous
version where each agent updates independently at different
times.
👉 Difference = synchronous (A2C) vs asynchronous (A3C)
training.

1(d) How are multi-agent systems different from distributed


systems? (2 marks)
 Multi-agent systems: Multiple intelligent agents interact,
cooperate, or compete to achieve goals (focus = decision-
making).
 Distributed systems: Multiple computers share resources and
coordinate tasks (focus = computation + fault tolerance).
👉 Multi-agent = intelligent decisions, Distributed = resource
distribution.

1(e) What is a real world example of reinforcement learning? (2


marks)
 Example: Self-driving cars → Learn to drive by interacting with
environment (traffic, signals, pedestrians) and maximizing
safety & efficiency.
Other examples: Robotics, Game playing (AlphaGo),
Recommendation systems.

Question 2
2(a) Why is the Bellman equation important in reinforcement
learning? How to solve it? (5 marks)
 Importance:
o It provides a recursive way to calculate value of a state.
o Forms the foundation of Dynamic Programming, Q-
learning, and Value Iteration.
 Bellman equation (for value function):
V(s)=R(s)+γ∑s′P(s′∣s,a)V(s′)V(s) = R(s) + \gamma \sum_{s'} P(s'|s,a)
V(s')V(s)=R(s)+γs′∑P(s′∣s,a)V(s′)
 How to solve:
o Iterative methods: Value Iteration, Policy Iteration.
o Approximation: Monte Carlo, Temporal Difference
learning.

2(b) Why do we need target network in DQN? How can we improve


DQN model? (5 marks)
 Need of target network:
o Prevents unstable learning by keeping target Q-values
fixed for some steps instead of updating immediately.
o Reduces oscillations and divergence.
 Improvements to DQN:
o Double DQN (removes overestimation bias).
o Dueling DQN (separates value and advantage functions).
o Prioritized experience replay.
o Using larger neural networks and better optimizers.
o Question 3
o 3(a) What is stochastic policy? What is the formula for
the policy of reinforcement learning? Explain. (5 marks)
o A stochastic policy in reinforcement learning is a type of
decision-making rule where the agent does not always
choose the same action for a given state, but instead
selects an action based on a probability distribution. This
means that for the same state, different actions may be
chosen at different times, which introduces randomness
and allows better exploration of the environment. This is
different from a deterministic policy, where the action is
always fixed for each state.
o The general formula for a stochastic policy is:
o π(a∣s)=P(A=a∣S=s)\pi(a|s) = P(A = a \mid S =
s)π(a∣s)=P(A=a∣S=s)
o This represents the probability of choosing action a when
the agent is in state s.
o Stochastic policies are very important in reinforcement
learning, especially in complex or continuous
environments, because they prevent the agent from
getting stuck in local optima and promote exploration. For
example, in a game-playing agent, using a stochastic
policy ensures that the agent sometimes tries out less
common moves, which may eventually lead to discovering
better strategies. Many modern algorithms such as Policy
Gradient, REINFORCE, and Actor-Critic methods rely on
stochastic policies for efficient learning.
o

o 3(b) What is meta-learning in reinforcement learning?


What are the applications of meta-learning? (5 marks)
o Meta-learning, also called “learning to learn,” is a process
in reinforcement learning where the agent does not only
learn to solve a single task but also learns how to quickly
adapt its knowledge to new tasks with minimal data and
training. Instead of starting from scratch for each task, the
agent develops a general learning strategy that can
transfer across different environments.
o The goal of meta-learning is to build agents that are
flexible and adaptable, much like humans who can apply
previous experience to new problems. For instance, once
a robot learns how to walk on flat ground, it should
quickly adapt to walking on sand, stairs, or rocky terrain
without learning everything again.
o Applications of meta-learning include:
o Robotics: Robots adapting quickly to different terrains,
tasks, or objects.
o Healthcare: Personalized treatment recommendations
based on patient-specific data.
o Few-shot learning: Training models that can classify or act
correctly with very few examples.
o Recommendation systems: Adapting quickly to changing
user preferences.
o Meta-learning is therefore extremely powerful because it
makes reinforcement learning agents more generalizable,
efficient, and closer to human-like adaptability.
o

o Question 4
o 4(a) What are the challenges associated with using a
POMDP? Explain the key components of the POMDP. (5
marks)
o A POMDP (Partially Observable Markov Decision
Process) is an extension of the MDP where the agent does
not have complete knowledge of the state of the
environment. Instead, it receives only partial observations
that provide incomplete information about the true state.
o Challenges associated with POMDP:
o High complexity: Solving POMDPs is computationally
difficult, as the agent must reason about all possible
hidden states.
o Uncertainty handling: Since the agent never knows the
true state, it must maintain a belief (probability
distribution over states), which is mathematically
challenging.
o Memory requirement: The agent must often remember
past actions and observations to make better decisions,
unlike in standard MDPs where the current state is
enough.
o Scalability issues: As the environment grows larger,
maintaining beliefs and making optimal policies becomes
nearly impossible in real time.
o Key components of POMDP:
o States (S): True states of the environment (hidden from
the agent).
o Actions (A): Choices available to the agent.
o Transition function (P): Probability of moving to a new
state given an action.
o Rewards (R): Immediate feedback after actions.
o Observations (O): What the agent can perceive from the
environment.
o Observation function: Probability of receiving an
observation given the hidden state.
o In short, POMDPs model real-world situations better than
MDPs (since we rarely know the full state of the world),
but their solution is much harder and often requires
approximation methods.
o

o 4(b) What are model-based techniques? Is AlphaZero


model based on RL? (5 marks)
o In reinforcement learning, model-based techniques are
methods where the agent builds or uses a model of the
environment to plan and make decisions. The model
usually includes the transition dynamics (probability of
going from one state to another) and the reward function.
By simulating possible future states using the model, the
agent can evaluate different strategies before actually
executing them in the environment.
o This is in contrast to model-free techniques (like Q-
learning or Policy Gradient methods), where the agent
directly learns from trial and error without trying to
predict future states explicitly. Model-based methods are
often more data-efficient but computationally expensive.
o Examples of model-based techniques:
o Dynamic programming
o Monte Carlo Tree Search (MCTS)
o Planning algorithms in robotics
o AlphaZero: Yes, the AlphaZero model is based on
reinforcement learning and it combines model-based
planning with deep learning. It uses:
o Monte Carlo Tree Search (MCTS): A planning algorithm
that simulates future game moves (model-based).
o Neural networks: To approximate value function and
policy.
o Self-play reinforcement learning: The system plays
against itself, improving iteratively without human data.
o This combination makes AlphaZero a hybrid model that
leverages the strengths of both model-based RL
(planning) and deep learning (function approximation). It
has been successfully used to master games like chess,
shogi, and Go at superhuman levels.
o

Q1 (Attempt any five, 5 marks each)


(a) What is Reinforcement Learning (RL)?
Reinforcement Learning is a type of machine learning where an agent
learns by interacting with an environment. The agent takes actions,
receives rewards or penalties, and improves its strategy over time to
maximize total rewards. Unlike supervised learning, RL does not
require labeled data, but instead focuses on trial-and-error learning.
Examples include training robots, game-playing (like AlphaGo), and
self-driving cars.

(b) What do you mean by Metadata?


Metadata means “data about data.” It gives information about other
data, such as how it is created, stored, or used. For example, a photo
file may have metadata such as date taken, camera type, and
resolution. In machine learning and databases, metadata helps in
organizing, retrieving, and understanding data better.

(c) Explain the two required properties of Dynamic Programming.


Dynamic Programming (DP) is used when problems can be broken
into smaller subproblems. The two key properties are:
1. Optimal substructure: The solution to a big problem can be
built from solutions of smaller subproblems.
2. Overlapping subproblems: The same subproblems occur
multiple times, so storing and reusing results saves time.
For example, shortest path problems and Fibonacci calculation
use DP.

(d) List out the requirements for Monte Carlo method.


Monte Carlo methods are techniques that use random sampling to
estimate values. Requirements:
1. Environment model or simulator for running episodes.
2. Many random samples or trials.
3. Reward function to evaluate outcomes.
4. Sufficient episodes to average results for accuracy.
It is often used in reinforcement learning when the
environment’s dynamics are not fully known.

(e) Differentiate between meta-learning and model-agent based


learning used in RL.
 Meta-learning: “Learning to learn.” The agent learns a general
way of learning so it can quickly adapt to new tasks. Example: A
robot trained to walk on flat ground can easily adapt to sand or
stairs.
 Model-agent based learning: The agent builds or uses a model
of the environment (states, actions, transitions) to plan
decisions. Example: Using simulations to decide the next move
in chess.

(f) What is Deep Learning?


Deep Learning is a branch of machine learning that uses artificial
neural networks with many layers to learn complex patterns from
data. It can automatically extract features from raw data like images,
sound, or text, without requiring manual feature engineering.
Applications include speech recognition, image classification, and
natural language processing.

(g) List out the types of Neural Networks.


1. Feedforward Neural Networks (FNN)
2. Convolutional Neural Networks (CNN)
3. Recurrent Neural Networks (RNN)
4. Generative Adversarial Networks (GANs)
5. Autoencoders
6. Radial Basis Function Networks

(h) Define Vector Space Model.


Vector Space Model (VSM) is a way to represent text documents as
vectors of numbers. Each document is expressed as a vector of
terms, and similarity is measured using methods like cosine similarity.
It is widely used in Information Retrieval (e.g., search engines) to find
documents similar to a given query.

UNIT – I (12.5 marks)


Q2. Discuss the features and elements of Reinforcement Learning.
Reinforcement Learning has some key features:
 Trial and error learning – Agent learns by interacting.
 Feedback-driven – Rewards or penalties guide behavior.
 Exploration vs. Exploitation – Agent must balance trying new
actions vs using known best actions.
 Sequential decision making – Actions affect future outcomes.
Elements of RL framework:
1. Agent – learner/decision-maker.
2. Environment – everything the agent interacts with.
3. State (S) – current situation of the agent.
4. Action (A) – choices agent can make.
5. Reward (R) – feedback after action.
6. Policy (π) – strategy followed by the agent.
7. Value function (V) – long-term expected reward from a state.

OR Q3. Illustrate about Markov Decision Process and RL


Framework.
A Markov Decision Process (MDP) is the mathematical framework
behind RL. It consists of:
 States (S), Actions (A), Transition probabilities (P), Reward
function (R), and Policy (π).
The Markov property means that the next state depends only
on the current state and action, not the past history.
The RL framework is built on top of MDP where the agent interacts
with the environment, updates its policy, and learns optimal
behavior. Example: In chess, the state is the board position, action is
a move, and reward is winning/losing.

UNIT – II (12.5 marks)


Q4. State and explain the various policy-based methods used in RL.
Policy-based methods directly learn a parameterized policy (πθ)
without using value functions. Examples:
1. Policy Gradient methods (REINFORCE): Adjusts policy
parameters in the direction that increases expected reward.
2. Actor-Critic methods: Combination of policy-based (actor) and
value-based (critic).
3. Trust Region Policy Optimization (TRPO): Improves policy while
preventing large harmful updates.
4. Proximal Policy Optimization (PPO): Simplified version of TRPO,
widely used in practice.
Advantages: Work well in continuous action spaces, provide
stochastic policies, and are stable for large problems.

OR Q5. Explain with an example the working of model-based RL


approach.
In model-based RL, the agent builds a model of the environment,
which includes transition probabilities and reward functions. The
agent then uses this model to simulate different possible futures and
choose the best actions.
Example: AlphaZero in chess. It uses Monte Carlo Tree Search (MCTS)
as a planning method. The agent simulates many possible moves
ahead (like a human imagining future moves) and chooses the one
with the best expected outcome. This makes model-based RL more
sample-efficient compared to model-free methods.

UNIT – III (12.5 marks)


Q6. Discuss the working principle of deep learning with practical
examples.
Deep learning works on the principle of using multi-layer neural
networks where each layer extracts increasingly complex features
from the input data. The first layers capture simple features (edges in
images), while deeper layers combine them into complex patterns
(faces, objects).
The network is trained using backpropagation, where errors are
propagated backward, and weights are adjusted using optimization
algorithms like gradient descent.
Examples:
 Image recognition (CNN): Identifying cats, dogs, or humans in
photos.
 Speech recognition (RNN, LSTM): Converting spoken language
into text.
 Medical diagnosis: Detecting tumors from X-rays or MRI scans.

OR Q7. Illustrate about the Convolutional Neural Network and its


application in real-time.
A Convolutional Neural Network (CNN) is a type of neural network
specialized for image and spatial data. It uses convolutional layers to
automatically extract features like edges, shapes, and textures.
Pooling layers reduce dimensions, and fully connected layers classify
the output.
Applications in real-time:
 Face recognition (unlocking phones).
 Self-driving cars (object detection like pedestrians, traffic
lights).
 Medical imaging (cancer detection).
 Security systems (CCTV image recognition).

UNIT – IV (12.5 marks)


Q8. Explain with example how deep learning can be utilized in
Natural Language Processing (NLP).
Deep learning has transformed NLP by replacing hand-crafted
features with automatic learning from text data. Models like RNNs,
LSTMs, GRUs, and Transformers can capture sequential dependencies
and context.
Examples:
 Machine translation: Google Translate uses deep learning.
 Chatbots & assistants: Siri, Alexa.
 Sentiment analysis: Detecting emotions in tweets or reviews.
 Text summarization: Generating concise summaries of long
articles.

OR Q9. Draw and explain the deep learning architecture for


Computer Vision.
Computer Vision architectures are mostly based on CNNs.
Architecture steps:
1. Input layer – image pixels.
2. Convolutional layers – detect features (edges, corners).
3. Pooling layers – reduce size and preserve important info.
4. Fully connected layers – combine extracted features.
5. Output layer – classification (e.g., cat, dog, car).
This layered architecture makes CNNs very effective in vision tasks
like face recognition, autonomous driving, and medical image
analysis.

You might also like