0% found this document useful (0 votes)

2 views3 pages

AML Assign3 Report

The assignment involved training a Reinforcement Learning (RL) agent to solve the LunarLander-v3 challenge using both discrete and continuous action spaces. The discrete control was achieved using Deep Q-Networks (DQN), while the continuous control utilized Deep Deterministic Policy Gradient (DDPG), with DDPG demonstrating greater efficiency and faster convergence. The results confirmed that DQN is suitable for discrete actions, while DDPG is more effective for continuous actions, emphasizing the importance of stability mechanisms in different learning environments.

Uploaded by

Pranav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views3 pages

AML Assign3 Report

Uploaded by

Pranav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

AML – Assignment 3

Developing a RL-Agent to Solve the Classic Lunar Lander Control

Problem
Nalla Janardhana Rao (MDS202426) Pranav Pothan (MDS202429)
Raja S (MDS202430)

1 Introduction
The objective of this assignment was to train a Reinforcement Learning (RL) agent to successfully solve the
Gymnasium environment’s LunarLander-v3 challenge. This environment, a classic rocket trajectory optimiza-
tion problem, requires the agent to manage thrust and orientation to achieve a soft landing on a designated
pad.
We were tasked with solving both:

• the original discrete action space version, and

• the more complex continuous action space version,

using Temporal Difference (TD) learning methods. The success criterion for both versions is achieving an
average reward of +200 over 100 consecutive episodes.

1.1 Environment Overview

• State Space: 8-dimensional continuous vector:
– Position: X, Y
– Velocities: VX , VY
– Angle and angular velocity
– Left/Right leg contact flags
• Discrete Action Space (Part I): 4 actions:
– Do nothing
– Fire Left
– Fire Main
– Fire Right
• Continuous Action Space (Part II): 2 continuous values in [−1, 1]:
– Main Engine Throttle
– Side Thruster Throttle

2 Part I: Discrete Control using Deep Q-Network (DQN)

The discrete nature of the first problem makes it suitable for a value-based approach. Since the state space
is continuous (8 dimensions), we chose Deep Q-Networks (DQN), an extension of Q-Learning, to approximate
the optimal action-value function Q∗ (s, a).

2.1 DQN Architecture

The DQN agent employs three key components to stabilize training when using a neural network as a function
approximator:

• Q-Network: A fully-connected neural network (8-input, 64-hidden, 64-hidden, 4-output) that maps the
state vector to the expected Q-value for each of the four possible actions.

1
• Experience Replay Buffer: A memory buffer that stores past transitions

(st , at , rt+1 , st+1 , done)

and samples them randomly in minibatches. This breaks the temporal correlation in sequential data, helping
to satisfy the i.i.d. assumption necessary for stable deep learning.
• Target Network: A separate, delayed copy of the Q-Network used exclusively to calculate the stable
target value Yj , minimizing oscillation in the training target.

2.2 Epsilon-Greedy Exploration

During training, the agent used an ϵ-greedy policy:

• Initial exploration rate: ϵ = 1.0 (pure exploration)

• Decay schedule: multiplicative decay by 0.995

• Minimum exploration rate: ϵmin = 0.01

This schedule ensures the agent explores the action space thoroughly in the early stages before gradually
committing to the learned policy (exploitation).

3 Part II: Continuous Control using Deep Deterministic Policy Gra-

dient (DDPG)
The continuous action space requires a different paradigm because enumerating all possible actions to find
maxa Q(s, a) is computationally intractable. We chose Deep Deterministic Policy Gradient (DDPG), an off-
policy, Actor-Critic method designed specifically for continuous environments.

3.1 DDPG Actor-Critic Architecture

DDPG maintains two sets of networks, four in total:

Network Type Role Input Output

Actor (µ) Learns deterministic policy µ(s) State (8-dim) Action (2-dim) ∈ [−1, 1]
Critic (Q) Learns action-value function Q(s, a) State (8-dim) + Action (2-dim) Single Q-value

Each primary (local) network has a corresponding target network: Target Actor µ′ and Target Critic Q′ .

3.2 Stability Mechanisms for Continuous Control

DDPG introduces two mechanisms essential for training stability in continuous spaces.
Ornstein-Uhlenbeck (OU) Noise
Since the policy is deterministic, we add temporally correlated OU noise to the Actor’s output action during
training. This allows for smoother, more exploratory movements in the physical control domain than simple
i.i.d. Gaussian noise.
Soft Target Update (Polyak Averaging)
Instead of periodically copying the weights (hard update, as in DQN), DDPG slowly updates the target network
weights θ′ toward the local network weights θ at every step using a small rate τ (e.g. τ = 10−3 ):

θ′ ← τ θ + (1 − τ )θ′ . (1)

This slow blending ensures the target value remains stable, preventing catastrophic divergence during learning.

4 Comparative Analysis of Algorithms

The project highlights the fundamental differences in approach between value-based and policy gradient meth-
ods across the two action spaces.

2
Feature Discrete DQN Continuous DDPG Advantage
Problem Type Best for discrete spaces Necessary for continuous spaces DDPG
Efficiency (Episodes) 1164 episodes 171 episodes DDPG (faster)
Policy Output Stochastic over 4 actions Deterministic 2D throttle vector DDPG (finer control)
Complexity 2 networks (Q-local, Q-target) 4 networks (Actor/Critic local/target) DQN (simpler)

5 Training Results and Discussion

The training sessions confirmed the relative complexity and efficiency of the two algorithms.

5.1 Part I: DQN Success

The DQN agent successfully solved the environment by episode 1164, achieving a 100-episode moving average
score of approximately 200.96. This validates the use of deep Q-learning with stabilization techniques (ex-
perience replay and target networks) for environments having a continuous state space but a discrete action
space.

5.2 Part II: DDPG Success and Efficiency

The DDPG agent demonstrated remarkable efficiency in solving the continuous challenge, achieving the success
criterion at episode 171 with an average score of approximately 200.55.
The speed difference is notable: DDPG’s direct policy optimization and smooth exploration strategy allowed
it to converge far more quickly than the DQN’s iterative estimation of Q-values for a discrete set of actions.

6 Result Visualization and Interpretation

6.1 DQN Score Curve
The visualization of the DQN scores showed a gradual but steady climb, indicating that the ϵ-greedy explo-
ration strategy was effective at discovering crucial, high-reward landing trajectories. As ϵ decayed, the agent
increasingly exploited the learned Q-values, leading to consistent high rewards and eventual satisfaction of the
success criterion.

6.2 DDPG Score and Loss Curves

The DDPG score curve exhibited extremely rapid learning, with scores rising from strongly negative values to
beyond the +200 threshold in relatively few episodes.
The corresponding loss curves validated the stability of the Actor-Critic system:

• Critic Loss: The Critic loss decreased rapidly and then stabilized, indicating that the value function was
accurately estimating the utility of the state-action pairs.
• Actor Loss: The Actor loss (which effectively minimizes the negative Q-value) showed a consistent down-
ward trajectory. This demonstrates that the policy network was continually adjusting its parameters towards
actions that maximize the Critic’s evaluation, leading directly to the high episodic rewards.

7 Conclusion
The project successfully implemented and demonstrated two foundational Deep Reinforcement Learning algo-
rithms across the two distinct control paradigms of the Lunar Lander problem. The results confirm:

• DQN is a viable, robust solution for continuous state, discrete action environments.
• DDPG is a more efficient and appropriate solution for continuous action control, requiring significantly
fewer interaction steps to achieve the solved criterion.

• The distinct stability measures—hard target updates in DQN versus soft updates and OU noise in DDPG—
highlight the architectural demands necessary to stabilize learning in different types of complex environ-
ments.

CV BankOfAmerica
No ratings yet
CV BankOfAmerica
1 page
Sunday Live Speakers Forum - 58th Meeting
No ratings yet
Sunday Live Speakers Forum - 58th Meeting
1 page
NMD 25 Brochure - Revised
No ratings yet
NMD 25 Brochure - Revised
4 pages
KANTAR CMI Coding Test
No ratings yet
KANTAR CMI Coding Test
4 pages
AML Assignment 2
No ratings yet
AML Assignment 2
2 pages
Project Proposal
No ratings yet
Project Proposal
9 pages
RL Framework
No ratings yet
RL Framework
3 pages
SCMHRD
No ratings yet
SCMHRD
3 pages
Job Offer Letter for Analyst Position
No ratings yet
Job Offer Letter for Analyst Position
1 page
Sources of Ethical Norms
90% (10)
Sources of Ethical Norms
2 pages
Architecture: Simplicity vs. Complexity
No ratings yet
Architecture: Simplicity vs. Complexity
3 pages
Working in The Classroom With Migrant and Refugee Students The Practices and Needs of Italian Primary and Middle School Teachers
No ratings yet
Working in The Classroom With Migrant and Refugee Students The Practices and Needs of Italian Primary and Middle School Teachers
18 pages
Metal Products Market Survey Presentation
No ratings yet
Metal Products Market Survey Presentation
12 pages
Asisc Kerala Region (12!08!2024)
No ratings yet
Asisc Kerala Region (12!08!2024)
30 pages
RPMS Teacher Portfolio Assessment Guide
100% (6)
RPMS Teacher Portfolio Assessment Guide
61 pages
The Perceived Leadership Style On Employee Performance and Organizational Commitment Among BPO Industry Employees
No ratings yet
The Perceived Leadership Style On Employee Performance and Organizational Commitment Among BPO Industry Employees
24 pages
Health 7 - Unit 1
No ratings yet
Health 7 - Unit 1
4 pages
Tweens Audience Analysis Profile Data
No ratings yet
Tweens Audience Analysis Profile Data
13 pages
1 - Process For Instilling Beliefs
No ratings yet
1 - Process For Instilling Beliefs
5 pages
1F. Process Recording. NCM 117 1
No ratings yet
1F. Process Recording. NCM 117 1
25 pages
HR Career Highlights
No ratings yet
HR Career Highlights
6 pages
Undergraduate Students' Perceptions and Attitudes Towards A Career in Tourism Industry: The Case of Indonesia
No ratings yet
Undergraduate Students' Perceptions and Attitudes Towards A Career in Tourism Industry: The Case of Indonesia
12 pages
Abacus Place Value 2 Digit
No ratings yet
Abacus Place Value 2 Digit
2 pages
Honorifics
No ratings yet
Honorifics
2 pages
PD & CD Reports Seminar Guidelines - IAS 15.4.2022
No ratings yet
PD & CD Reports Seminar Guidelines - IAS 15.4.2022
2 pages
11th PE PT II Ans
No ratings yet
11th PE PT II Ans
9 pages
Admit Card
No ratings yet
Admit Card
1 page
How to Write an Effective Summary
No ratings yet
How to Write an Effective Summary
11 pages
CPAR 11 - 12 Q3 02 Classification of Art Forms TG
No ratings yet
CPAR 11 - 12 Q3 02 Classification of Art Forms TG
24 pages
AStudio 61 SP 2 Readme
No ratings yet
AStudio 61 SP 2 Readme
32 pages
5.6 N Sem I 1a Introduction To Communication Skills B.SC
No ratings yet
5.6 N Sem I 1a Introduction To Communication Skills B.SC
6 pages
BUSI 651 - Week 3n
No ratings yet
BUSI 651 - Week 3n
24 pages
SEAM 2 (Revised)
67% (3)
SEAM 2 (Revised)
11 pages
A Framing Analysis and Model of Barack Obama in Political Cartoons (Dissertation)
No ratings yet
A Framing Analysis and Model of Barack Obama in Political Cartoons (Dissertation)
213 pages
Regarding Final Exam Schedule of Even Semester For Session 2024-25-Phase-I
No ratings yet
Regarding Final Exam Schedule of Even Semester For Session 2024-25-Phase-I
2 pages
Close Reading of A Text Exemplar Assignment: and 5 Grade Under The
No ratings yet
Close Reading of A Text Exemplar Assignment: and 5 Grade Under The
13 pages
Request Letter For Validation of Instrument
No ratings yet
Request Letter For Validation of Instrument
2 pages
Syllabus: Cambridge International AS & A Level Islamic Studies 9488
No ratings yet
Syllabus: Cambridge International AS & A Level Islamic Studies 9488
37 pages
NORSU First Year Semester Topics Updated
No ratings yet
NORSU First Year Semester Topics Updated
2 pages

AML Assign3 Report

Uploaded by

AML Assign3 Report

Uploaded by

AML – Assignment 3

Developing a RL-Agent to Solve the Classic Lunar Lander Control

• the original discrete action space version, and

1.1 Environment Overview

2 Part I: Discrete Control using Deep Q-Network (DQN)

2.1 DQN Architecture

(st , at , rt+1 , st+1 , done)

2.2 Epsilon-Greedy Exploration

• Initial exploration rate: ϵ = 1.0 (pure exploration)

• Decay schedule: multiplicative decay by 0.995

3 Part II: Continuous Control using Deep Deterministic Policy Gra-

3.1 DDPG Actor-Critic Architecture

Network Type Role Input Output

3.2 Stability Mechanisms for Continuous Control

4 Comparative Analysis of Algorithms

5 Training Results and Discussion

5.1 Part I: DQN Success

5.2 Part II: DDPG Success and Efficiency

6 Result Visualization and Interpretation

6.2 DDPG Score and Loss Curves

You might also like