AML – Assignment 3
Developing a RL-Agent to Solve the Classic Lunar Lander Control
Problem
Nalla Janardhana Rao (MDS202426) Pranav Pothan (MDS202429)
Raja S (MDS202430)
1 Introduction
The objective of this assignment was to train a Reinforcement Learning (RL) agent to successfully solve the
Gymnasium environment’s LunarLander-v3 challenge. This environment, a classic rocket trajectory optimiza-
tion problem, requires the agent to manage thrust and orientation to achieve a soft landing on a designated
pad.
We were tasked with solving both:
• the original discrete action space version, and
• the more complex continuous action space version,
using Temporal Difference (TD) learning methods. The success criterion for both versions is achieving an
average reward of +200 over 100 consecutive episodes.
1.1 Environment Overview
• State Space: 8-dimensional continuous vector:
– Position: X, Y
– Velocities: VX , VY
– Angle and angular velocity
– Left/Right leg contact flags
• Discrete Action Space (Part I): 4 actions:
– Do nothing
– Fire Left
– Fire Main
– Fire Right
• Continuous Action Space (Part II): 2 continuous values in [−1, 1]:
– Main Engine Throttle
– Side Thruster Throttle
2 Part I: Discrete Control using Deep Q-Network (DQN)
The discrete nature of the first problem makes it suitable for a value-based approach. Since the state space
is continuous (8 dimensions), we chose Deep Q-Networks (DQN), an extension of Q-Learning, to approximate
the optimal action-value function Q∗ (s, a).
2.1 DQN Architecture
The DQN agent employs three key components to stabilize training when using a neural network as a function
approximator:
• Q-Network: A fully-connected neural network (8-input, 64-hidden, 64-hidden, 4-output) that maps the
state vector to the expected Q-value for each of the four possible actions.
1
• Experience Replay Buffer: A memory buffer that stores past transitions
(st , at , rt+1 , st+1 , done)
and samples them randomly in minibatches. This breaks the temporal correlation in sequential data, helping
to satisfy the i.i.d. assumption necessary for stable deep learning.
• Target Network: A separate, delayed copy of the Q-Network used exclusively to calculate the stable
target value Yj , minimizing oscillation in the training target.
2.2 Epsilon-Greedy Exploration
During training, the agent used an ϵ-greedy policy:
• Initial exploration rate: ϵ = 1.0 (pure exploration)
• Decay schedule: multiplicative decay by 0.995
• Minimum exploration rate: ϵmin = 0.01
This schedule ensures the agent explores the action space thoroughly in the early stages before gradually
committing to the learned policy (exploitation).
3 Part II: Continuous Control using Deep Deterministic Policy Gra-
dient (DDPG)
The continuous action space requires a different paradigm because enumerating all possible actions to find
maxa Q(s, a) is computationally intractable. We chose Deep Deterministic Policy Gradient (DDPG), an off-
policy, Actor-Critic method designed specifically for continuous environments.
3.1 DDPG Actor-Critic Architecture
DDPG maintains two sets of networks, four in total:
Network Type Role Input Output
Actor (µ) Learns deterministic policy µ(s) State (8-dim) Action (2-dim) ∈ [−1, 1]
Critic (Q) Learns action-value function Q(s, a) State (8-dim) + Action (2-dim) Single Q-value
Each primary (local) network has a corresponding target network: Target Actor µ′ and Target Critic Q′ .
3.2 Stability Mechanisms for Continuous Control
DDPG introduces two mechanisms essential for training stability in continuous spaces.
Ornstein-Uhlenbeck (OU) Noise
Since the policy is deterministic, we add temporally correlated OU noise to the Actor’s output action during
training. This allows for smoother, more exploratory movements in the physical control domain than simple
i.i.d. Gaussian noise.
Soft Target Update (Polyak Averaging)
Instead of periodically copying the weights (hard update, as in DQN), DDPG slowly updates the target network
weights θ′ toward the local network weights θ at every step using a small rate τ (e.g. τ = 10−3 ):
θ′ ← τ θ + (1 − τ )θ′ . (1)
This slow blending ensures the target value remains stable, preventing catastrophic divergence during learning.
4 Comparative Analysis of Algorithms
The project highlights the fundamental differences in approach between value-based and policy gradient meth-
ods across the two action spaces.
2
Feature Discrete DQN Continuous DDPG Advantage
Problem Type Best for discrete spaces Necessary for continuous spaces DDPG
Efficiency (Episodes) 1164 episodes 171 episodes DDPG (faster)
Policy Output Stochastic over 4 actions Deterministic 2D throttle vector DDPG (finer control)
Complexity 2 networks (Q-local, Q-target) 4 networks (Actor/Critic local/target) DQN (simpler)
5 Training Results and Discussion
The training sessions confirmed the relative complexity and efficiency of the two algorithms.
5.1 Part I: DQN Success
The DQN agent successfully solved the environment by episode 1164, achieving a 100-episode moving average
score of approximately 200.96. This validates the use of deep Q-learning with stabilization techniques (ex-
perience replay and target networks) for environments having a continuous state space but a discrete action
space.
5.2 Part II: DDPG Success and Efficiency
The DDPG agent demonstrated remarkable efficiency in solving the continuous challenge, achieving the success
criterion at episode 171 with an average score of approximately 200.55.
The speed difference is notable: DDPG’s direct policy optimization and smooth exploration strategy allowed
it to converge far more quickly than the DQN’s iterative estimation of Q-values for a discrete set of actions.
6 Result Visualization and Interpretation
6.1 DQN Score Curve
The visualization of the DQN scores showed a gradual but steady climb, indicating that the ϵ-greedy explo-
ration strategy was effective at discovering crucial, high-reward landing trajectories. As ϵ decayed, the agent
increasingly exploited the learned Q-values, leading to consistent high rewards and eventual satisfaction of the
success criterion.
6.2 DDPG Score and Loss Curves
The DDPG score curve exhibited extremely rapid learning, with scores rising from strongly negative values to
beyond the +200 threshold in relatively few episodes.
The corresponding loss curves validated the stability of the Actor-Critic system:
• Critic Loss: The Critic loss decreased rapidly and then stabilized, indicating that the value function was
accurately estimating the utility of the state-action pairs.
• Actor Loss: The Actor loss (which effectively minimizes the negative Q-value) showed a consistent down-
ward trajectory. This demonstrates that the policy network was continually adjusting its parameters towards
actions that maximize the Critic’s evaluation, leading directly to the high episodic rewards.
7 Conclusion
The project successfully implemented and demonstrated two foundational Deep Reinforcement Learning algo-
rithms across the two distinct control paradigms of the Lunar Lander problem. The results confirm:
• DQN is a viable, robust solution for continuous state, discrete action environments.
• DDPG is a more efficient and appropriate solution for continuous action control, requiring significantly
fewer interaction steps to achieve the solved criterion.
• The distinct stability measures—hard target updates in DQN versus soft updates and OU noise in DDPG—
highlight the architectural demands necessary to stabilize learning in different types of complex environ-
ments.