LunarLander PPO

This repository implements Proximal Policy Optimization (PPO) using an Actor-Critic network to solve LunarLander-v3 with Gymnasium + PyTorch. It trains from scratch, logs metrics, saves checkpoints, generates plots, and can record rollout videos.

Training entry: python scripts/train.py --config configs/default.yaml
Evaluation: python scripts/evaluate.py --config configs/default.yaml --checkpoint <path>
Video: python scripts/record_video.py --config configs/default.yaml --checkpoint <path>
Sweep: python scripts/sweep.py --sweep configs/sweep.yaml

Installation

pip install -r requirements.txt

Quick Start

Train (single run)

python scripts/train.py --config configs/default.yaml

Outputs go into a timestamped folder under experiments/:

train_metrics.csv and train_metrics.jsonl
models/ (best/latest/final checkpoints)
plots/ (reward, loss, entropy, steps, elapsed time, eval curves)

Evaluate a saved checkpoint

python scripts/evaluate.py \
  --config configs/default.yaml \
  --checkpoint experiments/<run>/models/best_model.pth \
  --episodes 20 \
  --deterministic

Record a rollout video

python scripts/record_video.py \
  --config configs/default.yaml \
  --checkpoint experiments/<run>/models/best_model.pth \
  --out_dir assets/videos

Run a grid sweep

python scripts/sweep.py --sweep configs/sweep.yaml

LunarLander Environment

LunarLander-v3 is a classic discrete-action control environment with:

Observation: 8-dimensional continuous state
Actions: 4 discrete thrusters
Reward shaping that encourages stable landing and penalizes fuel waste

See Gymnasium docs for full detail.

PPO Notes

This implementation uses an Actor-Critic model and the PPO clipped surrogate objective:

Policy loss: min(r_t * A_t, clip(r_t, 1-ε, 1+ε) * A_t)
Value loss: MSE between predicted value and return
Entropy bonus encourages exploration

Advantages are computed with GAE(λ), then normalized for stability.

Hyperparameter Sweep (20 Setups)

The following table documents the hyperparameter configurations used in experimentation:

Setup #	Learning Rate	Gamma	Clip Epsilon	Update Epochs	Mini-Batch Size	Entropy Coef	Value Loss Coef	Hidden Dim	Max Timesteps	Eps Decay	Min Clip Epsilon
1	3e-4	0.99	0.2	4	64	0.01	0.5	128	1000	0.995	0.1
2	3e-4	0.99	0.2	4	64	0.02	0.5	128	1000	0.995	0.1
3	3e-4	0.99	0.2	8	64	0.01	0.5	128	1000	0.995	0.1
4	1e-3	0.99	0.2	4	64	0.01	0.5	128	1000	0.995	0.1
5	3e-4	0.99	0.3	4	64	0.01	0.5	128	1000	0.995	0.1
6	3e-4	0.99	0.2	4	128	0.01	0.5	128	1000	0.995	0.1
7	3e-4	0.99	0.2	4	64	0.01	0.7	128	1000	0.995	0.1
8	3e-4	0.99	0.2	4	64	0.01	0.5	128	1000	0.99	0.1
9	3e-4	0.99	0.1	4	64	0.01	0.5	128	1000	0.995	0.05
10	3e-4	0.99	0.2	4	64	0.005	0.5	128	1000	0.995	0.1
11	3e-4	0.995	0.2	4	64	0.01	0.5	128	1000	0.995	0.1
12	5e-4	0.99	0.2	4	64	0.01	0.5	128	1000	0.995	0.1
13	3e-4	0.99	0.2	6	64	0.01	0.5	128	1000	0.995	0.1
14	3e-4	0.99	0.2	4	32	0.01	0.5	128	1000	0.995	0.1
15	3e-4	0.99	0.2	4	64	0.015	0.4	128	1000	0.995	0.1
16	3e-4	0.99	0.25	4	64	0.01	0.5	128	1000	0.995	0.1
17	3e-4	0.99	0.2	4	64	0.01	0.5	128	1000	0.997	0.1
18	3e-4	0.99	0.2	4	64	0.01	0.5	256	1000	0.995	0.1
19	3e-4	0.98	0.2	4	64	0.01	0.5	128	1000	0.995	0.1
20	3e-4	0.99	0.2	10	32	0.01	0.5	128	1000	0.995	0.1

Best Performing Model

This configuration is the selected “best” setup used for the longer training run:

Parameter	Value
env_name	LunarLander-v3
max_episodes	10000
learning_rate	0.0003
gamma	0.99
clip_epsilon	0.2
update_epochs	4
mini_batch_size	64
entropy_coef	0.005
value_loss_coef	0.5
hidden_dim	128
max_timesteps	1000
eps_decay	0.995
min_clip_epsilon	0.1

Best Model Outputs

Training plots (best run snapshot):

Policy rollout video (best model):

video.mp4

Testing

pytest -q

License

MIT (See LICENSE).

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
assets		assets
configs		configs
docs		docs
scripts		scripts
src/lunarlander_ppo		src/lunarlander_ppo
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LunarLander PPO

Installation

Quick Start

Train (single run)

Evaluate a saved checkpoint

Record a rollout video

Run a grid sweep

LunarLander Environment

PPO Notes

Hyperparameter Sweep (20 Setups)

Best Performing Model

Best Model Outputs

Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LunarLander PPO

Installation

Quick Start

Train (single run)

Evaluate a saved checkpoint

Record a rollout video

Run a grid sweep

LunarLander Environment

PPO Notes

Hyperparameter Sweep (20 Setups)

Best Performing Model

Best Model Outputs

Testing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages