This repository implements a DQN-based agent for LunarLander-v3, with two knobs that actually matter:
- Experience Replay: Prioritized Experience Replay (PER) vs Uniform Replay vs Online updates
- Exploration: Parameter Noise (non-ε-greedy exploration) with optional ε-greedy baseline
It logs training metrics, saves checkpoints, plots reward curves, and records videos when a new best policy appears.
LunarLander-v3 (Gymnasium Box2D):
- Observation: 8D continuous state
- Actions: 4 discrete thrusters
- Reward shaping encourages stable landing, penalizes crashes and fuel waste
- Uniform replay: sample transitions uniformly from a replay buffer.
- PER: sample transitions proportional to priority (|TD error|), with importance-sampling weights and β annealing.
- Online: no replay buffer; update every step (baseline).
- Parameter noise: Gaussian noise applied to network parameters to create a perturbed policy for an episode.
epsilon_greedyis included as a baseline (because reviewers love baselines).
pip install -r requirements.txtpython scripts/train.py --config configs/default.yamlpython scripts/train.py --config configs/default.yaml --set replay_mode=uniform exploration_mode=noneCommon overrides:
replay_mode=per|uniform|onlineexploration_mode=param_noise|epsilon_greedy|nonenoise_scale=0.1num_episodes=800
Run the 4-way comparison (uniform/PER × noise/no-noise):
python scripts/ablation.py --base configs/default.yaml --ablation configs/ablation.yamlThis produces separate run folders under experiments/ with plots and models.
python scripts/evaluate.py --config experiments/<run>/config.yaml --checkpoint experiments/<run>/models/best_model.pth --episodes 20python scripts/record_video.py --config experiments/<run>/config.yaml --checkpoint experiments/<run>/models/best_model.pth --out_dir assets/videos --name best_lunarlander_dqnpytest -qMIT (See LICENSE).