A workflow for evaluating trade-offs between data efficiency, compute efficiency, and performance for online off-policy RL, validated across multiple algorithms and environments.
Oleh Rybkin1,
Michal Nauman1,2,
Preston Fu1,
Charlie Snell1,
Pieter Abbeel1,
Sergey Levine1,
Aviral Kumar3
1UC Berkeley, 2University of Warsaw, 3Carnegie Mellon University
QScaled can be easily installed in any environment with Python >= 3.10.
pip install -e .
We collect run data from the Wandb API using the BaseCollector subclasses
defined in qscaled/wandb_utils. Then, we format this
data into zip files, saved to ~/.qscaled/zip by default. Then, to perform
analysis on this data, we reference these zip files by name in experiments.
First, download the zip files from our experiments.
bash qscaled/scripts/download_zipdata.sh
Using data from our hyperparameter grid search, we can compute the "best" batch
size
cd experiments/1_grid_search
python gym_compute_params.py
For a closer look at our method, see e.g. gym_explore.ipynb.
Then, using data from our runs using these fitted hyperparameters, we follow the rest of the procedure we describe in the paper:
- Fitting the minimum amount of data
$\mathcal{D}_J(\sigma)$ needed to achieve performance level$J$ at UTD$\sigma$ . - Using these fits at different performance levels
$J$ to fit$\sigma^*(\mathcal F_0)$ . This procedure is detailed in the notebooks inexperiments/2_fitted.
The full workflow is as follows:
- First, run a hyperparameter grid search over UTD
$\sigma$ , batch size$B$ , and learning rate$\eta$ , with logging to Wandb. In our experiments, we run 8-10 seeds per configuration.
Using the results of this sweep, we will fit the "best" batch size
- Depending on whether you are running one or multiple seeds
in a single Wandb run, implement a subclass of
OneSeedPerRunCollectororMultipleSeedsPerRunCollector, which have example subclass implementationsExampleOneSeedPerRunCollectorandExampleMultipleSeedsPerRunCollector, respectively. - Make a copy of a notebook in
experiments/1_grid_search. - Label your Wandb runs with tags (or, if you don't have many runs,
skip this step and leave
wandb_tagsas[]). You can add tags by selecting runs in the Wandb UI and clicking "Tag". - Update the
SweepConfig.
This procedure takes ~10 minutes!
- Running the notebook and inspecting the fits will enable you to determine whether
a fit with shared or separate log-log slopes works better for your use case
(see the following section for more details). Hyperparameters are saved to
experiments/outputs/grid_proposed_hparams. - Run experiments using these fits on a larger range of UTDs.
- Make a copy of a notebook in
experiments/2_fitted, follow the same setup in steps 4 and 5, and run!
See experiments/outputs/grid_proposed_hparams.
-
shared(recommended): Our batch size$B^*(\sigma)$ and learning rate$\eta^*(\sigma)$ log-linear fits use a shared slope across all tasks within the same benchmark. -
separate: We fit$B^*(\sigma)$ and$\eta^*(\sigma)$ separately for each task. -
baseline_utd{sigma}: We compare our approach against taking the best$B$ and$\eta$ for some given UTD$\sigma$ , and reusing the same$B$ and$\eta$ for all other UTDs.
@inproceedings{
rybkin2025valuebased,
title={Value-Based Deep {RL} Scales Predictably},
author={Oleh Rybkin and Michal Nauman and Preston Fu and Charlie Victor Snell and Pieter Abbeel and Sergey Levine and Aviral Kumar},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
eprint={2502.04327},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.04327}
}