Value-Based Deep RL Scales Predictably

[Paper] [Website]

A workflow for evaluating trade-offs between data efficiency, compute efficiency, and performance for online off-policy RL, validated across multiple algorithms and environments.

Oleh Rybkin¹, Michal Nauman^1,2, Preston Fu¹, Charlie Snell¹, Pieter Abbeel¹, Sergey Levine¹, Aviral Kumar³
¹UC Berkeley, ²University of Warsaw, ³Carnegie Mellon University

Installation

QScaled can be easily installed in any environment with Python >= 3.10.

pip install -e .

Overview

We collect run data from the Wandb API using the BaseCollector subclasses defined in qscaled/wandb_utils. Then, we format this data into zip files, saved to ~/.qscaled/zip by default. Then, to perform analysis on this data, we reference these zip files by name in experiments.

Reproduce results from our paper

First, download the zip files from our experiments.

bash qscaled/scripts/download_zipdata.sh

Using data from our hyperparameter grid search, we can compute the "best" batch size $B^* (\sigma)$ and learning rate $\eta^* (\sigma)$ according to our method. To perform this analysis for e.g. OpenAI Gym,

cd experiments/1_grid_search
python gym_compute_params.py

For a closer look at our method, see e.g. gym_explore.ipynb.

Then, using data from our runs using these fitted hyperparameters, we follow the rest of the procedure we describe in the paper:

Fitting the minimum amount of data $\mathcal{D}_J(\sigma)$ needed to achieve performance level $J$ at UTD $\sigma$.
Using these fits at different performance levels $J$ to fit $\sigma^*(\mathcal F_0)$. This procedure is detailed in the notebooks in experiments/2_fitted.

Reproduce our method on your own experiments

The full workflow is as follows:

First, run a hyperparameter grid search over UTD $\sigma$, batch size $B$, and learning rate $\eta$, with logging to Wandb. In our experiments, we run 8-10 seeds per configuration.

Using the results of this sweep, we will fit the "best" batch size $B^* (\sigma)$ and learning rate $\eta^* (\sigma)$.

Depending on whether you are running one or multiple seeds in a single Wandb run, implement a subclass of OneSeedPerRunCollector or MultipleSeedsPerRunCollector, which have example subclass implementations ExampleOneSeedPerRunCollector and ExampleMultipleSeedsPerRunCollector, respectively.
Make a copy of a notebook in experiments/1_grid_search.
Label your Wandb runs with tags (or, if you don't have many runs, skip this step and leave wandb_tags as []). You can add tags by selecting runs in the Wandb UI and clicking "Tag".
Update the SweepConfig.

This procedure takes ~10 minutes!

Running the notebook and inspecting the fits will enable you to determine whether a fit with shared or separate log-log slopes works better for your use case (see the following section for more details). Hyperparameters are saved to experiments/outputs/grid_proposed_hparams.
Run experiments using these fits on a larger range of UTDs.
Make a copy of a notebook in experiments/2_fitted, follow the same setup in steps 4 and 5, and run!

Actually, I just want your hyperparameters.

See experiments/outputs/grid_proposed_hparams.

shared (recommended): Our batch size $B^*(\sigma)$ and learning rate $\eta^*(\sigma)$ log-linear fits use a shared slope across all tasks within the same benchmark.
separate: We fit $B^*(\sigma)$ and $\eta^*(\sigma)$ separately for each task.
baseline_utd{sigma}: We compare our approach against taking the best $B$ and $\eta$ for some given UTD $\sigma$, and reusing the same $B$ and $\eta$ for all other UTDs.

Citation

@inproceedings{
  rybkin2025valuebased,
  title={Value-Based Deep {RL} Scales Predictably},
  author={Oleh Rybkin and Michal Nauman and Preston Fu and Charlie Victor Snell and Pieter Abbeel and Sergey Levine and Aviral Kumar},
  booktitle={Forty-second International Conference on Machine Learning},
  year={2025},
  eprint={2502.04327},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2502.04327}
}

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
assets		assets
experiments		experiments
qscaled		qscaled
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Value-Based Deep RL Scales Predictably

[Paper] [Website]

Installation

Overview

Reproduce results from our paper

Reproduce our method on your own experiments

Actually, I just want your hyperparameters.

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Value-Based Deep RL Scales Predictably

[Paper] [Website]

Installation

Overview

Reproduce results from our paper

Reproduce our method on your own experiments

Actually, I just want your hyperparameters.

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages