This code implements and evaluates algorithms for the paper Reinforcement learning with combinatorial actions for coupled restless bandits by Lily Xu, Bryan Wilder, Elias B. Khalil, and Milind Tambe which appeared at the International Conference on Learning Representations (ICLR 2025). In this paper we consider reinforcement learning (RL) with combinatorial actions, specifically for restless bandits.
Typical restless bandits assume a simple budget constraint, which allow planning to be decoupled across each arm to learn an optimal policy for each arm separately, then using a simple threshold-based heuristic to select the optimal action.
Here, we introduce coRMAB, a broader class of problems with combinatorial actions that cannot be decoupled across the arms of the restless bandit, requiring direct solving over the joint, exponentially large action space. We propose SEQUOIA, an algorithm which directly solves for long-term reward by embedding a Q-network into a mixed-integer linear program (MILP) to select a combinatorial action in each timestep.
In this codebase, we implement and evaluate SEQUOIA. We include a number of simulation environments introduced in the paper, including:
- path-constrained,
- schedule-constrained,
- capacity-constrained, and
- multiple interventions.
This library also implements and compares against a variety of baselines: null (no action), random, sampling, myopic, iterative myopic, iterative DQN.
Details about all the environments and their implementation are in the paper appendix.
Running driver.py runs all four domains in sequence.
The following options are available:
-
-Hhorizon, default$H=20$ -
-Jnumber of arms, default$J=20$ -
-Bbudget, default$B=8$ -
-Nnumber of actions (used for multiple interventions setting), default$N=12$ -
-Dtype of RMAB setting {sigmoid,scheduling,constrained,routing}, defaultconstrained -
-pprefix for output files, default empty'' -
-srandom seed, default$s=0$ -
-Knumber of samples (for sampling baseline), default$K=100$ -
-Vnumber of episodes, for evaluation, default$V=50$ -
-Favoid training the model (for fast prototyping only), defaultFalse
The four domains of coRMAB implemented here are:
Each arm lies within a network, and we must find a valid route.
Network from London underground map: https://commons.wikimedia.org/wiki/London_Underground_geographic_maps
python driver.py -s 0 -H 20 -J 20 -B 5 -p run_note --rmab_type routing -V 1 -K 10 -F
python driver.py -s 0 -H 20 -J 40 -B 10 -p run_note --rmab_type routing -V 50 -K 100
Each arm and worker are available for a limited number of scheduling slots out of
python driver.py -s 0 -H 5 -J 5 -B 4 -p run_note --rmab_type scheduling -V 1 -K 10 -F
python driver.py -s 0 -H 20 -J 20 -B 5 -p run_note --rmab_type scheduling -V 50 -K 100
python driver.py -s 0 -H 20 -J 100 -B 20 -p run_note --rmab_type scheduling -V 50 -K 100
Each worker is limited by a capacity constriant of how many arms they can pull; each arm has a differing cost.
python driver.py -s 0 -H 5 -J 5 -B 4 -p run_note --rmab_type constrained -V 1 -K 10 -F
python driver.py -s 0 -H 20 -J 20 -B 5 -p run_note --rmab_type constrained -V 50 -K 100
python driver.py -s 0 -H 20 -J 100 -B 20 -p run_note --rmab_type constrained -V 50 -K 100
The multiple-invention RMAB setting is implemented in multi_action.py as MultiActionRMAB with link_type='sigmoid'.
The effects of these multiple cumulative actions are modeled as a sigmoid link function. The piecewise-linear version of the sigmoid function used is specified in sigmoid_alpha4_beta2_x3_breaks10.pickle.
Example execution:
python driver.py -s 0 -H 5 -J 5 -N 8 -B 4 -p run_note -V 1 -K 10 --rmab_type sigmoid -F
python driver.py -s 0 -H 20 -J 20 -N 10 -B 5 -p run_note -V 50 -K 100 --rmab_type sigmoid
We compare against the following baselines:
- Null (no action):
baseline_null_action()inbaseline.py - Random:
baseline_random()inbaseline.py - Sampling:
baseline_sample()inbaseline.py - Myopic:
baseline_myopic()inbaseline.py - Iterative myopic:
baseline_greedy_iterative_myopic()inbaseline.py - Iterative DQN:
baseline_iterative_dqn()inbaseline_iterative_dqn.py
run.shexecutes repeated versions of different problem instancesMultiStateRMABis the "base" restless bandit used in all experiments hereenvironment/bipartite.pycreates and visualizes bipartite graphsenvironment/rmab_instances.pygenerates the various RMAB problem instances./approximator/*_approximator.pyimplements the MILP solvers for each problem setting./model2mip/*files in this folder implement the process of embedding a trained neural network into a MILP, from Neur2SP
This code was developed using Python version 3.10.14. We require the following packages:
- numpy==1.24.3
- pandas==2.0.3
- scipy==1.9.3
- matplotlib==3.7.1
- pytorch==2.0.0
- networkx==3.3
- gurobi==10.0.1
We build off Neur2SP from the work of Dumouchelle et al., whose implementation for embedding a trained neural network into a MILP are in the ./model2mip/ folder. Their code is available at https://github.com/khalil-research/Neur2SP
@inproceedings{xu2025reinforcement,
title={Reinforcement learning with combinatorial actions for coupled restless bandits},
author={Xu, Lily and Wilder, Bryan and Khalil, Elias B. and Tambe, Milind},
booktitle={Proceedings of the 13th International Conference on Learning Representations},
year={2025},
}