Nstep Experience Replay
1 Overview
To reduce fluctuation of random sampling effect especially at bootstrap phase, N-step reward (discounted summation) are useful. By expanding Bellman equation, a N-step target of Q function becomes \(\sum _{k=0}^{N-1} \gamma ^k r_{t+k} + \gamma ^N \max _{a} Q(s_{t+N},a)\).
According to W. Fedus et al., N-step reward can utilize larger buffer more effectively. Even though theoretically N-step reward, which is based on a policy at exploration, is not justified for off-policy, it still works better.
You can create N-step version replay buffer by specifying Nstep
parameter at constructors of ReplayBuffer or
PrioritizedReplayBuffer. Without modification of its environment,
cpprb summarizes N-step rewards and slides “next” values like next_obs.
Nstep parameter is a dict with keys of "size" , "rew",
"gamma" , and "next" . Nstep["size"] is a N-step size and 1-step
is identical with ordinary replay buffer (but
inefficient). Nstep["rew"], whose type is str or array-like of
str, specifies the (set of) reward(s). Nstep["gamma"] is a
discount factor for reward summation. Nstep["next"] , whose type is
str or array like of str, specifies the (set of) next type
value(s), then sample method returns (i+N)-th value instead of
(i+1)-th one.
sample also replaces "done" with N-step version.
cpprb v10 no longer returns \(\gamma ^{N-1}\) since users can always multiply fixed \(\gamma ^N\).
Since N-step buffer temporary store the values into local storage, you
need to call on_episode_end member function at the end of the every
episode end to flush into main storage properly.
| Parameters | Type | Description |
|---|---|---|
size |
int |
Nstep size |
rew |
str or array-like of str |
Nstep reward |
gamma |
float |
Discount factor |
next |
str or array-like of str |
Next items (e.g. next_obs) |
| Return Value | Replace: From \(\to\) To |
|---|---|
| Next items (e.g. next_obs) | \(s_{t+1} \to s_{t+N}\) |
rew |
\(r_t \to \sum _{n=0}^{N-1} \gamma ^n r_{t+n}\) |
done |
\(d_t \to 1-\prod _{n=0}^{N-1} (1-d_{t+n})\) |
2 Example Usage
import numpy as np
from cpprb import ReplayBuffer
nstep = 4
gamma = 0.99
discounts = gamma ** nstep
rb = ReplayBuffer(32,{'obs': {"shape": (4,4)},
'act': {"shape": 3},
'rew': {},
'next_obs': {"shape": (4,4)},
'done': {}},
Nstep={"size": nstep,
"gamma": gamma,
"rew": "rew",
"next": "next_obs"})
for i in range(100):
done = 1.0 if i%10 == 9 else 0.0
rb.add(obs=np.zeros((4,4)),
act=np.ones((3)),
rew=1.0,
next_obs=np.zeros((4,4)),
done=0.0)
if done:
rb.on_episode_end()
sample = rb.sample(16)
nstep_target = sample["rew"] + (1-sample["done"]) * discounts * Q(sample["next_obs"]).max(axis=1)3 Notes
This N-step feature assumes sequential transitions in a trajectory (episode) are stored sequentially. If you utilize distributed agent configuration, you must add a single episode simultaneously.