Reverse Experience Replay
1 Overview
Reverse Experience Replay (RER) was proposed by E. Rotinov1 to overcome environments with delayed (sparse) rewards. Since many transitions doesn’t have immediate reward, random sampling is inefficient.
In RER, equally strided transitions are sampled from the latest transition. The next sample contains one step older transions.
\[ \begin{align} B_1 &= \lbrace& T_{t} &, T_{t-stride} &, \dots &, T_{t-batch_size \times stride} &\rbrace \\ B_2 &= \lbrace& T_{t-1}&, T_{t-stride-1}&, \dots &, T_{t-batch_size \times stride - 1} &\rbrace \\ &\vdots&&&&&& \end{align} \]
When the first sample index (\(t-i\)) becomes \(2 \times stride\) old from the latest transition, the first sample index is reset to the latest transition.
| Parameters | Default | Description |
|---|---|---|
| stride | 300 | Sample stride |
2 Example Usage
The usage of ReverseReplayBuffer is same as the usage of ordinary ReplayBuffer.
from cpprb import ReverseReplayBuffer
buffer_size = 256
obs_shape = 3
act_dim = 1
stride = 20
rb = ReverseReplayBuffer(buffer_size,
env_dict = {"obs": {"shape": obs_shape},
"act": {"shape": act_dim},
"rew": {},
"next_obs": {"shape": obs_shape},
"done": {}},
stride = stride)
obs = np.ones(shape=(obs_shape))
act = np.ones(shape=(act_dim))
rew = 0
next_obs = np.ones(shape=(obs_shape))
done = 0
for i in range(500):
rb.add(obs=obs,act=act,rew=rew,next_obs=next_obs,done=done)
if done:
# Together with resetting environment, call ReplayBuffer.on_episode_end()
rb.on_episode_end()
batch_size = 32
sample = rb.sample(batch_size)
# sample is a dictionary whose keys are 'obs', 'act', 'rew', 'next_obs', and 'done'3 Notes
The author indicated stride size must not be multiple of the length
of episode horizon to avoid sampling similar transitions
simultaneously.
4 Technical Detail
-
E. Rotinov, “Reverse Experience Replay” (2019), (arXiv:1910.08780) ↩︎