All videos are downloadable at this Google Drive folder: https://drive.google.com/drive/folders/0B_KFuCNKS7ZVRlFUOFBUSVZLOUE?usp=sharing
Note: we use Ornstein–Uhlenbeck process (theta = 0.15, sigma = 0.1) to generate noise to the action outputs of deterministic policies trained with DDPG.
There is no additional noise added to the stochastic policy trained with soft Q-learning. The only stochasticity comes from the policy itself.
Note: we use Ornstein–Uhlenbeck process (theta = 0.15, sigma = 0.1) to generate noise to the action outputs of deterministic policies trained with DDPG.
There is no additional noise added to the stochastic policy trained with soft Q-learning. The only stochasticity comes from the policy itself.
The reward is a Guassian with mean equal to the goal position.
A quadrupedal robot is trained with reward = (speed of its center-of-mass). The ideal maximum entropy policy should move uniformly in all directions. However, typically deterministic or uni-modal policies trained are not able to achieve this effect. The video below demonstrates how an energy-based stochastic policy can correctly represent the maximum entropy policy of the form below:
Note: the policies are not perfect, so it is common that the robot flips over.
Similarly, OU noise (theta = 0.15, sigma = 0.1) is added to DDPG policies.
A quadrupedal robot is pretrained on an empty ground with reward = (speed of its center of mass). Then the robot is placed in three new environments and may transfer knowledge from the pretraining environment. Soft Q-learning results in a maximum entropy policy that is advantageous for such pretrain and fine-tune tasks.