Reinforcement Learning with Deep Energy-Based Policies

All videos are downloadable at this Google Drive folder: https://drive.google.com/drive/folders/0B_KFuCNKS7ZVRlFUOFBUSVZLOUE?usp=sharing

A swimmer snake robot

Note: we use Ornstein–Uhlenbeck process (theta = 0.15, sigma = 0.1) to generate noise to the action outputs of deterministic policies trained with DDPG.

There is no additional noise added to the stochastic policy trained with soft Q-learning. The only stochasticity comes from the policy itself.

video link: http://youtu.be/8ysBFCDp1e8

A quadrupedal robot exploring a maze

Note: we use Ornstein–Uhlenbeck process (theta = 0.15, sigma = 0.1) to generate noise to the action outputs of deterministic policies trained with DDPG.

There is no additional noise added to the stochastic policy trained with soft Q-learning. The only stochasticity comes from the policy itself.

The reward is a Guassian with mean equal to the goal position.

video link: http://youtu.be/ppdIdYdD_U0

Pretraining a quadrupedal robot

A quadrupedal robot is trained with reward = (speed of its center-of-mass). The ideal maximum entropy policy should move uniformly in all directions. However, typically deterministic or uni-modal policies trained are not able to achieve this effect. The video below demonstrates how an energy-based stochastic policy can correctly represent the maximum entropy policy of the form below:

Note: the policies are not perfect, so it is common that the robot flips over.

Similarly, OU noise (theta = 0.15, sigma = 0.1) is added to DDPG policies.

video link: http://youtu.be/KpDVM4h8m4g

Fine-tuning a pretrained policy in new environments

A quadrupedal robot is pretrained on an empty ground with reward = (speed of its center of mass). Then the robot is placed in three new environments and may transfer knowledge from the pretraining environment. Soft Q-learning results in a maximum entropy policy that is advantageous for such pretrain and fine-tune tasks.

video link: http://youtu.be/7Nm1N6sUoVs

Google Sites

Report abuse