Academia.eduAcademia.edu

Abstract

... LSPI [14], which allows searching an optimal control more effi-ciently, however it is a batchalgorithm which does ... For exemple, incremental natural actor-critic algorithms are presented in [17 ... TD is used as the actor part instead of LSTD, mostly because of the inability of the latter ...

Key takeaways

  • KTD is a second order algorithm (and thus sample efficient): it updates the mean parameter vector, but also the associated variance matrix.
  • This is comparable to approaches such as LSTD [8] (nevertheless with the additional ability to handle nonlinear parameterization).
  • Thus, KTD-V fails to handle the stochastic case as expected, however it converges much faster than LSTD or TD in the deterministic one.
  • The optimistic policy iteration scheme used in this experiment implies non-stationarity for the learned Q-function, which explains that LSTD fails to learn a near-optimal policy.
  • TD is used as the actor part instead of LSTD, mostly because of the inability of the latter one to handle non-stationarity.