Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2009, Lecture Notes in Computer Science
…
10 pages
1 file
... LSPI [14], which allows searching an optimal control more effi-ciently, however it is a batchalgorithm which does ... For exemple, incremental natural actor-critic algorithms are presented in [17 ... TD is used as the actor part instead of LSTD, mostly because of the inability of the latter ...
IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2021
This retrospective describes the overall research project that gave rise to the authors' paper "Neuronlike adaptive elements that can solve difficult learning control problems" that was published in the 1983 Neural and Sensory Information Processing special issue of the IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS. This look back explains how this project came about, presents the ideas and previous publications that influenced it, and describes our most closely related subsequent research. It concludes by pointing out some noteworthy aspects of this article that have been eclipsed by its main contributions, followed by commenting on some of the directions and cautions that should inform future research.
REINFORCEMENT LEARNING AND ITS APPLICATION TO CONTROL February 1992 Vijaykumar Gullapalli, B.S., Birla Institute of Technology and Science, India M.S., University of Massachusetts Ph.D., University of Massachusetts Directed by: Professor Andrew G. Barto Learning control involves modifying a controller's behavior to improve its performance as measured by some predefined index of performance (IP). If control actions that improve performance as measured by the IP are known, supervised learning methods, or methods for learning from examples, can be used to train the controller. But when such control actions are not known a priori, appropriate control behavior has to be inferred from observations of the IP. One can distinguish between two classes of methods for training controllers under such circumstances. Indirect methods involve constructing a model of the problem's IP and using the model to obtain training information for the controller. On the other hand, direct, or model-fr...
2013
The aim of this chapter is to provide guidance to the prospective user of the Adaptive Critic / Approximate Dynamic Programming methods for designing the action device in certain kinds of control systems. While there are currently various different successful “camps ” in
Procedia Computer Science, 2012
The basic tenet of a learning process is for an agent to learn for only as much and as long as it is necessary. With reinforcement learning, the learning process is divided between exploration and exploitation. Given the complexity of the problem domain and the randomness of the learning process, the exact duration of the reinforcement learning process can never be known with certainty. Using an inaccurate number of training iterations leads either to the non-convergence or the over-training of the learning agent. This work addresses such issues by proposing a technique to self-regulate the exploration rate and training duration leading to convergence efficiently. The idea originates from an intuitive understanding that exploration is only necessary when the success rate is low. This means the rate of exploration should be conducted in inverse proportion to the rate of success. In addition, the change in exploration-exploitation rates alters the duration of the learning process. Using this approach, the duration of the learning process becomes adaptive to the updated status of the learning process. Experimental results from the K-Armed Bandit and Air Combat Maneuver scenario prove that optimal action policies can be discovered using the right amount of training iterations. In essence, the proposed method eliminates the guesswork on the amount of exploration needed during reinforcement learning.
emis.ams.org, 1999
The underlying concept in Reinforcement Learning is as simple as it is attractive: to learn by trial and error from the interaction with the environment. This approach allows us to deal with problems where a learning technique searches to improve the performance of the agent ...
Frontiers in Neurorobotics
Anais do 15. Congresso Brasileiro de Inteligência Computacional, 2021
Reinforcement learning has evolved in recent years,and overcoming challenges found in this field. This area, unlikeconventional machine learning, does not learn through a setof observational instances, but through interaction with anenvironment. The sampling efficiency of a reinforcement learningagent is a challenge. That is, how to make an agent learn withinan environment with as little interaction as possible. In this workwe perform an experimental study on the difficulties to integratea strategy of intrinsic motivation to an actor-critic agent toimprove the sampling efficiency. We found results that point to theeffectiveness of the intrinsic motivation as a approach to improvethe agent’s sampling efficiency, as well as its performance. Weshare practical guidelines to assist in the implementation of actor-critic agents to deal with sparse reward environments whilemaking use of intrinsic motivation feedback.
1993
Reinforcement learning concerns the gradual acquisition of associations between events in the context of specific rewarding outcomes, whereas model-based learning involves the construction of representations of causal or world knowledge outside the context of any specific task. This thesis investigates issues in reinforcement learning concerned with exploration, the adaptive recoding of continuous input spaces, and learning with partial state information. It also explores the borderline between reinforcement and model-based learning in the context of the problem of navigation. A connectionist learning architecture is developed for reinforcement and delayed reinforcement learning that performs adaptive recoding in tasks defined over continuous input spaces. This architecture employs networks of Gaussian basis function units with adaptive receptive fields. Simulation results show that networks with only a small number of units are capable of learning effective behaviour in realtime control tasks within reasonable time frames. A tactical/strategic split in navigation skills is proposed and it is argued that tactical, local navigation can be performed by reactive, task-specific systems. Acquisition of an adaptive local navigation behaviour is demonstrated within a modular control architecture for a simulated mobile robot. The delayed reinforcement learning system for this task acquires successful, often plan-like strategies for control using only partial state information. The algorithm also demonstrates adaptive exploration using performance related control over local search. Finally, it is suggested that strategic, way-finding navigation skills require model-based, task-independent knowledge. A method for constructing spatial models based on multiple, quantitative local allocentric frames is described and simulated. This system exploits simple neural network learning, storage and search mechanisms, to support robust way-finding behaviour without the need to construct a unique global model of the environment.
Automatica, 2009
We present four new reinforcement learning algorithms based on actor-critic, function approximation, and natural gradient ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the valuefunction parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. We present empirical results verifying the convergence of our algorithms.
2nd Slovak Con--0.15, 1998
\Ve consider reinforcement learning methods for the solution of complex sequential optimization problems. In particular, the soundness of tV'lO methods proposed for the solution of partially obsenrable problems will be shown. The first method suggests a state-estimation scheme and requires mild a priori knowledge, \vhile the second method assumes that a significant amount of abstract knowl edge is available about the decision problem and uses this knowledge to setup a macro-hierarchy to turn the partially observable problem into another one which can already be handled using methods worked out for observable problems. This second method is also illustrated with some experiments on a. real-robot.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Advances in neural …, 2008
Studies in Systems, Decision and Control, 2021
Advances in Psychology, 1997
Systems, Man, and …, 1997
Cognitive science, 2009
The Journal of Machine Learning Research, 2003
arXiv (Cornell University), 2023
International Journal of Computer Applications, 2013