Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2021, arXiv (Cornell University)
Q-learning is a popular reinforcement learning algorithm. This algorithm has however been studied and analysed mainly in the infinite horizon setting. There are several important applications which can be modeled in the framework of finite horizon Markov decision processes. We develop a version of Q-learning algorithm for finite horizon Markov decision processes (MDP) and provide a full proof of its stability and convergence. Our analysis of stability and convergence of finite horizon Q-learning is based entirely on the ordinary differential equations (O.D.E) method. We also demonstrate the performance of our algorithm on a setting of random MDP along with an application on smart Grids.
arXiv (Cornell University), 2021
Q-learning is a popular reinforcement learning algorithm. This algorithm has however been studied and analysed mainly in the infinite horizon setting. There are several important applications which can be modeled in the framework of finite horizon Markov decision processes. We develop a version of Q-learning algorithm for finite horizon Markov decision processes (MDP) and provide a full proof of its stability and convergence. Our analysis of stability and convergence of finite horizon Q-learning is based entirely on the ordinary differential equations (O.D.E) method. We also demonstrate the performance of our algorithm on a setting of random MDP along with an application on smart Grids.
We develop a simulation based algorithm for finite horizon Markov decision processes with finite state and finite action space. Illustrative numerical experiments with the proposed algorithm are shown for problems in flow control of communication networks and capacity switching in semiconductor fabrication.
Decision and Control, 2006
We develop a simulation based algorithm for finite horizon Markov decision processes with finite state and finite action space. Illustrative numerical experiments with the proposed algorithm are shown for problems in flow control of communication networks and capacity switching in semiconductor fabrication.
Mathematics of Operations Research, 2021
In this paper, for POMDPs, we provide the convergence of a Q learning algorithm for control policies using a finite history of past observations and control actions, and, consequentially, we establish near optimality of such limit Q functions under explicit filter stability conditions. We present explicit error bounds relating the approximation error to the length of the finite history window. We establish the convergence of such Q-learning iterations under mild ergodicity assumptions on the state process during the exploration phase. We further show that the limit fixed point equation gives an optimal solution for an approximate belief-MDP. We then provide bounds on the performance of the policy obtained using the limit Q values compared to the performance of the optimal policy for the POMDP, where we also present explicit conditions using recent results on filter stability in controlled POMDPs. While there exist many experimental results, (i) the rigorous asymptotic convergence (to an approximate MDP value function) for such finite-memory Q-learning algorithms, and (ii) the near optimality with an explicit rate of convergence (in the memory size) under filter stability are results that are new to the literature, to our knowledge.
Many reinforcement learning algorithms, likeQ-Learning or R-Learning, correspond toadaptative methods for solving Markoviandecision problems in infinite-horizon whenno model is available. In this article weconsider the particular framework of nonstationaryfinite-horizon Markov DecisionProcesses. After establishing a relationshipbetween the finite-horizon total reward criterionand the average-reward criterion infinite-horizon, we define QH -Learning andRH -Learning for finite-horizon...
Computers & Operations Research, 2011
Continuous time Markov decision processes (CTMDPs) with a finite state and action space have been considered for a long time. It is known that under fairly general conditions the reward gained over a finite horizon can be maximized by a so-called piecewise constant policy which changes only finitely often in a finite interval. Although this result is available for more
We present in this article a variant of Q-learning with linear function approximation that is based on two-timescale stochastic approximation. Whereas it is difficult to prove convergence of regular Q-learning with linear function approximation because of the off-policy problem, we prove that our algorithm is convergent. Numerical results on a multi-stage stochastic shortest path problem show that our algorithm exhibits significantly better performance and is more robust as compared to Q-learning.
Automatic Control, IEEE …, 1997
We discuss the temporal-difference learning algorithm, as applied to approximating the cost-to-go function of an infinite-horizon discounted Markov chain, using a function approximator involving linear combinations of fixed basis functions. The algorithm we analyze performs on-line updating of a parameter vector during a single endless trajectory of an ergodic Markov chain with a finite or infinite state space. We present a proof of convergence (with probability 1), a characterization of the limit of convergence, and a bound on the resulting approximation error. In addition to proving new and stronger results than those previously available, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal-difference learning. Finally, we prove that on-line updates, based on entire trajectories of the Markov chain, are in a certain sense necessary for convergence. This fact reconciles positive and negative results that have been discussed in the literature, regarding the soundness of temporal-difference learning.
Machine Learning, 1996
This paper presents a novel incremental algorithm that combines Q-learning, a wellknown dynamic programming-based reinforcement learning method, with the TD( ) return estimation process, which is typically used in actor-critic learning, another well-known dynamic programming-based reinforcement learning method. The parameter is used to distribute credit throughout sequences of actions, leading to faster learning and also helping to alleviate the non-Markovian e ect of coarse state-space quantization. The resulting algorithm, Q( )learning, thus combines some of the best features of the Q-learning and actor-critic learning paradigms. The behavior of this algorithm is demonstrated through computer simulations of the standard benchmark control problem of learning to balance a pole on a cart.
IEEE Transactions on Automatic Control, 2007
We present a sampling algorithm, called "Recursive Automata Sampling Algorithm" (RASA), for control of finite horizon Markov decision processes. By extending in a recursive manner the learning automata Pursuit algorithm of Rajaraman and Sastry [5] designed for solving stochastic optimization problems, RASA returns an estimate of both the optimal action from a given state and the corresponding optimal value. Based on the finite-time analysis of the Pursuit algorithm, we provide an analysis for the finite-time behavior of RASA. Specifically, for a given initial state, we derive the following probability bounds as a function of the number of samples: (i) a lower bound on the probability that RASA will sample the optimal action; and (ii) an upper bound on the probability that the deviation between the true optimal value and the RASA estimate exceeds a given error.
SAIEE Africa Research Journal, 2021
This paper proposes an improved Q-learning method to obtain near-optimal schedules for grid and battery power in a grid-connected electric vehicle charging station for a 24-hour horizon. The charging station is supplied by a solar PV generator with a backup from the utility grid. The grid tariff model is dynamic in line with the smart grid paradigm. First, the mathematical formulation of the problem is developed highlighting each of the cost components considered including battery degradation cost and the real-time tariff for grid power purchase cost. The problem is then formulated as a Markov Decision Process (MDP), i.e., defining each of the parts of a reinforcement learning environment for the charging station's operation. The MDP is solved using the improved Q-learning algorithm proposed in this paper and the results are compared with the conventional Q-learning method. Specifically, the paper proposes to modify the action-space of a Q-learning algorithm so that each state has just the list of actions that meet a power balance constraint. The Q-table updates are done asynchronously, i.e., the agent does not sweep through the entire state-space in each episode. Simulation results show that the improved Q-learning algorithm returns a 14% lower global cost and achieves higher total rewards than the conventional Q-learning method. Furthermore, it is shown that the improved Q-learning method is more stable in terms of the sensitivity to the learning rate than the conventional Q-learning.
Automatica, 2008
We propose two algorithms for Q-learning that use the two-timescale stochastic approximation methodology. The first of these updates Q-values of all feasible state-action pairs at each instant while the second updates Q-values of states with actions chosen according to the 'current' randomized policy updates. A proof of convergence of the algorithms is shown. Finally, numerical experiments using the proposed algorithms on an application of routing in communication networks are presented on a few different settings.
2013
We seek to learn an effective policy for a Markov Decision Process (MDP) with continuous states via Q-Learning. Given a set of basis functions over state action pairs we search for a corresponding set of linear weights that minimizes the mean Bellman residual. Our algorithm uses a Kalman filter model to estimate those weights and we have developed a simpler approximate Kalman filter model that outperforms the current state of the art projected TD-Learning methods on several standard benchmark problems.
Advances in neural information processing systems, 1995
Semi-Markov Decision Problems are continuous time generalizations of discrete time Markov Decision Problems. A number of reinforcement learning algorithms have been developed recently for the solution of Markov Decision Problems, based on the ideas of asynchronous dynamic programming and stochastic approximation. Among these are TD( ), Q-learning, and Real-time Dynamic Programming. After reviewing semi-Markov Decision Problems and Bellman's optimality equation in that context, we propose algorithms similar to those named above, adapted to the solution of semi-Markov Decision Problems. We demonstrate these algorithms by applying them to the problem of determining the optimal control for a simple queueing system. We conclude with a discussion of circumstances under which these algorithms may be usefully applied. Z 1 0 e ? t dF xy (tj (x));
… of International Conference of Machine Learning
Universitext, 2011
In this chapter we will establish the theory of Markov Decision Processes with a finite time horizon and with general state and action spaces. Optimization problems of this kind can be solved by a backward induction algorithm. Since state and action space are arbitrary, we will impose a structure assumption on the problem in order to prove the validity of the backward induction and the existence of optimal policies. The chapter is organized as follows. Section 2.1 provides the basic model data and the definition of policies. The precise mathematical model is then presented in Section 2.2 along with a sufficient integrability assumption which implies a well-defined problem. The solution technique for these problems is explained in Section 2.3. Under structure assumptions on the model it will be shown that Markov Decision Problems can be solved recursively by the so-called Bellman equation. The next section summarizes a number of important special cases in which the structure assumption is satisfied. Conditions on the model data are given such that the value functions are upper semicontinuous, continuous, measurable, increasing, concave or convex respectively. Also the monotonicity of the optimal policy under some conditions is established. This is an essential property for computations. Finally the important concept of upper bounding functions is introduced in this section. Whenever an upper bounding function for a Markov Decision Model exists, the integrability assumption is satisfied. This concept will be very fruitful when dealing with infinite horizon Markov Decision Problems in Chapter 7. In Section 2.5 the important case of stationary Markov Decision Models is investigated. The notion 'stationary' indicates that the model data does not depend on the time index. The relevant theory is here adopted from the non-stationary case. Finally Section 2.6 highlights the application of the developed theory by investigating three simple examples. The first example is a special card game, the second one a cash balance problem and the last one deals with the classical stochastic LQ-problems. The last section contains some notes and references.
2010
This paper considers the application of a variable neighborhood search (VNS) algorithm for finite-horizon (H stages) Markov Decision Processes (MDPs), for the purpose of alleviating the ''curse of dimensionality" phenomenon in searching for the global optimum. The main idea behind the VNSMDP algorithm is that, based on the result of the stage just considered, the search for the optimal solution (action) of state x in stage t is conducted systematically in variable neighborhood sets of the current action. Thus, the VNSMDP algorithm is capable of searching for the optimum within some subsets of the action space, rather than over the whole action set. Analysis on complexity and convergence attributes of the VNSMDP algorithm are conducted in the paper. It is shown by theoretical and computational analysis that, the VNSMDP algorithm succeeds in searching for the global optimum in an efficient way. Crown j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / a m c (standard DP algorithm for short) with known solution at stage H (see, e.g., Puterman 1994). The optimal policy for a MDP with finite-horizon is a time-dependent policy, under discounted or total expected rewards (or costs), which makes its choice of actions dependent on the actual observation and on the number of steps the process performed already. Even though finite-horizon MDPs can be solved in time which increases polynomially in the number of states and actions (Vincent, 2000), many problems of practical interest involve a very large number of states and (or) actions, while the problem data are succinctly described, in terms of a small number of parameters. As a result, the practical applicability of the standard DP algorithm in finite-horizon MDP problems is somewhat restricted, a phenomenon that Bellman has termed the ''curse of dimensionality" . Quite a few researchers have devoted themselves to efficient methods for solving this problem.
2009
We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These "PAC-MDP" algorithms include the wellknown E 3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX.
IEEE Control Systems Letters, 2020
In a discounted reward Markov Decision Process (MDP), the objective is to find the optimal value function, i.e., the value function corresponding to an optimal policy. This problem reduces to solving a functional equation known as the Bellman equation and a fixed point iteration scheme known as the value iteration is utilized to obtain the solution. In literature, a successive over-relaxation based value iteration scheme is proposed to speed-up the computation of the optimal value function. The speed-up is achieved by constructing a modified Bellman equation that ensures faster convergence to the optimal value function. However, in many practical applications, the model information is not known and we resort to Reinforcement Learning (RL) algorithms to obtain optimal policy and value function. One such popular algorithm is Q-learning. In this paper, we propose Successive Over-Relaxation (SOR) Q-learning. We first derive a modified fixed point iteration for SOR Q-values and utilize stochastic approximation to derive a learning algorithm to compute the optimal value function and an optimal policy. We then prove the almost sure convergence of the SOR Q-learning to SOR Q-values. Finally, through numerical experiments, we show that SOR Q-learning is faster compared to the standard Q-learning algorithm.
Annals of Operations Research, 2013
We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1): 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. 96 Ann Oper Res (2013) 208:95-132 SD . The notion of a proper policy will be used to classify policies of SSP. In particular, let us call a policy proper, if under that policy the destination state 0 is reached with probability 1 from every initial state, and improper otherwise.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.