Optimal Control Theory

Uploaded by

Aritra Dasgupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

92 views28 pages

Optimal Control Theory

Uploaded by

Aritra Dasgupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 28

To appear in Bayesian Brain, Doya, K. (ed), MIT Press (2006) Optimal Control Theory Emanuel Todorov University of California San Diego Optimal control theory is a mature mathematical discipline with numerous applications in both science and engineering. It is emerging as the computational framework of choice {or studying the neural control of movement, in much the same way that probabilistic inference is emerging as the computational framework of choice for studying sensory information processing. Despite the growing popularity of optimal control models, however, the elab- orate mathematical machinery behind them is rarely exposed and the big picture is hard to grasp without reading a few technical books on the subject. While this chapter cannot replace such books, it aims to provide a self-contained mathematical introduction to optimal control theory that is sufficiently broad and yet sufficiently detailed when it comes to key concepts. ‘The text is not tailored to the field of motor control (apart from the last section, and the overall emphasis on systems with continuous state) so it will hopefully be of interest to a wider audience. Of special interest in the context of this book is the aaterial on the duality of optimal control and probabilistic inferenee; such duality suggests that neural information processing in sensory and motor areas may be more similar than currently thought. The chapter is organized in the following sections 1. Dynamic programming, Bellman equations, optimal value functions, value and policy iteration, shortest paths, Markov decision processes. 2. Hamilton-Jacobi-Bellman equations, approximation methods, finite and infinite horizon formulations, basies of stochastic calculus. 3. Pontryagin’s maximum principle, ODE and gradient descent methods, relationship to classical mechanies. 4. Linear-quadratic-Gaussian control, Ri to nonlinear problems. ‘ati equations, iterative linear approximations 5. Optimal recursive estimation, Kalman filter, Zakai equation, 6. Duality of optimal control and optimal estimation (including new results), 7. Optimality models in motor control, promising research directions.1 Discrete control: Bellman equations Let x € & denote the state of an agent’s environment, and w € U(c) the action (or control) which the agent chooses while at state «xr, For now both 2 and UW (1) are finite sets, Let nesrt (x,u) € A denote the state which results from applying action w in state sr, and cost (rr,u) > 0 the cost of applying action w in state x. As an example, « may be the city where we are now, u the flight we choose to take, ne:tt (x, ) the city where that flight lands, and cost (1,1) the price of the ticket. We ean now pose a simple yet. practical optimal control problem: find the cheapest way to fly to your destination, This problem can be formalized as follows: find an action sequence (ug, t,*+ty~1) and corresponding state sequence (9, 1,-+q) minimizing the total cost T(esu) = 2" cost (24, ux) where r441 = next (rp, uz) and uy € U (rq). ‘The initial state x» = <*"* and destination state, = a are given. We can visualize this setting with a directed graph where the states are nodes and the actions are arrows connecting the nodes. If cost (1, u) = 1 for all (c,u) the problem reduces to finding the shortest path from 2° to «** in the graph. 1.1 Dynamic programming Optimization problems such as the one stated above are efficiently solved via dynamic programming (DP). DP relies on the following obvious fact: if a given state-action sequence is optimal, and we were to remove the first state and action, the remaining sequence is also optimal (with the second state of the original sequence now acting as initial state). ‘This is the Bellman optimality principle. Note the close resemblance to the Markov property of stochastie processes (a process is Markov if its future is conditionally independent of the past given the present state). ‘The optimality principle can be reworded in similar language the choice of optimal actions in the future is independent of the past actions which led to the present state. Thus optimal state-action sequences can be constructed by starting at the final state and extending backwards. Key to this procedure is the optimal value function (or optimal cost-to-go function) (2) = "minimal total cost for completing the task starting from state 2" ‘This function captures the long-term cost for starting from a given state, and makes it possible to find optimal actions through the following algorithm: Consider every action available at the current state, add its immediate cost to the optimal value of the resulting nest state, and choose an action for which the sum is minimal. "The above algorithm is "greedy" in the sense that actions are chosen based on local information, without explicit consideration of all future scenarios. And yet the resulting actions are optimal. ‘This is possible because the optimal value function contains all information about future scenarios that is relevant to the present choice of action, ‘Thus the optimal value function is an extremely useful quantity, and indeed its calculation is at the heart of many methods for optimal controlThe above algorithm yields an optimal action w= 7 (x) € U (xr) for every state x. A ‘mapping from states to actions is ealled control law or control policy. Once we have a control law 1: ¥ U(X) we can start at any state xq, generate action ug = m (rq), transition to state 7 = next (xp, uo), generate action u: = (1), and keep going until we reach a!" Formally, an optimal control law 7 satisfies = arg min. {cost (2,1) + (next (2, 1 ag in, (east (0) + (next (2,0))} @ ‘The minimum in (1) may be achieved for multiple actions in the set U (ar), may not be unique. However the optimal value function v is always uniquely defined, and satisfies v(x) in, {cost (2,1) +» (next (2, u))) Q) 2) Equations (1) and (2) are the Bellman equations. If for some z we already know » (nest (1, u)) for all u € U(r), then we can apply the Bellman equations directly and compute x(x) and v(x). Thus dynamic programming is particularly simple in acyclic graphs where we can start from 2%" with v(x") = 0, and perform a backward pass in which every state is visited after all its successor states have been visited. It is straightforward to extend the algorithm to the case where we are given non-zero final costs for a number of destination states (or absorbing states). 1.2 Value iteration and policy iteration The situation is more complex in graphs with eyeles. Here the Bellman equations are still valid, but we cannot apply them in a single pass. This is because the presence of cycles makes it impossible to visit each state only after all its successors have been visited Instead the Bellman equations are treated as consistency conditions and used to design iterative relaxation schemes ~ mmuch like partial differential equations (PDEs) are treated as consistency conditions and solved with corresponding relaxation schemes. By "relaxation scheme" we mean guessing the solution, and iteratively improving the guess so as to make it more compatible with the consistency condition. ‘The two main relaxation schemes are value iteration and policy iteration. Value iteration uses only (2). We start with a guess v'°) of the optimal value function, and construct a sequence of improved guesses: yo) min { cost (r,u) +0") (next (1, u) 3 Bip, {eost (eu) +0 (next .u))} @) This process is guaranteed to converge to the optimal value function v in a finite number of iterations. The proof relies on the important idea of contraction mappings: one defines the approximation error ¢ (v) = max. |» (x) — v (2)|, (3) causes € (u) to decrease as i increases. In other words, the mapping uf — v+) given by (3) contracts the "size" of v‘ as measured by the error norm € (v). Policy iteration uses both (1) and (2). It starts with a guess = of the optimal controlJaw, and constructs a sequence of improved guesses: oF (2) = cost (ey (2)) +0" (neet (esx ())) @ (42) (2) = arg ae, {eost (x,u) + v™® (neat («,w)} ‘The first line of (4) requires a separate relaxation to compute the value function v™"” for the control law a), This function is defined as the total cost for starting at state x and acting according to 7 thereafter. Policy iteration can also be proven to converge in a finite number of iterations. It is not obvious which algorithm is better, because each of the two nested relaxations in poliey iteration converges faster than the single relaxation in value iteration, In practice both algorithms are used depending on the problem at hand, 1.3 Markov decision processes The problems considered thus far are deterministic, in the sense that applying action u at state « always yields the same next state nest (x,u). Dynamic programming easily generalizes to the stochastic ease where we have a probability distribution over possible next states: p(yla,u) = "probability that neat (x, u) = y" In order to qualify as a probability distribution the fun Dyce? le) P(ylx,u) >0 np must satisfy In the stochastic case the value function equation (2) becomes (2) = i (cont (2,1) + Bo (next (2,09) where E denotes expectation over next (x,u), and is computed as Bl (next (2,0)] = Dey Pula.) »(v) Equations (1, 3, 4) generalize to the stochastic case in the same way as equation (2) does. An optimal control problem with discrete states and actions and probabilistic state transitions is called a Markov decision process (MDP). MDPs are extensively studied in reinforcement learning ~ which is a sub-field of machine learning focusing on optimal control problems with discrete state. In contrast, optimal control theory focuses on problems with continuous state and exploits their rich differential structure 2 Continuous control: Hamilton-Jacobi-Bellman equations We now turn to optimal control problems where the state x € R" and control w € U (x) © RM are real-valued vectors. To simplify notation we will use the shortcut min, instead ofmnin,cu(e), although the latter is implied unless noted otherwise. Consider the stochastic differential equation dx =f (x,u) dt + F (x,u) dw 6) where dw is my-dimensional Brownian motion. ‘This is sometimes called a controlled Ito diffusion, with f (x, m) being the drift and F (x, u) the diffusion coefficient. In the absence of noise, ic. when F (x,u) = 0, we can simply write x = f (x,u). However in the stochastic case this would be meaningless because the sample paths of Brownian motion are not differentiable (the term dw/dt is infinite). What equation (6) really means is that the integral of the left hand side is equal to the integral of the right hand side: ‘ xoxo f Fex(o) mae [ F (x(s),a(s)) dew (s) ‘The last term is an Ito integral, defined for square-integrable functions g (t) as ff 9s) ins) = fi, YP) 0) (5002) — wu) where 0 = 89 <2 <0 < Sy = We will stay away from the complexities of stochastic calculus to the extent possible. Instead ‘we will discretize the time axis and obtain results for the continuous-time case in the limit of infinitely small time step. ‘The appropriate Euler discretization of (6) is Xb = XK + AF (xp, uy) + VAF (xp, UK) ck where A is the time step, ex ~.A(0,I") and xz =x (kA). The VB term appears because the variance of Brownian motion grows linearly with time, and thus the standard deviation of the discrete-time noise should scale as VA. ‘To define an optimal control problem we also need a cost function, In finite-horizon problems, i, when a final time ty is specified, it is natural to separate the total cost into a time-integral of a cost rate &(x,u,t) > 0, and a final cost h(x) > 0 which is only evaluated at the final state x(t). Thus the total cost for a given state-control trajectory {x(),u() 0 StS ty} is defined as y I(x(), U6) = h(x (ts) of (x(t), u(t) ,t)dt Keep in mind that we are dealing with a stochastic system. Our objective is to find a control law u = x (x,t) which minimizes the expected total cost for starting at a given (x,t) and acting according thereafter. In discrete time the total cost becomes FM.) = Hn) +A Lest AA) where n = ty/A is the number of time steps (assume that ty/A is integer)2.1 Derivation of the HJB equations We are now ready to apply dynamic programming to the time-discretized stochastic prob- Jem. The development is similar to the MDP case except that the state space is now infinite: it consists of n +1 copies of R"*. The reason we need multiple copies of R" is that we have a finite-horizon problem, and therefore the time when a given x € R" is reached makes a difference. ‘The state transitions are now stochastic: the probability distribution of xey1 given Xi, Ux is the multivariate Gaussian Xia YN (Xe + AL (Xp, UR), AS Oxk, UK) where S(x,u) = F (x,u) F (x,u)™ ‘The Bellman equation for the optimal value function v is similar to (5), except that v is now a funetion of space and time, We have v(x,k) = min {Ae(x, u,kA) + B [v(x + Af (x,u) + k+1))} (7 where €~N (0, AS(x,u)) and v(x,n) = h(x) Consider the second-order Taylor-series expansion of v, with the time index k+1 suppressed for clarity: U(x +5) = v(x) + dT ue (x) + £67 xx (x) 6 + 0 (58) where 6 = Af (x,u) +6 vx = 20, tee = gage Now compute the expectation of the optimal value function at the next state, using the above Taylor-series expansion and only keeping terms up to first-order in A. The result is: E(u] = 0 (x) + AF (x, 0)" vx (x) + Ft (AS (x, U) dx (x)) + 0 (A?) ‘The trace term appears because B [tr (eT) | Note the second-order derivative tye in the first-order approximation to E [p]. This is a recurrent theme in stochastic calculus. It is directly related to Ito’s lemma, which states that if x (t) is an Ito diffusion with coefficient 7, then B | ET vxx€) it (Cov [f] Ux! HT (ASV xx) dg (x (1)) = ge (# (t)) de (8) + fo°ge0 (x(t) dt Coming back to the derivation, we substitute the expression for F[v] in (7), move the term v(x) outside the minimization operator (since it does not depend on u), and divide by A. Suppressing x, u, & on the right hand side, we have (xk) 0 (ek +1) = min {6-4 fox + $61 (Stax) +0(A)}Recall that t= kA, and consider the optimal value function v (x,t) defined in continuous time. The left hand side in the above equation is then v(xt) —vOx,t+d) a In the limit A — 0 the latter expression becomes —v, which we denote ~v,. ‘Thus for O 0 being the discount factor. Intuitively this says that future costs are less costly (whatever that means). Here we do not have a final cost f(x), and the cost rate £ (x, u) no-longer depends on time explicitly. The HJB equation for the optimal value function becomes aw (x) = aaa, (ew) E(x, u)! vy (x) + tr (5 (x, 0) tine 0) } (10) Another alternative is the average-cost-per-stage formulation, with total cost F(x(),w()) = tim 2 [cow weorae aytae Ty In this case the HJB equation for the optimal value function is, min. ox, a) (¢,u)T rx (x) + 5 tr (S (2, 1) te (0) A= amin, {8 660) +f G5 u)T ox (0) +B (5 64,1) Mee ())} ay where A > (/is the average cost per stage, and v now has the meaning of a differential value function. Equations (10) and (11) do not depend on time, which makes them more amenable to numerical approximations in the sense that we do not need to store a copy of the optimal value function at each point in time, Form another point of view, however, (8) may be easierto solve numerically. This is because dynamic programming can be performed in a single backward pass through time: initialize v (x,t7) = h(x) and simply integrate (8) backward in time, computing the spatial derivatives numerically along the way. In contrast, (10) and (11) call for relaxation methods (such as value iteration or policy iteration) which in the continnous-state case may take an arbitrary number of iterations to converge. Relaxation methods are of course guaranteed to converge in a finite number of iterations for any finite state approximation, but that number may increase rapidly as the discretization of the continuous state space is refined. 3 Deterministic control: Pontryagin’s maximum principle Optimal control theory is based on two fundamental ideas. One is dynamic programming and the associated optimality principle, introduced by Bellman in the United States. The other is the mazimum principle, introduced by Pontryagin in the Soviet Union. The maximum principle applies only to deterministic problems, and yields the same solutions as dynamic programming. Unlike dynamic programming, however, the maximum principle avoids the curse of dimensionality. Here we derive the maximum principle indirectly via the HJB equation, and directly via Lagrange multipliers. We also clarify its relationship to classical mechanics. 3.1 Derivation via the HJB equations For deterministic dynamics % = f (x, u) the finite-horizon HJB equation (8) becomes u(x,t) = min {e6s.u.1) $£ (x, a)" ox (x, »} (12) Suppose a solution to the minimization problem in (12) is given by an optimal control law (Xt) which is differentiable in x. Setting u = x (x,t) we can drop the min operator in (22) and write 0 =u (x, t) + £ (x, m (x, £) ,t) + £ (x, (x, 8)" Ux (x, 6) This equation is valid for all x, and therefore can be differentiated w.r.t. x to obtain (in shortcut notation) O= tg + be tag la + (ET + EEE) ta + tl Regrouping terms, and using the identity bg = tock + vax = Yoel + Une, yields 0 ig + la + Blt +4 (Cu + Boom) We now make a key observation: the term in the brackets is the gradient w.r.t. u of the quantity being minimized w.r.t. u in (12). ‘That gradient is zero (assuming unconstrained minimization), which leaves us with Ati, (0) = Be (X, m (26, 0) 0) +E (x, (x, 0) te (x, 0) (13)This may look like a PDE for w, but if we think of vx as a vector p instead of a gradient of a function which depends on x, then (13) is an ordinary differential equation (ODE) for p. ‘That equation holds along any trajectory generated by x (x,t). The vector p is called the costate vector. We are now ready to formulate the maximum principle. If (x(t) ,u(t):0 > or = Raya = AR, + AEKHT (P+ HDGH™) (yk HR) (30) and covariance matrix Bey 1 4+ AY,AT — AN HT (P + HY,H") HY, AT (31) ‘This completes the induction proof, Equation (31) is a Riccati equation, Equation (30) is usually written as Rhu = ARe + Ke (Ye ~ HRe) a where Ky = AB, H™ (P+ HExH") ‘The time-varying matrix Kj is called the filter gain. It does not depend on the observation sequence and therefore can be computed offine. ‘The quantity y, — Hp is called the innovation. It is the mismatch between the observed and the expected measurement. ‘The covariance Nz of the posterior probability distribution p (xz|ye—1---yo) is the estimation error covariance. The estimation error is x4 — Re. ‘The above derivation corresponds to the discrete-time Kalman filter. A similar result holds in continuous time, and is called the Kalman-Bucy filler. It is possible to write down the Kalman filter in equivalent forms which have numerical advantages. One such approach 18is to propagate the matrix square root of ©. This is called a square-root filter, and involves Riccati-like equations which are more stable because the dynamic range of the elements of. ¥ is reduced. Another approach is to propagate the inverse covariance 31. ‘This is called an information filter, and again involves Riccati-like equations. The information filter can represent numerically very large covariances (and even infinite covariances - which are useful for specifying "non-informative" priors). Instead of filtering one can do smoothing, i.e. obtain state estimates using observations from the past and from the future. In that case the posterior probability of each state given all observations is still Gaussian, and its parameters can be found by an additional backward pass known as Rauch recursion. The resulting Kalman smoother is closely related to the forward-backward algorithm for probabilistic inference in hidden Markov models (HMMs) ‘The Kalan filter is optimal in many ways. First of all it is a Bayesian filter, in the sense that it computes the posterior probability distribution over the hidden state. In addition, the mean X is the optimal point estimator with respect to multiple loss functions, Recall that optimality of point estimators is defined through a loss function é (x, %) which quantifies how bad it is to estimate % when the true state is x. Some possible loss functions are |x — &||*, \|x ~ ||, 6 ~3), and the corresponding optimal estimators are the mean, median, and mode of the posterior probability distribution. If the posterior is Gaussian then the mean, median and mode coincide. If we choose an unusual loss function for which the optimal point estimator is not %, even though the posterior is Gaussian, the information contained in % and ¥ is still sufficient to compute the optimal point estimator. This is because ¥ and © are sufficient statistics which fully describe the posterior probability distribution, which in turn captures all information about the state that is available in the observation sequence. ‘The set of sufficient statistics can be thought of as an augmented state, with respect to which the partially-observed process has a Markov property. It is called the belief state or alternatively the information state. 5.2 Beyond the Kalman filter When the estimation problem involves non-linear dynamics or non-Gaussian noise the posterior probability distribution rarely has a finite set of sufficient statistics (although there are exceptions such as the Benes system). In that case one has to rely on numerical approximations. ‘The most widely used approximation is the extended Kalman filter (KF). It relies on local linearization centered at the current state estimate and closely resembles the LQG approximation to non-LQG optimal control problems. The EKF is not guaranteed to be optimal in any sense, but in practice if often yields good results — especially when the posterior is single-peaked. ‘There is a recent improvement, ealled the unscented filter, which propagates the covariance using deterministic sampling instead of linearization of the system dynamics. The unscented filter tends to be superior to the EKF and requires a com- parable amount of computation. An even more accurate, although computationally more expensive approach, is particle filtering. Instead of propagating a Gaussian approximation it propagates a cloud of points sampled from the posterior (without actually computing the posterior). Key to its success is the idea of importance sampling. Even when the posterior does not have a finite set of sufficient statistics, it is still a well-defined scalar fumction over the state space, and as such must obey some equation. In 19discrete time this equation is simply a recursive version of Bayes’ rule ~ which is not too revealing. In continuous time, however, the posterior satisfies a PDE which resembles the HJB equation. Before we present this result we need some notation. Consider the stochastic differential equations dx = f (x) dt + F (x) dw (32) dy =h(x)di+dv where w (f) and v(f) are Brownian motion processes, x (1) is the hidden state, and y (t) is the observation sequence. Define $ (x) = F (x) F(x)" as before. One would normally think of the increments of y(t) as being the observations, but in continuous time these inerements are infinite and so we work with their time-integral. Let p(x,#) be the probability distribution of x(t) in the absence of any observations. At = 0 it is initialized with a given prior p(x,0). For ¢ > 0 it is governed by the first line of (32), and can be shown to satisfy =a + bt (Spex) + (“DO aft dD, atau) P ‘This is called the forward Kolmogorov equation, or alternatively the Fokker-Planck equation. We have written it in expanded form to emphasize the resemblance to the IIJB equation. ‘The more usual form is =- DA) + 4D, wae (Su) Let j (x, t) be an unnormalized posterior over x (t) given the observations {y (s) : 0 < "unnormalized" means that the actual posterior (x,t) can be recovered by normalizing: B(x,0) =P(x,t)/ [P(%,t) dz, Tt can be shown that some unnormalized posterior 7 satisfies Zakai’s equation = (-L AB UA+E DL, ahs (Su) ae +b Bdy ‘The first term on the right reflects the prior and is the same as in the Kolmogorov equation (except that we have multiplied both sides by di). The second term incorporates the observation and makes Zakai’s equation a stochastic PDE. After certain manipulations (conversion to Stratonovich form and a gauge transformation) the second term can be integrated by parts, leading to a regular PDE. One can then approximate the solution to that PDE numerically via discretization methods. As in the HJB equation, however, such methods are only applicable in low-dimensional spaces due to the curse of dimensionality, 6 Duality of optimal control and optimal estimation Optimal control and optimal estimation are closely related mathematical problems. The best-known example is the duality of the linear-quadratie regulator and the Kalman filter. "To see that duality more clearly, we repeat the corresponding Riccati equations (27) and (31) side by side: control: Ve =Q+ ATV pA ~ ATV B(R4+ BTVe iB) BVA filtering: Sy. = $+ AN,AT — ADpHT (P+ HE,H") | HSQAT 20‘These equations are identical up to @ time reversal and some matrix transposes. The correspondence is given in the following table: contro: VQ A BR k t (33) filtering: SS AT HT P n-k ‘The above duality was first described by Kalman in his famous 1960 paper introducing the discrete-time Kalman filter, and is now mentioned in most books on estimation and control. However its origin and meaning are not apparent. ‘This is because the Kalman filter is optimal from multiple points of view and can be written in multiple forms, making it hard to tell which of its properties have a dual in the control domain, Attempts to generalize the duality to non-LQG settings have revealed that the fundamental relationship is between the optimal value function and the negative log-posterior. This is actually inconsistent with (33), although the inconsistency has not been made explicit before. Recall that V is the Hessian of the optimal value function, The posterior is Gaussian with covariance matrix ©, and thus the Hessian of the negative log-posterior is © 1, However in (33) we have V corresponding to ¥ and not to 37. Another problem is that while AT in (33) makes sense for linear dynamics X = Ax, the meaning of "transpose" for general non-linear dynamics X= f (x) in unclear. Thus the duality described by Kalman is specific to linear-quadratie-Gaussian systems and does not generalize. 6.1 General duality of optimal control and MAP smoothing We now discuss an alternative approach which does generalize. In fact we start with the general case and later specialize it to the LQG setting to obtain something known as a ‘minimaum-energy estimator. As far as we know, the general treatment presented here is a novel result. The duality we establish is between maximum a posteriori (MAP) smoothing and deterministic optimal control for systems with continuous state. For systems with discrete state (i.e. HMMs) an instance of such duality is the Viterbi algorithm — which finds the most likely state sequence using dynamic programming, Consider the discrete-time partially-observable system P (Xie |Xk) = exp (a (X41) Xk) (34) P(¥«\Xx) = exp (—b (Ye, Xe) (Xo) = exp (—e(xo)) where a,b,¢ are the negative log-probabilities of the state transitions, observation emis- sions, and initial state respectively. The states are hidden and we only have access to the observations (y2,y2,-*+¥n}- Our objective is to find the most probable sequence of states (xo, X1,+++%n), that is, the sequence which maximizes the posterior probability ply|x.) p(x) Poly) = —Tyy The term p(y.) does not affect the maximization and so it can be dropped. Using the 2‘Markov property of (34) we have p(ybe) p(x.) = xo) TT), » Oxele—a) (yale) = exp (-¢ (%0)) [Tex (a (es k-2)) exp (—b (Ye) = exp (=e (0) — 9), (@ Ge 24-1) + bY x4))) Maximizing exp(—J) is equivalent to minimizing J. ‘Therefore the most probable state sequence is the one which minimizes Ix. (0) + 0", (a 64, 4-1) + (Yes xk) (38) This is beginning to look like total cost for a deterministic optimal control problem. However we are still missing a control signal. To remedy that we will define the passive dynamics as the expected state transition: f(s4) = Bbsialse) = f exe (a(x) de and then define the control signal as the deviation from the expected state transition: pt = £ (XK) + Ue (36) ‘The control cost is now defined as 7 (Ui, Xe) = a (£ (Xk) + URE), OSk P and Q <=> S, while these two correspondences are now reversed. A minimum-energy interpretation can help better understand the duality. From (29) we have Xk — AXk—1 = Wi-1 and ye — HXx = Ve. Thus the cost rate for the optimal control problem is of the form wis'twtvTPoly This can be though of as the energy of the noise signals w and v. Note that an estimate for the states implies estimates for the two noise terms, and the likelihood of the estimated noise is @ natural quantity to optimize. The first term above measures how far the estimated state is from the prior; it represents a control cost because the control signal pushes the estimate away from the prior. The second term measures how far the predicted observation (and thus the estimated state) is from the actual observation. One ean think of this as a minimum-cnergy tracking problem with reference trajectory specified by the observations. 237 Optimal control as a theory of biological movement ‘To say that the brain generates the best behavior in can, subject to the constraints imposed by the body and environment, is almost trivial. After all, the brain has evolved for the sole purpose of generating behavior advantageous to the organism, It is then reasonable to expect that, at least in natural and well-practiced tasks, the observed behavior will be close to optimal. This makes optimal control theory an appealing computational framework for studying the neural control of movement. Optimal control is also a very successful framework in terms of explaining the details of observed movement. However we have recently reviewed this literature [12] and will not repeat the review here. Instead we will briefly summarize existing optimal control models from a methodological perspective, and then list some research directions which we consider promising. Most optimality models of biological movement assume deterministic dynamics and impose state constraints at different points in time. These constraints can for example specify the initial and final posture of the body in one step of locomotion, or the positions of a sequence of targets which the hand has to pass through. Since the constraints guarantee accurate execution of the task, there is no need for accuracy-related costs which specify what the task is. The only cost is a cost rate which specifies the "style" of the movement, It has been defined as (an approximation to) metabolic energy, or the squared derivative of acceleration (Le. jerk), or the squared derivative of joint torque. The solution method is usually based on the maximum principle, Minimum-energy models are explicitly formulated as optimal control problems, while minimum-jerk and minimum-torque-change models are formulated in terms of trajectory optimization. However they can be easily transformed into optimal control problems by relating the derivative being minimized to a control signal. Here is an example. Let q(t) be the vector of generalized coordinates (c.g. joint angles) for an articulated body such as the human arm. Let r (t) be the vector of generalized forces (c.g. joint torques). The equations of motion are 7=M(q)4+n(4.4) where M (q) is the configuration-dependent inertia matrix, and n (q, 4) captures non-linear interaction forces, gravity, and any external foree fields that depend on position or velocity. Unlike mechanical devices, the musculo-skeletal systern has order higher than two because the muscle actuators have their own states. For simplicity assume that the torques 7 correspond to the set of muscle activations, and have dynamies u—7) where u (t) is the control signal sent by the nervous system, and c is the muscle time constant (around 40 msec). The state vector of this system is, x= [a 4 7] We will use the subscript notation xjj = 4, X[qj = @ Xj) = 7. ‘The general first-order 24dynamies x = f (x, u) are given by Xl = Xi) 32 = M Oy)? Ga) 2 pa) %q] = ¢ (U~ )) Note that these dynamics are affine in the control signal and can be written as &=a(x)+Bu Now we can specify a desired movement time ty, an initial state xo, and a final state xy. We can also specify a cost rate, such as control energy: £(x,u) = 3 |u|? torque-change: (x, u) =} [/F|? = gle {ju — x! |? In both cases the cost is quadratic in wand the dynamics are affine in u. Therefore the Hamiltonian can be minimized explicitly. Focusing on the minimum-energy model, we have A (x,u,p) = (

Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
Dynamic Programming and Optimal Control Script
No ratings yet
Dynamic Programming and Optimal Control Script
58 pages
Dynamic Programming & Control
No ratings yet
Dynamic Programming & Control
62 pages
5.1 Dynamic Programming and The HJB Equation: k+1 K K K K
No ratings yet
5.1 Dynamic Programming and The HJB Equation: k+1 K K K K
30 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
Class Notes 2
No ratings yet
Class Notes 2
6 pages
SLchapt 3
No ratings yet
SLchapt 3
10 pages
Namic Programming
No ratings yet
Namic Programming
18 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Computational Economics: Session 16: Numerical Dynamic Programming
No ratings yet
Computational Economics: Session 16: Numerical Dynamic Programming
17 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
16.323 Principles of Optimal Control: Mit Opencourseware
No ratings yet
16.323 Principles of Optimal Control: Mit Opencourseware
27 pages
Abstract Dynamic Programming Overview
No ratings yet
Abstract Dynamic Programming Overview
257 pages
MDP and Bellman Equation Overview
No ratings yet
MDP and Bellman Equation Overview
29 pages
Dynamic Programming in Control Theory
No ratings yet
Dynamic Programming in Control Theory
26 pages
Lec 09
No ratings yet
Lec 09
51 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
Dynamic Programming for Economists
No ratings yet
Dynamic Programming for Economists
71 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
MDP Solution Methods: Iteration & LP
No ratings yet
MDP Solution Methods: Iteration & LP
34 pages
Optimal Control
No ratings yet
Optimal Control
51 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
62 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
PhysRevResearch.5.013122 Physics of Networks
No ratings yet
PhysRevResearch.5.013122 Physics of Networks
9 pages
Homework - 06 - 223 - Spring 2024
No ratings yet
Homework - 06 - 223 - Spring 2024
5 pages
SC Dec22
No ratings yet
SC Dec22
82 pages
RL and ObC Lecture 2
No ratings yet
RL and ObC Lecture 2
20 pages
Random Sampling of States in Dynamic Programming: Christopher G. Atkeson and Benjamin J. Stephens
No ratings yet
Random Sampling of States in Dynamic Programming: Christopher G. Atkeson and Benjamin J. Stephens
6 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Book All in One
No ratings yet
Book All in One
288 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Notas - Dynamic Optimation and Optimal Control
No ratings yet
Notas - Dynamic Optimation and Optimal Control
26 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
Reinforcement Learning: Foundations Exam
No ratings yet
Reinforcement Learning: Foundations Exam
42 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Dynamic Programming
No ratings yet
Dynamic Programming
384 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
CS229
No ratings yet
CS229
17 pages
Dynamic Optimization
No ratings yet
Dynamic Optimization
73 pages
Formula Sheet: Section 1 - Deterministic Dynamic Programming
No ratings yet
Formula Sheet: Section 1 - Deterministic Dynamic Programming
10 pages
Woolseylecture 1
No ratings yet
Woolseylecture 1
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Introducción Piazza
No ratings yet
Introducción Piazza
33 pages
Typeset by AMS-TEX
No ratings yet
Typeset by AMS-TEX
27 pages
Deterministic Dynamic Optimization in Economics
No ratings yet
Deterministic Dynamic Optimization in Economics
49 pages
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
Unified Stochastic Optimization Framework
No ratings yet
Unified Stochastic Optimization Framework
69 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
Bellman Equation in Markov Decision Processes
No ratings yet
Bellman Equation in Markov Decision Processes
5 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
DP - Bellman - 1741339134 2025-03-07 09 - 19 - 05
No ratings yet
DP - Bellman - 1741339134 2025-03-07 09 - 19 - 05
13 pages
Finance: Stochastic Control Guide
No ratings yet
Finance: Stochastic Control Guide
67 pages
Section 16620 Fire Alarm System
No ratings yet
Section 16620 Fire Alarm System
12 pages
Siemens Meter Mains 1 Phase Usa Drawings
No ratings yet
Siemens Meter Mains 1 Phase Usa Drawings
8 pages
3 Phase Meter Disconnect USA
No ratings yet
3 Phase Meter Disconnect USA
1 page
Waydev Engineering Leaders Handbook
No ratings yet
Waydev Engineering Leaders Handbook
28 pages
26 05 35 Electrical Gutters and Wireways
No ratings yet
26 05 35 Electrical Gutters and Wireways
4 pages
Industrial Cable Solutions Guide
No ratings yet
Industrial Cable Solutions Guide
8 pages
ANSI C63 17 v0 4
No ratings yet
ANSI C63 17 v0 4
80 pages
BSC-Bulletin-20-01 FINAL
No ratings yet
BSC-Bulletin-20-01 FINAL
4 pages
Requirements For Electrical Installations Iet Wiring Regulations 6
100% (1)
Requirements For Electrical Installations Iet Wiring Regulations 6
8 pages
The Complete Guide To R454B Refrigerant
100% (1)
The Complete Guide To R454B Refrigerant
5 pages
Energy Efficiency Certification
No ratings yet
Energy Efficiency Certification
31 pages
Technical Data Sheet Luminosity White Color LED: 61-238/XK2C-BXXXXXXXXXX/ET
No ratings yet
Technical Data Sheet Luminosity White Color LED: 61-238/XK2C-BXXXXXXXXXX/ET
16 pages
TR220 Instructions
No ratings yet
TR220 Instructions
4 pages
Low Flicker Dimmable LED Driver Design
No ratings yet
Low Flicker Dimmable LED Driver Design
8 pages
A5e39074854a-002 NMMS E229808 Ul
No ratings yet
A5e39074854a-002 NMMS E229808 Ul
2 pages
Toshiba LED Bulb Specifications Guide
No ratings yet
Toshiba LED Bulb Specifications Guide
2 pages
Toaz - Info Bs General Requirements Electrical Accessory PR
No ratings yet
Toaz - Info Bs General Requirements Electrical Accessory PR
58 pages
Electronic Switches Test Report IEC 60669 - 2 - 1
100% (1)
Electronic Switches Test Report IEC 60669 - 2 - 1
61 pages
5TE6800 Datasheet en
No ratings yet
5TE6800 Datasheet en
5 pages
1498852
No ratings yet
1498852
5 pages
Nema Bi 50002-2022 Watermarked - 2023-10-04
No ratings yet
Nema Bi 50002-2022 Watermarked - 2023-10-04
12 pages
UL 873 12th Ed 16 Nov 2007 ED 8 31 2019
No ratings yet
UL 873 12th Ed 16 Nov 2007 ED 8 31 2019
4 pages
NEC Code & NEMA Wiring Guide
100% (1)
NEC Code & NEMA Wiring Guide
73 pages

Optimal Control Theory

Uploaded by

Optimal Control Theory

Uploaded by

You might also like