Learning control in robotics

2010

Abstract

Abstract Recent trends in robot learning are to use trajectory-based optimal control techniques and reinforcement learning to scale complex robotic systems. On the one hand, increased computational power and multiprocessing, and on the other hand, probabilistic reinforcement learning methods and function approximation, have contributed to a steadily increasing interest in robot learning. Imitation learning has helped significantly to start learning with reasonable initial behavior.

Figures (6)

done before robot learning will make it beyond the laboratory environment. This article will survey some ongoing and past activities in robot learning to assess where the field stands and where it is going. We will largely focus on nonwheeled robots and less on topics of state estimation, as typically explored in wheeled robots [3]—6], and we emphasize learning in continuous state-action spaces rather than discrete state-action spaces [7], [8]. We will illustrate the different topics of robot learning with examples from our own research with anthropomorphic and humanoid robots.

Figure 1. Classification of robot learning along three dimensions. Topics further out on the arrows can be considered more complex research topics than topics closer to the center. Using learning to acquire internal models for control is useful when the analytical models are too complex to derive, and/or when it can be expected that the models change over time, e.g., due to wear and tear. Various kinds of internal models are used in robotics. The most well known are kinematics and dynamic models. For instance, the direct kinematics of 'a robot relates joint variables q to end-effector variables y, iec., y = g(q) [11]. Dynamics models include kinetic terms like forces or torques, as in (2). The previous models are forward models, i.e., they model the causal relationship between input and output variables, and they are proper functions. Often, however, what is needed in control are inverse models, e.g., the inverse kinematics q = g ‘(y) or the inverse dynamics u = f '(q, q, t). As discussed in [12], inverse models are often not functions, as the inverse rela- tionships may be a one-to-many map, i.e., just a relation. Such

Figure 2. (a) Phantom robot. (b) Learned-inverse kinematics solution; the difference between the actual and desired trajectory is small.

Figure 3. The robot swinging up an inverted pendulum.

Figure 4. (a) Configurations from the simulated one link pendulum optimal trajectory every half second and at the end of the trajectory. (b) Value function for one-link example. (c) Policy for one-link example. (d) Trajectory-based approach: random states (dots) and trajectories (black lines) used to plan one-link swing-up, superimposed on a contour map of the value function [33]. We use another form of one-link pendulum swing-up as an example problem to provide the reader with a visualizable example of a value function and policy (Figure 4). In this one- link pendulum swing-up, a motor at the base of the pendulum swings a rigid arm from the downward stable equilibrium to the upright unstable equilibrium and balances the arm there. What makes this challenging is that the one-step cost function penalizes the amount of torque used and the deviation of the current position from the goal. The controller must try to minimize the total cost of the trajectory. The one-step cost function for this example is a weighted sum of the squared position errors (0: difference between current angle and the goal angle) and the squared torques Tt: L(x,u) = 0.10 T+ 77T, where 0.1 weights the position error relative to the torque penalty and T is the time step of the simulation (0.01s). Including the time step Tin the optimi- zation criterion allows comparison with controllers with dif- ferent time steps and continuous time controllers. There are no costs associated with the joint velocity. Figure 4 shows the optimal value function and policy. The optimal trajectory is shown as a yellow line in the value function plot and as a black line with a yellow border in the policy plot [Figure 4(b) and (c)]. The value function is cut off above 20 so that we can see the details of the part of the value function that determines the optimal trajectory. The goal is at the state (0,0). TD and Q-learning work well for discrete state-action spaces but become more problematic in continuous state- action scenarios. In continuous spaces, function approximators need to be used to represent the value function and policy. Achieving reliable estimation of these functions usually requires a large number of samples that densely fill the relevant space for learning, which is hard to accomplish in actual experiments with complex robot systems. There are also no guarantees that, during learning, the robot will not be given unsafe commands. Thus, many practical approaches learn first

Figure 5. (a) Actual and simulated robot dog. (b) Learning curve of optimizing the jump behavior with path-integral reinforcement learning.