Utilizing the search gradient in this framework is similar to evolution strategies in that it iteratively generates the fitnesses of batches of vector-valued samples — the ES’s so-called candidate solutions. It is different however, in that it represents this ‘population’ as a parameterized distribution, and in the fact that it uses a search gradient to update the parameters of this distribution, which is computed using the fitnesses. where 7 is a learning rate parameter. Algorithm 1 provides the pseudocode for this very general approach to black-box optimization by using a search gradient on search distribu- tions. and the updates, assuming simple hill-climbing (i.e. a population size \ = 1) read: 1: Left: Schematic illustration of how the search distribution adapts in the one- dimensional case: from (1) to (2), w is adjusted to make the distribution cover the optimum. From (2) to (3), o is reduced to allow for a precise localization of the optimum. The step from (3) to (1) then is the problematic case, where a small o induces a largely overshooting update, making the search start over again. Right: Progression of 4 (black) and o (red, dashed) when following the search gradient towards minimizing f(z) = z?, executing Algorithm 2] Plotted are median values over 1000 runs, with a small learning rate 7 = 0.01 and = 10, both of which mitigate the instability somewhat, but still show the failure to precisely locate the optimum (for which both jz and o need to approach 0). search distribution 7 (z|6’). This leads to redundant fitness evaluations in that same area. We improve the efficiency with a new procedure called importance mixing, which aims at reusing fitness evaluations from the previous batch, while ensuring the updated batch conforms to the new search distribution. These results give us the updates in the natural coordinate system and thus the log-derivatives (at 6 = 0, M = 0) take the following, surprisingly simple forms: which are then mapped back onto (yz, A) using equation 10): We decompose the parameter vector space (6,M) € R¢@ x Sq into the produc 4.4 Connection to CMA-ES. It has been noticed independently by (Glasmachers et_al, 2010) and shortly afterwards by (Akimoto et all, 2010) that the natural gradient updates of xNES and the strategy updates of the CMA-ES algorithm (Hansen and Ostermeier}, (2001) are closely connected. However, since xNES does not feature evolution paths, this connection is restricted to the so-called rank-yi-update (in the terminology of this study, rank-A-update) of CMA-ES. Algorithm 6: (1+1)-NES with radial Gaussian distribution Obviously, this allows us to sample new offspring in O(d) time (per sample). Since the adaptation of each component’s parameters is independent, the strategy update step also takes only O(d) time. where p is a family of densities on the reals, and 6 = (01,...,6q) collects the parameters of all of these distributions. In most cases these parameters amount to 6; = (f4;,0;), where ut; € R is a position and o; € R® is a scale parameter (i.e., mean and standard deviation, if they exist), such that z; = uw; +0;-s; ~ p(-| u;,0%) for s; ~ p(-| 0,1). Algorithm 8: (1+1)-NES with multi-variate Cauchy distribution The full N] E'S hill-climber with multi-variate Cauchy mutations is provided in algorithm Table 1: Default parameter values for xNES and SNES (including the utility function) as a function of problem dimension d. 2: Left: Plotted are the cumulative success rates on the non-Markovian double- pole balancing task after a certain number of evaluations, empirically determined over 100 runs for each algorithm, using a single tanh-unit (n = 1) (i.e., optimiz- ing 4 weights). We find that all three algorithm variants give state-of-the-art results, with a slightly faster but less robust performance for xN] ES with impor- tance mixing and adaptation sampling. Right: Median number of evaluations required to solve the same task, but with increasing number of neurons (and the runtime to one hour per for xNES on higher dimensions corresponding number of weights). We limited run, which explains why no results are available (cubic time complexity). The fact that SN] ES q uickly outperforms xN] ES, also in number of function evaluations, indicates that the benchmark is (sufficiently close to) separable, and it is unnecessary to reference we also plot the corresponding results o algorithm CoSyNE (Gomez et all, 2008). use the full covariance mat rix. For the previously best performing Figure 13: Illustration of the best configuration found for 13 atoms (symmetric, left), and 22 atoms (asymmetric, right). where rj; is the distance between atoms i and j (see also figure [13] for an illustration). For the setup here, we initialized ys near 0 and the step-sizes at ao; = 0.01 to avoid jumping into a local optimum in the fist generation. The results are plotted in figure 14] showing how SNES scales convincingly to hundreds of parameters (each run up to 500d function evaluations). Table 2: Summary of enhancing techniques Table [2| summarizes the various techniques we introduced. The plain search gradient suffers from premature convergence and lack of scale invariance (see section 2.2). Therefore, we use the natural gradient instead, which turns NES into a viable optimization method. To improve performance and robustness, we introduced several novel techniques. Fitness shap- ing makes the NI] ES algorithm invariant to order-preserving transformations of the fitness function, thus increasing robustness. Importance mixing reduces the number of necessary samples to estimate the search gradient, while adaptation sampling adjusts NES’s learn- 5: Left: Contour plot of the 2-dimensional Double-Rosenbrock function foRosen, illustrating its deceptive double-funnel structure. The global structure leads the search to the local optimum ((14,14), red circle), whereas the true optimum ((- 11,-11), black square) is located in the smaller valley. Right: Empirical success probabilities (of locating the global optimum), evaluated over 200 runs, of 1000 function evaluations each, on the same benchmark, while varying the size of the initial search distribution. The results clearly show the robustness of using a heavy-tailed distribution. 6: Left: Contour plot of (one instantiation of) the deceptive global optimization benchmark function f,», in two dimensions. It is constructed to contain loca optima in unit-cube-sized spaces, whose best value is uniformly random. In addition, it is superimposed on 10%-sized regional plateaus, also with uniformly random value. Right: Value of the local optimum discovered on f,.» (averaged over 250 runs, each with a budget of 100d function evaluations) as a function o problem dimension. Since the locally optimal values are uniformly distributed in [0, 1], the results can equivalently be interpreted as the top percentile in which the found local optimum is located. E.g., on the 4-dimensional benchmark, NES with the Cauchy distribution tends to find one of the 6% best local optima, whereas employing the Gaussian distribution only leads to one of the best 22%.