Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
…
8 pages
1 file
Many of the standard optimization algorithms focus on optimizing a single, scalar feedback signal. However, real-life optimization problems often require a simultaneous optimization of more than one objective. In this paper, we propose a multi-objective extension to the standard X -armed bandit problem. As the feedback signal is now vector-valued, the goal of the agent is to sample actions in the Pareto dominating area of the objective space. Therefore, we propose the multi-objective Hierarchical Optimistic Optimization strategy that discretizes the continuous action space in relation to the Pareto optimal solutions obtained in the multi-objective objective space. We experimentally validate the approach on two well-known multi-objective test functions and a simulation of a real life application, the filling phase of a wet clutch. We demonstrate that the strategy allows to identify the Pareto front after just a few epochs and to sample accordingly. After learning, several multi-objective quality indicators indicate that the set of sampled solutions by the algorithm very closely approximates the Pareto front.
2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014
In the stochastic multi-objective multi-armed bandit (or MOMAB), arms generate a vector of stochastic rewards, one per objective, instead of a single scalar reward. As a result, there is not only one optimal arm, but there is a set of optimal arms (Pareto front) of reward vectors using the Pareto dominance relation and there is a trade-off between finding the optimal arm set (exploration) and selecting fairly or evenly the optimal arms (exploitation). To trade-off between exploration and exploitation, either Pareto knowledge gradient (or Pareto-KG for short), or Pareto upper confidence bound (or Pareto-UCB1 for short) can be used. They combine the KG-policy and UCB1-policy, respectively with the Pareto dominance relation. In this paper, we propose Pareto Thompson sampling that uses Pareto dominance relation to find the Pareto front. We also propose annealing-Pareto algorithm that trades-off between the exploration and exploitation by using a decaying parameter t in combination with Pareto dominance relation. The annealing-Pareto algorithm uses the decaying parameter to explore the Pareto optimal arms and uses Pareto dominance relation to exploit the Pareto front. We experimentally compare Pareto-KG, Pareto-UCB1, Pareto Thompson sampling and the annealing-Pareto algorithms on multi-objective Bernoulli distribution problems and we conclude that the annealing-Pareto is the best performing algorithm.
The 2013 International Joint Conference on Neural Networks (IJCNN), 2013
We propose an algorithmic framework for multiobjective multi-armed bandits with multiple rewards. Different partial order relationships from multi-objective optimization can be considered for a set of reward vectors, such as scalarization functions and Pareto search. A scalarization function transforms the multi-objective environment into a single objective environment and are a popular choice in multi-objective reinforcement learning. Scalarization techniques can be straightforwardly implemented into the current multi-armed bandit framework, but the efficiency of these algorithms depends very much on their type, linear or non-linear (e.g. Chebyshev), and their parameters. Using Pareto dominance order relationship allows to explore the multi-objective environment directly, however this can result in large sets of Pareto optimal solutions. In this paper we propose and evaluate the performance of multi-objective MABs using three regret metric criteria. The standard UCB1 is extended to scalarized multi-objective UCB1 and we propose a Pareto UCB1 algorithm. Both algorithms are proven to have a logarithmic upper bound for their expected regret. We also introduce a variant of the scalarized multi-objective UCB1 that removes online inefficient scalarizations in order to improve the algorithm's efficiency. These algorithms are experimentally compared on multi-objective Bernoulli distributions, Pareto UCB1 being the algorithm with the best empirical performance.
We extend knowledge gradient (KG) policy for the multi-objective, multi-armed bandits problem to efficiently explore the Pareto optimal arms. We consider two partial order relationships to order the mean vectors, i.e. Pareto and scalarized functions. Pareto KG finds the optimal arms using Pareto search, while the scalarizations-KG transform the multi-objective arms into one-objective arm to find the optimal arms. To measure the performance of the proposed algorithms, we propose three regret measures. We compare the performance of knowledge gradient policy with UCB1 on a multi-objective multi-armed bandits problem, where KG outperforms UCB1.
2014 International Joint Conference on Neural Networks (IJCNN), 2014
The multi-armed bandit (MAB) problem is the simplest sequential decision process with stochastic rewards where an agent chooses repeatedly from different arms to identify as soon as possible the optimal arm, i.e. the one of the highest mean reward. Both the knowledge gradient (KG) policy and the upper confidence bound (UCB) policy work well in practice for the MAB-problem because of a good balance between exploitation and exploration while choosing arms. In case of the multi-objective MAB (or MOMAB)-problem, arms generate a vector of rewards, one per arm, instead of a single scalar reward. In this paper, we extend the KGpolicy to address multi-objective problems using scalarization functions that transform reward vectors into single scalar reward. We consider different scalarization functions and we call the corresponding class of algorithms scalarized KG. We compare the resulting algorithms with the corresponding variants of the multi-objective UCB1-policy (MO-UCB1) on a number of MOMAB-problems where the reward vectors are drawn from a multivariate normal distribution. We compare experimentally the exploration versus exploitation trade-off and we conclude that scalarized-KG outperforms MO-UCB1 on these test problems.
In the stochastic multi-objective multi-armed bandit (MOMAB), arms generate a vector of stochastic normal rewards, one per objective, instead of a single scalar reward. As a result, there is not only one optimal arm, but there is a set of optimal arms (Pareto front) using Pareto dominance relation. The goal of an agent is to find the Pareto front. To find the optimal arms, the agent can use linear scalarization function that transforms a multi-objective problem into a single problem by summing the weighted objectives. Selecting the weights is crucial, since different weights will result in selecting a different optimum arm from the Pareto front. Usually, a predefined weights set is used and this can be computational inefficient when different weights will optimize the same Pareto optimal arm and arms in the Pareto front are not identified. In this paper, we propose a number of techniques that adapt the weights on the fly in order to ameliorate the performance of the scalarized MOMAB. We use genetic and adaptive scalarization functions from multi-objective optimization to generate new weights. We propose to use Thompson sampling policy to select frequently the weights that identify new arms on the Pareto front. We experimentally show that Thompson sampling improves the performance of the genetic and adaptive scalarization functions. All the proposed techniques improves the performance of the standard scalarized MOMAB with a fixed set of weights.
Many real-world stochastic environments are inherently multi-objective environments with multiple possibly conflicting objectives. Techniques from multi-objective optimization are imported into the multi-armed bandits (MAB) problem for efficient exploration/exploitation mechanisms of reward vectors. We introduce the $\varepsilon$-approximate Pareto MAB algorithm that uses the $\varepsilon$-dominance relation such that its upper confidence bound does not depend on the number of best arms, an important feature for environments with relatively many optimal arms. We experimentally show that the $\varepsilon$-approximate Pareto MAB algorithms outperform the performance of the Pareto UCB1 algorithm on a multi-objective Bernoulli problem inspired by a real world control application.
The multi-armed bandit (MAB) problem is the simplest sequential decision process with stochastic rewards where an agent chooses repeatedly from different arms to identify as soon as possible the optimal arm, i.e. the one of the highest mean reward. Both the knowledge gradient (KG) policy and the upper confidence bound (UCB) policy work well in practice for the MAB-problem because of a good balance between exploitation and exploration while choosing arms.
2015
The multi-objective multi-armed bandit (M OM AB) problem is a sequential decision process with stochastic rewards. Each arm generates a vector of rewards instead of a single scalar reward. Moreover, these multiple rewards might be conflicting. The M OM AB-problem has a set of Pareto optimal arms and an agent's goal is not only to find that set but also to play evenly or fairly the arms in that set. To find the Pareto optimal arms, linear scalarized function or Pareto dominance relations can be used. The linear scalarized function converts the multiobjective optimization problem into a single objective one and is a very popular approach because of its simplicity. The Pareto dominance relations optimizes directly the multi-objective problem. In this paper, we extend the Thompson Sampling policy to be used in the M OM AB problem. We propose Pareto Thompson Sampling and linear scalarized Thompson Sampling approaches. We compare empirically between Pareto Thompson Sampling and linear scalarized Thompson Sampling on a test suite of M OM AB problems with Bernoulli distributions. Pareto Thompson Sampling is the approach with the best empirical performance.
— Current algorithms for solving multi-armed bandit (MAB) problem in stationary observations often perform well. Although this performance may be acceptable under carefully tuned accurate parameter settings, most of them degrade under non stationary observations. Among these-greedy iterative approach is still attractive and is more applicable to real-world tasks owing to its simple implementation. One main concern among the iterative models is the parameter dependency and more specifically the dependency on step size. This study proposes an enhanced-greedy iterative model, termed as adaptive step size model (ASM), for solving multi-armed bandit (MAB) problem. This model is inspired by the steepest descent optimization approach that automatically computes the optimal step size of the algorithm. In addition, it also introduces a dynamic exploration parameter that is progressively ineffective with increasing process intelligence. This model is empirically evaluated and compared, under stationary as well as non stationary situations, with previously proposed popular algorithms: traditional-greedy, Softmax,-decreasing and UCB-Tuned approaches. ASM not only addresses the concerns on parameter dependency but also performs either comparably or better than all other algorithms. With the proposed enhancements ASM learning time is greatly reduced and this feature of ASM is attractive to a wide range of online decision making applications such as autonomous agents, game theory, adaptive control, industrial robots and prediction tasks in management and economics.
Multi-objective multi-armed bandits (MOMAB) are multi-armed bandits (MAB) extended to reward vectors. We use the Pareto dominance relation to assess the quality of reward vectors, as opposite to scalarization functions. In this paper, we study the exploration vs exploitation trade-off in infinite horizon MOMABs algorithms. Single objective MABs explore the suboptimal arms and exploit a single optimal arm. MOMABs explore the suboptimal arms, but they also need to exploit fairly all optimal arms. We study the exploration vs exploitation trade-off of the Pareto UCB1 algorithm. We extend UCB2 that is another popular infinite horizon MAB algorithm to rewards vectors using the Pareto dominance relation. We analyse the properties of the proposed MOMAB algorithms in terms of upper regret bounds. We experimentally compare the exploration vs exploitation trade-off of the proposed MOMAB algorithms on a bi-objective Bernoulli environment coming from control theory.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
2015 IEEE Symposium Series on Computational Intelligence, 2015
Information Sciences
Mathematics of Operations Research, 1986
The Annals of Statistics, 2013
IEEE Access, 2021
2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014
Operational Research, 2008
Learning and Intelligent Optimization, 2008
arXiv (Cornell University), 2017
IEEE Transactions on Evolutionary Computation, 1998
Advances in Neural …, 2009