Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010
Abstract We consider the general, widely applicable problem of selecting from n real-valued random variables a subset of size m of those with the highest means, based on as few samples as possible. This problem, which we denote Explore-m, is a core aspect in several stochastic optimization algorithms, and applications of simulation and industrial engineering.
2021
We explore the class of problems where a central planner needs to select a subset of agents, each with different quality and cost. The planner wants to maximize its utility while ensuring that the average quality of the selected agents is above a certain threshold. When the agents’ quality is known, we formulate our problem as an integer linear program (ILP) and propose a deterministic algorithm, namely DPSS that provides an exact solution to our ILP. We then consider the setting when the qualities of the agents are unknown. We model this as a Multi-Arm Bandit (MAB) problem and propose DPSS-UCB to learn the qualities over multiple rounds. We show that after a certain number of rounds, τ , DPSS-UCB outputs a subset of agents that satisfy the average quality constraint with a high probability. Next, we provide bounds on τ and prove that after τ rounds, the algorithm incurs a regret of O(lnT ), where T is the total number of rounds. We further illustrate the efficacy of DPSS-UCB throug...
Learning and Intelligent Optimization, 2008
2012
Abstract We consider the problem of selecting, from among the arms of a stochastic n-armed bandit, a subset of size m of those arms with the highest expected rewards, based on efficiently sampling the arms. This “subset selection” problem finds application in a variety of areas. In the authors' previous work (Kalyanakrishnan & Stone, 2010), this problem is framed under a PAC setting (denoted “Explore-m”), and corresponding sampling algorithms are analyzed.
ArXiv, 2017
Over the past few years, the multi-armed bandit model has become increasingly popular in the machine learning community, partly because of applications including online content optimization. This paper reviews two different sequential learning tasks that have been considered in the bandit literature ; they can be formulated as (sequentially) learning which distribution has the highest mean among a set of distributions, with some constraints on the learning process. For both of them (regret minimization and best arm identification) we present recent, asymptotically optimal algorithms. We compare the behaviors of the sampling rule of each algorithm as well as the complexity terms associated to each problem.
2014 International Joint Conference on Neural Networks (IJCNN), 2014
The multi-armed bandit (MAB) problem is the simplest sequential decision process with stochastic rewards where an agent chooses repeatedly from different arms to identify as soon as possible the optimal arm, i.e. the one of the highest mean reward. Both the knowledge gradient (KG) policy and the upper confidence bound (UCB) policy work well in practice for the MAB-problem because of a good balance between exploitation and exploration while choosing arms. In case of the multi-objective MAB (or MOMAB)-problem, arms generate a vector of rewards, one per arm, instead of a single scalar reward. In this paper, we extend the KGpolicy to address multi-objective problems using scalarization functions that transform reward vectors into single scalar reward. We consider different scalarization functions and we call the corresponding class of algorithms scalarized KG. We compare the resulting algorithms with the corresponding variants of the multi-objective UCB1-policy (MO-UCB1) on a number of MOMAB-problems where the reward vectors are drawn from a multivariate normal distribution. We compare experimentally the exploration versus exploitation trade-off and we conclude that scalarized-KG outperforms MO-UCB1 on these test problems.
NIPS 2006 Workshop on On-line …, 2006
2007
We present a sampling-based algorithm for solving stochastic discrete optimization problems based on Auer et al.'s Exp3 algorithm for "nonstochastic multi-armed bandit problems." The algorithm solves the sample average approximation (SAA) of the original problem by iteratively updating and sampling from a probability distribution over the search space. We show that as the number of samples goes to infinity, the value returned by the algorithm converges to the optimal objective-function value and the probability distribution to a distribution that concentrates only on the set of best solutions of the original problem. We then extend the Exp3-based algorithm to solving finite-horizon Markov decision processes (MDPs), where the underlying MDP is approximated by a recursive SAA problem. We show that the estimate of the "recursive" sample-average-maximum computed by the extended algorithm at a given state approaches the optimal value of the state as the sample size per state per stage goes to infinity. The recursive Exp3-based algorithm for MDPs is then further extended for finite-horizon two-person zero-sum Markov games (MGs), providing a finite-iteration bound to the equilibrium value of the induced SAA game problem and asymptotic convergence to the equilibrium value of the original
The Annals of Statistics, 2013
We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multi-armed bandit problem, this setting allows for dynamically changing rewards that better describe applications where side information is available. We adopt a nonparametric model where the expected rewards are smooth functions of the covariate and where the hardness of the problem is captured by a margin parameter. To maximize the expected cumulative reward, we introduce a policy called Adaptively Binned Successive Elimination (ABSE) that adaptively decomposes the global problem into suitably "localized" static bandit problems. This policy constructs an adaptive partition using a variant of the Successive Elimination (SE) policy. Our results include sharper regret bounds for the SE policy in a static bandit problem and minimax optimal regret bounds for the ABSE policy in the dynamic problem.
Advances in Applied Mathematics, 1986
An exact solution to certain multi-armed bandit problems with independent and simple arms is presented. An arm is simple if the observations associated with the arm have one of two distributions conditional on the value of an unknown dichotomous parameter. This solution is obtained relating Gittins indices for the arms to ladder variables for associated random walks. 0 1986 Academic PRSS. IX
2012
Abstract We formulate the following combinatorial multi-armed bandit (MAB) problem: There are $ N $ random variables with unknown mean that are each instantiated in an iid fashion over time. At each time multiple random variables can be selected, subject to an arbitrary constraint on weights associated with the selected variables. All of the selected individual random variables are observed at that time, and a linearly weighted combination of these selected variables is yielded as the reward.
Algorithmic Learning Theory, 2007
The stochastic multi-armed bandit problem is a popular model of the exploration/exploitation trade-off in sequential decision problems. We introduce a novel algorithm that is based on sub-sampling. Despite its simplicity, we show that the algorithm demonstrates excellent empirical performances against state-of-the-art algorithms, including Thompson sampling and KL-UCB. The algorithm is very flexible, it does need to know a set of reward distributions in advance nor the range of the rewards. It is not restricted to Bernoulli distributions and is also invariant under rescaling of the rewards. We provide a detailed experimental study comparing the algorithm to the state of the art, the main intuition that explains the striking results, and conclude with a finite-time regret analysis for this algorithm in the simplified two-arm bandit setting.
2016
Drawing a sample from a discrete distribution is one of the building components for Monte Carlo methods. Like other sampling algorithms, discrete sampling suffers from the high computational burden in large-scale inference problems. We study the problem of sampling a discrete random variable with a high degree of dependency that is typical in large-scale Bayesian inference and graphical models, and propose an efficient approximate solution with a subsampling approach. We make a novel connection between the discrete sampling and Multi-Armed Bandits problems with a finite reward population and provide three algorithms with theoretical guarantees. Empirical evaluations show the robustness and efficiency of the approximate algorithms in both synthetic and real-world large-scale problems.
International Journal of Data Science and Analytics
We consider a variant of the stochastic multiarmed bandit with K arms where the rewards are not assumed to be identically distributed, but are generated by a nonstationary stochastic process. We first study the unique best arm setting when there exists one unique best arm. Second, we study the general switching best arm setting when a best arm switches at some unknown steps. For both settings, we target problem-dependent bounds, instead of the more conservative problem-free bounds. We consider two classical problems: (1) identify a best arm with high probability (best arm identification), for which the performance measure by the sample complexity (number of samples before finding a near-optimal arm). To this end, we naturally extend the definition of sample complexity so that it makes sense in the switching best arm setting, which may be of independent interest. (2) Achieve the smallest cumulative regret (regret minimization) where the regret is measured with respect to the strategy pulling an arm with the best instantaneous mean at each step. This paper extends the work presented in the DSAA'2015 Long Presentation paper "EXP3 with Drift Detection for the Switching Bandit Problem" [1]. Algorithms SER3 and SER4 are original and presented for the first time.
arXiv (Cornell University), 2015
Adaptive and sequential experiment design is a well-studied area in numerous domains. We survey and synthesize the work of the online statistical learning paradigm referred to as multi-armed bandits integrating the existing research as a resource for a certain class of online experiments. We first explore the traditional stochastic model of a multi-armed bandit, then explore a taxonomic scheme of complications to that model, for each complication relating it to a specific requirement or consideration of the experiment design context. Finally, at the end of the paper, we present a table of known bounds of regret for all studied algorithms providing both perspectives for future theoretical work and a decision-making tool for practitioners looking for theoretical guarantees. Primary 62K99, 62L05; secondary 68T05. Keywords and phrases: multi-armed bandits, adaptive experiments, sequential experiment design, online experiment design. * Loeppky and Lawrence were partly supported by Natural Sciences and Engineering Research Council of Canada Discovery Grants, grant numbers RGPIN-2015-03895 and RGPIN-341202-12 respectively. 1 imsart-generic ver. 2011/11/15 file: mab-ss-survey.tex date: November 4, G. Burtini et al./Survey of Multi-Armed Bandit Experiments
ArXiv, 2020
In this paper we propose the first multi-armed bandit algorithm based on re-sampling that achieves asymptotically optimal regret simultaneously for different families of arms (namely Bernoulli, Gaussian and Poisson distributions). Unlike Thompson Sampling which requires to specify a different prior to be optimal in each case, our proposal RB-SDA does not need any distribution-dependent tuning. RB-SDA belongs to the family of Sub-sampling Duelling Algorithms (SDA) which combines the sub-sampling idea first used by the BESA [1] and SSMC [2] algorithms with different sub-sampling schemes. In particular, RB-SDA uses Random Block sampling. We perform an experimental study assessing the flexibility and robustness of this promising novel approach for exploration in bandit models.
2007
We provide a framework to exploit dependencies among arms in multi-armed bandit problems, when the dependencies are in the form of a generative model on clusters of arms. We find an optimal MDP-based policy for the discounted reward case, and also give an approximation of it with formal error guarantee. We discuss lower bounds on regret in the undiscounted reward scenario, and propose a general two-level bandit policy for it. We propose three different instantiations of our general policy and provide theoretical justifications of how the regret of the instantiated policies depend on the characteristics of the clusters. Finally, we empirically demonstrate the efficacy of our policies on large-scale realworld and synthetic data, and show that they significantly outperform classical policies designed for bandits with independent arms.
Theoretical Computer Science, 2009
In the stochastic multi-objective multi-armed bandit (MOMAB), arms generate a vector of stochastic normal rewards, one per objective, instead of a single scalar reward. As a result, there is not only one optimal arm, but there is a set of optimal arms (Pareto front) using Pareto dominance relation. The goal of an agent is to find the Pareto front. To find the optimal arms, the agent can use linear scalarization function that transforms a multi-objective problem into a single problem by summing the weighted objectives. Selecting the weights is crucial, since different weights will result in selecting a different optimum arm from the Pareto front. Usually, a predefined weights set is used and this can be computational inefficient when different weights will optimize the same Pareto optimal arm and arms in the Pareto front are not identified. In this paper, we propose a number of techniques that adapt the weights on the fly in order to ameliorate the performance of the scalarized MOMAB. We use genetic and adaptive scalarization functions from multi-objective optimization to generate new weights. We propose to use Thompson sampling policy to select frequently the weights that identify new arms on the Pareto front. We experimentally show that Thompson sampling improves the performance of the genetic and adaptive scalarization functions. All the proposed techniques improves the performance of the standard scalarized MOMAB with a fixed set of weights.
IEEE Access, 2021
Computer experiments are widely used to mimic expensive physical processes as black-box functions. A typical challenge of expensive computer experiments is to find the set of inputs that produce the desired response. This study proposes a multi-armed bandit regularized expected improvement (BREI) method to adaptively adjust the balance between exploration and exploitation for efficient global optimization of long-running computer experiments with low noise. The BREI adds a stochastic regularization term to the objective function of the expected improvement to integrate the information of additional exploration and exploitation into the optimization process. The proposed study also develops a multi-armed bandit strategy based on Thompson sampling for adaptive optimization of the tuning parameter of the BREI based on the preexisting and newly tested points. The performance of the proposed method is validated against some of the existing methods in the literature under different levels of noise using a case study on optimization of the collision avoidance algorithm in mobile robot motion planning as well as extensive simulation studies. INDEX TERMS Computer experiments, Gaussian process regression, expected improvement, multi-armed bandit, Thompson sampling.