Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2008, Learning and Intelligent Optimization
AI
The paper addresses the challenge of balancing exploration and exploitation in K-armed bandit problems, emphasizing the inadequacy of random exploration strategies. It introduces the Probability of Correct Selection (PCS) as a measure to enhance decision-making during exploration, proposing a novel semi-uniform strategy termed ǫ-PCSgreedy. Experimental results demonstrate the superior performance of the ǫ-PCSgreedy approach over traditional ǫ-greedy strategies across various synthetic and real tasks.
Annals of Mathematics and Artificial Intelligence, 2010
The K-armed bandit problem is a well-known formalization of the exploration versus exploitation dilemma. In this learning problem, a player is confronted to a gambling machine with K arms where each arm is associated to an unknown gain distribution. The goal of the player is to maximize the sum of the rewards. Several approaches have been proposed in literature to deal with the K-armed bandit problem. This paper introduces first the concept of “expected reward of greedy actions” which is based on the notion of probability of correct selection (PCS), well-known in simulation literature. This concept is then used in an original semi-uniform algorithm which relies on the dynamic programming framework and on estimation techniques to optimally balance exploration and exploitation. Experiments with a set of simulated and realistic bandit problems show that the new DP-greedy algorithm is competitive with state-of-the-art semi-uniform techniques.
Abstract. We compare well-known action selection policies used in reinforcement learning like e-greedy and softmax with lesser known ones like the Gittins index and the knowledge gradient on bandit problems. The latter two are in comparison very performant. Moreover the knowledge gradient can be generalized to other than bandit problems.
2010
Abstract We consider the general, widely applicable problem of selecting from n real-valued random variables a subset of size m of those with the highest means, based on as few samples as possible. This problem, which we denote Explore-m, is a core aspect in several stochastic optimization algorithms, and applications of simulation and industrial engineering.
IEEE Transactions on Evolutionary Computation, 1998
We explore the two-armed bandit with Gaussian payoffs as a theoretical model for optimization. The problem is formulated from a Bayesian perspective, and the optimal strategy for both one and two pulls is provided. We present regions of parameter space where a greedy strategy is provably optimal. We also compare the greedy and optimal strategies to one based on a genetic algorithm. In doing so, we correct a previous error in the literature concerning the Gaussian bandit problem and the supposed optimality of genetic algorithms for this problem. Finally, we provide an analytically simple bandit model that is more directly applicable to optimization theory than the traditional bandit problem and determine a near-optimal strategy for that model.
Proceedings of the 6th International Conference on Operations Research and Enterprise Systems, 2017
Different allocation strategies can be found in the literature to deal with the multi-armed bandit problem under a frequentist view or from a Bayesian perspective. In this paper, we propose a novel allocation strategy, the possibilistic reward method. First, possibilistic reward distributions are used to model the uncertainty about the arm expected rewards, which are then converted into probability distributions using a pignistic probability transformation. Finally, a simulation experiment is carried out to find out the one with the highest expected reward, which is then pulled. A parametric probability transformation of the proposed is then introduced together with a dynamic optimization, which implies that neither previous knowledge nor a simulation of the arm distributions is required. A numerical study proves that the proposed method outperforms other policies in the literature in five scenarios: a Bernoulli distribution with very low success probabilities, with success probabilities close to 0.5 and with success probabilities close to 0.5 and Gaussian rewards; and truncated in [0,10] Poisson and exponential distributions. Martin M., JimÃl'nez-Martà n A. and Mateos A.
International Conference on Machine Learning, 2013
We study the problem of exploration in stochastic Multi-Armed Bandits. Even in the simplest setting of identifying the best arm, there remains a logarithmic multiplicative gap between the known lower and upper bounds for the number of arm pulls required for the task. This extra logarithmic factor is quite meaningful in nowadays large-scale applications. We present two novel, parameterfree algorithms for identifying the best arm, in two different settings: given a target confidence and given a target budget of arm pulls, for which we prove upper bounds whose gap from the lower bound is only doublylogarithmic in the problem parameters. We corroborate our theoretical results with experiments demonstrating that our algorithm outperforms the state-of-the-art and scales better as the size of the problem increases.
2021
We explore the class of problems where a central planner needs to select a subset of agents, each with different quality and cost. The planner wants to maximize its utility while ensuring that the average quality of the selected agents is above a certain threshold. When the agents’ quality is known, we formulate our problem as an integer linear program (ILP) and propose a deterministic algorithm, namely DPSS that provides an exact solution to our ILP. We then consider the setting when the qualities of the agents are unknown. We model this as a Multi-Arm Bandit (MAB) problem and propose DPSS-UCB to learn the qualities over multiple rounds. We show that after a certain number of rounds, τ , DPSS-UCB outputs a subset of agents that satisfy the average quality constraint with a high probability. Next, we provide bounds on τ and prove that after τ rounds, the algorithm incurs a regret of O(lnT ), where T is the total number of rounds. We further illustrate the efficacy of DPSS-UCB throug...
Theoretical Computer Science, 2009
2017
This paper explores Thompson sampling in the context of mechanism design for stochastic multi-armed bandit (MAB) problems. The setting is that of an MAB problem where the reward distribution of each arm consists of a stochastic component as well as a strategic component. Many existing MAB mechanisms use upper confidence bound (UCB) based algorithms for learning the parameters of the reward distribution. The randomized nature of Thompson sampling introduces certain unique, non-trivial challenges for mechanism design which we address in this paper through a rigorous regret analysis. We first propose a MAB mechanism with deterministic payment rule, namely, TSM-D. We show that in TSMD, the variance of agent utilities asymptotically approaches zero. However, the game theoretic properties satisfied by TSM-D (incentive compatibility and individual rationality with high probability) are rather weak. As our main contribution, we then propose the mechanism TSM-R, with randomized payment rule,...
Operations Research and Enterprise Systems
In this paper, we propose a novel allocation strategy based on possibilistic rewards for the multi-armed bandit problem. First, we use possibilistic reward distributions to model the uncertainty about the expected rewards from the arms, derived from a set of infinite confidence intervals nested around the expected value. They are then converted into probability distributions using a pignistic probability transformation. Finally, a simulation experiment is carried out to find out the one with the highest expected reward, which is then pulled. A parametric probability transformation of the proposed is then introduced together with a dynamic optimization. A numerical study proves that the proposed method outperforms other policies in the literature in five scenarios accounting for Bernoulli, Poisson and exponential distributions for the rewards. The regret analysis of the proposed methods suggests a logarithmic asymptotic convergence for the original possibilistic reward method, whereas a polynomial regret could be associated with the parametric extension and the dynamic optimization.
The multi-armed bandit (MAB) problem is the simplest sequential decision process with stochastic rewards where an agent chooses repeatedly from different arms to identify as soon as possible the optimal arm, i.e. the one of the highest mean reward. Both the knowledge gradient (KG) policy and the upper confidence bound (UCB) policy work well in practice for the MAB-problem because of a good balance between exploitation and exploration while choosing arms.
2014 International Joint Conference on Neural Networks (IJCNN), 2014
The multi-armed bandit (MAB) problem is the simplest sequential decision process with stochastic rewards where an agent chooses repeatedly from different arms to identify as soon as possible the optimal arm, i.e. the one of the highest mean reward. Both the knowledge gradient (KG) policy and the upper confidence bound (UCB) policy work well in practice for the MAB-problem because of a good balance between exploitation and exploration while choosing arms. In case of the multi-objective MAB (or MOMAB)-problem, arms generate a vector of rewards, one per arm, instead of a single scalar reward. In this paper, we extend the KGpolicy to address multi-objective problems using scalarization functions that transform reward vectors into single scalar reward. We consider different scalarization functions and we call the corresponding class of algorithms scalarized KG. We compare the resulting algorithms with the corresponding variants of the multi-objective UCB1-policy (MO-UCB1) on a number of MOMAB-problems where the reward vectors are drawn from a multivariate normal distribution. We compare experimentally the exploration versus exploitation trade-off and we conclude that scalarized-KG outperforms MO-UCB1 on these test problems.
2012
Abstract We consider the problem of selecting, from among the arms of a stochastic n-armed bandit, a subset of size m of those arms with the highest expected rewards, based on efficiently sampling the arms. This “subset selection” problem finds application in a variety of areas. In the authors' previous work (Kalyanakrishnan & Stone, 2010), this problem is framed under a PAC setting (denoted “Explore-m”), and corresponding sampling algorithms are analyzed.
Algorithmic Learning Theory, 2007
— Current algorithms for solving multi-armed bandit (MAB) problem in stationary observations often perform well. Although this performance may be acceptable under carefully tuned accurate parameter settings, most of them degrade under non stationary observations. Among these-greedy iterative approach is still attractive and is more applicable to real-world tasks owing to its simple implementation. One main concern among the iterative models is the parameter dependency and more specifically the dependency on step size. This study proposes an enhanced-greedy iterative model, termed as adaptive step size model (ASM), for solving multi-armed bandit (MAB) problem. This model is inspired by the steepest descent optimization approach that automatically computes the optimal step size of the algorithm. In addition, it also introduces a dynamic exploration parameter that is progressively ineffective with increasing process intelligence. This model is empirically evaluated and compared, under stationary as well as non stationary situations, with previously proposed popular algorithms: traditional-greedy, Softmax,-decreasing and UCB-Tuned approaches. ASM not only addresses the concerns on parameter dependency but also performs either comparably or better than all other algorithms. With the proposed enhancements ASM learning time is greatly reduced and this feature of ASM is attractive to a wide range of online decision making applications such as autonomous agents, game theory, adaptive control, industrial robots and prediction tasks in management and economics.
ArXiv, 2020
In this paper we propose the first multi-armed bandit algorithm based on re-sampling that achieves asymptotically optimal regret simultaneously for different families of arms (namely Bernoulli, Gaussian and Poisson distributions). Unlike Thompson Sampling which requires to specify a different prior to be optimal in each case, our proposal RB-SDA does not need any distribution-dependent tuning. RB-SDA belongs to the family of Sub-sampling Duelling Algorithms (SDA) which combines the sub-sampling idea first used by the BESA [1] and SSMC [2] algorithms with different sub-sampling schemes. In particular, RB-SDA uses Random Block sampling. We perform an experimental study assessing the flexibility and robustness of this promising novel approach for exploration in bandit models.
The stochastic multi-armed bandit problem is a popular model of the exploration/exploitation trade-off in sequential decision problems. We introduce a novel algorithm that is based on sub-sampling. Despite its simplicity, we show that the algorithm demonstrates excellent empirical performances against state-of-the-art algorithms, including Thompson sampling and KL-UCB. The algorithm is very flexible, it does need to know a set of reward distributions in advance nor the range of the rewards. It is not restricted to Bernoulli distributions and is also invariant under rescaling of the rewards. We provide a detailed experimental study comparing the algorithm to the state of the art, the main intuition that explains the striking results, and conclude with a finite-time regret analysis for this algorithm in the simplified two-arm bandit setting.
Neurocomputing, 2018
In this paper, we propose a set of allocation strategies to deal with the multi-armed bandit problem, the possibilistic reward (PR) methods. First, we use possibilistic reward distributions to model the uncertainty about the expected rewards from the arm, derived from a set of infinite confidence intervals nested around the expected value. Depending on the inequality used to compute the confidence intervals, there are three possible PR methods with different features. Next, we use a pignistic probability transformation to convert these possibilistic functions into probability distributions following the insufficient reason principle. Finally, Thompson sampling techniques are used to identify the arm with the higher expected reward and play that arm. A numerical study analyses the performance of the proposed methods with respect to other policies in the literature. Two PR methods perform well in all representative scenarios under consideration, and are the best allocation strategies if truncated poisson or exponential distributions in [0,10] are considered for the arms.
Multi-armed bandit problem is a classic example of the exploration vs. exploitation dilemma in which a collection of one-armed bandits, each with unknown but fixed reward probability, is given. The key idea is to develop a strategy, which results in the arm with the highest reward probability to be played such that the total reward obtained is maximized. Although seemingly a simplistic problem, solution strategies are important because of their wide applicability in a myriad of areas such as adaptive routing, resource allocation, clinical trials, and more recently in the area of online recommendation of news articles, advertisements, coupons, etc. to name a few. In this dissertation, we present different types of Bayesian Inference based bandit algorithms for Two and Multiple Armed Bandits which use Order Statistics to select the next arm to play. The Bayesian strategies, also known in literature as "Thompson Method" are shown to function well for a whole range of values, including very small values, outperforming UCB and other commonly used strategies. Empirical analysis results show a significant improvement in both synthetic and real
Proceedings of the 4th International Conference on Agents and Artificial Intelligence (ICAART 2012), 2012
We propose a learning approach to pre-compute K-armed bandit playing policies by exploiting prior information describing the class of problems targeted by the player. Our algorithm first samples a set of K-armed bandit problems from the given prior, and then chooses in a space of candidate policies one that gives the best average performances over these problems. The candidate policies use an index for ranking the arms and pick at each play the arm with the highest index; the index for each arm is computed in the form of a linear combination of features describing the history of plays (e.g., number of draws, average reward, variance of rewards and higher order moments), and an estimation of distribution algorithm is used to determine its optimal parameters in the form of feature weights. We carry out simulations in the case where the prior assumes a fixed number of Bernoulli arms, a fixed horizon, and uniformly distributed parameters of the Bernoulli arms. These simulations show that learned strategies perform very well with respect to several other strategies previously proposed in the literature (UCB1, UCB2, UCB-V, KL-UCB and ε n -GREEDY); they also highlight the robustness of these strategies with respect to wrong prior information.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.