Improving the Exploration Strategy in Bandit Algorithms

Olivier Caelen; Gianluca Bontempi

Improving the exploration strategy in bandit algorithms

Gianluca Bontempi

2008, Learning and Intelligent Optimization

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract
AI

The paper addresses the challenge of balancing exploration and exploitation in K-armed bandit problems, emphasizing the inadequacy of random exploration strategies. It introduces the Probability of Correct Selection (PCS) as a measure to enhance decision-making during exploration, proposing a novel semi-uniform strategy termed ǫ-PCSgreedy. Experimental results demonstrate the superior performance of the ǫ-PCSgreedy approach over traditional ǫ-greedy strategies across various synthetic and real tasks.

Figures (11)

though naive, is known to be hard to beat [13]. Also, because of the non zero €
term, it is not a zero-regret strategy. A pseudo-code of the algorithm follows. — though naive, is known to be hard to beat [13]. Also, because of the non zero € term, it is not a zero-regret strategy. A pseudo-code of the algorithm follows.

Algorithm 2 The Interval Estimation algorithm

Algorithm 3 The Gittins’ index algorithm

The distribution of f;"" — 5°" is shown in Figure 1. This figure shows that in
Fig. 1. If K=2, the probability of correct selection (PCS) is the surface under the
Gaussian when the abscissa takes his value in the set [0, -+oo]. — The distribution of f;"" — 5°" is shown in Figure 1. This figure shows that in Fig. 1. If K=2, the probability of correct selection (PCS) is the surface under the Gaussian when the abscissa takes his value in the set [0, -+oo].

Since da is positive and the goal of the PCSstrategy is to minimize the
variance at 1+ 1 the resulting strategy for the exploration step is — Since da is positive and the goal of the PCSstrategy is to minimize the variance at 1+ 1 the resulting strategy for the exploration step is

Algorithm 5 The e-PCSgreedy algorithm
Note how this algorithm differentiates from the conventional e-greedy only
for what concerns the exploration step, while the exploitation step is the same. — Algorithm 5 The e-PCSgreedy algorithm Note how this algorithm differentiates from the conventional e-greedy only for what concerns the exploration step, while the exploitation step is the same.

Key takeaways

The idea is that an effective exploration step should lead to the largest increase of the probability PCS of correctly selecting the best arm.
At the lth step the algorithm either chooses a random arm with probability ǫ ∈ [0, 1] or chooses the arm with the highest sample average µ l k .
The Interval Estimation algorithm 1: z 1,2 k ← play arm k twice; ∀k ∈ [1, K] 2: n k ← 2; ∀k ∈ [1, K] 3: for l = (2K + 1) to H do 4:
If the rewards of the arms follow a normal probability distribution (unknown mean and standard deviation) the solution of the dynamic programming problem at the l step is the arm which maximizes the Gittins' index v k = µ l k + σ l k · v g n l k , D
The probability of correct selection (PCS) [9,8] at the lth step is the probability that a (ǫ = 0)-greedy algorithm will select the best arm (i.e. the arm one in our notation)

Gianluca Bontempi

Annals of Mathematics and Artificial Intelligence, 2010

The K-armed bandit problem is a well-known formalization of the exploration versus exploitation dilemma. In this learning problem, a player is confronted to a gambling machine with K arms where each arm is associated to an unknown gain distribution. The goal of the player is to maximize the sum of the rewards. Several approaches have been proposed in literature to deal with the K-armed bandit problem. This paper introduces first the concept of “expected reward of greedy actions” which is based on the notion of probability of correct selection (PCS), well-known in simulation literature. This concept is then used in an original semi-uniform algorithm which relies on the dynamic programming framework and on estimation techniques to optimally balance exploration and exploitation. Experiments with a set of simulated and realistic bandit problems show that the new DP-greedy algorithm is competitive with state-of-the-art semi-uniform techniques.

Log In

Improving the exploration strategy in bandit algorithms

Sign up for access to the world's latest research

AbstractAI

Key takeaways

Related papers

Related topics

Related papers

Abstract
AI