Academia.eduAcademia.edu

Improving the exploration strategy in bandit algorithms

2008, Learning and Intelligent Optimization

Abstract
sparkles

AI

The paper addresses the challenge of balancing exploration and exploitation in K-armed bandit problems, emphasizing the inadequacy of random exploration strategies. It introduces the Probability of Correct Selection (PCS) as a measure to enhance decision-making during exploration, proposing a novel semi-uniform strategy termed ǫ-PCSgreedy. Experimental results demonstrate the superior performance of the ǫ-PCSgreedy approach over traditional ǫ-greedy strategies across various synthetic and real tasks.

Key takeaways

  • The idea is that an effective exploration step should lead to the largest increase of the probability PCS of correctly selecting the best arm.
  • At the lth step the algorithm either chooses a random arm with probability ǫ ∈ [0, 1] or chooses the arm with the highest sample average µ l k .
  • The Interval Estimation algorithm 1: z 1,2 k ← play arm k twice; ∀k ∈ [1, K] 2: n k ← 2; ∀k ∈ [1, K] 3: for l = (2K + 1) to H do 4:
  • If the rewards of the arms follow a normal probability distribution (unknown mean and standard deviation) the solution of the dynamic programming problem at the l step is the arm which maximizes the Gittins' index v k = µ l k + σ l k · v g n l k , D
  • The probability of correct selection (PCS) [9,8] at the lth step is the probability that a (ǫ = 0)-greedy algorithm will select the best arm (i.e. the arm one in our notation)