0% found this document useful (0 votes)
38 views27 pages

Model-Free Reinforcement Learning in Infinite-Horizon Average-Reward Markov Decision Processes

Uploaded by

m bouchra2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views27 pages

Model-Free Reinforcement Learning in Infinite-Horizon Average-Reward Markov Decision Processes

Uploaded by

m bouchra2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Model-free Reinforcement Learning in Infinite-horizon

Average-reward Markov Decision Processes

Chen-Yu Wei Mehdi Jafarnia-Jahromi Haipeng Luo Hiteshi Sharma Rahul Jain
University of Southern California
arXiv:1910.07072v1 [cs.LG] 15 Oct 2019

{chenyu.wei, mjafarni, haipengl, hiteshis, rahul.jain}@usc.edu

Abstract REGAL (Bartlett and Tewari, 2009), PSRL (Ouyang


et al., 2017), SCAL (Fruit et al., 2018b), UCBVI (Azar
Model-free reinforcement learning is known et al., 2017), EBF (Zhang and Ji, 2019) and EU-
to be memory and computation efficient and LER (Zanette and Brunskill, 2019). Model-based algo-
more amendable to large scale problems. rithms are well-known for their sample efficiency. How-
In this paper, two model-free algorithms ever, there are two general disadvantages of model-
are introduced for learning infinite-horizon based algorithms: First, model-based algorithms re-
average-reward Markov Decision Processes quire large memory to store the estimate of the model
(MDPs). The first algorithm reduces the parameters. Second, it is hard to extend model-based
problem to the discounted-reward version approaches to non-parametric settings, e.g., continu-
and achieves O(T 2/3 ) regret after T steps, ous state MDPs.
under the minimal assumption of weakly Model-free algorithms, on the other hand, try to re-
communicating MDPs. The second algo- solve these issues by directly maintaining an estimate
rithm makes use of recent advances in adap- of the optimal Q-value function or the optimal pol-
tive algorithms for adversarial multi-armed
√ icy. Examples include Q-learning (Watkins, 1989), De-
bandits and improves the regret to O( T ), layed Q-learning (Strehl et al., 2006), TRPO (Schul-
albeit with a stronger ergodic assumption. man et al., 2015), DQN (Mnih et al., 2013), A3C (Mnih
To the best of our knowledge, these are the et al., 2016), and more. Model-free algorithms are not
first model-free algorithms with sub-linear re- only computation and memory efficient, but also easier
gret (that is polynomial in all parameters) in to be extended to large scale problems by incorporat-
the infinite-horizon average-reward setting. ing function approximation.
It was believed that model-free algorithms are less
1 Introduction sample-efficient compared to model-based algorithms.
However, recently Jin et al. (2018) showed that
Reinforcement learning (RL) refers to the problem of (model-free) Q-learning algorithm with UCB explo-
an agent interacting with an unknown environment ration achieves a nearly-optimal regret bound, imply-
with the goal of maximizing its cumulative reward ing the possibility of designing algorithms with ad-
through time. The environment is usually modeled as vantages of both model-free and model-based meth-
a Markov Decision Process (MDP) with an unknown ods. Jin et al. (2018) addressed the problem for
transition kernel and/or an unknown reward function. episodic finite-horizon MDPs. Following this work,
The fundamental trade-off between exploration and Dong et al. (2019) extended the result to the infinite-
exploitation is the key challenge for RL: should the horizon discounted-reward setting.
agent exploit the available information to optimize
However, model-free algorithms with low regret for
the immediate performance, or should it explore the
infinite-horizon average-reward MDPs, an equally
poorly understood states and actions to gather more
heavily-studied setting in the RL literature, remains
information to improve future performance?
unknown. Designing such algorithms has proven to
There are two broad classes of RL algorithms: model- be rather challenging since the Q-value function esti-
based and model-free. Model-based algorithms main- mate may grow unbounded over time and it is hard
tain an estimate of the underlying MDP and use to control its magnitude in a way that guarantees effi-
that to determine a policy during the learning pro- cient learning. Moreover, techniques such as backward
cess. Examples include UCRL2 (Jaksch et al., 2010), induction in the finite-horizon setting or contraction
Model-free RL in Infinite-horizon Average-reward MDPs

Table 1: Regret comparisons for RL algorithms in infinite-horizon average-reward MDPs with S states, A
actions, and T steps. D is the diameter of the MDP, sp(v ∗ ) ≤ D is the span of the optimal value function,
V⋆s,a := Vars′ ∼p(·|s,a) [v ∗ (s′ )] ≤ sp(v ∗ )2 in the variance of the optimal value function, tmix is the mixing time
(Def 5.1), and ρ is some distribution mismatch coefficient (Eq. (4)). For more concrete definition of these
parameters, see Sections 3-5.

Algorithm Regret Comment



REGAL (Bartlett and Tewari, 2009) e
O(sp(v ∗
) SAT ) no efficient implementation

UCRL2 (Jaksch et al., 2010) e
O(DS AT ) -

PSRL (Ouyang et al., 2017) e ∗
O(sp(v )S AT ) Bayesian regret
√ ergodic assumption and
OSP (Ortner, 2018) e tmix SAT )
O(
Model-based no efficient implementation

SCAL (Fruit et al., 2018b) e
O(sp(v ∗
)S AT ) -
q P
KL-UCRL (Talebi and Maillard, 2018) e S
O( ⋆
s,a Vs,a T ) -

UCRL2B (Fruit et al., 2019) e
O(S DAT ) -

EBF (Zhang and Ji, 2019) e
O( DSAT ) no efficient implementation
1 2
Optimistic Q-learning (this work) e
O(sp(v ∗
)(SA) T )
3 3 -
Model-free p
MDP-OOMD (this work) e t3 ρAT )
O( ergodic assumption
mix

lower bound (Jaksch et al., 2010) Ω( DSAT ) -

mapping in the infinite-horizon discounted setting can MDPs with sub-linear regret (that is polynomial in
not be applied to the infinite-horizon average-reward all parameters). For comparisons with existing ap-
setting. proaches for this problem, see Table 1.
In this paper, we take the first attempt in this direc-
tion and propose two model-free algorithms for learn- 2 Related Work
ing infinite-horizon average-reward MDPs. The first
algorithm, Optimistic Q-learning (Section 4), achieves
We review the related literature with theoretical guar-
a regret bound of O(T e 2/3 ) with high probability for
antees for learning MDPs with finite state and action
the broad class of weakly communicating MDPs.1 The spaces. Three common settings have been studied: 1)
key idea of this algorithm is to artificially introduce finite-horizon episodic setting, 2) infinite-horizon dis-
a discount factor for the reward, to avoid the afore- counted setting, and 3) infinite-horizon average-reward
mentioned unbounded Q-value estimate issue, and to setting. For the first two settings, previous works
trade-off this effect with the approximation introduced have designed efficient algorithms with regret bound
by the discount factor. or sample complexity that is (almost) information-
The second algorithm, MDP-OOMD theoretically optimal, using either model-based ap-
√(Section 5), at-
e T ) for the more
tains an improved regret bound of O( proaches such as (Azar et al., 2017), or model-free ap-
restricted class of ergodic MDPs. This algorithm main- proaches such as (Jin et al., 2018; Dong et al., 2019).
tains an instance of a multi-armed bandit algorithm at For the infinite-horizon average-reward setting, many
each state to learn the best action. Importantly, the model-based algorithms have been proposed, such
multi-armed bandit algorithm needs to ensure several as (Auer and Ortner, 2007; Jaksch et al., 2010; Ouyang
key properties to achieve our claimed regret bound, et al., 2017; Agrawal and Jia, 2017; Talebi and Mail-
and to this end we make use of the recent advances for lard, 2018; Fruit et al., 2018a,b). These algorithms
adaptive adversarial bandit algorithms from (Wei and either conduct posterior sampling or follow the op-
Luo, 2018) in a novel way. timism in face of uncertainty principle to build an
To the best of our knowledge, these are the first model- MDP model estimate and then plan according to the
free algorithms for infinite-horizon average-reward estimate (hence model-based).
√ They all achieve a re-
gret bound of order Õ( T ), but the dependence on
1 e
Throughout the paper, we use the notation O(·) to other parameters are suboptimal. Recent works made
suppress log terms. progress toward obtaining the optimal bound (Ortner,
Wei, Jafarnia-Jahromi, Luo, Sharma, Jain

2018; Zhang and Ji, 2019); however, their algorithms lowing Bellman equation holds:
are not computationally efficient – the time complexity
scales exponentially with the number of states. J ∗ + q ∗ (s, a) = r(s, a) + Es′ ∼p(·|s,a) [v ∗ (s′ )], (1)
On the other hand, to the best of our knowledge there where v ∗ (s) := maxa∈A q ∗ (s, a). The optimal policy is
are no existing model-free algorithms for the infinite- then obtained by π ∗ (s) = argmaxa q ∗ (s, a).
horizon average-reward setting, except for the naive
approach of combining Q-learning with an ǫ-greedy We consider a learning problem where S, A and the
exploration strategy, which is known to suffer regret reward function r are known to the agent, but not the
exponential in some parameters (Osband et al., 2014). transition probability p (so one cannot directly solve
Extending model-free methods from the other two set- the Bellman equation to find the optimal policy). The
tings to this problem is highly nontrivial — a phe- knowledge of the reward function is a typical assump-
nomenon already encountered when designing model- tion as in Bartlett and Tewari (2009); Gopalan and
based methods. Our work provides the first solutions Mannor (2015); Ouyang et al. (2017), and can be re-
to this problem. moved at the expense of a constant factor for the regret
bound.
Two additional works are very related to our second al-
gorithm MDP-OOMD: (Neu et al., 2013) and (Wang, Specifically, the learning protocol is as follows. An
2017). They all belong to policy optimization method agent starts at an arbitrary state s1 ∈ S. At each time
where the learner tries to learn the parameter of the step t = 1, 2, 3, · · · , the agent observes state st ∈ S and
optimal policy directly. Their settings are quite differ- takes action at ∈ A which is a function of the history
ent from ours and the results are not comparable. We s1 , a1 , s2 , a2 , · · · , st−1 , at−1 , st . The environment then
defer more detailed comparisons with these two works determines the next state by drawing st+1 according
to the end of Section 5.1. to p(·|st , at ). The performance of a learning algorithm
is evaluated through the notion of cumulative regret,
defined as the difference between the total reward of
3 Preliminaries the optimal policy and that of the algorithm:
An infinite-horizon average-reward Markov Decision T 
X 
Process (MDP) can be described by (S, A, r, p) where RT := J ∗ − r(st , at ) .
S is the state space, A is the action space, r : S × A → t=1
[0, 1] is the reward function and p : S 2 × A → [0, 1]
is the transition probability such that p(s′ |s, a) := Since r ∈ [0, 1] (and subsequently J ∗ ∈ [0, 1]), the re-
P(st+1 = s′ | st = s, at = a) for st ∈ S, at ∈ A and gret can at worst grow linearly with T . If a learning
t = 1, 2, 3, · · · . We assume that S and A are finite sets algorithm achieves sub-linear regret, then RT /T con-
with cardinalities S and A, respectively. The average verges to zero, i.e., the average reward of the algorithm

reward per stage of a deterministic/stationary policy converges to the optimal per stage √ reward J . The
best existing regret bound is O( e DSAT ) achieved by
π : S → A starting from state s is defined as
a model-based algorithm (Zhang and Ji, 2019) (where
" T #
1 X D is the diameter of the MDP) and it matches the
π
J (s) := lim inf E r(st , π(st )) s1 = s lower bound of Jaksch et al. (2010) up to logarithmic
T →∞ T
t=1 factors. As far as we know, there is no existing model-
free algorithm with sub-linear regret bound.
where st+1 is drawn from p(·|st , π(st )). Let J ∗ (s) :=
maxπ∈AS J π (s). A policy π ∗ is said to be optimal if it

satisfies J π (s) = J ∗ (s) for all s ∈ S. 4 Optimistic Q-Learning
We consider two standard classes of MDPs in this pa-
In this section, we introduce our first algorithm, Op-
per: (1) weakly communicating MDPs defined in Sec-
timistic Q-learning (see Algorithm 1 for pseu-
tion 4 and (2) ergodic MDPs defined in Section 5. The
docode). The algorithm works for any weakly commu-
weakly communicating assumption is weaker than the
nicating MDPs. An MDP is weakly communicating if
ergodic assumption, and is in fact known to be neces-
its state space S can be partitioned into two subsets: in
sary for learning infinite-horizon MDPs with low regret
the first subset, all states are transient under any sta-
(Bartlett and Tewari, 2009).
tionary policy; in the second subset, every two states
Standard MDP theory (Puterman, 2014) shows that are accessible from each other under some stationary
for these two classes, there exist q ∗ : S × A → R policy. It is well-known that the weakly communicat-
(unique up to an additive constant) and unique J ∗ ∈ ing condition is necessary for ensuring low regret in
[0, 1] such that J ∗ (s) = J ∗ for all s ∈ S and the fol- this setting (Bartlett and Tewari, 2009).
Model-free RL in Infinite-horizon Average-reward MDPs

Algorithm 1: Optimistic Q-learning greedy action with the maximum estimated Q value
(Line 6). After seeing the next state, the algorithm
1 Parameters: H ≥ 2, confidence level δ ∈ (0, 1)
makes a stochastic update of Qt based on the Bell-
2 Initialization: γ = 1 − H1 , ∀s : V̂1 (s) = H man equation, importantly with an extra bonus term
3 ∀s, a : Q1 (s, a) = Q̂1 (s, a) = H, n1 (s, a) = 0 bτ and a carefully chosen step size ατ (see Eq.(2)).
q
Define: ∀τ, ατ = H+τ H+1
, bτ = 4 sp(v ∗ ) H 2T Here, τ is the number of times the current state-action
4
τ ln δ
pair has
p been visited, and the bonus term bτ scales
5 for t = 1, . . . , T do
as O( H/τ ), which encourages exploration since it
6 Take action at = argmaxa∈A Q̂t (st , a). shrinks every time a state-action pair is executed. The
7 Observe st+1 . choice of the step size ατ is also crucial as pointed out
8 Update: in Jin et al. (2018) and determines a certain effective
period of the history for the current update.
nt+1 (st , at ) ← nt (st , at ) + 1
τ ← nt+1 (st , at ) While the algorithmic idea is similar to Dong et al.
(2019), we emphasize that our analysis is different and
Qt+1 (st , at ) ← (1 − ατ )Qt (st , at )
h i novel:
+ατ r(st , at ) + γ V̂t (st+1 ) + bτ (2)
n o
Q̂t+1 (st , at ) ← min Q̂t (st , at ), Qt+1 (st , at ) • First, Dong et al. (2019) analyze the sample com-
plexity of their algorithm while we analyze the
V̂t+1 (st ) ← max Q̂t+1 (st , a). regret.
a∈A

(All other entries of nt+1 , Qt+1 , Q̂t+1 , V̂t+1


remain the same as those in nt , Qt , Q̂t , V̂t .) • Second, we need to deal with the approximation
effect due to the difference between the discounted
MDP and the original undiscounted one.
Define sp(v ∗ ) = maxs v ∗ (s)−mins v ∗ (s) to be the span
of the value function, which is known to be bounded
for weakly communicating MDPs. In particular, it is • Last, but not the least, part of our analysis im-
bounded by the diameter of the MDP (see Lemma 38.1 proves over that of Dong et al. (2019) (specifically
of Lattimore and Szepesvári (2018)). We assume that our Lemma 3) and in fact can be used to improve
sp(v ∗ ) is known and use it to set the parameters of their sub-optimal sample complexity bound (de-
our algorithm. However, in the case when it is un- tails omitted to make the results of this work more
known, we can replace sp(v ∗ ) with any upper bound focused).
of it (such as the diameter) in both the algorithm and
the analysis.
We state the main regret guarantee of Algorithm 1 in
The key idea of Algorithm 1 is to solve the undis- the following theorem.
counted problem via learning a discounted MDP (with
the same states, actions, reward function, and transi- Theorem 1. If the MDP q is weakly communicating,
tion kernel), for some discount factor γ (defined in Algorithm 1 with H = min{ sp(v
∗ )T
T 1
SA , ( SA ln 4T ) 3 } en-
terms of a parameter H). We define V ∗ and Q∗ to be δ

the optimal value-function and Q-function of the dis- sures that with probability at least 1 − δ, RT is of order
counted MDP, satisfying the following Bellman equa-
  
tion: p 2 1 q
O sp(v ∗ )SAT + sp(v ∗ ) T 3 SA ln Tδ 3 + T ln 1δ .
∀(s, a), Q∗ (s, a) = r(s, a) + γEs′ ∼p(·|s,a) [V ∗ (s′ )]
∀s, V ∗ (s) = max Q∗ (s, a).
a∈A
e 2/3 ) and is suboptimal
Our regret bound scales as O(T √
The way we learn this discounted MDP is essentially compared to model-based approaches with O( e T ) re-
the same as the algorithm of Dong et al. (2019), which gret (such as UCRL2) that matches the information-
itself is based on the idea from Jin et al. (2018). Specif- theoretic lower bound (Jaksch et al., 2010). However,
ically, the algorithm maintains an estimate V̂t for the this is the first model-free algorithm
√ with sub-linear
optimal value function V ∗ and Q̂t for the optimal Q- regret, and how to achieve O( e T ) regret via model-
function Q∗ , which itself is a clipped version of an- free algorithms (under only the weakly communicating
other estimate Qt . Each time the algorithm takes a condition) remains unknown.
Wei, Jafarnia-Jahromi, Luo, Sharma, Jain

4.1 Proof sketch of Theorem 1 This lemma is proven via Bellman equation for the
discounted setting and Azuma’s inequality.
The proof starts by decomposing the regret as

T
X 5 Õ( T ) Regret for Ergodic MDPs
RT = (J ∗ − r(st , at ))
t=1 In this section, we propose
√ another model-free algo-
T
X rithm that achieves Õ( T ) regret bound for ergodic
= (J ∗ − (1 − γ)V ∗ (st )) MDPs, a sub-class of weakly communicating MDPs.
t=1 An MDP is ergodic if for any stationary policy π,
T
X the induced markov chain is irreducible and aperiodic.
+ (V ∗ (st ) − Q∗ (st , at )) Learning ergodic MDPs is arguably easier than the
t=1
general case because the
√ MDP is explorative by itself.
T
X However, achieving Õ( T ) regret bound in this case
+ (Q∗ (st , at ) − γV ∗ (st ) − r(st , at )) . with model-free methods is still highly non-trivial and
t=1
we are not aware of any such result in the literature.
Each of these three terms are handled through Lem- Below, we first introduce a few useful properties of er-
mas 2, 3 and 4 whose proofs are deferred to the ap- godic MDPs, all of which can be found in (Puterman,
pendix. Plugging in γ = 1− H1 and picking the optimal 2014).
H finish the proof.
We use randomized policies in this approach. A ran-
Lemma 2. The optimal value function V ∗ of the dis- domized policy π maps every state s to a distribution
counted MDP satisfies over actions π(·|s) ∈ ∆A , where ∆A = {x ∈ RA
P + :
a x(a) = 1}. In an ergodic MDP, any policy π in-
1. |J ∗ − (1 − γ)V ∗ (s)| ≤ (1 − γ) sp(v ∗ ), ∀s ∈ S, duces a Markov chain with a unique stationary distri-
2. sp(V ∗ ) ≤ 2 sp(v ∗ ). bution µπ ∈ ∆S satisfying (µπ )⊤ P π = (µπ )⊤ , where
P π ∈ RS×S isP the induced transition matrix defined
as P π (s, s′ ) = a π(a|s)p(s′ |s, a). We denote the sta-
The proof of this lemma is by combining Bellman equa-
tionary distribution of the optimal policy π ∗ by µ∗ .
tion of the discounted and undiscounted settings.
Lemma 3. With probability at least 1 − δ, we have For ergodic MDPs, the long-term average reward J π of
any fixed policy π is independent of the starting state
T
X and can be written as PJ π = (µπ )⊤ rπ where rπ ∈ [0, 1]S
(V ∗ (st ) − Q∗ (st , at )) π
is such that r (s) := a π(a|s)r(s, a). For any policy
t=1
q π, the following Bellman equation has a solution q π :
≤ 4HSA + 24 sp(v ∗ ) HSAT ln 2T
δ .
S × A → R that is unique up to an additive constant:
J π + q π (s, a) = r(s, a) + Es′ ∼p(·|s,a) [v π (s′ )],
This lemma is one of our key technical contributions. P
To prove this lemma one can write where v π (s) = a π(a|s)q πP (s, a). In this section, we
impose an extra constraint: s µπ (s)v π (s) = 0 so that
T
X q π is indeed unique. In this case, it can be shown that
(V ∗ (st ) − Q∗ (st , at ))
v π has the following form:
t=1

X
T
X T
X  π
= ∗
(V (st ) − V̂t (st )) + (Q̂t (st , at ) − Q∗ (st , at )), v π (s) = e⊤ π t π ⊤
s (P ) − (µ ) r (3)
t=1 t=1 t=0

where es is the basis vector with 1 in coordinate s.


using the fact that V̂t (st ) = Q̂t (st , at ) by the greedy
policy. The main part of the proof is to show that Furthermore, ergodic MDPs have finite mixing time
the
PT +1second summation can in fact be bounded as and hitting time, defined as follows.

t=2 (V̂t (st ) − V (st )) plus a small sub-linear term, Definition 5.1 ((Levin and Peres, 2017; Wang,
which cancels with the first summation. 2017)). The mixing time of an ergodic MDP is defined
Lemma 4. With probability at least 1 − δ, as
 
T
X π t π 1
tmix := max min t ≥ 1 k(P ) (s, ·) − µ k1 ≤ , ∀s ,
(Q∗ (st , at ) − γV ∗ (st ) − r(st , at )) π 4
t=1
q that is, the maximum time required for any policy
≤ 2 sp(v ∗ ) 2T ln δ1 + 2 sp(v ∗ ). starting at any initial state to make the state distri-
Model-free RL in Infinite-horizon Average-reward MDPs

Algorithm 2: MDP-OOMD Algorithm 3: EstimateQ


1 Define: episode length B = 16tmixthit (log2 T )2 and 1 Input: T , π, s
number of episodes K = T /B
2
1
Initialize: π1′ (a|s) = π1 (a|s) = A , ∀s, a. T : a state-action trajectory from t1 to t2
3 for k = 1, 2, . . . , K do (st1 , at1 , . . . , st2 , at2 )
4 for t = (k − 1)B + 1, . . . , kB do π : a policy used to sample the trajectory T
5 Draw at ∼ πk (·|st ) and observe st+1 .
s : target state
6 Define trajectory
Tk = (s(k−1)B+1 , a(k−1)B+1 , . . . , skB , akB ). 2 Define: N = 4tmix log2 T (window length minus 1)
7 for all s ∈ S do 3 Initialize: τ ← t1 , i ← 0
8 βbk (s, ·) = EstimateQ(Tk , πk , s). 4 while τ ≤ t2 − N do
′ if sτ = s then
9 (πk+1 (·|s), πk+1 (·|s)) = 5

OomdUpdate(πk′ (·|s), βbk (s, ·)). 6 i←i+1


Pτ +N
7 Let R = t=τ r(st , at ).
R
8 Let yi (a) = π(a|s) 1[aτ = a], ∀a. (yi ∈ RA )
9 τ ← τ + 2N
bution 14 -close (in ℓ1 norm) to the stationary distribu-
tion. 10 else
11 τ ←τ +1
Definition 5.2. The hitting time of an ergodic MDP
is defined as 12 if i 6= 0 then
Pi
13 return 1i j=1 yj .
1
thit := max max . else
π s µπ (s) 14

15 return 0.
Our regret bound also depends on the following distri-
bution mismatch coefficient:
Algorithm 4: OomdUpdate
X µ∗ (s)
ρ := max (4) 1 Input: π ′ ∈ ∆A , βb ∈ RA
π
s
µπ (s) 2 Define:
PA
3
1
Regularizer ψ(x) = η1 a=1 log x(a) , for x ∈ RA
+
which has been used in previous work (Kakade and
4 Bregman divergence associated with ψ:
Langford, 2002;
P Agarwal et al., 2019). Clearly, one
has ρ ≤ thit s µ∗ (s) = thit . We assume T is large Dψ (x, x′ ) = ψ(x) − ψ(x′ ) − h∇ψ(x′ ), x − x′ i
enough so that the finite constant tmix and thit are
both smaller than T /4, and we also assume that these 5 Update:
quantities are known.
n o

πnext b − Dψ (π, π ′ )
= argmax hπ, βi (5)
5.1 Policy Optimization via Optimistic π∈∆A
n o
Online Mirror Descent b − Dψ (π, π ′ )
πnext = argmax hπ, βi (6)
next
√ π∈∆A
The key to get O( e T ) bound is to learn the optimal
policy π ∗ directly, by reducing the problem to solving 6

return (πnext , πnext ).
an adversarial multi-armed bandit (MAB) (Auer et al.,
2002) instance at each individual state.
The details of our algorithm MDP-OOMD is shown using the samples collected in episode k (see Algo-
in Algorithm 2. It proceeds in episodes, and maintains rithm 3). Finally all MAB algorithms update their
an independent copy of a specific MAB algorithm for distributions and output πk+1 for the next episode (Al-
each state. At the beginning of episode k, each MAB gorithm 4).
algorithm outputs an action distribution πk (·|s) for the
The reward estimator βbk (s, ·) is an almost unbiased
corresponding state s, which together induces a policy
estimator for
πk . The learner then executes policy πk throughout
episode k. At the end of the episode, for every state β πk (s, ·) := q πk (s, ·) + N J πk (7)
s we feed a reward estimator βbk (s, ·) ∈ RA to the cor-
responding MAB algorithm, where βbk is constructed with negligible bias (N is defined in Algorithm 3).
Wei, Jafarnia-Jahromi, Luo, Sharma, Jain

The term N J πk is the same for all actions and thus that leads to a better regret bound in terms of ρ
the corresponding MAB algorithm is trying to learn instead of thit . See Lemma 8 and Lemma 9.
the best action at state s in terms of the average of
Q-value functions q π1 (s, ·), . . . , q πK (s, ·). To construct The regret guarantee of our algorithm is shown in the
the reward estimator for state s, the sub-routine Es- next theorem.
timateQ collects non-overlapping intervals of length Theorem 5. For ergodic MDPs, with an appropriate
N + 1 = O(te mix ) that start from state s, and use the
chosen learning rate η for Algorithm 4, MDP-OOMD
standard inverse-propensity scoring to construct an es- achieves
timator yi for interval i (Line 8). In fact, to reduce q 
the correlation among the non-overlapping intervals, E[RT ] = Oe 3
tmix ρAT .
we also make sure that these intervals are at least N
steps apart from each other (Line 9). The final esti-
mator βbk (s, ·) is simply the average of all estimators Discussion. Neu et al. (2013) considered learning
yi over these disjoint intervals. This averaging is im- ergodic MDPs with known transition kernel and ad-
portant for reducing variance as explained later. versarial rewards, a setting that is incomparable to
ours. Their algorithm MDP-Exp3 maintains a copy
The MAB algorithm we use is optimistic online mirror of Exp3 for each state, but the reward estimators fed
descent (OOMD) with log-barrier as the regularizer, to these algorithms are very different
p from ours. They
which is analyzed in depth in (Wei and Luo, 2018). e
proved a regret bound of order O t3mix thit AT . Our
Here, optimism refers to something different from the
optimistic exploration discussed in Section 4. It corre- bound is never worse than theirs since ρ ≤ thit .
sponds to the fact that after a standard mirror descent In another recent work, Wang (2017) considered
update (Eq. (5)), the algorithm further makes a sim- learning ergodic MDPs under the assumption that
ilar update using an optimistic prediction of the next the learner is provided with a generative model (an
reward vector, which in our case is simply the previ- oracle that takes in a state-action pair and out-
ous reward estimator (Eq. (6)). We refer the reader to put a sample of the next state). They  2 derived 
(Wei and Luo, 2018) for more details, but point out a sample-complexity bound of order O mixǫ2 e t τ 2 SA
that the optimistic prediction we use here is new.
for finding an ǫ-optimal policy,  where τ =
It is clear that each MAB algorithm is facing a non-  ∗ 2  2
µ (s) 1/S
max maxs 1/S , maxs′ ,π µπ (s′ ) , which is at
stochastic problem (since πk is changing over time) and

thus it is important to deploy an adversarial MAB al- least maxπ maxs,s′ µµπ (s
(s)
′ ) by AM-GM inequality. This
gorithm. The standard algorithm for adversarial MAB result is again incomparable to ours, but we point out
is Exp3 (Auer et al., 2002), which was also used for that our distribution mismatch coefficient ρ is always
solving adversarial MDPs by (Neu et al., 2013) (more bounded by τ S, while τ can be much larger than ρ on
comparisons with this work to follow). However, there the other hand.
are several important reasons for our choice of the re-
cently developed algorithm OOMD with log-barrier:
5.2 Proof sketch of Theorem 5

• First, the log-barrier regularizer produces a more We first decompose the regret as follows:
exploratory distribution compared to Exp3 (as
T
X
noticed in e.g. (Agarwal et al., 2017)), so we do
RT = J ∗ − r(st , at )
not need an explicit uniform exploration over the
t=1
actions, which significantly simplifies the analysis K K X
compared to (Neu et al., 2013). X X
=B (J ∗ − J πk ) + (J πk − r(st , at )) , (8)
k=1 k=1 t∈Ik
• Second, log-barrier regularizer provides more sta-
ble updates compared to Exp3 in the sense that where Ik := {(k − 1)B + 1, . . . , kB} is the set of time
πk (a|s) and πk−1 (a|s) are within a multiplicative steps for episode k. Using the reward difference lemma
factor of each other (see Lemma 7). This implies (Lemma 15 in the Appendix), the first term of Eq. (8)
that the corresponding policies and their Q-value can be written as
functions are also stable, which is critical for our "K #
analysis. X XX
∗ ∗ πk
B µ (s) (π (a|s) − πk (a|s))q (s, a) ,
s k=1 a
• Finally, the optimistic prediction of OOMD, to-
gether with our particular reward estimator from where the term in the square bracket can be recognized
EstimateQ, provides a variance reduction effect as exactly the regret of the MAB algorithm for state s
Model-free RL in Infinite-horizon Average-reward MDPs

and is analyzed in Lemma 8 of Section 5.3. Combining Lemma 6. Let Ek [x] denote the expectation of a
the regret of all MAB algorithms, Lemma 9 then shows random variable x conditioned on all history before
that in expectation the first term of Eq. (8) is at most episode k. Then for any k, s, a (recall β defined in
  Eq. (7)),
e BA ηT N 3 ρ h i  
O + + η3 T N 6 . (9) b πk 1
η B Ek βk (s, a) − β (s, a) ≤ O , (11)
T
 2   
b πk N 3 log T
On the other hand, the expectation of the second term Ek βk (s, a) − β (s, a) ≤O .
in Eq.(8) can be further written as Bπk (a|s)µπk (s)
(12)
" K X
#
X
E (J πk
− r(st , at )) The next lemma shows that in OOMD, πk and πk−1
k=1 t∈Ik are close in a strong sense, which further implies the
" K X
# stability for several other related quantities.
X
πk ′ πk
=E (Es′ ∼p(·|st ,at ) [v (s )] − q (st , at )) Lemma 7. For any k, s, a,
k=1 t∈Ik
|πk (a|s) − πk−1 (a|s)| ≤ O(ηN πk−1 (a|s)), (13)
(Bellman equation)
" # πk πk−1 2
K X |J −J | ≤ O(ηN ),
X
πk ′ πk πk πk−1
=E (Es′ ∼p(·|st ,at ) [v (s )] − v (st+1 )) |v (s) − v (s)| ≤ O(ηN 3 ),
k=1 t∈Ik
" # |q πk (s, a) − q πk−1 (s, a)| ≤ O(ηN 3 ),
K
X X
+E (v πk (st ) − q πk (st , at )) |β πk (s, a) − β πk−1 (s, a)| ≤ O(ηN 3 ).
k=1 t∈Ik
" K X
# The next lemma shows the regret bound of OOMD
X
+E πk
(v (st+1 ) − v (st )) πk based on an analysis similar to (Wei and Luo, 2018).
k=1 t∈Ik Lemma 8. For a specific state s, we have
" K
# "K #
X XX A ln T
πk πk
=E (v (skB+1 ) − v (s(k−1)B+1 )) E (π ∗ (a|s) − πk (a|s))βbk (s, a) ≤ O
k=1
η
k=1 a
(the first two terms above are zero) "K #!
XX  2
"K−1 #
X + ηE πk (a|s)2 βbk (s, a) − βbk−1 (s, a) ,
πk πk+1
=E (v (skB+1 ) − v (skB+1 )) k=1 a
k=1
h i where we define βb0 (s, a) = 0 for all s and a.
+ E v πK (sKB+1 ) − v π1 (s1 ) . (10)
Finally, we state a key lemma for proving Theorem 5.
The first term in the last expression can be bounded Lemma 9. MDP-OOMD ensures
" K #
by O(ηN 3 K) = O(ηN 3 T /B) due to the stability of XXX
OomdUpdate (Lemma 7) and the second term is at E B µ∗ (s) (π ∗ (a|s) − πk (a|s)) q πk (s, a)
most O(tmix ) according to Lemma 14 in the appendix. k=1 s a
 
e mix ), B = BA ln T T N 3ρ 3 6
Combining these facts with N = O(t =O +η + η TN .
e η B
O(tmix thit ), Eq .(8) and Eq. (9) and choosing the op-
timal η, we arrive at
6 Conclusions
 3

E[RT ] = O e BA + η tmix ρT + η 3 t6 T In this work we propose two model-free algorithms for
mix
η B
q  learning infinite-horizon average-reward MDPs. They
3 1 are based on different ideas: one reduces the problem
=Oe t3mix ρAT + t3mix thit A 4 T 4 + t2mix thit A .
to the discounted version, while the other optimizes
the policy directly via a novel application of adap-
5.3 Auxiliary Lemmas tive adversarial multi-armed bandit algorithms. The
main open question is how to achieve the information-
To analyze the regret, we establish several useful lem- theoretically optimal regret bound via a model-free al-
mas, whose proofs can be found in the Appendix. gorithm, if it is possible at all. We believe that the
First, we show that βbk (s, a) is an almost unbiased es- techniques we develop in this work would be useful in
timator for β πk (s, a). answering this question.
Wei, Jafarnia-Jahromi, Luo, Sharma, Jain

Acknowledgement Gopalan, A. and Mannor, S. (2015). Thompson sampling


for learning parameterized markov decision processes. In
The authors would like to thank Mengxiao Zhang for Conference on Learning Theory, pages 861–898.
helping us prove Lemma 6, Gergely Neu for clarifying Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal
the analysis in (Neu et al., 2013), and Ronan Fruit regret bounds for reinforcement learning. Journal of Ma-
for discussions on a related open problem presented at chine Learning Research, 11(Apr):1563–1600.
ALT 2019. Support from NSF for MJ (award ECCS-
1810447), HL (award IIS-1755781), HS (award CCF- Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I.
1817212) and RJ (awards ECCS-1810447 and CCF- (2018). Is q-learning provably efficient? In Advances
1817212) is gratefully acknowledged. in Neural Information Processing Systems, pages 4863–
4873.
References Kakade, S. and Langford, J. (2002). Approximately opti-
mal approximate reinforcement learning. In Proceedings
Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, of the 34th International Conference on Machine Learn-
G. (2019). Optimality and approximation with policy ing.
gradient methods in markov decision processes. arXiv
Lattimore, T. and Szepesvári, C. (2018). Bandit algo-
preprint arXiv:1908.00261.
rithms. Cambridge University Press.
Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E.
Levin, D. A. and Peres, Y. (2017). Markov chains and
(2017). Corralling a band of bandit algorithms. In Con-
mixing times, volume 107. American Mathematical Soc.
ference on Learning Theory, pages 12–38.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,
Agrawal, S. and Jia, R. (2017). Optimistic posterior
T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016).
sampling for reinforcement learning: worst-case regret
Asynchronous methods for deep reinforcement learning.
bounds. In Advances in Neural Information Processing
In International conference on machine learning, pages
Systems, pages 1184–1194.
1928–1937.
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
(2002). The nonstochastic multiarmed bandit problem.
Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013).
SIAM Journal on Computing, 32(1):48–77.
Playing atari with deep reinforcement learning. arXiv
Auer, P. and Ortner, R. (2007). Logarithmic online regret preprint arXiv:1312.5602.
bounds for undiscounted reinforcement learning. In Ad-
Neu, G., György, A., Szepesvári, C., and Antos, A. (2013).
vances in Neural Information Processing Systems, pages
Online markov decision processes under bandit feedback.
49–56.
IEEE Transactions on Automatic Control, 59:676–691.
Azar, M. G., Osband, I., and Munos, R. (2017). Minimax
Ortner, R. (2018). Regret bounds for reinforcement learn-
regret bounds for reinforcement learning. In Proceed-
ing via markov chain concentration. arXiv preprint
ings of the 34th International Conference on Machine
arXiv:1808.01813.
Learning-Volume 70, pages 263–272. JMLR. org.
Osband, I., Van Roy, B., and Wen, Z. (2014). General-
Bartlett, P. L. and Tewari, A. (2009). Regal: A reg-
ization and exploration via randomized value functions.
ularization based algorithm for reinforcement learning
arXiv preprint arXiv:1402.0635.
in weakly communicating mdps. In Proceedings of the
Twenty-Fifth Conference on Uncertainty in Artificial In- Ouyang, Y., Gagrani, M., Nayyar, A., and Jain, R. (2017).
telligence, pages 35–42. AUAI Press. Learning unknown markov decision processes: A thomp-
son sampling approach. In Advances in Neural Informa-
Bubeck, S., Li, Y., Luo, H., and Wei, C.-Y. (2019). Im-
tion Processing Systems, pages 1333–1342.
proved path-length regret bounds for bandits. In Con-
ference On Learning Theory. Puterman, M. L. (2014). Markov decision processes: dis-
crete stochastic dynamic programming. John Wiley &
Chiang, C.-K., Yang, T., Lee, C.-J., Mahdavi, M., Lu, C.-
Sons.
J., Jin, R., and Zhu, S. (2012). Online optimization with
gradual variations. In Conference on Learning Theory, Schulman, J., Levine, S., Abbeel, P., Jordan, M., and
pages 6–1. Moritz, P. (2015). Trust region policy optimization.
In International conference on machine learning, pages
Dong, K., Wang, Y., Chen, X., and Wang, L. (2019).
1889–1897.
Q-learning with ucb exploration is sample efficient for
infinite-horizon mdp. arXiv preprint arXiv:1901.09311. Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and
Littman, M. L. (2006). Pac model-free reinforcement
Fruit, R., Pirotta, M., and Lazaric, A. (2018a). Near
learning. In Proceedings of the 23rd international con-
optimal exploration-exploitation in non-communicating
ference on Machine learning, pages 881–888. ACM.
markov decision processes. In Advances in Neural Infor-
mation Processing Systems, pages 2994–3004. Talebi, M. S. and Maillard, O.-A. (2018). Variance-aware
regret bounds for undiscounted reinforcement learning in
Fruit, R., Pirotta, M., and Lazaric, A. (2019).
mdps. In Algorithmic Learning Theory, pages 770–805.
Improved analysis of ucrl2b. Available at
rlgammazero.github.io/docs/ucrl2b_improved.pdf. Wang, M. (2017). Primal-dual π learning: Sample com-
plexity and sublinear run time for ergodic markov deci-
Fruit, R., Pirotta, M., Lazaric, A., and Ortner, R. (2018b).
sion problems. arXiv preprint arXiv:1710.06100.
Efficient bias-span-constrained exploration-exploitation
in reinforcement learning. In International Conference Watkins, C. J. C. H. (1989). Learning from delayed re-
on Machine Learning, pages 1573–1581. wards. Phd Thesis, King’s College, Cambridge.
Model-free RL in Infinite-horizon Average-reward MDPs

Wei, C.-Y. and Luo, H. (2018). More adaptive algorithms


for adversarial bandits. In Conference On Learning The-
ory, pages 1263–1291.
Zanette, A. and Brunskill, E. (2019). Tighter problem-
dependent regret bounds in reinforcement learning with-
out domain knowledge using value function bounds. In
International Conference on Machine Learning.
Zhang, Z. and Ji, X. (2019). Regret minimization for rein-
forcement learning by evaluating the optimal bias func-
tion. In Advances in Neural Information Processing Sys-
tems.
Wei, Jafarnia-Jahromi, Luo, Sharma, Jain

A Omitted Proofs in Section 4


H+1
In this section, we provide detailed proof for the lemmas used in Section 4. Recall that the learning rate ατ = H+τ
is
similar to the one used by Jin et al. (2018). For notational convenience, let
τ
Y τ
Y
α0τ := (1 − αj ), αiτ := αi (1 − αj ). (14)
j=1 j=i+1

It can be verified that α0τ = 0 for τ ≥ 1 and we define α00 = 1. These quantities are used in the proof of Lemma 3 and
have some nice properties summarized in the following lemma.
Lemma 10 (Jin et al. (2018)). The following properties hold for αiτ :

Pτ αiτ
1. √1 ≤ √ ≤ √2 for every τ ≥ 1.
τ i=1 i τ

Pτ i 2 2H
2. i=1 (ατ ) ≤ τ
for every τ ≥ 1.
Pτ P∞ 1
3. i=1 αiτ = 1 for every τ ≥ 1 and τ =i αiτ = 1 + H
for every i ≥ 1.

Also recall the well-known Azuma’s inequality:


Lemma 11 (Azuma’s inequality). Let X1 , X2 , · · · be a martingale difference sequence with |Xi | ≤ ci for all i. Then, for
any 0 < δ < 1,

T r !
X 1
P Xi ≥ 2c̄2T ln ≤ δ,
i=1
δ

PT
where c̄2T := 2
i=1 ci .

A.1 Proof of Lemma 2


Lemma 2 (Restated). Let V ∗ be the optimal value function in the discounted MDP with discount factor γ and v ∗ be
the optimal value function in the undiscounted MDP. Then,

1. |J ∗ − (1 − γ)V ∗ (s)| ≤ (1 − γ) sp(v ∗ ), ∀s ∈ S,

2. sp(V ∗ ) ≤ 2 sp(v ∗ ).

Proof. 1. Let π ∗ and πγ be the optimal policy under undiscounted and discounted settings, respectively. By Bellman’s
equation, we have

v ∗ (s) = r(s, π ∗ (s)) − J ∗ + Es′ ∼p(·|s,π∗ (s)) v ∗ (s′ ).

Consider a state sequence s1 , s2 , · · · generated by π ∗ . Then, by sub-optimality of π ∗ for the discounted setting, we
have
"∞ #

X t−1 ∗
V (s1 ) ≥ E γ r(st , π (st )) s1
t=1
" ∞
#
X
=E γ t−1 (J ∗ + v ∗ (st ) − v ∗ (st+1 )) s1
t=1
"∞ #
J ∗

X t−2
= + v (s1 ) − E (γ − γ t−1 )v ∗ (st ) s1
1−γ t=2
X∞
J∗
≥ + min v ∗ (s) − max v ∗ (s) (γ t−2 − γ t−1 )
1−γ s s
t=2
J∗
= − sp(v ∗ ),
1−γ

where the first equality is by the Bellman equation for the undiscounted setting.
Model-free RL in Infinite-horizon Average-reward MDPs

Similarly, for the other direction, let s1 , s2 , · · · be generated by πγ . We have


"∞ #

X t−1
V (s1 ) = E γ r(st , πγ (st )) s1
t=1
" ∞
#
X t−1 ∗ ∗ ∗
≤E γ (J + v (st ) − v (st+1 )) s1
t=1
"∞ #
J∗ ∗
X t−2
= + v (s1 ) − E (γ − γ t−1 )v ∗ (st ) s1
1−γ t=2
X∞
J∗
≤ + max v ∗ (s) − min v ∗ (s) (γ t−2 − γ t−1 )
1−γ s s
t=2
J∗
= + sp(v ∗ ),
1−γ
where the first inequality is by sub-optimality of πγ for the undiscounted setting.
2. Using previous part, for any s1 , s2 ∈ S, we have
J∗ J∗
|V ∗ (s1 ) − V ∗ (s2 )| ≤ V ∗ (s1 ) − + V ∗ (s2 ) − ≤ 2 sp(v ∗ ).
1−γ 1−γ
Thus, sp(V ∗ ) ≤ 2 sp(v ∗ ).

A.2 Proof of Lemma 3


Lemma 3. With probability at least 1 − δ,
T
r
X ∗ ∗ ∗ 2T
(V (st ) − Q (st , at )) ≤ 4HSA + 24 sp(v ) HSAT ln .
t=1
δ

Proof. We condition on the statement of Lemma 12, which happens with probability at least 1 − δ. Let nt ≥ 1 denote
nt+1 (st , at ), that is, the total number of visits to the state-action pair (st , at ) for the first t rounds (including round t).
Also let ti (s, a) denote the timestep at which (s, a) is visited the i-th time. Recalling the definition of αint in Eq. (14), we
have
XT   XT
V̂t (st ) − V ∗ (st ) + (V ∗ (st ) − Q∗ (st , at )) (15)
t=1 t=1
T 
X 
= Q̂t (st , at ) − Q∗ (st , at ) (because at = argmaxa Q̂t (st , a))
t=1
T 
X  XT  
= Q̂t+1 (st , at ) − Q∗ (st , at ) + Q̂t (st , at ) − Q̂t+1 (st , at ) (16)
t=1 t=1
T r T X nt h i

X H 2T X
≤ 12 sp(v ) ln +γ αint V̂ti (st ,at ) (sti (st ,at )+1 ) − V ∗ (sti (st ,at )+1 ) + SAH. (17)
t=1
nt δ t=1 i=1

Here, we apply Lemma 12 to bound the first term of Eq .(16) (note α0nt = 0 by definition since nt ≥ 1), and also bound
the second term of Eq .(16) by SAH since for each fixed (s, a), Q̂t (s, a) is non-increasing in t and overall cannot decrease
by more than H (the initial value).
To bound the third term of Eq. (17) we write:
nt
T X
X h i
γ αint V̂ti (st ,at ) (sti (st ,at )+1 ) − V ∗ (sti (st ,at )+1 )
t=1 i=1
T X nt+1 (s,a) h i
X X
=γ 1[st =s,at =a] αint+1 (s,a) V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 )
t=1 s,a i=1
nT +1 (s,a)
X X X ih
j i
=γ αj V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 ) .
s,a j=1 i=1
Wei, Jafarnia-Jahromi, Luo, Sharma, Jain

By changing the order of summation on i and j, the latter is equal to

(s,a) nT +1 (s,a)
X nT +1
X X h i
γ αij V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 )
s,a i=1 j=i
nT +1 (s,a) h i nT +1 (s,a)
X X X
=γ V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 ) αij
s,a i=1 j=i

PnT +1 (s,a) i P∞ i 1
Now, we can upper bound j=i αj by j=i αj where the latter is equal to 1 + H
by Lemma 10. Since
V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 ) ≥ 0 (by Lemma 12), we can write:

(s,a) h (s,a)
X nT +1
X i nT +1
X
γ V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 ) αij
s,a i=1 j=i
nT +1 (s,a) h iX

X X
≤γ V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 ) αij
s,a i=1 j=i

X nT +1
X (s,a) h i 1

=γ V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 ) 1+
s,a i=1
H
  XT h i
1
= 1+ γ V̂t (st+1 ) − V ∗ (st+1 )
H t=1
  X T h i   T
1 1 Xh i
= 1+ γ V̂t+1 (st+1 ) − V ∗ (st+1 ) + 1 + V̂t (st+1 ) − V̂t+1 (st+1 )
H t=1
H t=1
T
X +1 h i  
1
≤ V̂t (st ) − V ∗ (st ) + 1 + SH.
t=2
H

1

The last inequality is because 1 + H γ ≤ 1 and that for any state s, V̂t (s) ≥ V̂t+1 (s) and the value can decrease by at
most H (the initial value). Substituting in Eq. (17) and telescoping with the left hand side, we have

T T r    
X X H 2T 1
(V ∗ (st ) − Q∗ (st , at )) ≤ 12 sp(v ∗ ) ln + V̂T +1 (sT +1 ) − V ∗ (sT +1 ) + 1 + SH + SAH
t=1 t=1
nt δ H
T r

X H 2T
≤ 12 sp(v ) ln + 4SAH.
t=1
nt δ

PT √
Moreover, √1 ≤ 2 SAT because
t=1 nt

T
X X X 1[s =s,a =a]
T X nT +1 (s,a)
X X p s X √
1 1
p = pt t
= √ ≤ 2 nT +1 (s, a) ≤ 2 SA nT +1 (s, a) = 2 SAT ,
t=1 nt+1 (st , at ) t=1 s,a nt+1 (s, a) s,a j=1
j s,a s,a

where the last inequality is by Cauchy-Schwarz inequality. This finishes the proof.
Lemma 12. With probability at least 1 − δ, for any t = 1, . . . , T and state-action pair (s, a), the following holds
τ h i r
X H 2T
0 ≤ Q̂t+1 (s, a) − Q∗ (s, a) ≤ Hα0τ + γ αiτ V̂ti (sti +1 ) − V ∗ (sti +1 ) + 12 sp(v ∗ ) ln ,
i=1
τ δ

where τ = nt+1 (s, a) (i.e., the total number of visits to (s, a) for the first t timesteps), αiτ is defined by (14), and
t1 , . . . , tτ ≤ t are the timesteps on which (s, a) is taken.

Proof. Recursively substituting Qt (s, a) in Eq. (2) of the algorithm, we have


τ
X h i Xτ
Qt+1 (s, a) = Hα0τ + αiτ r(s, a) + γ V̂ti (sti +1 ) + αiτ bi .
i=1 i=1
Model-free RL in Infinite-horizon Average-reward MDPs


Moreover, since i=1 αiτ = 1 (Lemma 10), By Bellman equation we have
τ
X  
Q∗ (s, a) = α0τ Q∗ (s, a) + αiτ r(s, a) + γEs′ ∼p(·|s,a) V ∗ (s′ ) .
i=1


Taking their difference and adding and subtracting a term γ i=1 αiτ V ∗ (sti +1 ) lead to:
τ
X h i
Qt+1 (s, a) − Q∗ (s, a) = α0τ (H − Q∗ (s, a)) + γ αiτ V̂ti (sti +1 ) − V ∗ (sti +1 )
i=1
τ
X τ
  X
+γ αiτ V ∗ (sti +1 ) − Es′ ∼p(·|s,a) V ∗ (s′ ) + αiτ bi .
i=1 i=1

P∞
The first term is upper bounded by α0τ H clearly and lower bounded by 0 since Q∗ (s, a) ≤ i=0 γi = 1
1−γ
= H.

The third term is a martingale difference sequence with each term boundedqin [−γαiτ sp(V ∗ ), γαiτ sp(V ∗ )]. Therefore,
q by
P
Azuma’s inequality (Lemma 11), its absolute value is bounded by γ sp(V ∗ ) 2 τi=1 (αiτ )2 ln 2T δ
≤ 2γ sp(V ∗
) H
τ
ln 2T
δ

q
4γ sp(v ∗ ) Hτ
ln 2T
δ
with probability at least 1 − Tδ , where the first inequality is by Lemma 10 and the last inequality is
by Lemma 2. Note that when t varies from 1 to T and (s, a) varies over all possible state-action pairs, the third term
only takes T different forms. Therefore,qby taking a union bound over these T events, we have: with probability 1 − δ,
the third term is bounded by 4γ sp(v ∗ ) ln 2T
H
τ
δ
in absolute value for all t and (s, a).
q q
The forth term is lower bounded by 4 sp(v ∗ ) H τ
ln 2T
δ
and upper bounded by 8 sp(V ∗
) H
τ
ln 2T
δ
, by Lemma 10.
n o
Combining all aforementioned upper bounds and the fact Q̂t+1 (s, a) = min Q̂t (s, a), Qt+1 (s, a) ≤ Qt+1 (s, a) we prove
the upper bound
h in the lemma statement. To provei the lower bound, further note that the second term can be written as
P
γ τi=1 αiτ maxa Q̂ti (sti +1 , a) − maxa Q∗ (sti +1 , a) . Using a direct induction with all aforementioned lower bounds and
n o
the fact Q̂t+1 (s, a) = min Q̂t (s, a), Qt+1 (s, a) we prove the lower bound in the lemma statement as well.

A.3 Proof of Lemma 4


Lemma 4. With probability at least 1 − δ,

T r
X 1
(Q∗ (st , at ) − γV ∗ (st ) − r(st , at )) ≤ 2 sp(v ∗ ) 2T ln + 2 sp(v ∗ ).
t=1
δ

Proof. By Bellman equation  for the discounted problem, we have Q∗ (st , at ) − γV ∗ (st ) − r(st , at ) =
γ Es′ ∼p(·|st ,at ) [V ∗ (s′ )] − V ∗ (st ) . Adding and subtracting V ∗ (st+1 ) and summing over t we will get

T
X T
X T
X

(Q∗ (st , at ) − γV ∗ (st ) − r(st , at )) = γ Es′ ∼p(·|st ,at ) [V ∗ (s′ )] − V ∗ (st+1 ) + γ (V ∗ (st+1 ) − V ∗ (st ))
t=1 t=1 t=1

The summands of the first term on the right hand side constitute a martingale difference sequence. Thus, byqAzuma’s
inequality (Lemma 11) and the fact that sp(V ∗ ) ≤ 2 sp(v ∗ ) (Lemma 2), this term is upper bounded by 2γ sp(v ∗ ) 2T ln δ1 ,
with probability at least 1 − δ. The second term is equal to γ(V ∗ (sT +1 ) − V ∗ (s1 )) which is upper bounded by 2γ sp(v ∗ ).
Recalling γ < 1 completes the proof.

B Omitted Proofs in Section 5 — Proofs for Lemma 6 and Lemma 7


B.1 Auxiliary Lemmas
In this subsection, we state several lemmas that will be helpful in the analysis.
Lemma 13 ((Levin and Peres, 2017, Section 4.5)). Define
n o
tmix (ǫ) := max min t ≥ 1 k(P π )t (s, ·) − µπ k1 ≤ ǫ, ∀s ,
π
Wei, Jafarnia-Jahromi, Luo, Sharma, Jain

so that tmix = tmix ( 14 ). We have


 
1
tmix (ǫ) ≤ log2 tmix
ǫ

for any ǫ ∈ (0, 12 ].


Corollary 13.1. For an ergodic MDP with mixing time tmix , we have

−t t
k(P π )t (s, ·) − µπ k1 ≤ 2 · 2 mix , ∀π, s

for all π and all t ≥ 2tmix .

Proof. Lemma 13 implies for any ǫ ∈ (0, 12 ], as long as t ≥ ⌈log 2 (1/ǫ)⌉tmix , we have

k(P π )t (s, ·) − µπ k1 ≤ ǫ.

t −t t
This condition can be satisfied by picking log2 (1/ǫ) = tmix
− 1, which leads to ǫ = 2 · 2 mix .

Corollary 13.2. Let N = 4tmix log 2 T . For an ergodic MDP with mixing time tmix < T /4, we have for all π:

X 1
k(P π )t (s, ·) − µπ k1 ≤ .
t=N
T3

Proof. By Corollary 13.1,


X ∞
X −tN
−t t 2·2 mix 2tmix − N 2tmix 1 1
k(P π )t (s, ·) − µπ k1 ≤ 2·2 mix = ≤ · 2 · 2 tmix = · 2· 4 ≤ 3.
−t 1 ln 2 ln 2 T T
t=N t=N 1−2 mix

Lemma 14 (Stated in (Wang, 2017) without proof). For an ergodic MDP with mixing time tmix , and any π, s, a,

|v π (s)| ≤ 5tmix ,
π
|q (s, a)| ≤ 6tmix .

Proof. Using the identity of Eq. (3) we have


X
|v π (s)| = ((P π )t (s, ·) − µπ )⊤ r π
t=0

X
≤ (P π )t (s, ·) − µπ 1
kr π k∞
t=0
2tmix −1
X ∞ (i+1)t
X X mix −1

≤ (P π )t (s, ·) − µπ 1
+ (P π )t (s, ·) − µπ 1
t=0 i=2 t=itmix

X
≤ 4tmix + 2 · 2−i tmix (by (P π )t (s, ·) − µπ 1
≤ 2 and Corollary 13.1)
i=2

≤ 5tmix ,

and thus

|q π (s, a)| = r(s, a) + Es′ ∼p(·|s,a) [v π (s′ )] ≤ 1 + 5tmix ≤ 6tmix .

Lemma 15 ((Neu et al., 2013, Lemma 2)). For any two policies π, π̃,
XX
J π̃ − J π = µπ̃ (s) (π̃(a|s) − π(a|s)) q π (s, a).
s a
Model-free RL in Infinite-horizon Average-reward MDPs

Proof. Using Bellman equation we have

XX
µπ̃ (s)π̃(a|s)q π (s, a)
s a
!
XX π̃ π
X ′ π ′
= µ (s)π̃(a|s) r(s, a) − J + p(s |s, a)v (s )
s a s′
X
= J π̃ − J π + µπ̃ (s′ )v π (s′ )
s′
π̃ π
X
= J −J + µπ̃ (s)v π (s)
s
XX
= J π̃ − J π + µπ̃ (s)π(a|s)q π (s, a),
s a

P P P
where the second equality uses the facts J π̃ = s a µπ̃ (s)π̃(a|s)r(s, a) and s,a µπ̃ (s)π̃(a|s)p(s′ |s, a) = µπ̃ (s′ ). Rear-
ranging gives the desired equality.

Lemma 16. Let I = {t1 + 1, t1 + 2, . . . , t2 } be a certain period of an episode k of Algorithm 2 with |I| ≥ N = 4tmix log 2 T .
Then for any s, the probability that the algorithm never visits s in I is upper bounded by

j k
  |I|
3µπk (s) N
1− .
4

t 
2 −t1
Proof. Consider a subset of I: {t1 + N, t1 + 2N, . . .} which consists of at least N
rounds that are at least N -step
away from each other. By Corollary 13.1, we have for any i,

−tN 2
Pr[st1 +iN = s st1 +(i−1)N ] − µπk (s) ≤ 2 · 2 mix ≤ 2 · 2−4 log2 T ≤ ,
T4

that is, conditioned on the state at time t1 + (i − 1)N , the state distribution at time t1 + iN is close to the stationary
distribution induced by πk . Therefore we further have Pr[st1 +iN = s st1 +(i−1)N ] ≥ µπk (s) − T24 ≥ 43 µπk (s), where
the last step uses the fact µπk (s) ≥ t 1 ≥ T4 . The probability that the algorithm does not visit s in any of the rounds
hit
{t1 + N, t1 + 2N, . . .} is then at most

 j t2 −t1 k  j |I| k
3µπk (s) N 3µπk (s) N
1− = 1− ,
4 4

finishing the proof.

B.2 Proof for Lemma 6

Proof for Eq.(11). In this proof, we consider a specific episode k and a specific state s. For notation simplicity, we use
π for πk throughout this proof, and all the expectations or probabilities are conditioned on the history before episode k.
Suppose that when Algorithm 2 calls EstimateQ in episode k for state s, it finds M disjoint intervals that starts from s.
Denote the reward estimators corresponding to the i-th interval as βbk,i (s, ·) (i.e., the yi (·) in Algorithm 3), and the time
when the i-th interval starts as τi (thus sτi = s). Then by the algorithm, we have

( PM b
i=1 βk,i (s,a)
if M > 0,
βbk (s, a) = M (18)
0 if M = 0.
Wei, Jafarnia-Jahromi, Luo, Sharma, Jain

Since each βbk,i (s, a) is constructed by a length-(N + 1) trajectory starting from s at time τi ≤ kB − N , we can calculate
its conditional expectation as follows:
h i
E βbk,i (s, a) sτi = s
hP i
τi +N
r(s, a) + E t=τi +1 r(s t , a t ) (s τ i , a τ i ) = (s, a)
= Pr[aτi = a | sτi = s] ×
π(a|s)
" τ +N #
X ′
X
i

= r(s, a) + p(s |s, a)E r(st , at ) sτi +1 = s
s′ t=τi +1

X N−1
X
= r(s, a) + p(s′ |s, a) e⊤ π j π
s′ (P ) r
s′ j=0

X N−1
X
= r(s, a) + p(s′ |s, a) (e⊤ π j π ⊤ π
s′ (P ) − (µ ) )r + N J
π
(because µπ⊤ r π = J π )
s′ j=0

X X ∞
X
= r(s, a) + p(s′ |s, a)v π (s′ ) + N J π − p(s′ |s, a) (e⊤ π j π ⊤ π
s′ (P ) − (µ ) )r (By Eq. (3))
s′ s′ j=N

= q π (s, a) + N J π − δ(s, a)
= β π (s, a) − δ(s, a), (19)
P P∞
where δ(s, a) , s′ p(s′ |s, a) ⊤ π j
j=N (es′ (P ) − (µπ )⊤ )r π . By Corollary 13.2,

1
|δ(s, a)| ≤ . (20)
T3
Thus,
h i 1
E βbk,i (s, a) sτi = s − β π (s, a) ≤ 3 .
T

This shows that βbk,i (s, a) is an almost unbiased estimator for β π conditioned on all history before τi . Also, by our selection
of the episode length, M > 0 will happen with very high probability according to Lemma 16. These facts seem to indicate
that βbk (s, a) – an average of several βbk,i (s, a) – will also be an almost unbiased estimator for β π (s, a) with small error.
However, a caveat here is that the quantity M in Eq.(18) is random, and it is not independent from the reward estimators
PM b b π
i=1 βk,i (s, a). Therefore, to argue that the expectation of E[βk (s, a)] is close to β (s, a), more technical work is needed.
b π
Specifically, we use the following two steps to argue that E[βk (s, a)] is close to β (s, a).
Step 1. Construct an imaginary world where βbk (s, a) is an almost unbiased estimator of β π (s, a).
Step 2. Argue that the expectation of βbk (s, a) in the real world and the expectation of βbk (s, a) in the imaginary world
are close.

        

wait to see , wait to see ,


(length =  ) (length =  )

do nothing do nothing

Figure 1: An illustration for the sub-algorithm EstimateQ with target state s (best viewed in color). The red
round points indicate that the algorithm “starts to wait” for a visit to s. When the algorithm reaches s (the
Pτi +N
blue stars) at time τi , it starts to record the sum of rewards in the following N + 1 steps, i.e. t=τi r(st , at ).
b
This is used to construct βk,i (s, ·). The next point the algorithm “starts to wait for s” would be τi + 2N if this
is still no later than kB − N .

Step 1. We first examine what EstimateQ sub-algorithm does in an episode k for a state s. The goal of this sub-
algorithm is to collect disjoint intervals of length N + 1 that start from s, calculate a reward estimator from each of them,
Model-free RL in Infinite-horizon Average-reward MDPs

and finally average the estimators over all intervals to get a good estimator for β π (s, ·). However, after our algorithm
collects an interval [τ, τ + N ], it rests for another N steps before starting to find the next visit to s – i.e., it restarts from
τ + 2N (see Line 9 in EstimateQ (Algorithm 3), and also the illustration in Figure 1).
The goal of doing this is to de-correlate the observed reward and the number of collected intervals: as shown in Eq.(18),
these two quantities affect the numerator and the denominator of βbk (s, ·) respectively, and if they are highly correlated,
then βbk (s, ·) may be heavily biased from β π (s, ·). On the other hand, if we introduce the “rest time” after we collect
each interval (i.e., the dashed segments in Figure 1), then since the length of the rest time (N ) is longer than the mixing
time, the process will almost totally “forget” about the reward estimators collected before. In Figure 1, this means that
the state distributions at the red round points (except for the left most one) will be close to µπ when conditioned on all
history that happened N rounds ago.
We first argue that if the process can indeed “reset its memory” at those red round points in Figure 1 (except for the
left most one), then we get almost unbiased estimators for β π (s, ·). That is, consider a process like in Figure 2 where
everything remains same as in EstimateQ except that after every rest interval, the state distribution is directly reset to
the stationary distribution µπ .

reset to stationary reset to stationary reset to stationary


distribution  distribution  distribution 

        

wait to see , wait to see ,


(length =  ) (length =  )

do nothing do nothing

Figure 2: The imaginary world (best viewed in color)

Below we calculate the expectation of βbk (s, a) in this imaginary world. As specified in Figure 2, we use τi to denote
the i-th time EstimateQ starts to record an interval (therefore sτi = s), and let wi = τi − (τi−1 + 2N ) for i > 1 and
w1 = τ1 − ((k − 1)B + 1) be the “wait time” before starting the i-th interval. Note the following facts in the imaginary
world:

1. M is determined by the sequence w1 , w2 , . . . because all other segments in the figures have fixed length.

2. w1 only depends on s(k−1)B+1 and P π , and wi only depends on the stationary distribution µπ and P π because of
the reset.

The above facts imply that in the imaginary world, w1 , w2 , . . ., as well as M , are all independent from
βbk,1 (s, a), βbk,2 (s, a), . . .. Let E′ denote the expectation in the imaginary world. Then
" #
h i 1 X ′ hb
M i
E βbk (s, a) = Pr[w1 ≤ B − N ] × E′{wi }

E βk,i (s, a) {wi } w1 ≤ B − N + Pr[w1 > B − N ] × 0
M i=1
" M
!#
1 X π
= Pr[w1 ≤ B − N ] × E′{wi } β (s, a) − δ(s, a) (by the same calculation as in (19))
M i=1
= Pr[w1 ≤ B − N ] × (β π (s, a) − δ(s, a))
= β π (s, a) − δ ′ (s, a), (21)

where E′{wi } denotes the expectation over the randomness of w1 , w2 , . . ., and δ ′ (s, a) = (1 − Pr[w1 ≤ B −
  B−N
N
N ]) (β π (s, a) − δ(s, a)) + δ(s, a). By Lemma 16, we have Pr[w1 ≤ B − N ] ≥ 1 − 1 − 4t3hit = 1 −
 4thit log2 T −1
1 − 4t3hit ≥ 1 − T13 . Together with Eq. (20) and Lemma 14, we have
 
1 1 1 1 1
|δ ′ (s, a)| ≤ (|β π (s, a)| + |δ(s, a)|) + |δ(s, a)| ≤ 3 (6tmix + N + 3 ) + 3 = O ,
T3 T T T T2
Wei, Jafarnia-Jahromi, Luo, Sharma, Jain

and thus
h i  
1
E′ βbk (s, a) − β π (s, a) = O . (22)
T2

Step 2. Note that βbk (s, a) is a deterministic function of X = (M, τ1 , T1 , τ2 , T2 , . . . , τM , TM ), where Ti =


(aτi , sτi +1 , aτi +1 , . . . , sτi +N , aτi +N ). We use βbk (s, a) = f (X) to denote this mapping. To say E[βbk (s, a)] and E′ [βbk (s, a)]
are close, we bound their ratio:

P
E[βbk (s, a)] f (X)P(X) P(X)
= PX ′
≤ max ′ , (23)
′ b
E [βk (s, a)] X f (X)P (X)
X P (X)

where we use P and P′ to denote the probability mass function in the real world and the imaginary world respectively,
and in the last inequality we use the non-negativeness of f (X).
For a fixed sequence of X, the probability of generating X in the real world is

P(X) = P(τ1 ) × P(T1 |τ1 ) × P(τ2 |τ1 , T1 ) × P(T2 |τ2 ) × · · · × P(τM |τM −1 , TM −1 )
h i
× P(TM |τM ) × Pr st 6= s, ∀t ∈ [τM + 2N, kB − N ] τM , TM . (24)

In the imaginary world, it is

P′ (X) = P(τ1 ) × P(T1 |τ1 ) × P′ (τ2 |τ1 , T1 ) × P(T2 |τ2 ) × · · · × P′ (τM |τM −1 , TM −1 )
h i
× P(TM |τM ) × Pr st 6= s, ∀t ∈ [τM + 2N, kB − N ] τM , TM . (25)

Their difference only comes from P(τi+1 |τi , Ti ) 6= P′ (τi+1 |τi , Ti ) because of the reset. Note that

X h i
P(τi+1 |τi , Ti ) = P(sτi +2N = s′ |τi , Ti ) × Pr st 6= s, ∀t ∈ [τi + 2N, τi+1 − 1], sτi+1 = s sτi +2N = s′ , (26)
s′ 6=s
X h i
P′ (τi+1 |τi , Ti ) = P′ (sτi +2N = s′ |τi , Ti ) × Pr st 6= s, ∀t ∈ [τi + 2N, τi+1 − 1], sτi+1 = s sτi +2N = s′ . (27)
s′ 6=s

Because of the reset in the imaginary world, P′ (sτi +2N = s′ |τi , Ti ) = µπ (s′ ) for all s′ ; in the real world, since at time
τi + 2N , the process has proceeded N steps from τi + N (the last step of Ti ), by Corollary 13.1 we have

P(sτi +2N = s′ |τi , Ti ) P(sτi +2N = s′ |τi , Ti ) − µπ (s′ ) 2 1


= 1 + ≤1+ 4 π ′ ≤1+ 3 for all s′ ,
P′ (sτi +2N = s′ |τi , Ti ) µπ (s′ ) T µ (s ) T

P(τ |τ ,T ) P(X) 
1 M
M 1
which implies P′ (τi+1 i i
i+1 |τi ,Ti )
≤ 1 + T13 by (26) and (27) . This further implies P′ (X)
≤ 1+ T3
≤ e T3 ≤ e T2 ≤ 1 + 2
T2
by (24) and (25). From (23), we then have

E[βbk (s, a)] 2


≤ 1+ 2.
E′ [βbk (s, a)] T

Thus, using the bound from Eq. (22) we have


       
2 2 1 1
E[βbk (s, a)] ≤ 1 + 2 E′ [βbk (s, a)] ≤ 1 + 2 βk (s, a) + O ≤ βk (s, a) + O .
T T T2 T


Similarly we can prove the other direction: βk (s, a) ≤ E[βbk (s, a)] + O 1
T
, finishing the proof.

Proof for Eq.(12). We use the same notations, and the similar approach as in the previous proof for Eq. (11). That is,
we first bound the expectation of the desired quantity in the imaginary world, and then argue that the expectation in the
imaginary world and that in the real world are close.
Model-free RL in Infinite-horizon Average-reward MDPs

Step 1. Define ∆i = βbk,i (s, a) − β π (s, a) + δ(s, a). Then E′ [∆i | {wi }] = 0 by Eq.(19). Thus in the imaginary world,
 2 
E′ βbk (s, a) − β π (s, a))
 ! 
1 XM   2
= E′  βbk,i (s, a) − β π (s, a) 1[M > 0] + β π (s, a)2 1[M = 0]
M i=1
 !2 
XM
1
= E′  ∆i − δ(s, a) 1[M > 0] + β π (s, a)2 1[M = 0]
M i=1
 !2  
XM
1
≤ E′ 2 ∆i + 2δ(s, a)2  1[M > 0] + β π (s, a)2 1[M = 0] (using (a − b)2 ≤ 2a2 + 2b2 )
M i=1
  !2  
XM
1
≤ Pr[w1 ≤ B − N ] × E′{wi } E′ 2 ∆i + 2δ(s, a)2 {wi } w1 ≤ B − N  + Pr[w1 > B − N ] × (N + 6tmix )2
M i=1
(β π (s, a) ≤ N + 6tmix by Lemma 14)
  !2  
M  
1 X 1
≤ E′{wi } E′ 2 ∆i {wi } w1 ≤ B − N  + O
M i=1 T
  B−N
3 N 1
(using Lemma 16: Pr[w1 > B − N ] ≤ 1 − 4thit
≤ T3
.)
" M
#  
2 X ′ 2  1
≤ E′{wi } 2
E ∆i
{wi } w1 ≤ B − N + O
M i=1 T
(∆i is zero-mean and independent of each other conditioned on {wi })
" #  
2 O(N 2 ) 1
≤ E′{wi } · M × w 1 ≤ B − N + O
M2 π(a|s) T
2
O(N 2 )
O(N )
(E′ [∆2i ] ≤ π(a|s) π(a|s) 2 = π(a|s)
by definition of βbk (s, a), Lemma 14, and Eq. (20))
   
O(N 2 ) ′ 1 1
≤ E w1 ≤ B − N + O . (28)
π(a|s) M T

Since Pr′ [M = 0] ≤ 1
T3
by Lemma 16, we have Pr′ [w1 ≤ B − N ] = Pr′ [M > 0] ≥ 1 − 1
T3
. Also note that if

B−N
M < M0 := ,
2N + 4N log T
µπ (s)

4N log T
then there exists at least one waiting interval (i.e., wi ) longer than µπ (s)
(see Figure 1 or 2) . By Lemma 16, this
 π
 4 log T
µπ (s)
happens with probability smaller than 1 − 3µ 4(s) ≤ T13 .

Therefore,
  P∞ 1
′ 1

m=1 m Pr [M = m]
1 × Pr′ [M < M0 ] + M10 × Pr′ [M ≥ M0 ]
E M >0 = ′ ≤
M Pr [M > 0] Pr′ [M > 0]
2N+ 4N log T
µπ (s)  
1
1× T3
+ B−N N log T
≤ 1 ≤O .
1− T3
Bµπ (s)

Combining with (28), we get


 2   
N 3 log T
E′ βbk (s, a) − β π (s, a)) ≤O .
Bπ(a|s)µπ (s)

Step 2. By the same argument as in the “Step 2” of the previous proof for Eq. (11), we have
 2     2   
2 N 3 log T
E βbk (s, a) − β π (s, a)) ≤ 1+ E′ βbk (s, a) − β π (s, a)) ≤O ,
T2 Bπ(a|s)µπ (s)
which finishes the proof.
Wei, Jafarnia-Jahromi, Luo, Sharma, Jain

B.3 Proof for Lemma 7


Proof. We defer the proof of Eq. (13) to Lemma 17 and prove the rest of the statements assuming Eq. (13). First, we
have
XX
|J πk − J πk−1 | = µπk (s) (πk (a|s) − πk−1 (a|s)) q πk−1 (s, a) (By Lemma 15)
s a
XX
≤ µπk (s) |(πk (a|s) − πk−1 (a|s))| |q πk−1 (s, a)|
s a
!
XX πk
=O µ (s)N ηπk−1 (a|s)tmix (By Eq. (13) and Lemma 14)
s a

= O (ηtmix N ) = O(ηN 2 ). (29)

Next, to prove a bound on |v πk (s) − v πk−1 (s)|, first note that for any policy π,
∞ 
X 
v π (s) = e⊤ π n π ⊤
s (P ) − (µ ) rπ (By Eq. (3))
n=0

X
N−1  ∞ 
X 
= e⊤ π n π ⊤
s (P ) − (µ ) rπ + e⊤ π n π ⊤
s (P ) − (µ ) rπ
n=0 n=N
N−1
X
= e⊤ π n π π π
s (P ) r − N J + error (s), (J π = (µπ )⊤ r π )
n=0

P∞ ⊤
where errorπ (s) := n=N e⊤ π n
s (P ) − µ
π
r π . By Corollary 13.2, |errorπ (s)| ≤ 1
T2
. Thus

N−1
X N−1
X 2
|v πk (s) − v πk−1 (s)| = e⊤
s ((P
πk n
) − (P πk−1 )n ) r πk + e⊤
s (P
πk−1 n πk
) (r − r πk−1 ) − N J πk + N J πk−1 +
n=0 n=0
T2
N−1
X N−1
X 2
≤ k((P πk )n − (P πk−1 )n ) r πk k∞ + kr πk − r πk−1 k∞ + N |J πk − J πk−1 | + . (30)
n=0 n=0
T2

Below we bound each individual term above (using notation π ′ := πk , π := πk−1 , P ′ := P πk , P := P πk−1 , r ′ := r πk , r :=
r πk−1 , µ := µπk−1 for simplicity). The first term can be bounded as

k(P ′n − P n )r ′ k∞

= k P ′ (P ′n−1 − P n−1 ) + (P ′ − P )P n−1 r ′ k∞
≤ kP ′ (P ′n−1 − P n−1 )r ′ k∞ + k(P ′ − P )P n−1 r ′ k∞
≤ k(P ′n−1 − P n−1 )r ′ k∞ + k(P ′ − P )P n−1 r ′ k∞ (because every row of P ′ sums to 1)
= k(P ′n−1 − P n−1 )r ′ k∞ + max e⊤ ′
s (P − P )P
n−1 ′
r
s

≤ k(P ′n−1 − P n−1 )r ′ k∞ + max ke⊤ ′


s (P − P )P
n−1
k1 ,
s

where the last term can be further bounded by

max ke⊤ ′
s (P − P )P
n−1
k1 ≤ max ke⊤ ′
s (P − P )k1
s s
!
X X ′ ′
= max (π (a|s) − π(a|s))p(s |s, a)
s
s′ a
!!
XX ′
≤O max ηN π(a|s)p(s |s, a) (By Eq. (13))
s
s′ a

= O (ηN ) .

Repeatedly applying this bound we arrive at k(P ′n − P n )r ′ k∞ ≤ O ηN 2 , and therefore,
N−1
X 
k((P πk )n − (P πk−1 )n ) r πk k∞ ≤ O ηN 3 .
n=0
Model-free RL in Infinite-horizon Average-reward MDPs

The second term in Eq. (30) can be bounded as (by Eq. (13) again)
N−1 N−1 N−1
!
X ′
X X ′
X X 
kr − rk∞ = max (π (a|s) − π(a|s))r(s, a) ≤ O max ηN π(a|s) = O ηN 2 ,
s s
n=0 n=0 a n=0 a

πk
and the third term in Eq. (30) is bounded via the earlier proof (for bounding |J − J πk−1 |):

N |J πk − J πk−1 | = O ηN 3 . (Eq.(29))

Plugging everything into Eq.(30), we prove |v πk (s) − v πk−1 (s)| = O ηN 3 .
Finally, it is straightforward to prove the rest of the two statements:

|q πk (s, a) − q πk−1 (s, a)| = r(s, a) + Es′ ∼p(·|s,a) [v πk (s′ )] − r(s, a) − Es′ ∼p(·|s,a) [v πk−1 (s′ )]

= Es′ ∼p(·|s,a) [v πk (s′ ) − v πk−1 (s′ )] = O ηN 3 .


|β πk (s, a) − β πk−1 (s, a)| ≤ |q πk (s, a) − q πk−1 (s, a)| + N |J πk − J πk−1 | = O ηN 3 .

This completes the proof.

C Analyzing Optimistic Online Mirror Descent with Log-barrier Regularizer —


Proofs for Eq.(13), Lemma 8, and Lemma 9
In this section, we derive the stability property (Eq.(13)) and the regret bound (Lemma 8 and Lemma 9) for optimistic
online mirror descent with the log-barrier regularizer. Most of the analysis is similar to that in (Wei and Luo, 2018; Bubeck
et al., 2019). Since in our MDP-OOMD algorithm, we run optimistic online mirror descent independently on each state,
the analysis in this section only focuses on a specific state s. We simplify our notations using πk (·) := πk (·|s), πk′ (·) :=
πk′ (·|s), βbk (·) := βbk (s, ·) throughout the whole section.
Our MDP-OOMD algorithm is effectively running Algorithm 5 on each state. We first verify that the condition in
Line 7 of Algorithm 5 indeed holds in our MDP-OOMD algorithm. Recall that in EstimateQ (Algorithm 3) we collect
trajectories in every episode for every state. Suppose for episode k and state P s it collects M trajectoriesPthat start from
time τ1 , . . . , τM and has total reward R1 , . . . , RM respectively. Let ma = M
i=1 1[aτi = a], then we have a ma = M . By
our way of constructing βbk (s, ·), we have

XM
Ri 1[aτi = a]
βbk (s, a) =
i=1
M πk (a|s)

P P P Ri 1[aτi =a] P
when M > 0. Thus we have a πk (a|s)βbk (s, a) = a M i=1 M
= M Ri
i=1 M ≤ (N + 1) because every Ri is the
total reward for an interval of length N + 1. This verifies the condition in Line 7 for the case M > 0. When M = 0,
b ·) to zero so the condition clearly still holds.
EstimateQ sets β(s,

C.1 The stability property of Algorithm 5 — Proof of Eq.(13)


The statement and the proofs of Lemmas 17 and 18 are almost identical to those of Lemma 9 and 10 in (Bubeck et al.,
2019).
1 1
Lemma 17. In Algorighm 5, if η ≤ 270C
= 270(N+1)
, then

|πk+1 (a) − πk (a)| ≤ 120ηCπk (a).


To prove this lemma we make use of the following auxiliary result, where we use the notation kakM = a⊤ M a for a
vector a ∈ RA and a positive semi-definite matrix M ∈ RA×A .
1
Lemma 18. For some arbitrary b1 , b2 ∈ RA , a0 ∈ ∆A with η ≤ 270C
, define
(
a1 = argmina∈∆A F1 (a), where F1 (a) , ha, b1 i + Dψ (a, a0 ),
a2 = argmina∈∆A F2 (a), where F2 (a) , ha, b2 i + Dψ (a, a0 ).

(ψ and Dψ are defined in Algorithm 5). Then as long as kb1 −b2 k∇−2 ψ(a1 ) ≤ 12 ηC, we have for all i ∈ [A], |a2,i −a1,i | ≤
60ηCa1,i .
Wei, Jafarnia-Jahromi, Luo, Sharma, Jain

Algorithm 5: Optimistic Online Mirror Descent (OOMD) with log-barrier regularizer


1 Define:
2 C := N + 1
PA
3
1
Regularizer ψ(x) = η1 a=1 log x(a) , for x ∈ RA
+
4 Bregman divergence associated with ψ:

Dψ (x, x′ ) = ψ(x) − ψ(x′ ) − h∇ψ(x′ ), x − x′ i


1
5 Initialization: π1′ = π1 = A 1
6 for k = 1, . . . , K do
P
7 Receive βbk ∈ RA + for which
b
a πk (a)βk (a) ≤ C.
8 Update
n o

πk+1 = argmax hπ, βbk i − Dψ (π, πk′ )
π∈∆A
n o
πk+1 = argmax hπ, βbk i − Dψ (π, πk+1

)
π∈∆A

√ √
Proof of Lemma 18. First, we prove ka1 − a2 k∇2 ψ(a1 ) ≤ 60 ηC by contradiction. Assume ka1 − a2 k∇2 ψ(a1 ) > 60 ηC.
′ ′ √
Then there exists some a2 lying in the line segment between a1 and a2 such that ka1 − a2 k∇2 ψ(a1 ) = 60 ηC. By Taylor’s
theorem, there exists a that lies in the line segment between a1 and a′2 such that

1 ′
F2 (a′2 ) = F2 (a1 ) + h∇F2 (a1 ), a′2 − a1 i + ka2 − a1 k2∇2 F2 (a)
2
1 ′
= F2 (a1 ) + hb2 − b1 , a′2 − a1 i + h∇F1 (a1 ), a′2 − a1 i + ka − a1 k2∇2 ψ(a)
2 2
1 ′
≥ F2 (a1 ) − kb2 − b1 k∇−2 ψ(a1 ) ka′2 − a1 k∇2 ψ(a1 ) + ka2 − a1 k2∇2 ψ(a)
2
√ √ 1
≥ F2 (a1 ) − 12 ηC × 60 ηC + ka′2 − a1 k2∇2 ψ(a) (31)
2
where in the first inequality we use Hölder inequality
√ and the first-order optimality
√ condition, and in the last inequality
we use the conditions kb1 − b2 k∇−2 ψ(a1 ) ≤ 12 ηC and ka1 − a′2 k∇2 ψ(a1 ) = 60 ηC. Note that ∇2 ψ(x) is a diagonal matrix
and ∇2 ψ(x)ii = η1 x12 . Therefore for any i ∈ [A],
i
v
u A
√ uX (a′2,j − a1,j )2 |a′2,i − a1,i |
60 ηC = ka′2 − a1 k∇2 ψ(a1 ) =t ≥ √
j=1
ηa21,j ηa1,i

|a′2,i −a1,i |
n a′ o
2 2,i a1,i 9
and thus a1,i
≤ 60ηC ≤ 9
, which implies max a1,i
, a′2,i
≤ 7
. Thus the last term in (31) can be lower bounded
by
A  2 XA
1X 1 ′ 1 7 1
ka′2 − a1 k2∇2 ψ(a) = (a 2,i − a 1,i ) 2
≥ (a′2,i − a1,i )2
η i=1 a2i η 9 i=1
a 2
1,i

≥ 0.6ka′2 − a1 k2∇2 ψ(a1 ) = 0.6 × (60 ηC)2 = 2160ηC 2 .

Combining with (31) gives


1
F2 (a′2 ) ≥ F2 (a1 ) − 720ηC 2 + × 2160ηC 2 > F2 (a1 ).
2
Recall that a′2 is a point in the line segment between a1 and a2 . By the convexity of F2 , the above inequality implies
F2 (a1 ) < F2 (a2 ), contradicting the optimality of a2 .
r
√ PA (a1,j −a2,j )2 |a −a1,i |
Thus we conclude ka1 − a2 k∇2 ψ(a1 ) ≤ 60 ηC. Since ka1 − a2 k∇2 ψ(a1 ) = j=1 ηa2
≥ 2,i

ηa1,i
for all i, we get
1,j
|a2,i −a1,i | √

ηa1,i
≤ 60 ηC, which implies |a2,i − a1,i | ≤ 60ηCa1,i .
Model-free RL in Infinite-horizon Average-reward MDPs

Proof of Lemma 17. We prove the following stability inequalities



πk (a) − πk+1 (a) ≤ 60ηCπk (a), (32)

πk+1 (a) − πk+1 (a) ≤ 60ηCπk (a). (33)
Note that (32) and (33) imply
|πk (a) − πk+1 (a)| ≤ 120ηCπk (a), (34)
which is the inequality we want to prove.
We use induction on k to prove (32) and (33). Note that (32) implies
′ 60
πk+1 (a) ≤ πk (a) + 60ηCπk (a) ≤ πk (a) + πk (a) ≤ 2πk (a), (35)
270
and (34) implies
120
πk+1 (a) ≤ πk (a) + 120ηCπk (a) ≤ πk (a) + πk (a) ≤ 2πk (a). (36)
270
Thus, (35) and (36) are also inequalities we may use in the induction process.

Base case. For the case k = 1, note that


(
π1 = argminπ∈∆A Dψ (π, π1′ ), (because π1 = π1′ )
π2 = argminπ∈∆A hπ, −βb1 i + Dψ (π, π1 ).
′ ′


To apply Lemma 18 and obtain (32), we only need to show kβb1 k∇−2 ψ(π1 ) ≤ 12 ηC. Recall ∇2 ψ(u)ii = 1 1
η u2
and
i
−2
∇ ψ(u)ii = ηu2i . Thus,
A
X
kβb1 k2∇−2 ψ(π1 ) ≤ ηπ1 (a)2 βb1 (a)2 ≤ ηC 2
a=1

P P 2
because a π1 (a)2 βb1 (a)2 ≤ b
a π1 (a)β1 (a) ≤ C 2 by the condition in Line 7 of Algorithm 5. This proves (32) for the
base case.
Now we prove (33) of the base case. Note that
( ′ ′
π2 = argminπ∈∆A D
D ψ (π, π2E), (37)
π2 = argmin π, −βb1 + Dψ (π, π2′ ).
π∈∆A


Similarly, with the help of Lemma 18, we only need to show kβb1 k∇−2 ψ(π2′ ) ≤ 12 ηC. This can be verified by
A
X A
X
kβb1 k2∇−2 ψ(π′ ) ≤ ηπ2′ (a)2 βb1 (a)2 ≤ 4 ηπ1 (a)2 βb1 (a)2 ≤ 4ηC 2 ,
2
a=1 a=1

where the second inequality uses (35) for the base case (implied by (32) for the base case, which we just proved).

Induction. Assume (32) and (33) hold before k. To prove (32), observe that
( D E
πk = argminπ∈∆A π, −βbk−1 + Dψ (π, πk′ ),
(38)

πk+1 = argmin hπ, −βbk i + Dψ (π, πk′ ).
π∈∆A


To apply Lemma 18 and obtain (32), we only need to show kβbk − βbk−1 k∇−2 ψ(πk ) ≤ 12 ηC. This can be verified by
A
X  2
kβbk − βbk−1 k2∇−2 ψ(πk ) ≤ ηπk (a)2 βbk (a) − βbk−1 (a)
a=1
A
X  
≤ 2η πk (a)2 βbk (a)2 + βbk−1 (a)2
a=1
A
X A
X
≤ 2η πk (a)2 βbk (a)2 + 2η 4πk−1 (a)2 βbk−1 (a)2
a=1 a=1

≤ 10ηC 2 ,
Wei, Jafarnia-Jahromi, Luo, Sharma, Jain

where the third inequality uses (36) for k − 1.


To prove (33), we observe:
( ′ ′
πk+1 = argminπ∈∆A D
D ψ (π, πk+1
E ), (39)
πk+1 = argmin π, −βbk + Dψ (π, πk+1
π∈∆A

).

Similarly, with the help of Lemma 18, we only need to show kβbk k∇−2 ψ(π′ ) ≤ 12 ηC. This can be verified by
k+1

A
X A
X
kβbk k2∇−2 ψ(π′ ) ≤ ′
ηπk+1 (a)2 βbk (a)2 ≤ 4 ηπk (a)2 βbk (a)2 ≤ 4ηC 2 ,
k+1
a=1 a=1

where in the second inequality we use (35) (implied by (32), which we just proved). This finishes the proof.

C.2 The regret bound of Algorithm 5 — Proof of Lemma 8


Proof of Lemma 8. By standard analysis for optimistic online mirror descent (e.g, (Wei and Luo, 2018, Lemma 6), (Chiang
et al., 2012, Lemma 5)), we have (recall βb0 is the all-zero vector)

hπ̃ − πk , βbk i ≤ Dψ (π̃, πk′ ) − Dψ (π̃, πk+1


′ ′
) + hπk − πk+1 , βbk−1 − βbk i (40)
for any π̃ ∈ ∆A . Summing over k and telescoping give
K
X K
X K
X
hπ̃ − πk , βbk i ≤ Dψ (π̃, π1′ ) − Dψ (π̃, πK+1

)+ ′
hπk − πk+1 , βbk−1 − βbk i ≤ Dψ (π̃, π1′ ) + ′
hπk − πk+1 , βbk−1 − βbk i.
k=1 k=1 k=1

As in (Wei and Luo, 2018), we pick π̃ = 1 − 1
T
π∗ + 1
1 ,
TA A
and thus

Dψ (π̃, π1′ ) = ψ(π̃) − ψ(π1′ ) − h∇ψ(π1′ ), π̃ − π1′ i


= ψ(π̃) − ψ(π1′ ) (∇ψ(π1′ ) = − A
η
1 and h1, π̃ − π1′ i = 0)
A A
1X 1 1X 1
= log − log ′
η a=1 π̃(a) η a=1 π1 (a)
A log(AT ) A log A A ln T
≤ − = .
η η η


On the other hand, to bound hπk − πk+1 , βbk−1 − βbk i, we follow the same approach as in (Wei and Luo, 2018, Lemma
b
14): define Fk (π) = hπ, −βk−1 i + Dψ (π, πk′ ) and Fk+1

(π) = hπ, −βbk i + Dψ (π, πk′ ). Then by definition we have πk =
′ ′
argminπ∈∆A Fk (π) and πk+1 = argminπ∈∆A Ft+1 (π).
Observe that

Fk+1 ′
(πk ) − Fk+1 ′
(πk+1 ′
) = (πk − πk+1 )⊤ (βbk−1 − βbk ) + Fk (πk ) − Fk (πk+1

)
′ ⊤ b b
≤ (πk − πk+1 ) (βk−1 − βk ) (by the optimality of πk )

≤ πk − πk+1 ∇2 ψ(πk )
βbk−1 − βbk . (41)
∇−2 ψ(πk )


On the other hand, for some ξ that lies on the line segment between πk and πk+1 , we have by Taylor’s theorem and the

optimality of πk+1 ,

′ ′ ′ ′ ′ 1 2
Fk+1 (πk ) − Fk+1 (πk+1 ) = ∇Fk+1 (πk+1 )⊤ (πk − πk+1

)+ ′
πk − πk+1 ′
∇2 Fk+1 (ξ)
2
1 ′ 2 ′
≥ πk − πk+1 ∇2 ψ(ξ)
(by the optimality of πk+1 and that ∇2 Fk+1

= ∇2 ψ)
2
(42)

1  1 
By Eq.(32) we know πk+1 (a) ∈ π (a), 2πk (a) , and hence ξ(a) ∈ 2 πk (a), 2πk (a) holds as well, because ξ is in the line
2 k

segment between πk and πk+1 . This implies for any x,
v v
u A u A
uX x(a)2 1 uX x(a)2 1
kxk∇2 ψ(ξ) = t ≥ t = kxk∇2 ψ(πk ) .
ηξ(a) 2 2 ηπ (a) 2 2
k
a=1 a=1
Model-free RL in Infinite-horizon Average-reward MDPs

Combine this with (41) and (42), we get

′ 1 ′ 2
πk − πk+1 ∇2 ψ(πk )
βbk−1 − βbk ≥ πk − πk+1 ∇2 ψ(πk )
,
∇−2 ψ(πk ) 8

which implies kπk − πk+1 k∇2 ψ(π ≤ 8 βbk−1 − βbk . Hence we can bound the third term in (40) by
k) ∇−2 ψ(πk )

2 X  2

πk − πk+1 ∇2 ψ(πk )
βbk−1 − βbk ≤ 8 βbk−1 − βbk = 8η πk (a)2 βbk−1 (a) − βbk (a) .
∇−2 ψ(πk ) ∇−2 ψ(πk )
a

Finally, combining everything we have


"K #
X ∗
E hπ − πk , βbk i
k=1
" K
#
X
=E hπ − π̃, βbk i + hπ̃ − πk , βbk i

k=1
" K  # !
1 X 1 b A log T XK X  2
≤ ∗
π − 1, βk +O +η πk (a)2 βbk−1 (a) − βbk (a) ,
T k=1 A η k=1 a

where the expectation of the first term is bounded by O KN
T
= O(1) by the fact E[βbk (s)] = O(N ) (implied by Lemma 6
and Lemma 14). This completes the proof.

C.3 Proof for Lemma 9


Lemma 19 (Restatement of Lemma 9).
" K #
XXX ∗ ∗ πk
E B µ (s) (π (a|s) − πk (a|s)) q (s, a)
k=1 s a
 
BA ln T T N 3ρ
+η e
=O + η3 T N 6 .
η B
 √ √

4
1
With the choice of η = min 270(N+1) , √B A 3 , √
4
BA
6
, the bound becomes
ρT N TN

p  q 
3 1 3 1
e
O N 3 ρAT + (BAN 2 ) 4 T 4 + BN A = O
e t3mix ρAT + (t3mix thit A) 4 T 4 + t2mix thit A .

Proof. For any s,


"K #
XX ∗
E (π (a|s) − πk (a|s))q πk (s, a)
k=1 a
" K X
#
X ∗ πk P ∗
=E (π (a|s) − πk (a|s))β (s, a) (by the definition of β πk and that a (π (a|s) − πk (a|s))J πk = 0)
k=1 a
" #  
K X
X h i K
≤E (π (a|s) − πk (a|s))Ek βbk (s, a)

+O (by Eq. (11))
k=1 a
T
  " K X
#!
A ln T X 2 2
=O +O ηE πk (a|s) (βbk (s, a) − βbk−1 (s, a)) (by Lemma 8)
η a
k=1
  "K #!
A ln T 2
XX 2 b πk 2
≤O + ηN + O ηE πk (a|s) (βk (s, a) − β (s, a))
η
k=2 a
"K #!
XX 2 πk πk−1 2
+ O ηE πk (a|s) (β (s, a) − β (s, a))
k=2 a
"K #!
XX
+O ηE πk (a|s) (β 2 πk−1
(s, a) − βbk−1 (s, a)) 2
. (43)
k=2 a
Wei, Jafarnia-Jahromi, Luo, Sharma, Jain

The second term in (43) can be bounded using Eq. (12):


" K X
#!
X
O ηE πk (a|s)2 (βbk (s, a) − β πk (s, a))2
k=2 a
"K #!
XX N 3 log T
=O ηE πk (a|s)2
k=2 a
Bπk (a|s)µπk (s)
"K #!
X N 3 log T
=O ηE .
Bµπk (s)
k=2

The fourth term in (43) can be bounded similarly,


 except
hP that3 we firstiuse Lemma
 hP17 to upper boundi πk (a|s) by 2πk−1 (a|s).
K N log T K N 3 log T
Eventually this term is upper bounded by O ηE π
k=2 Bµ k−1 (s) = O ηE π
k=1 Bµ k (s) .

The third term in (43) can be bounded using Lemma 7:


" K X
#!
X 2 πk πk−1 2
O ηE πk (a|s) (β (s, a) − β (s, a))
k=2 a
" K X
#!
X 2 3 2
=O ηE πk (a|s) (ηN )
k=2 a

= O η 3 KN 6
.

Combining all these bounds in (43), we get


" K X
# "K # !
X A ln T X N 3 log T
E (π ∗ (a|s) − πk (a|s))q πk (s, a) = O + ηE 3
+ η KN 6
.
a
η Bµπk (s)
k=1 k=1

Now multiplying both sides by Bµ∗ (s) and summing over s we get
" K XX
# "K # !
X ∗ ∗ πk BA ln T X X N 3 (log T )µ∗ (s) 3 6
E B µ (s)(π (a|s) − πk (a|s))q (s, a) = O + ηE + η BKN
k=1 s a
η k=1 s
µπk (s)
 
BA ln T
≤O + ηρKN 3 (log T ) + η 3 BKN 6
η
 
BA T N3
=Oe + ηρ + η3 T N 6 (T = BK)
η B
 √ √

4
1 √B A BA 1
Choosing η = min 270(N+1)
, , √
4 (η ≤ 270(N+1)
is required by Lemma 17), we finally obtain
ρT N 3 T N6

" #
K XX
X p 3 1

∗ ∗
E B µ (s)(π (a|s) − πk (a|s))q πk e
(s, a) = O N 3 ρAT + (BAN 2 ) 4 T 4 + BN A
k=1 s a
q 
3 1
e
=O t3mix ρAT + (t3mix thit A) 4 T 4 + t2mix thit A .

You might also like