Deep Reinforcement Learning in Recommendations
Deep Reinforcement Learning in Recommendations
ABSTRACT called large genres. There are small genres in each large
Services that introduce stores to users on the Internet are genre. Users of this website can gather store information by
increasing in recent years. Each service conducts thorough selecting a genre. When a user searches for stores on Ekiten,
analyses in order to display stores matching each user’s pref- there are two main search methods. The first method is to
erences. In the field of recommendation, collaborative filter- select an area after clicking on a large or small genre on
ing performs well when there is sufficient click information the top page. In other words, this method narrows down
from users. Generally, when building a user-item matrix, stores in an area close to a specific point and compiles a list
data sparseness becomes a problem. It is especially difficult of recommended stores. The second method for receiving a
to handle new users. When sufficient data cannot be ob- list of stores is to enter keywords into the search bar on the
tained, a multi-armed bandit algorithm is applied. Bandit top page. To do so, a user must input an area and a genre
algorithms advance learning by testing each of a variety of as search keywords. On the display page of stores listed,
options sufficiently and obtaining rewards (i.e. feedback). It store names, ratings, descriptions, number of reviews, and
is practically impossible to learn everything when the num- related photos are displayed. Based on this information,
ber of items to be learned periodically increases. The prob- users click on a store that interests them. Users can then go
lem of having to collect sufficient data for a new user of to the store ʟs page and get detailed information. On the
a service is the same as the problem that collaborative fil- store ʟs page, store information, reviews, photos, menus, staff
tering faces. In order to solve this problem, we propose a information, and maps can be viewed. It is also possible to
recommender system based on deep reinforcement learning. check a store ʟs availability and make reservations. This is
In deep reinforcement learning, a multilayer neural network how Ekiten ʟs service for finding and reserving stores works.
is used to update the value function. In our research, we optimized Ekiten ʟs store recommen-
dation function using the framework of deep reinforcement
learning. We performed offline experiments with the goal of
Keywords establishing aʠ stores recommended for you ʡsection on the
Recommender systems, Latent Dirichlet Allocation, Deep site. Appropriate store recommendations for each user were
Reinforcement Learning, Deep Deterministic Policy Gradi- made on the basis of past delivery records (impressions and
ent clicks) possessed by Ekiten.
In this paper, we introduce a method of applying deep re-
1. INTRODUCTION inforcement learning to the problem of recommending stores.
In services for introducing stores, the number of page As with general reinforcement learning, it is necessary to
views and active users greatly affects the services ʟ ability define a state, action, episode, and reward. When dealing
to increase the number of stores that pay to be registered on with actions discretely, the number of stores is enormous,
the service. Service providers make content easy to use and and learning does not converge (the curse of dimensional-
post information that is beneficial to users. Improving user ity). Also, it is not practical to conduct the learning process
experience increases monthly page views and the number of each time a new store is added. In order to handle this
active users. Although some improvements benefit all users, problem, we expressed actions as continuous values. Specifi-
such as improvements to user interface, the store informa- cally, Latent Dirichlet Allocation (LDA) was used to convert
tion that users require generally varies from user to user. store information into distributed representation. Because
Reducing the time required for the service ʟs search func- the state was created on the basis of browsing history and
tion to retrieve a target store is important. The function of area information, it was treated as a continuous value (sec-
recommending suitable stores to users is also an important tion 4). The deep deterministic policy gradient (DDPG),
requirement for recommender systems. which has high performance with continuous value output,
Ekiten is a website that lists store reviews and rankings. It was used for deep reinforcement learning. On the basis of
has store information in the following categories: relaxation, the delivery record of Ekiten, we compared our proposed
body care, beauty, medical care, hobbies, education, dining, method and bandit algorithms and showed the superiority
shopping, leisure, and everyday life. These categories are of the proposed method.
978-1-5386-0954-5/18/$31.00
Authorized licensed ©2018
use limited to: NATIONAL INSTITUTE TECHNOLOGY CALICUT.226
OF IEEE Downloaded on September 02,2022 at 05:27:46 UTC from IEEE Xplore. Restrictions apply.
2018 International Conference on Information and Communications Technology (ICOIACT)
We review related work on recommender systems and deep actively selected. Thompson sampling [5, 6] counts the num-
reinforcement learning in Section 2. Section 3 describes this ber of clicks and non-clicks. Random numbers are generated
research ʟs problem setting. Section 4 presents a method based on the beta distribution for each advertisement, and
for fitting the framework of deep reinforcement learning to an advertisement with the maximum expected value is dis-
recommending stores. Section 5 describes the experimental played. The maximum expected value is calculated on the
results of the proposed and comparative methods. We also basis of Bernoulli distribution. In the Bernoulli distribution
explain our evaluation method. Finally, Section 6 concludes model, it is common to use a beta distribution that is a
this paper. conjugate prior in order to facilitate calculation. As these
examples show, bandit algorithms optimize delivery on the
2. RELATED WORK basis of advertisement click information.
In recent years, a linear-model bandit algorithm has been
Collaborative filtering [1, 2] is a popular and widely ap-
proposed. There are many cases where each advertisement
plied method in recommender systems. It is possible to cal-
in a system has some related features. We can estimate the
culate evaluation values on the basis of similarities between
reward of the advertisement using the reward of another ad-
users by creating a user-item preference matrix (user-based
vertisement. However, it is difficult to handle cases in which
collaborative filtering). The method of calculating evalua-
the action spaces change for each step in the linear-model
tion values on the basis of similarities between items is called
bandit algorithm. As it is, content on the web is constantly
item-based collaborative filtering. Collaborative filtering is
changing. New advertisement recommendations are deliv-
very effective when evaluation values are sufficiently filled,
ered to users while others are deleted. Since this process
but in general, the user-item matrix is scarcely filled. This is
does not always have the same action space, it cannot be
called the problem of data sparseness. In a situation where
represented by a simple linear model. Lihong Li proposed
there are few new users or delivery results, the problem of
linUCB [7, 8] as a linear model that can treat action space
data sparseness always occurs. When making recommenda-
that changes across steps. To distinguish it from previous
tions using collaborative filtering, it is necessary to use an-
models, linUCB is called a contextual bandit.
other method for convenience, such as random distribution,
Bandit algorithms can be discussed in the framework of
until data is accumulated; for this reason, it is not immedi-
reinforcement learning. In the case of an advertisement,
ately possible to obtain the benefits of this recommendation
assume an agent selects an advertisement to be displayed
method. This problem occurs especially for services that
each time a user accesses a page. When the user visits the
introduce many new products frequently.
page, the agent receives the state from the environment. The
Another feature of collaborative filtering is its ability to
agent selects an advertisement as an action based on the
predict the click-through rate (CTR) and conversion rate
state. The environment receives the user’s click information
(CVR). This is possible by making CTR or CVR elements
as feedback and rewards the agent. The value function in
of the user-item matrix. Multi-armed bandit algorithms[3,
the bandit algorithm defines the current CTR.
4, 5] use CTR or CVR as an indicator. These algorithms
Q-learning [9] is one reinforcement learning algorithm. In
are one method for optimizing recommendations in adver-
this method, the agent learns a value for how beneficial it
tisements. There are two phases of a bandit algorithm: ex-
is to take a given action within a given state. This func-
ploration and exploitation. Advertisements are distributed
tion is called the action-value function. In this research, we
randomly in exploration. In exploitation, advertisements
advanced learning to maximize it. The action-value func-
with the highest expected value are delivered; values are
tion is not an immediate reward, but it selects the action
calculated on the basis of data gained through exploration.
taking future rewards into consideration. Specifically, we
Both exploration and exploitation receive feedback. This
considered the sum of discounted rewards. The sum of
feedback is generally called a reward. The -greedy method
discounted
T rewards at the current time t is expressed as
[3] which explores with the probability and exploits with τ
Rt = τ =0 γ Rt+1+τ . In this formula of rewards, T is
the probability (1 − ) is the simplest bandit algorithm and
the time-step at which the episode terminates and future
does not require much time for implementation. The prob-
rewards are discounted by a factor of γ per time-step. The
ability is an important parameter in determining the bal-
bandit algorithm maximizes the immediate reward. In con-
ance between exploration and exploitation. If the value of
trast, Q-learning maximizes cumulative reward obtained by
is large, the exploration rate will increase. Data can be
a series of actions.
sufficiently gathered, but it cannot be delivered effectively
Research on deep reinforcement learning that expresses
during this time. However, if the value of is small, adver-
value functions by multi-layered neural networks is actively
tisements with high expected value are displayed. Although
advancing in recent years. By expressing the value func-
this may have temporary results, it runs the risk of missing
tion with a neural network, it is possible to treat high-
the opportunity to deliver more effective advertisements. In
dimensional objects such as images. However, data obtained
general, exploration and exploitation have a trade-off rela-
from the environment correlates with each other, which can
tionship.
interfere with learning. Deep Q-net (DQN)[10, 11] solved
Several methods have been proposed to change the value
this problem by using experience replay. Experience replay
of as the number of steps increases and devise a balance
stores obtained data in a storage area called a replay buffer.
between exploration and exploitation. The softmax method
We randomly sample data from it during learning to reduce
[3], for instance, recommends high-value-expected advertise-
correlation between data. The size of the buffer is prede-
ments with high probability and low-value-expected adver-
fined, and when the amount of data exceeds its size, the
tisements with low probability. Weighted averages are used
oldest data is deleted. Also, DQN uses a target function to
to calculate expected values. Upper confidence bounds (UCB)
stabilize learning. The action-value function is copied be-
[4] distribute advertisements using the reliability of the ex-
fore the start of learning to the target function; within the
pected value. Advertisements that are not fully explored are
batch, it updates the action value by using the output of the erences. This section describes the store vector representa-
target function as a teacher signal. This prevents the prob- tion method (Section 4.1), the elements necessary for deep
lem of teacher signals changing during learning and thereby reinforcement learning (Section 4.2), and the deep reinforce-
stabilizes learning. ment learning algorithm used in this research (Section 4.3).
DQN makes it possible to learn a high-dimensional state
space by using a neural network for functional approxima- 4.1 DISTRIBUTED REPRESENTATION ME-
tion and to stabilize learning. However, DQN cannot be THOD FOR STORES
used when the action space is continuous because it is nec- We use LDA [13] as the stochastic language model to
essary to obtain actions that maximize the current action make the stores a distributed representation. A bag-of-
value in each state. The deep deterministic policy gradient words (BoW), which is a set of words and occurrence fre-
(DDPG) [12] is a method that performs well even when the quency, model was made. In a BoW model, the order of
state space and action space of the agent are continuous val- words in the document is ignored. It represents a phe-
ues. In this paper, we give an effective method for raising nomenon in which words co-occur, and it can be assumed
the click rate of a displayed store using DDPG. We con- that there are several sets of words with a statistically high
ducted experiments on store recommendations, which can possibility of co-occurring. It is assumed that one document
correspond to continuous values, and compare this method ʟ consists of N topics, so each topic is given a probability value.
s performance with that of other methods. We used LDA with documents that we made for each store
from their review information. The store vector item has
3. PROBLEM SETTING N-dimensional probability values as elements.
When users visit Ekiten and search for areas and genres 4.2 DEFINITION OF ELEMENTS FOR DEEP
as keywords, stores that match the keywords are displayed. REINFORCEMENT LEARNING
Users can make inquiries to and reservations with stores that This section gives the definitions of agents, environment,
they want to visit on the basis of the introduction of the and episodes, which are the constituent elements of rein-
stores, store reviews, photos, and ratings. In this paper, forcement learning.
we assumed that Ekiten aims to display the stores that are
best for each individual user conducting a search. What 4.2.1 Agent
this meant was even when the same search keywords were When the agent receives the state st from the environment
used, we recommended different stores for each user. Then, at the time t, it outputs the N-dimensional store vector as
the model is updated so that the displayed stores fit better the behavior at and the reward rt and updates the model.
with the user’s preferences. The area and genre are input at is an element of action space A ∈ [0, 1]N . When rec-
information given by users, so it is not practical to recom- ommending a store at time t, the set of stores Itemt to
mend a store far from the area the users select. Therefore, be recommended is determined on the basis of keywords en-
we recommended stores within the user-selected area. The tered by the user. We calculate the similarity between the
store recommendations were made using deep reinforcement action at output by the agent and the stores Itemt to be
learning on the basis of logs of clicks to detailed store infor- recommended. The similarity is defined as follows
mation pages. The agent receives the searching user’s past
browsing history and search keywords as states and outputs N
at,n itemn
desirable stores for the user as the action. The environment sim(at , item) = n=1 (1)
N 2 N 2
returns the reward calculated on the basis of the user’s clicks n=1 at,n n=1 itemn
as feedback on the action of the agent. The agent updates
its own model on the basis of the reward. When the action The similarity defined by Equation 1 is calculated using
of the agent is discrete, the agent needs to learn all stores the distributed representation Itemt of the store being rec-
listed in Ekiten. It takes a very long time to converge the ommended and the action vector at .
learning of the agent. In this study, we used DDPG [12], The store recommend itemt that maximizes similarity
which is compatible with problems with continuous values is displayed to the user.
for state space and action space. We converted store in-
formation to N-dimensional distributed representation using recommend itemt = arg max sim(a, item) (2)
LDA [13]. This made it possible to calculate the degree of item∈Itemt
suitability of stores for a user even for stores that have never
4.2.2 ENVIRONMENT
received feedback. We used mean reciprocal rank (MRR)
[14] to evaluate this recommendation system. In this paper, After displaying to the user the store received from the
we determined whether the agent ʟ s store output increasingly agent, the environment sends the reward rt , calculated on
fit the user’s preferences as learning progressed. the basis of click information, and the next state st+1 to the
agent. In this research, there are two ways of giving rewards.
The first method consists of giving the similarity between
4. RECOMMENDING STORES WITH DEEP the vector of N dimension output by DDPG and the clicked
REINFORCEMENT LEARNING store as a reward (raction vec ). The reward calculation is
We used deep reinforcement learning to recommend stores. defined as follows:
In this research, state space and action space are treated as
N
continuous values. Distributed representation of stores can at,n clicked itemt,n
be expressed using LDA. We used Ekiten’s reviews as a cor- raction vec = n=1 − 0.5
N 2 N 2
pus. The agent advances in order to output a distributed n=1 at,n n=1 clicked itemt,n
s pref-
representation of a store that matches the target user ʟ (3)
The first term of Equation 3 is the cosine similarity. The to evaluate the value of a state and an actor function to se-
action space is A ∈ [0, 1]N , so the cosine similarity is be- lect actions stochastically according to the state. By using a
tween 0 and 1. We wanted to give a negative reward when combination of these two methods, we can handle continuous
the degree of similarity between the clicked store and the values in the action space. At time t, we get Bt data ran-
recommended store is low. The reward rt becomes rt ∈ domly from a tuple set containing the current state, action,
[−0.5, 0.5] by subtracting 0.5 from the cosine similarity. The reward, and the next state stored in the replay buffer. In
second is calculated by replacing the action vector in Equa- the DDPG algorithm, we use the mini-batch that randomly
tion 3 with recommendi temt obtained by Equation 2(ritem vec ). samples Bt data from a tuple set containing the current
The state is expressed on the basis of search keywords state, action, reward, and the next state stored in the replay
entered at the time t and the stores that the user browsed buffer at time t. The critic function and the actor function
up to the time t − 1. The area areat is represented by a are updated using this algorithm. (st,i , at,i , rt,i , st,i+1 ) is an
two-dimensional vector of latitude and longitude informa- element of a tuple set included in the mini-batch. Regarding
tion. In this research, we recommended dining-genre stores. Q(st,i , at,i ) as the action-value function, the critic function
There are 40 small genres in the dining genre, and the genre is updated to minimize the loss function L shown below.
genret is represented by a 40-dimensional one-hot vector.
The stores that the user has browsed in the past are ex- 1
pressed as historyt . To elaborate further, historyt is L= (yt,i − Q(st,i , at,i |θQ ))2 (7)
Bt i
represented by adding j number of store vectors for the
stores that the user has most recently viewed. The store The teacher signal yt,i is expressed as Equation 8 by using
viewed the most recently has a high likelihood of reflecting the reward rt,i and the action value calculated from the tar-
the user’s preferences. Therefore, we adopted the discounted get networks of the critic and the actors (θQ , θμ ). γ ∈ [0, 1]
sum γ ∈ [0, 1]. The user’s browsing history historyt is de- is a discounted rate.
fined as follows.
j
yt,i = rt,i + γQ (st,i , μ (st,i+1 |θμ )|θQ ) (8)
historyt = γ i clicked itemt−i (4) The actor function is updated on the basis of the gradient
i=1 expressed by the following equation.
The store vector is represented by a probability value.
The browsing histories are also represented by probability 1
∇θ μ J ≈ ∇a Q(s, a|θQ )|s=st,i ,a=μ(st,i ) ∇θμ μ(s|θμ )|st,i
values. We normalized historyt as follows Bt i
(9)
Normalized(historyt ) = normalized historyt Within the mini-batch, we do not change the two target
historyt (5) networks, and we update the critic network θQ and the actor
= network θμ .
|historyt |
We also defined the state vector st using areat , genret , 5. EXPERIMENT
normalized historyt as follows: This section describes our experiment. In the experiment,
we used Ekiten ʟs delivery data for the past 14 months.
st = concat areat , genret , normalized historyt (6) Specifically, when a user searches and selects the store, the
log where the click occurs is the target. Logs for which
concat is a function that concatenates vectors. clicks did not occur were excluded from the experiment be-
cause when a click does not occur, it cannot be determined
4.2.3 EPISODE whether the user is not interested or just left the page, which
An episode starts when the user first visits the web page. makes it difficult to appropriately pay the reward. The ex-
In this case, it starts when the user lands on a page of stores periment used 239,846 data records. The experimental set-
in the dining genre. A step occurs every time the user visits a tings of DDPG [12] are explained in Section 5.1. In Section
page in the dining genre. At each step in the episode, a store 5.2, we explain the comparison method and the experimen-
is recommended. We call the move from one step to the next tal settings. Section 5.3 explains the evaluation method,
a state transition. If a state transition does not occur for and Section 5.4 gives the experimental results, which are
the user for 30 minutes, the episode ends. When the same then discussed in Section 5.5. In the experiment, the num-
user visits again after that, a new episode begins. Even for ber of dimensions of the distributed expression of the store
the same user, there is a possibility that the purpose of a was compared with N ∈ {10, 50, 100}.
search may be different after several tens of minutes. This
is why we defined the episode like this. 5.1 PROPOSED METHOD
DDPG [12] is used as the proposed method with two pat-
4.3 ALGORITHM FOR DEEP REINFORCE- terns, raction vec and ritem vec , described in Section 4, to
MENT LEARNING reward agents.
The DDPG [12] used in this study is a combination of
DQN [10, 11] and deterministic policy gradient (DPG) [15]. 5.2 COMPARATIVE ALGORITHM
As previously explained, DQN stabilizes learning by using In order to test the performance of the proposed method,
the target function and minimizes the correlation between we compared its results with those when using a random
data with experience replay. DPG uses a critic function recommending method and when using a bandit algorithm.
5.4 RESULTS
5.4.1 COMPARISON OF PERFORMANCE USING
LDA
We conducted preliminary experiments to find the opti-
mal number of dimensions of DDPG N . Comparative ex-
periments were conducted with N ∈ {10, 50, 100} in the
distributed-representation dimension of the store. To ver-
ify the effect of the dimension number of LDA, we used
the reward raction vec indicated in Section 4.2.2. Table 1 figure 1: Average reward per 10,000 steps for MRR,
shows the scores of MRR and recall @3 when experiments recall @3 when changing the dimension N of LDA in
were performed in each dimension in the DDPG algorithm. DDPG
Both MRR and recall @3 scores are at their minimum with
N = 10 (MRR = 0.288, recall @3 = 0.308). The scores of
MRR and recall @3 are at their maximum with N = 50 the experiment. There is a tendency for the average com-
(MRR = 0.302, recall @3 = 0.327). The figure 1 is the av- pensation to increase in raction vec and decrease in ritem vec .
erage reward for MRR and recall @3 for every 10,000 steps When raction vec , the MRR for each step ranges from 0.293
when experimenting on the DDPG algorithm at each di- to 0.312, and recall @3 ranges from 0.313 to 0.345. When
mension. The average reward for each step is always the ritem vec , the MRR for each step ranges from 0.287 to 0.301,
maximum when the dimension number of LDA is N = 10 and recall @3 ranges from 0.302 to 0.328. Learning is sta-
and is the minimum when N = 50. Although the value of bilized by giving raction vec as a reward, which shows that
the reward is different, values tend to converge as the num- both the MRR and recall @3 score high in each step.
ber of steps increases. When N = 10, MRR for each step
ranges from 0.279 to 0.297, and recall @3 ranges from 0.296 5.4.3 COMPARISON OF PERFORMANCE WITH EX-
to 0.321. When N = 30, MRR for each step ranges from ISTING METHODS
0.286 to 0.307, and recall @3 ranges from 0.299 to 0.332. Section 5.4.1 and Section 5.4.2 explained how we deter-
When N = 50, MRR for each step ranges from 0.293 to mined the optimal number of dimensions of DDPG N and
0.312, and recall @3 ranges from 0.313 to 0.345. As shown, the reward design. Here, we demonstrate the superiority of
when the dimension number of LDA is N = 50, the score DDPG by comparing DDPG and existing methods.
at each step is high, meaning this dimension is the most Table 3 gives the results of all the experimental settings
effective. for each method. DDPG (N = 50, raction vec ) in the pro-
posed method was MRR = 0.302 and Recall@3 = 0.327,
5.4.2 COMPARISON OF PERFORMANCE IN GIV- which were the best scores among the scores of all condi-
ING REWARDS tions of the bandit algorithms and random recommendation
Section 5.4.1 We also conducted preliminary experiments method. Softmax (τ = 0.2, N = 50) had the highest score
to determine how to give rewards suitable for DDPG. Com- in the bandit algorithm, with MRR = 0.292 and recall @3
parative experiments were conducted to compare the differ- = 0.311. Figure 3 shows higher MRR and recall @3 values
ence between raction vec and ritem vec for the reward when from DDPG earlier in learning than from other methods.
LDA dimensions were fixed at N = 50. Table 2 shows the Also, DDPG has the maximum score for both MRR and re-
scores of MRR and recall @3 when experimenting with two call @3 in almost all steps even when learning progresses.
kinds of rewards. Both the MRR and recall @3 scores are Softmax exceeded DDPG in MRR only at 200,001-210,000
raction vec , exceeding ritem vec . Figure 2 shows the average steps, but the difference was slight.
reward for MRR and recall @3 for every 10,000 steps when
changing the reward in the DDPG algorithm and performing 5.5 DISCUSSION
6. CONCLUSION
We proposed a framework for a recommender system us-
ing deep reinforcement learning and achieved higher recom-
mendation accuracy than that of existing methods. Using
reinforcement learning, we were able to make recommenda-
tions considering the state transition. We also made the
following contributions. First, we could output stores rec-
ommended by depth reinforcement learning as vectors of the
distributed representation of N dimension. Therefore, com-
pared to bandit algorithms recommending stores directly, it
was possible to learn efficiently, resulting in high MRR and
recall values. Second, in our system, the reward was given
on the basis of the similarity of the recommendation with the
clicked store vector, so we were able to give a reward even
if a store other than the recommended store was clicked.
Therefore, it was possible to learn even offline. Third, store
vectors were made from reviews, so even newly listed stores
could be recommended if LDA vectors could be made for
figure 2: Average reward for MRR and recall @3 per those stores. In this paper, we could not verify the features
10,000 steps when changing the reward in DDPG used for the state and the hyperparameters of DDPG. It
is necessary to verify this and other information related to
recommender systems using deep reinforcement learning in
Here, we discuss the results of the experiment. future research.
DDPG had high MRR and recall @3 scores from an early
learning stage compared to the scores for the bandit algo- 7. REFERENCES
rithms. In DDPG, store vectors are output in distributed [1] J. Ben Schafer, Joseph Konstan, and John Riedl.
representation as actions. Therefore, it is possible to use the Recommender systems in e-commerce.
rewards gained when recommending a certain store on the E-COMMERCE 99, ACM, 1999.
basis of the recommendation of other stores. In the bandit [2] Paul Resnick, Neophytos Iacovou, Mitesh Suchak,
algorithms, however, there is a big difference in the speed of Peter Bergstrom, and John Riedl. Grouplens: an open
learning because each exploration has to be performed for architecture for collaborative filtering of netnews.
each store. We consider the higher learning speed to be the Computer-Supported Cooperative Work and Social
effect of applying deep reinforcement learning to the system. Computing, 1994.
There was also a difference in the magnitude of the aver-
[3] Richard S. Sutton and Andrew G. Brato.
age reward in each experiment. The reward was calculated
Reinforcement learning. The MIT Press, 1998.
with the similarity calculation of Equation ref eq: reward
The difference in the magnitude of the average reward when [4] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time
changing the dimension N of the LDA did not affect the ac- analysis of the multiarmed bandit problem. Machine
curacy of our model. For this research, it was not important Learning, 2002.
for the size of the average reward to affect the magnitude [5] Shipra Agrawal and Navin Goyal. Analysis of
of MRR or recall @3; rather, it was important to create an thompson sampling for the multi-armed bandit
appropriate store vector with LDA, and for learning to head problem. JMLR: Workshop and Conference
toward convergence. In the dimension of LDA, when N = 10 Proceedings, Vol. 21, , 2012.
, MRR and recall @3 have the lowest score. If N is small, [6] William R. Thompson. On the likelihood that one
the state space and the action space are both reduced so unknown probability exceeds another in view of the
that learning converges faster, but the MRR and recall @3 evidence of two samples. Oxford University Press on
scores are considered to be low because store information is behalf of Biometrika Trust, 1933.
not represented properly. [7] Lihong Li, Wei Chu, John Langford, and Robert E.
For reward design, raction vec was more stable than ritem vec . Schapire. A contextual-bandit approach to
The reason why learning was not stable when the reward de- personalized news article recommendation. World
sign was set to ritem vec is thought to be because the output- Wide Web, 2010.
action vector was replaced with the store vector when ob- [8] Lihong Li, Wei Chu, John Langford, and Xuanhui
taining the reward. There is a possibility that an appro- Wang. Unbiased offline evaluation of
priate reward is not given when the divergence between the contextual-bandit-based news article recommendation
behavior vector and the store vector is large, especially at algorithms. Web Search and Data Mining, 2012.
figure 3: Average reward for MRR and recall @3 per 10,000 steps in the experiment setting with the highest
score in each method