Deep Reinforcement Learning in Recommendations

Uploaded by

RAJKUMAR KURAMI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views8 pages

Deep Reinforcement Learning in Recommendations

Uploaded by

RAJKUMAR KURAMI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

2018 International Conference on Information and Communications Technology (ICOIACT)

Deep Reinforcement Learning for Recommender Systems

Isshu Munemasa Yuta Tomomatsu

Dept. of Computer Science, Dept. of Computer Science,
Meiji University, Japan Meiji University, Japan
munemasa tomo_otamot
@cs.meiji.ac.jp @cs.meiji.ac.jp
Kunioki Hayashi Tomohiro Takagi
DesignOne Japan,Inc., Dept. of Computer Science,
Tokyo, Japan Meiji University, Japan
hayashi.kunioki takagi
@designone.jp @cs.meiji.ac.jp

ABSTRACT called large genres. There are small genres in each large
Services that introduce stores to users on the Internet are genre. Users of this website can gather store information by
increasing in recent years. Each service conducts thorough selecting a genre. When a user searches for stores on Ekiten,
analyses in order to display stores matching each user’s pref- there are two main search methods. The first method is to
erences. In the field of recommendation, collaborative filter- select an area after clicking on a large or small genre on
ing performs well when there is sufficient click information the top page. In other words, this method narrows down
from users. Generally, when building a user-item matrix, stores in an area close to a specific point and compiles a list
data sparseness becomes a problem. It is especially difficult of recommended stores. The second method for receiving a
to handle new users. When sufficient data cannot be ob- list of stores is to enter keywords into the search bar on the
tained, a multi-armed bandit algorithm is applied. Bandit top page. To do so, a user must input an area and a genre
algorithms advance learning by testing each of a variety of as search keywords. On the display page of stores listed,
options sufficiently and obtaining rewards (i.e. feedback). It store names, ratings, descriptions, number of reviews, and
is practically impossible to learn everything when the num- related photos are displayed. Based on this information,
ber of items to be learned periodically increases. The prob- users click on a store that interests them. Users can then go
lem of having to collect sufficient data for a new user of to the store ʟs page and get detailed information. On the
a service is the same as the problem that collaborative fil- store ʟs page, store information, reviews, photos, menus, staff
tering faces. In order to solve this problem, we propose a information, and maps can be viewed. It is also possible to
recommender system based on deep reinforcement learning. check a store ʟs availability and make reservations. This is
In deep reinforcement learning, a multilayer neural network how Ekiten ʟs service for finding and reserving stores works.
is used to update the value function. In our research, we optimized Ekiten ʟs store recommen-
dation function using the framework of deep reinforcement
learning. We performed offline experiments with the goal of
Keywords establishing aʠ stores recommended for you ʡsection on the
Recommender systems, Latent Dirichlet Allocation, Deep site. Appropriate store recommendations for each user were
Reinforcement Learning, Deep Deterministic Policy Gradi- made on the basis of past delivery records (impressions and
ent clicks) possessed by Ekiten.
In this paper, we introduce a method of applying deep re-
1. INTRODUCTION inforcement learning to the problem of recommending stores.
In services for introducing stores, the number of page As with general reinforcement learning, it is necessary to
views and active users greatly affects the services ʟ ability define a state, action, episode, and reward. When dealing
to increase the number of stores that pay to be registered on with actions discretely, the number of stores is enormous,
the service. Service providers make content easy to use and and learning does not converge (the curse of dimensional-
post information that is beneficial to users. Improving user ity). Also, it is not practical to conduct the learning process
experience increases monthly page views and the number of each time a new store is added. In order to handle this
active users. Although some improvements benefit all users, problem, we expressed actions as continuous values. Specifi-
such as improvements to user interface, the store informa- cally, Latent Dirichlet Allocation (LDA) was used to convert
tion that users require generally varies from user to user. store information into distributed representation. Because
Reducing the time required for the service ʟs search func- the state was created on the basis of browsing history and
tion to retrieve a target store is important. The function of area information, it was treated as a continuous value (sec-
recommending suitable stores to users is also an important tion 4). The deep deterministic policy gradient (DDPG),
requirement for recommender systems. which has high performance with continuous value output,
Ekiten is a website that lists store reviews and rankings. It was used for deep reinforcement learning. On the basis of
has store information in the following categories: relaxation, the delivery record of Ekiten, we compared our proposed
body care, beauty, medical care, hobbies, education, dining, method and bandit algorithms and showed the superiority
shopping, leisure, and everyday life. These categories are of the proposed method.

978-1-5386-0954-5/18/$31.00
Authorized licensed ©2018
use limited to: NATIONAL INSTITUTE TECHNOLOGY CALICUT.226
OF IEEE Downloaded on September 02,2022 at 05:27:46 UTC from IEEE Xplore. Restrictions apply.
2018 International Conference on Information and Communications Technology (ICOIACT)

We review related work on recommender systems and deep actively selected. Thompson sampling [5, 6] counts the num-
reinforcement learning in Section 2. Section 3 describes this ber of clicks and non-clicks. Random numbers are generated
research ʟs problem setting. Section 4 presents a method based on the beta distribution for each advertisement, and
for fitting the framework of deep reinforcement learning to an advertisement with the maximum expected value is dis-
recommending stores. Section 5 describes the experimental played. The maximum expected value is calculated on the
results of the proposed and comparative methods. We also basis of Bernoulli distribution. In the Bernoulli distribution
explain our evaluation method. Finally, Section 6 concludes model, it is common to use a beta distribution that is a
this paper. conjugate prior in order to facilitate calculation. As these
examples show, bandit algorithms optimize delivery on the
2. RELATED WORK basis of advertisement click information.
In recent years, a linear-model bandit algorithm has been
Collaborative filtering [1, 2] is a popular and widely ap-
proposed. There are many cases where each advertisement
plied method in recommender systems. It is possible to cal-
in a system has some related features. We can estimate the
culate evaluation values on the basis of similarities between
reward of the advertisement using the reward of another ad-
users by creating a user-item preference matrix (user-based
vertisement. However, it is difficult to handle cases in which
collaborative filtering). The method of calculating evalua-
the action spaces change for each step in the linear-model
tion values on the basis of similarities between items is called
bandit algorithm. As it is, content on the web is constantly
item-based collaborative filtering. Collaborative filtering is
changing. New advertisement recommendations are deliv-
very effective when evaluation values are sufficiently filled,
ered to users while others are deleted. Since this process
but in general, the user-item matrix is scarcely filled. This is
does not always have the same action space, it cannot be
called the problem of data sparseness. In a situation where
represented by a simple linear model. Lihong Li proposed
there are few new users or delivery results, the problem of
linUCB [7, 8] as a linear model that can treat action space
data sparseness always occurs. When making recommenda-
that changes across steps. To distinguish it from previous
tions using collaborative filtering, it is necessary to use an-
models, linUCB is called a contextual bandit.
other method for convenience, such as random distribution,
Bandit algorithms can be discussed in the framework of
until data is accumulated; for this reason, it is not immedi-
reinforcement learning. In the case of an advertisement,
ately possible to obtain the benefits of this recommendation
assume an agent selects an advertisement to be displayed
method. This problem occurs especially for services that
each time a user accesses a page. When the user visits the
introduce many new products frequently.
page, the agent receives the state from the environment. The
Another feature of collaborative filtering is its ability to
agent selects an advertisement as an action based on the
predict the click-through rate (CTR) and conversion rate
state. The environment receives the user’s click information
(CVR). This is possible by making CTR or CVR elements
as feedback and rewards the agent. The value function in
of the user-item matrix. Multi-armed bandit algorithms[3,
the bandit algorithm defines the current CTR.
4, 5] use CTR or CVR as an indicator. These algorithms
Q-learning [9] is one reinforcement learning algorithm. In
are one method for optimizing recommendations in adver-
this method, the agent learns a value for how beneficial it
tisements. There are two phases of a bandit algorithm: ex-
is to take a given action within a given state. This func-
ploration and exploitation. Advertisements are distributed
tion is called the action-value function. In this research, we
randomly in exploration. In exploitation, advertisements
advanced learning to maximize it. The action-value func-
with the highest expected value are delivered; values are
tion is not an immediate reward, but it selects the action
calculated on the basis of data gained through exploration.
taking future rewards into consideration. Specifically, we
Both exploration and exploitation receive feedback. This
considered the sum of discounted rewards. The sum of
feedback is generally called a reward. The -greedy method
discounted
T rewards at the current time t is expressed as
[3] which explores with the probability and exploits with τ
Rt = τ =0 γ Rt+1+τ . In this formula of rewards, T is
the probability (1 − ) is the simplest bandit algorithm and
the time-step at which the episode terminates and future
does not require much time for implementation. The prob-
rewards are discounted by a factor of γ per time-step. The
ability is an important parameter in determining the bal-
bandit algorithm maximizes the immediate reward. In con-
ance between exploration and exploitation. If the value of
trast, Q-learning maximizes cumulative reward obtained by
is large, the exploration rate will increase. Data can be
a series of actions.
sufficiently gathered, but it cannot be delivered effectively
Research on deep reinforcement learning that expresses
during this time. However, if the value of is small, adver-
value functions by multi-layered neural networks is actively
tisements with high expected value are displayed. Although
advancing in recent years. By expressing the value func-
this may have temporary results, it runs the risk of missing
tion with a neural network, it is possible to treat high-
the opportunity to deliver more effective advertisements. In
dimensional objects such as images. However, data obtained
general, exploration and exploitation have a trade-off rela-
from the environment correlates with each other, which can
tionship.
interfere with learning. Deep Q-net (DQN)[10, 11] solved
Several methods have been proposed to change the value
this problem by using experience replay. Experience replay
of as the number of steps increases and devise a balance
stores obtained data in a storage area called a replay buffer.
between exploration and exploitation. The softmax method
We randomly sample data from it during learning to reduce
[3], for instance, recommends high-value-expected advertise-
correlation between data. The size of the buffer is prede-
ments with high probability and low-value-expected adver-
fined, and when the amount of data exceeds its size, the
tisements with low probability. Weighted averages are used
oldest data is deleted. Also, DQN uses a target function to
to calculate expected values. Upper confidence bounds (UCB)
stabilize learning. The action-value function is copied be-
[4] distribute advertisements using the reliability of the ex-
fore the start of learning to the target function; within the
pected value. Advertisements that are not fully explored are