0% found this document useful (0 votes)
41 views8 pages

Deep Reinforcement Learning For Chatbots Using

Uploaded by

Mehrin Afroz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views8 pages

Deep Reinforcement Learning For Chatbots Using

Uploaded by

Mehrin Afroz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Deep Reinforcement Learning for Chatbots Using

Clustered Actions and Human-Likeness Rewards


Heriberto Cuayáhuitl Donghyeon Lee Seonghan Ryu
School of Computer Science Artificial Intelligence Research Group Artificial Intelligence Research Group
University of Lincoln Samsung Electronics Samsung Electronics
Lincoln, United Kingdom Seoul, South Korea Seoul, South Korea
[email protected] [email protected] [email protected]

Sungja Choi Inchul Hwang Jihie Kim


Artificial Intelligence Research Group Artificial Intelligence Research Group Artificial Intelligence Research Group
arXiv:1908.10331v1 [cs.AI] 27 Aug 2019

Samsung Electronics Samsung Electronics Samsung Electronics


Seoul, South Korea Seoul, South Korea Seoul, South Korea
[email protected] [email protected] [email protected]

Abstract—Training chatbots using the reinforcement learning


paradigm is challenging due to high-dimensional states, infinite
action spaces and the difficulty in specifying the reward function.
We address such problems using clustered actions instead of
infinite actions, and a simple but promising reward function
based on human-likeness scores derived from human-human
dialogue data. We train Deep Reinforcement Learning (DRL)
agents using chitchat data in raw text—without any manual
annotations. Experimental results using different splits of training
data report the following. First, that our agents learn reasonable
policies in the environments they get familiarised with, but
their performance drops substantially when they are exposed
to a test set of unseen dialogues. Second, that the choice of
sentence embedding size between 100 and 300 dimensions is
not significantly different on test data. Third, that our proposed
human-likeness rewards are reasonable for training chatbots as
long as they use lengthy dialogue histories of ≥10 sentences.
Index Terms—neural networks, reinforcement / unsupervised
/ supervised learning, sentence embeddings, chatbots, chitchat

I. I NTRODUCTION
What happens in the minds of humans during chatty inter-
actions containing sentences that are not only coherent but
also engaging? While not all chatty human dialogues are
engaging, they are arguably coherent [1]. They also exhibit Fig. 1. High-level architecture of the proposed deep reinforcement learning
large vocabularies—according to the language in focus— approach for chatbots—see text for details
because conversations can address any topic that comes to
the minds of the partner conversants. In addition, each con-
tribution by a partner conversant may exhibit multiple sen- provided labels), a Deep Reinforcement Learning (DRL) agent
tences instead of one such as greeting+question or acknowl- takes the role of one of the two partner conversants in order
edgement+statement+question. Furthermore, the topics raised to learn to select human-like sentences when exposed to both
in the conversation may go back and forth without losing human-like and non-human-like sentences. In our learning
coherence. This is a big challenge for data-driven chatbots. scenario the agent-environment interactions consist of agent-
We present a novel approach based on the reinforcement data interactions – there is no user simulator as in task-
learning [2], unsupervised learning [3] and deep learning [4] oriented dialogue systems. During each verbal contribution,
paradigms. Our learning scenario is as follows: given a data set the DRL agent (1) observes the state of the world via a deep
of human-human dialogues in raw text (without any manually neural network, which models a representation of all sentences
raised in the conversation together with a set of candidate
Work carried out while the first author was visiting Samsung Research. responses or agent actions (referred as clustered actions in
our approach); (2) it then selects an action so that its word- methods for chatbots, which has not been explored before—
based representation is sent to the environment; and (3) it at least not from the perspective of deriving the action sets
receives an updated dialogue history and a numerical reward automatically as attempted in this paper.
for having chosen each action, until a termination condition Other closely related methods to deep RL include seq2seq
is met. This process—illustrated in Figure 1—is carried out models for dialogue generation [16]–[21]. These methods
iteratively until the end of a dialogue for as many dialogues tend to be data-hungry because they are typically trained
as necessary, i.e. until there is no further improvement in the with millions of sentences, which imply high computational
agent’s performance. demands. While they can be used to address the same problem,
The contributions of this paper are as follows. in this paper we focus our attention on deep RL-based chatbots
• We propose to train chatbots using value-based deep and leave their comparison or combination as future work.
reinforcement learning using action spaces derived from Nonetheless, these related works agree with the fact that
unsupervised clustering, where each action cluster is a evaluation is a difficult part and that there is a need for
representation of a type of meaning (greeting, question better evaluation metrics [22]. This is further supported by
around a topic, statements around a topic, etc.). [23], where they found that metrics such as Bleu and Meteor
• We propose a simple though promising reward func- amongst others do not correlate with human judgments.
tion. It is based on human-human dialogues and noisy With regard to performance metrics, the reward functions
dialogues for learning to rate good vs. bad dialogues. used by deep RL dialogue agents are either specified manually
According to an analysis of dialogue reward prediction, depending on the application, or learnt from dialogue data. For
dialogues with lengthy dialogue histories (of at least 10 example, [11] conceives a reward function that rewards pos-
sentences) report strong correlations between true and itively sentences that are easy to respond and coherent while
predicted rewards on test data. penalising repetitiveness. [12] uses an adversatial approach,
• Our experiments comparing different sentence embedding where the discriminator is trained to score human vs. non-
sizes (100 vs. 300) did not report statistical differences human sentences so that the generator can use such scores
on test data. This means that similar results can be during training. [13] trains a reward function from human
obtained more efficiently with the smaller embedding ratings. All these related works are neural-based, and there
than the larger one due to less features. In other words, is no clear best reward function to use in future (chitchat)
sentence embeddings of 100 dimensions are as good as chatbots. This motivated us to propose a new metric that is
300 dimensions but less computationally demanding. easy to implement, practical due to requiring only data in raw
• Last but not least, we found that training chatbots on text, and potentially promising as described below.
multiple data splits is crucial for improved performance III. P ROPOSED A PPROACH
over training chatbots using the entire training set.
To explain the proposed learning approach we first describe
The remainder of the paper describes our proposed approach
how to conceive a finite set of dialogue actions from raw text,
in more detail and evaluates it using a publicly available
then we describe how to assign rewards, and finally describe
dataset of chitchat conversations. Although our learning agents
how to bring everything together during policy learning.
indeed improve their performance over time with dialogues
that they get familiarised with, their performance drops with A. Clustered Actions
dialogues that the agents are not familiar with. The former is
Actions in reinforcement learning chatbots correspond to
promising and in favour of our proposed approach, and the
sentences, and their size is infinite assuming all possible
latter is not, but it is a general problem faced by data-driven
combinations of words sequences in a given language. This
chatbots and an interesting avenue for future research.
is especially true in the case of open-ended conversations that
make use of large vocabularies, as opposed to task-oriented
II. R ELATED W ORK
conversations that make use of smaller (restricted) vocabu-
Reinforcement Learning (RL) methods are typically based laries. A clustered action is a group of sentences sharing a
on value functions or policy search [2], which also applies to similar or related meaning via sentence vectors derived from
deep RL methods. While value functions have been particu- word embeddings [24], [25]. While there are multiple ways of
larly applied to task-oriented dialogue systems [5]–[10], policy selecting features for clustering and also multiple clustering
search has been particularly applied to open-ended dialogue algorithms, the following requirements arise for chatbots: (1)
systems such as (chitchat) chatbots [11]–[15]. This is not unlabelled data due to human-human dialogues in raw text
surprising given the fact that task-oriented dialogue systems (this makes it difficult to evaluate the goodness of clustering
use finite action sets, while chatbot systems use infinite action features and algorithms), and (2) scalability to clustering a
sets. So far there is a preference for policy search methods for large set of data points (sentences in our case, which are
chatbots, but it is not clear whether they should be preferred mostly unique).
because they face problems such as local optima rather than Given a set of data points {x1 , · · · , xn }∀xi ∈ Rm and a
global optima, inefficiency and high variance. It is thus that similarity metric d(xi , xi0 ), the task is to find a set of k clusters
this paper explores the feasibility of value function-based with a clustering algorithm. Since in our case each data point
x corresponds to a sentence within a dialogue, we represent we sample a training dialogue from our data of human-human
sentences via their mean word vectors—similarly as in Deep conversations (lines 1-4). A human starts the conversation,
Averaging Networks [26]—denoted as which is mapped to its corresponding sentence embedding
Ni representation (lines 5-6). Then a set of candidate responses
1 X is generated including (1) the true human response and (2)
xi = cj ,
Ni j=1 randomly chosen responses (distractors). The candidate re-
sponses are clustered as described in Section III-A and the
where cj is the vector of coefficients of word j and Ni is the resulting actions are taken into account by the agent for action
number of words in sentence i. For scalability purposes, we selection (lines 8-10). Once an action is chosen, it is conveyed
use the K-Means++ algorithm [27] with the Euclidean distance to the environment, a reward is observed as described in Sec-
v
tion III-B, and the agent’s partner response is observed as well
um
uX
d(xji , xji0 ) = t (xji , xji0 )2 in order to update the dialogue history H (lines 11-14). With
j=1 such an update, the new sentence embedding representation
is generated from H in order to update the replay memory
with m dimensions, and assume that k is provided rather than
D with learning experience (s, a, r, s0 ) (lines 15-16). Then a
automatically induced – though other algorithms can be used
minibatch of experiences M B = (sj , aj , rj , s0j ) is sampled
with our approach. In this way, a trained clustering model
from D in order to update the weights θ according to the
assigns a cluster ID a ∈ A to features xi , where the number
error derived from the difference between the target value yj
of actions is equivalent to the number of clusters, i.e. |A| = k.
and the predicted value Q(s, a; θ) (see lines 18 and 20), which
B. Human-Likeness Rewards is based on the following loss function:
2
Reward functions in reinforcement learning dialogue agents

0 0 ˆ
L(θj ) = EM B r + γ max Q̂(s , a ; θ j ) − Q(s, a; θ j ) .
is often a difficult aspect. We propose to derive the rewards a 0

from human-human dialogues by assigning positive values to The target action-value function Q̂ and state s are updated ac-
contextualised responses seen in the data, and negative values cordingly (lines 21-22), and this iterative procedure continues
to randomly generated responses due to lacking coherence until convergence.
(also referred to as ‘non-human-like responses’) – see example
in Table V. Thus, anPepisode or dialogue reward can be IV. E XPERIMENTS AND R ESULTS
N i
computed as Ri = r
j=1 j (a), where i is the dialogue A. Data
in focus, j the dialogue turn in focus, and rji (a) is given We used data from the Persona-Chat data set1 , which
according to includes 17,877 dialogues for training (131,431 turns) and
(
i +1, if a is a human response in dialogue-turn i, j. 999 dialogues for testing (7,793 turns). They represent av-
rj (a)= erages of 7.35 and 7.8 dialogue turns for training and testing,
−1, if a is human but randomly chosen (incoherent).
respectively—see example dialogue in Table V. The vocabu-
C. Policy Learning lary size in the entire data set contains 19,667 unique words.
Our Deep Reinforcement Learning (DRL) agents aim to B. Experimental Setting
maximise their cumulative reward overtime according to
To analyse the performance of our ChatDQN agents we use
Q∗ (s, a; θ) = max E[rt + γrt+1 + γ 2 rt+2 + · · · |s, a, πθ ], subsets of training data vs. the entire training data set. The
πθ
former are automatically generated by using sentence vectors
where r is the numerical reward given at time step t for to represent the features of each dialogue—as described in
choosing action a in state s, γ is a discounting factor, and Section III-A. Similarly, the agents’ states are modelled using
Q∗ (s, a; θ) is the optimal action-value function using weights sentence vectors of the dialogue history with the pretrained
θ in a neural network. During training, a DRL agent will coefficients of the Glove model [25]. In all our experiments
choose actions in a probabilistic manner in order to explore we use the following neural network architecture2 :
new (s, a) pairs for discovering better rewards or to exploit • mean word vectors, one per sentence, in the input layer
already learnt values—with a reduced level of exploration (maximum number of vectors=50, with zero-padding) –
overtime and an increased level of exploitation over time. each word vector of 100 or 300 embedding size,
During testing, a DRL agent will choose the best actions a∗ • two Gated Recurrent Unit (GRU) [30] layers with latent
according to dimensionality of 256, and
πθ∗ (s) = arg max Q∗ (s, a; θ). 1 Data set downloaded from http://parl.ai/ on 18 May 2018 [29]
a∈A
2 Other hyperparameters include embedding batch size=128, dropout=0.2,
Our DRL agents implement the procedure above using a latent dimensionality=256, discount factor=0.99, size of candidate re-
generalisation of the DQN method [28]—see Algorithm 1. sponses=3, max. number of sentence vectors in H=50, burning steps=3K,
memory size=10K, target model update (C)=10K, learning steps=50K, test
After initialising replay memory D, dialogue history H, steps=100K. The number of parameters in our neural nets with 100 and 300
action-value function Q and target action-value function Q̂, sentence vector dimensions corresponds to 4.4 and 12.1 million, respectively.
Human Sentences Distorted Human Sentences
hello what are doing today? hello what are doing today?
i’m good, i just got off work and tired, i have two jobs.[r=+1] do your cats like candy?[r=-1]
i just got done watching a horror movie i just got done watching a horror movie
i rather read, i have read about 20 books this year.[r=+1] do you have any hobbies?[r=-1]
wow! i do love a good horror movie. loving this cooler weather wow! i do love a good horror movie. loving this cooler weather
but a good movie is always good.[r=+1] good job! if you live to 100 like me, you will need all that learning.[r=-1]
yes! my son is in junior high and i just started letting him watch them
yes! my son is in junior high and i just started letting him watch them
i work in the movies as well.[r=+1] what a nice gesture. i take my dog to compete in agility classes.[r=-1]
neat!! i used to work in the human services field neat!! i used to work in the human services field
yes it is neat, i stunt double, it is so much fun and hard work.[r=+1]
you work very hard. i would like to do a handstand. can you teach it?[r=-1]
yes i bet you can get hurt. my wife works and i stay at home yes i bet you can get hurt. my wife works and i stay at home
nice, i only have one parent so now i help out my mom.[r=+1] yes i do, red is one of my favorite colors[r=-1]
i bet she appreciates that very much. i bet she appreciates that very much.
she raised me right, i’m just like her.[r=+1] haha, it is definitely attention grabbing![r=-1]
my dad was always busy working at home depot my dad was always busy working at home depot
now that i am older home depot is my toy r us.[r=+1] i bet there will be time to figure it out. what are your interests?[r=-1]
TABLE I
M ODIFIED DIALOGUE FROM THE P ERSONA -C HAT DATASET [21] WITH OUR PROPOSED REWARDS : r=+1 MEANS A HUMAN - LIKE SENTENCE AND r=-1
MEANS NON - HUMAN LIKE . T HE LATTER SENTENCES , IN RED , ARE SAMPLED RANDOMLY FROM DIFFERENT DIALOGUES IN THE SAME DATASET

Algorithm 1 ChatDQN Learning number of sentence clusters would mitigate the problem, but
1: Initialise Deep Q-Networks with replay memory D, di- the larger the number of clusters the larger the computational
alogue history H, action-value function Q with random expense—i.e. more parameters in the neural network. Fig-
weights θ, and target action-value functions Q̂ with θ̂ = θ ure 2(a) shows an example of our sentence clustering using
2: Initialise clustering model from training dialogue data 100 clusters on our training data. A manual inspection showed
3: repeat that greeting sentences were mostly assigned to the same
4: Sample a training dialogue (human-human sentences) cluster, and questions expressing preferences (e.g. What is
5: Append first sentence to dialogue history H your favourite X?) were also assigned to the same cluster. In
6: s = sentence embedding representation of H this work we thus use a sentence clustering model with k=100
7: repeat derived from our training data and prior to reinforcement
8: Generate noisy candidate response sentences learning3 . In addition, we trained a second clustering model
9: A =( cluster IDs of candidate response sentences to analyse our experiments using different data splits, where
randa∈A if random number ≤  instead of clustering sentences we cluster dialogues. Given that
10: a=
maxa∈A Q(s, a; θ) otherwise we represent a sentence using a mean word vector, a dialogue
11: Execute chosen clustered action a can thus be represented by a group of sentence vectors.
12: Observe human-likeness dialogue reward r Figure 2(b) shows an example of our dialogue clustering using
13: Observe environment response (agent’s partner) 20 clusters on our training data.
14: Append agent and environment responses to H Notice that while previous related works in task-oriented
15: s0 = sentence embedding representation of H DRL-based agents typically use a user simulator, this paper
16: Append transition (s, a, r, s0 ) to D does not use a simulator. Instead, we use the dataset of human-
0 human dialogues directly and substitute one partner conversant
17:
( random minibatch (sj , aj , rj , sj ) from D
Sample
rj if final step of episode in the dialogues by a DRL agent. The goal of the agent is to
18: yj = choose the human-generated sentences (actions) out of a set
rj + γ maxa0 ∈A Q̂(s0 , a0 ; θ̂) otherwise of candidate responses.
2
19: Set err = (yj − Q(s, a; θ))
20: Gradient descent step on err with respect to θ C. Experimental Results
21: Reset Q̂ = Q every C steps The plots in Figure 3 show the training performance of
22: s ← s0 our ChatDQN agents—all using 100 clustered actions. Each
23: until end of dialogue plot contains two learning curves, one per agent, where each
24: Reset dialogue history H agent uses a different sentence embedding size (100 or 300
25: until convergence dimensions). In addition, each plot uses an automatically
generated data split according to our clustered dialogues.
These plots show evidence that all agents indeed improve their
behaviour over time even when they use only 100 actions. This
• fully connected layer with number of nodes=the number can be observed from their average episode rewards, the higher
of clusters, i.e. each cluster corresponding to one action.
3 Each experiment in this paper was ran on a GPU Tesla K80 using the fol-
While a small number of sentence clusters could result in lowing libraries: Keras (https://github.com/keras-team/keras), OpenAI (https:
actions being assigned to potentially the same cluster, a larger //github.com/openai) and Keras-RL (https://github.com/keras-rl/keras-rl).
(a) 100 clusters of training sentences

(b) 20 clusters of training dialogues

Fig. 2. Example clusters of our training data using Principal Component Analysis [31] for visualisations in 2D – black dots represent sentences or dialogues

the better in all learning curves. From a visual inspection, we in performance during testing on the test set. These results
can observe that the agents using either embedding size (100 could be confirmed in other datasets and/or settings in future
or 300) perform rather equivalently but with a small trend for work. In addition, we can observe that the ChatDQN agents
300 dimensions to dominate its counterpart – more on this trained using all data (agents with id=20) were not able to
below. achieve as good performance than those agents using smaller
The performance of our ChatDQN agents using all training data splits. Our results thus reveal that training chatbots on
dialogues is shown in Figure 4. It can be noted that in contrast some sort of domains (groups of dialogues automatically
to the previous agents where their improvement in average discovered in our case), is useful for improved performance.
reward reached values of around 2, the performance in these
agents was lower (with average episode reward < 0). We V. A NALYSIS OF H UMAN -L IKENESS R EWARDS
attribute this to the larger amount of variation exhibited from We employ the algorithm of [32] for extending a dataset of
about 1K dialogues to 17.8K dialogues. human-human dialogues with distorted dialogues. The latter
We analysed the performance of our agents further by include varying amounts of distortions, i.e. different degrees
using a test set of totally unseen dialogues during training. of human-likeness. We use such data for training and testing
Table II summarises our results, where we can note that the reward prediction models in order to analise the goodness
larger sentence embedding size (300) generally performed of our proposed reward function. Given extended dataset
better. While a significant difference (according to a two-tailed D̂ = {(dˆ1 , y1 ), . . . , (dˆN , yN )} with (noisy) dialogue histories
Wilcoxon Singed Rank Test) at p = 0.05 was identified in dˆi , the goal is to predict dialogue scores yi as accurately
testing on the training set, no significant difference was found as possible. We represent a dialogue history via its sentence
(a) ChatDQN agents using data splits 0 to 4 (from left to right)

(b) ChatDQN agents using data splits 5 to 9 (from left to right)

(c) ChatDQN agents using data splits 10 to 14 (from left to right)

(d) ChatDQN agents using data splits 15 to 19 (from left to right)

Fig. 3. Training performance of ChatDQN agents using different data splits of dialogues—see text for details

where cij,k is the vector of coefficients of word k, part of


sentence j in dialogue i, and Nji is the number of words in
the sentence in focus.
Assuming that vector Y = {y1 , ..., y|Y| } is the set of target
labels—generated as described in the dialogue generation
algorithm of [32], and using the same test data as the previous
section. In this way, dataset Dtrain = (Xtrain , Ytrain ) is used
for training neural regression models using varying amounts
of dialogue history, and dataset Dtest = (Xtest , Ytest ) is used
for testing the learnt models.
Our experiments use a 2-layer Gated Recurrent Unit (GRU)
Fig. 4. Training performance of our ChatDQN agents using all training
dialogues and two sentence embedding sizes neural network [30], similar to the one in SectionIV-B but
including Batch Normalisation [33] between hidden layers.
We trained neural networks for six different lengths of
vectors as in Deep Averaging Networks [26], where sentences dialogue history, ranging from 1 sentence to 50 sentences.
are represented with numerical feature vectors denoted as Each length size involved a separate neural network, trained
x = {x1 , ..., x|x| }. In this way, a set of word sequences sij 10 times in order to report results over multiple runs. Figure 5
in dialogue-sentence pair i, j is mapped to feature vectors reports the average Pearson correlation coefficient—between
j Ni true dialogue rewards and predicted dialogue rewards—for
1 X i each length size. It can be observed that short dialogue
xij = i cj,k ,
Nj histories contribute to obtain weak correlations, and that longer
k=1
TABLE II
AVERAGE REWARD RESULTS OF C HAT DQN AGENTS 0 TO 20 TRAINED WITH DIFFERENT DATA SPLITS AND SIZE OF SENTENCE EMBEDDING (42 AGENTS
IN TOTAL ), WHERE † DENOTES SIGNIFICANT DIFFERENCE ( AT p = 0.05) USING A TWO - TAILED W ILCOXON S IGNED R ANK T EST

|Embedding|=100 |Embedding|=300
Data Split Training Testing on the Testing on the Data Split Training Testing on the Testing on the
(|dialogues|) Training Set Test Set (|dialogues|) Training Set Test Set
0 (861) 1.8778 3.7711 -1.1708 0 (1000) 1.8168 3.6785 -0.8618
1 (902) 1.3751 3.1663 -1.7006 1 (850) 2.0622 4.4598 -1.8688
2 (907) 1.4194 3.1579 -0.9723 2 (1010) 1.6896 3.6724 -1.4282
3 (785) 2.1532 4.2508 -1.3444 3 (1029) 1.9845 4.0136 -0.6109
4 (1046) 1.2204 2.1581 -1.5633 4 (951) 1.8255 4.0423 -1.4448
5 (767) 1.9456 3.9017 -1.2123 5 (832) 2.0860 4.2182 -0.8277
6 (1053) 0.4621 0.1370 -1.8443 6 (815) 2.1735 4.2592 -1.5193
7 (968) 1.8090 3.8368 -1.1137 7 (891) 2.1921 4.5799 -1.4233
8 (858) 1.7608 3.5531 -1.6678 8 (905) 1.8835 3.8337 -0.6628
9 (826) 1.8431 3.6254 -1.0919 9 (892) 2.0521 4.1882 -1.5267
10 (818) 1.9188 3.8629 -0.5394 10 (835) 2.0709 4.2852 -0.8831
11 (944) 1.8212 3.5724 -1.7020 11 (873) 2.1902 4.4848 -1.3329
12 (873) 2.0195 4.1895 -1.3456 12 (948) 1.7761 3.7927 -1.6167
13 (895) 2.0515 4.1873 -1.8034 13 (932) 1.8563 3.6208 -1.5149
14 (863) 1.9722 4.1479 -1.3244 14 (812) 1.9486 4.0347 -1.5866
15 (842) 1.8214 3.8942 -0.8921 15 (880) 1.1338 2.4880 -1.4084
16 (837) 1.8162 3.8817 -1.3784 16 (787) 2.2628 4.5583 -1.4290
17 (958) 1.6373 3.3373 -0.7726 17 (994) 0.9038 1.5106 -1.5925
18 (1012) 1.7631 3.6279 -1.2690 18 (853) 2.2405 4.4716 -1.4231
19 (862) 2.0683 4.2026 -1.5901 19 (788) 2.0686 4.2219 -0.9594
20 (17877) -0.4138 -1.2473 -1.9684 20 (17877) -0.3516 -0.3490 -2.0870
Average0−20 1.6353 3.2959† -1.3461 Average0−20 1.8031 3.7174† -1.3337
Sum0−20 34.3419 69.2146 -28.2674 Sum0−20 37.8656 78.0653 -28.0079
Upper Bound 7.1810 7.1810 7.5942 Upper Bound 7.1810 7.1810 7.5942
Lower Bound -7.2834 -7.2834 -7.7276 Lower Bound -7.2834 -7.2834 -7.7276
Random Sel. -2.4139 -2.4139 -2.5526 Random Sel. -2.4139 -2.4139 -2.5526

without any manual annotations. The task of the agents is


to learn to choose human-like actions (sentences) out of
candidate responses including human generated and randomly
generated sentences. In our proposed rewards we assume that
the latter are generally incoherent throughout the dialogue
history. Experimental results using chitchat data report that
DRL agents learn reasonable policies using training dialogues,
but their generalisation ability in a test set of unseen dialogues
remains a key challenge for future research in this field. In
addition, we found the following: (a) that sentence embedding
sizes of 100 and 300 perform equivalently on test data; (b) that
training agents using larger amounts of training can deteriorate
Fig. 5. Bar plot showing the performance of our dialogue reward predictors
using different amounts of dialogue history (from 1 sentence to 50 sentences). performance than training with smaller amounts; and (c) that
Each bar reports an average Pearson correlation score over 10 runs, where the our proposed dialogue rewards can be predicted with strong
coefficients report the correlation between true dialogue rewards and predicted correlation (between true and predicted rewards) by using
dialogue rewards in our test data
neural-based regressors with lengthy dialogue histories of ≥
10 sentences (25 sentences was the best in our experiments).
dialogue histories (≥ 10 sentences) contribute to obtain strong
correlations. It can also be observed that the longest history Future work can explore the following avenues. First,
may not be the best choice of length size, the network using confirm these findings with other datasets and settings in
25 sentences achieved the best results. From these results we order to draw even stronger conclusions. Second, investigate
can conclude that our proposed human-likeness rewards—with further the proposed approach for improved generalisation in
lengthy dialogue histories—can be used for training future test data. For example, other methods of feature extraction,
neural-based chatbots. clustering algorithms, distance metrics, policy learning algo-
rithms, architectures, and a comparison of reward functions
VI. C ONCLUSION AND F UTURE W ORK can be explored. Last but not least, combine the proposed
This paper presents a novel approach for training Deep learning approach with more knowledge intensive resources
Reinforcement Learning (DRL) chatbots, which uses clustered [34], [35] such as semantic parsers, coreference resolution,
actions and rewards derived from human-human dialogues among others.
R EFERENCES [23] Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent
Charlin, and Joelle Pineau, “How NOT to evaluate your dialogue system:
An empirical study of unsupervised evaluation metrics for dialogue
[1] Barbara J. Grosz and Candace L. Sidner, “Attention, intentions, and the response generation,” in EMNLP, 2016.
structure of discourse,” Computational Linguistics, vol. 12, no. 3, pp. [24] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and
175–204, 1986. Jeffrey Dean, “Distributed representations of words and phrases and
[2] Richard S. Sutton and Andrew G. Barto, Reinforcement learning - an their compositionality,” in NIPS, 2013.
introduction, Adaptive computation and machine learning. MIT Press, [25] Jeffrey Pennington, Richard Socher, and Christopher D. Manning,
2nd edition edition, 2018. “Glove: Global vectors for word representation,” in EMNLP, 2014.
[3] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman, The elements [26] Mohit Iyyer, Varun Manjunatha, Jordan L. Boyd-Graber, and Hal Daumé
of statistical learning: data mining, inference, and prediction, 2nd III, “Deep unordered composition rivals syntactic methods for text
Edition, Springer series in statistics. Springer, 2009. classification,” in ACL (1), 2015.
[4] Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton, “Deep learning,” [27] David Arthur and Sergei Vassilvitskii, “K-means++: The advantages of
Nature, vol. 521, no. 7553, pp. 436–444, 2015. careful seeding,” in SODA. 2007, SIAM.
[5] Iñigo Casanueva, Pawel Budzianowski, Pei-Hao Su, Stefan Ultes, [28] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu,
Lina Maria Rojas-Barahona, Bo-Hsiang Tseng, and Milica Gasic, “Feu- Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller,
dal reinforcement learning for dialogue management in large domains,” Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,
in NAACL-HLT, 2018. Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan
[6] Heriberto Cuayáhuitl, “SimpleDS: A simple deep reinforcement learning Wierstra, Shane Legg, and Demis Hassabis, “Human-level control
dialogue system,” CoRR, vol. abs/1601.04574, 2016. through deep reinforcement learning,” Nature, vol. 518, no. 7540, 2015.
[7] Heriberto Cuayáhuitl, Seunghak Yu, Ashley Williamson, and Jacob [29] Alexander H. Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam
Carse, “Scaling up deep reinforcement learning for multi-domain Fisch, Jiasen Lu, Devi Parikh, and Jason Weston, “Parlai: A dialog
dialogue systems,” in IJCNN, 2017. research software platform,” in EMNLP (System Demonstrations), 2017.
[8] Heriberto Cuayáhuitl and Seunghak Yu, “Deep reinforcement learning [30] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bah-
of dialogue policies with less weight updates,” in INTERSPEECH, 2017. danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning
[9] Jason D. Williams, Kavosh Asadi, and Geoffrey Zweig, “Hybrid phrase representations using RNN encoder–decoder for statistical ma-
code networks: practical and efficient end-to-end dialog control with chine translation,” in EMNLP. 2014, Association for Computational
supervised and reinforcement learning,” in ACL, 2017. Linguistics.
[10] Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Çelikyilmaz, [31] M. E. Tipping and Christopher Bishop, “Probabilistic principal compo-
Sungjin Lee, and Kam-Fai Wong, “Composite task-completion dia- nent analysis,” Journal of the Royal Statistical Society, Series B, vol.
logue policy learning via hierarchical deep reinforcement learning,” in 21/3, pp. 611622, January 1999.
EMNLP, 2017. [32] Heriberto Cuayáhuitl, Seonghan Ryu, Donghyeon Lee, and Jihie Kim,
[11] Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and “A study on dialogue reward prediction for open-ended conversational
Jianfeng Gao, “Deep reinforcement learning for dialogue generation,” agents,” in NeurIPS Workshop on Conversational AI: “Today’s Practice
in EMNLP, 2016. and Tomorrow‘s Potential”, 2018.
[33] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating
[12] Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and
deep network training by reducing internal covariate shift,” in Interna-
Dan Jurafsky, “Adversarial learning for neural dialogue generation,” in
tional Conference on Machine Learning (ICML), 2015.
EMNLP, 2017.
[34] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel,
[13] Iulian Vlad Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Steven J. Bethard, and David McClosky, “The Stanford CoreNLP
Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael natural language processing toolkit,” in Association for Computational
Pieper, Sarath Chandar, Nan Rosemary Ke, Sai Rajeswar, Alexandre Linguistics (ACL) System Demonstrations, 2014.
de Brébisson, Jose M. R. Sotelo, Dendi Suhubdy, Vincent Michalski, [35] Nina Dethlefs, “Domain transfer for deep natural language generation
Alexandre Nguyen, Joelle Pineau, and Yoshua Bengio, “A deep rein- from abstract meaning representations,” IEEE Comp. Int. Mag., vol. 12,
forcement learning chatbot (short version),” CoRR, vol. abs/1801.06700, no. 3, pp. 18–28, 2017.
2018.
[14] Chinnadhurai Sankar and Sujith Ravi, “Modeling non-goal oriented
dialog with discrete attributes,” in NeurIPS Workshop on Conversational
AI: “Today’s Practice and Tomorrow‘s Potential”, 2018.
[15] Chih-Wei Lee, Yau-Shian Wang, Tsung-Yuan Hsu, Kuan-Yu Chen,
Hung-yi Lee, and Lin-Shan Lee, “Scalable sentiment for sequence-
to-sequence chatbot response with performance analysis,” CoRR, vol.
abs/1804.02504, 2018.
[16] Oriol Vinyals and Quoc V. Le, “A neural conversational model,” CoRR,
vol. abs/1506.05869, 2015.
[17] Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett,
Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill
Dolan, “A neural network approach to context-sensitive generation of
conversational responses,” in HLT-NAACL, 2015.
[18] Iulian Vlad Serban, Tim Klinger, Gerald Tesauro, Kartik Talamadupula,
Bowen Zhou, Yoshua Bengio, and Aaron C. Courville, “Multiresolution
recurrent neural networks: An application to dialogue response genera-
tion,” in AAAI, 2017.
[19] Jiwei Li, Michel Galley, Chris Brockett, Georgios P. Spithourakis, Jian-
feng Gao, and William B. Dolan, “A persona-based neural conversation
model,” in ACL, 2016.
[20] Wenjie Wang, Minlie Huang, Xin-Shun Xu, Fumin Shen, and Liqiang
Nie, “Chat more: Deepening and widening the chatting topic via a deep
model,” in SIGIR. 2018, ACM.
[21] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe
Kiela, and Jason Weston, “Personalizing dialogue agents: I have a dog,
do you have pets too?,” CoRR, vol. abs/1801.07243, 2018.
[22] Rui Yan, “”chitty-chitty-chat bot”: Deep learning for conversational AI,”
in IJCAI, 2018.

You might also like