Deep Reinforcement Learning For Chatbots Using
Deep Reinforcement Learning For Chatbots Using
I. I NTRODUCTION
What happens in the minds of humans during chatty inter-
actions containing sentences that are not only coherent but
also engaging? While not all chatty human dialogues are
engaging, they are arguably coherent [1]. They also exhibit Fig. 1. High-level architecture of the proposed deep reinforcement learning
large vocabularies—according to the language in focus— approach for chatbots—see text for details
because conversations can address any topic that comes to
the minds of the partner conversants. In addition, each con-
tribution by a partner conversant may exhibit multiple sen- provided labels), a Deep Reinforcement Learning (DRL) agent
tences instead of one such as greeting+question or acknowl- takes the role of one of the two partner conversants in order
edgement+statement+question. Furthermore, the topics raised to learn to select human-like sentences when exposed to both
in the conversation may go back and forth without losing human-like and non-human-like sentences. In our learning
coherence. This is a big challenge for data-driven chatbots. scenario the agent-environment interactions consist of agent-
We present a novel approach based on the reinforcement data interactions – there is no user simulator as in task-
learning [2], unsupervised learning [3] and deep learning [4] oriented dialogue systems. During each verbal contribution,
paradigms. Our learning scenario is as follows: given a data set the DRL agent (1) observes the state of the world via a deep
of human-human dialogues in raw text (without any manually neural network, which models a representation of all sentences
raised in the conversation together with a set of candidate
Work carried out while the first author was visiting Samsung Research. responses or agent actions (referred as clustered actions in
our approach); (2) it then selects an action so that its word- methods for chatbots, which has not been explored before—
based representation is sent to the environment; and (3) it at least not from the perspective of deriving the action sets
receives an updated dialogue history and a numerical reward automatically as attempted in this paper.
for having chosen each action, until a termination condition Other closely related methods to deep RL include seq2seq
is met. This process—illustrated in Figure 1—is carried out models for dialogue generation [16]–[21]. These methods
iteratively until the end of a dialogue for as many dialogues tend to be data-hungry because they are typically trained
as necessary, i.e. until there is no further improvement in the with millions of sentences, which imply high computational
agent’s performance. demands. While they can be used to address the same problem,
The contributions of this paper are as follows. in this paper we focus our attention on deep RL-based chatbots
• We propose to train chatbots using value-based deep and leave their comparison or combination as future work.
reinforcement learning using action spaces derived from Nonetheless, these related works agree with the fact that
unsupervised clustering, where each action cluster is a evaluation is a difficult part and that there is a need for
representation of a type of meaning (greeting, question better evaluation metrics [22]. This is further supported by
around a topic, statements around a topic, etc.). [23], where they found that metrics such as Bleu and Meteor
• We propose a simple though promising reward func- amongst others do not correlate with human judgments.
tion. It is based on human-human dialogues and noisy With regard to performance metrics, the reward functions
dialogues for learning to rate good vs. bad dialogues. used by deep RL dialogue agents are either specified manually
According to an analysis of dialogue reward prediction, depending on the application, or learnt from dialogue data. For
dialogues with lengthy dialogue histories (of at least 10 example, [11] conceives a reward function that rewards pos-
sentences) report strong correlations between true and itively sentences that are easy to respond and coherent while
predicted rewards on test data. penalising repetitiveness. [12] uses an adversatial approach,
• Our experiments comparing different sentence embedding where the discriminator is trained to score human vs. non-
sizes (100 vs. 300) did not report statistical differences human sentences so that the generator can use such scores
on test data. This means that similar results can be during training. [13] trains a reward function from human
obtained more efficiently with the smaller embedding ratings. All these related works are neural-based, and there
than the larger one due to less features. In other words, is no clear best reward function to use in future (chitchat)
sentence embeddings of 100 dimensions are as good as chatbots. This motivated us to propose a new metric that is
300 dimensions but less computationally demanding. easy to implement, practical due to requiring only data in raw
• Last but not least, we found that training chatbots on text, and potentially promising as described below.
multiple data splits is crucial for improved performance III. P ROPOSED A PPROACH
over training chatbots using the entire training set.
To explain the proposed learning approach we first describe
The remainder of the paper describes our proposed approach
how to conceive a finite set of dialogue actions from raw text,
in more detail and evaluates it using a publicly available
then we describe how to assign rewards, and finally describe
dataset of chitchat conversations. Although our learning agents
how to bring everything together during policy learning.
indeed improve their performance over time with dialogues
that they get familiarised with, their performance drops with A. Clustered Actions
dialogues that the agents are not familiar with. The former is
Actions in reinforcement learning chatbots correspond to
promising and in favour of our proposed approach, and the
sentences, and their size is infinite assuming all possible
latter is not, but it is a general problem faced by data-driven
combinations of words sequences in a given language. This
chatbots and an interesting avenue for future research.
is especially true in the case of open-ended conversations that
make use of large vocabularies, as opposed to task-oriented
II. R ELATED W ORK
conversations that make use of smaller (restricted) vocabu-
Reinforcement Learning (RL) methods are typically based laries. A clustered action is a group of sentences sharing a
on value functions or policy search [2], which also applies to similar or related meaning via sentence vectors derived from
deep RL methods. While value functions have been particu- word embeddings [24], [25]. While there are multiple ways of
larly applied to task-oriented dialogue systems [5]–[10], policy selecting features for clustering and also multiple clustering
search has been particularly applied to open-ended dialogue algorithms, the following requirements arise for chatbots: (1)
systems such as (chitchat) chatbots [11]–[15]. This is not unlabelled data due to human-human dialogues in raw text
surprising given the fact that task-oriented dialogue systems (this makes it difficult to evaluate the goodness of clustering
use finite action sets, while chatbot systems use infinite action features and algorithms), and (2) scalability to clustering a
sets. So far there is a preference for policy search methods for large set of data points (sentences in our case, which are
chatbots, but it is not clear whether they should be preferred mostly unique).
because they face problems such as local optima rather than Given a set of data points {x1 , · · · , xn }∀xi ∈ Rm and a
global optima, inefficiency and high variance. It is thus that similarity metric d(xi , xi0 ), the task is to find a set of k clusters
this paper explores the feasibility of value function-based with a clustering algorithm. Since in our case each data point
x corresponds to a sentence within a dialogue, we represent we sample a training dialogue from our data of human-human
sentences via their mean word vectors—similarly as in Deep conversations (lines 1-4). A human starts the conversation,
Averaging Networks [26]—denoted as which is mapped to its corresponding sentence embedding
Ni representation (lines 5-6). Then a set of candidate responses
1 X is generated including (1) the true human response and (2)
xi = cj ,
Ni j=1 randomly chosen responses (distractors). The candidate re-
sponses are clustered as described in Section III-A and the
where cj is the vector of coefficients of word j and Ni is the resulting actions are taken into account by the agent for action
number of words in sentence i. For scalability purposes, we selection (lines 8-10). Once an action is chosen, it is conveyed
use the K-Means++ algorithm [27] with the Euclidean distance to the environment, a reward is observed as described in Sec-
v
tion III-B, and the agent’s partner response is observed as well
um
uX
d(xji , xji0 ) = t (xji , xji0 )2 in order to update the dialogue history H (lines 11-14). With
j=1 such an update, the new sentence embedding representation
is generated from H in order to update the replay memory
with m dimensions, and assume that k is provided rather than
D with learning experience (s, a, r, s0 ) (lines 15-16). Then a
automatically induced – though other algorithms can be used
minibatch of experiences M B = (sj , aj , rj , s0j ) is sampled
with our approach. In this way, a trained clustering model
from D in order to update the weights θ according to the
assigns a cluster ID a ∈ A to features xi , where the number
error derived from the difference between the target value yj
of actions is equivalent to the number of clusters, i.e. |A| = k.
and the predicted value Q(s, a; θ) (see lines 18 and 20), which
B. Human-Likeness Rewards is based on the following loss function:
2
Reward functions in reinforcement learning dialogue agents
0 0 ˆ
L(θj ) = EM B r + γ max Q̂(s , a ; θ j ) − Q(s, a; θ j ) .
is often a difficult aspect. We propose to derive the rewards a 0
from human-human dialogues by assigning positive values to The target action-value function Q̂ and state s are updated ac-
contextualised responses seen in the data, and negative values cordingly (lines 21-22), and this iterative procedure continues
to randomly generated responses due to lacking coherence until convergence.
(also referred to as ‘non-human-like responses’) – see example
in Table V. Thus, anPepisode or dialogue reward can be IV. E XPERIMENTS AND R ESULTS
N i
computed as Ri = r
j=1 j (a), where i is the dialogue A. Data
in focus, j the dialogue turn in focus, and rji (a) is given We used data from the Persona-Chat data set1 , which
according to includes 17,877 dialogues for training (131,431 turns) and
(
i +1, if a is a human response in dialogue-turn i, j. 999 dialogues for testing (7,793 turns). They represent av-
rj (a)= erages of 7.35 and 7.8 dialogue turns for training and testing,
−1, if a is human but randomly chosen (incoherent).
respectively—see example dialogue in Table V. The vocabu-
C. Policy Learning lary size in the entire data set contains 19,667 unique words.
Our Deep Reinforcement Learning (DRL) agents aim to B. Experimental Setting
maximise their cumulative reward overtime according to
To analyse the performance of our ChatDQN agents we use
Q∗ (s, a; θ) = max E[rt + γrt+1 + γ 2 rt+2 + · · · |s, a, πθ ], subsets of training data vs. the entire training data set. The
πθ
former are automatically generated by using sentence vectors
where r is the numerical reward given at time step t for to represent the features of each dialogue—as described in
choosing action a in state s, γ is a discounting factor, and Section III-A. Similarly, the agents’ states are modelled using
Q∗ (s, a; θ) is the optimal action-value function using weights sentence vectors of the dialogue history with the pretrained
θ in a neural network. During training, a DRL agent will coefficients of the Glove model [25]. In all our experiments
choose actions in a probabilistic manner in order to explore we use the following neural network architecture2 :
new (s, a) pairs for discovering better rewards or to exploit • mean word vectors, one per sentence, in the input layer
already learnt values—with a reduced level of exploration (maximum number of vectors=50, with zero-padding) –
overtime and an increased level of exploitation over time. each word vector of 100 or 300 embedding size,
During testing, a DRL agent will choose the best actions a∗ • two Gated Recurrent Unit (GRU) [30] layers with latent
according to dimensionality of 256, and
πθ∗ (s) = arg max Q∗ (s, a; θ). 1 Data set downloaded from http://parl.ai/ on 18 May 2018 [29]
a∈A
2 Other hyperparameters include embedding batch size=128, dropout=0.2,
Our DRL agents implement the procedure above using a latent dimensionality=256, discount factor=0.99, size of candidate re-
generalisation of the DQN method [28]—see Algorithm 1. sponses=3, max. number of sentence vectors in H=50, burning steps=3K,
memory size=10K, target model update (C)=10K, learning steps=50K, test
After initialising replay memory D, dialogue history H, steps=100K. The number of parameters in our neural nets with 100 and 300
action-value function Q and target action-value function Q̂, sentence vector dimensions corresponds to 4.4 and 12.1 million, respectively.
Human Sentences Distorted Human Sentences
hello what are doing today? hello what are doing today?
i’m good, i just got off work and tired, i have two jobs.[r=+1] do your cats like candy?[r=-1]
i just got done watching a horror movie i just got done watching a horror movie
i rather read, i have read about 20 books this year.[r=+1] do you have any hobbies?[r=-1]
wow! i do love a good horror movie. loving this cooler weather wow! i do love a good horror movie. loving this cooler weather
but a good movie is always good.[r=+1] good job! if you live to 100 like me, you will need all that learning.[r=-1]
yes! my son is in junior high and i just started letting him watch them
yes! my son is in junior high and i just started letting him watch them
i work in the movies as well.[r=+1] what a nice gesture. i take my dog to compete in agility classes.[r=-1]
neat!! i used to work in the human services field neat!! i used to work in the human services field
yes it is neat, i stunt double, it is so much fun and hard work.[r=+1]
you work very hard. i would like to do a handstand. can you teach it?[r=-1]
yes i bet you can get hurt. my wife works and i stay at home yes i bet you can get hurt. my wife works and i stay at home
nice, i only have one parent so now i help out my mom.[r=+1] yes i do, red is one of my favorite colors[r=-1]
i bet she appreciates that very much. i bet she appreciates that very much.
she raised me right, i’m just like her.[r=+1] haha, it is definitely attention grabbing![r=-1]
my dad was always busy working at home depot my dad was always busy working at home depot
now that i am older home depot is my toy r us.[r=+1] i bet there will be time to figure it out. what are your interests?[r=-1]
TABLE I
M ODIFIED DIALOGUE FROM THE P ERSONA -C HAT DATASET [21] WITH OUR PROPOSED REWARDS : r=+1 MEANS A HUMAN - LIKE SENTENCE AND r=-1
MEANS NON - HUMAN LIKE . T HE LATTER SENTENCES , IN RED , ARE SAMPLED RANDOMLY FROM DIFFERENT DIALOGUES IN THE SAME DATASET
Algorithm 1 ChatDQN Learning number of sentence clusters would mitigate the problem, but
1: Initialise Deep Q-Networks with replay memory D, di- the larger the number of clusters the larger the computational
alogue history H, action-value function Q with random expense—i.e. more parameters in the neural network. Fig-
weights θ, and target action-value functions Q̂ with θ̂ = θ ure 2(a) shows an example of our sentence clustering using
2: Initialise clustering model from training dialogue data 100 clusters on our training data. A manual inspection showed
3: repeat that greeting sentences were mostly assigned to the same
4: Sample a training dialogue (human-human sentences) cluster, and questions expressing preferences (e.g. What is
5: Append first sentence to dialogue history H your favourite X?) were also assigned to the same cluster. In
6: s = sentence embedding representation of H this work we thus use a sentence clustering model with k=100
7: repeat derived from our training data and prior to reinforcement
8: Generate noisy candidate response sentences learning3 . In addition, we trained a second clustering model
9: A =( cluster IDs of candidate response sentences to analyse our experiments using different data splits, where
randa∈A if random number ≤ instead of clustering sentences we cluster dialogues. Given that
10: a=
maxa∈A Q(s, a; θ) otherwise we represent a sentence using a mean word vector, a dialogue
11: Execute chosen clustered action a can thus be represented by a group of sentence vectors.
12: Observe human-likeness dialogue reward r Figure 2(b) shows an example of our dialogue clustering using
13: Observe environment response (agent’s partner) 20 clusters on our training data.
14: Append agent and environment responses to H Notice that while previous related works in task-oriented
15: s0 = sentence embedding representation of H DRL-based agents typically use a user simulator, this paper
16: Append transition (s, a, r, s0 ) to D does not use a simulator. Instead, we use the dataset of human-
0 human dialogues directly and substitute one partner conversant
17:
( random minibatch (sj , aj , rj , sj ) from D
Sample
rj if final step of episode in the dialogues by a DRL agent. The goal of the agent is to
18: yj = choose the human-generated sentences (actions) out of a set
rj + γ maxa0 ∈A Q̂(s0 , a0 ; θ̂) otherwise of candidate responses.
2
19: Set err = (yj − Q(s, a; θ))
20: Gradient descent step on err with respect to θ C. Experimental Results
21: Reset Q̂ = Q every C steps The plots in Figure 3 show the training performance of
22: s ← s0 our ChatDQN agents—all using 100 clustered actions. Each
23: until end of dialogue plot contains two learning curves, one per agent, where each
24: Reset dialogue history H agent uses a different sentence embedding size (100 or 300
25: until convergence dimensions). In addition, each plot uses an automatically
generated data split according to our clustered dialogues.
These plots show evidence that all agents indeed improve their
behaviour over time even when they use only 100 actions. This
• fully connected layer with number of nodes=the number can be observed from their average episode rewards, the higher
of clusters, i.e. each cluster corresponding to one action.
3 Each experiment in this paper was ran on a GPU Tesla K80 using the fol-
While a small number of sentence clusters could result in lowing libraries: Keras (https://github.com/keras-team/keras), OpenAI (https:
actions being assigned to potentially the same cluster, a larger //github.com/openai) and Keras-RL (https://github.com/keras-rl/keras-rl).
(a) 100 clusters of training sentences
Fig. 2. Example clusters of our training data using Principal Component Analysis [31] for visualisations in 2D – black dots represent sentences or dialogues
the better in all learning curves. From a visual inspection, we in performance during testing on the test set. These results
can observe that the agents using either embedding size (100 could be confirmed in other datasets and/or settings in future
or 300) perform rather equivalently but with a small trend for work. In addition, we can observe that the ChatDQN agents
300 dimensions to dominate its counterpart – more on this trained using all data (agents with id=20) were not able to
below. achieve as good performance than those agents using smaller
The performance of our ChatDQN agents using all training data splits. Our results thus reveal that training chatbots on
dialogues is shown in Figure 4. It can be noted that in contrast some sort of domains (groups of dialogues automatically
to the previous agents where their improvement in average discovered in our case), is useful for improved performance.
reward reached values of around 2, the performance in these
agents was lower (with average episode reward < 0). We V. A NALYSIS OF H UMAN -L IKENESS R EWARDS
attribute this to the larger amount of variation exhibited from We employ the algorithm of [32] for extending a dataset of
about 1K dialogues to 17.8K dialogues. human-human dialogues with distorted dialogues. The latter
We analysed the performance of our agents further by include varying amounts of distortions, i.e. different degrees
using a test set of totally unseen dialogues during training. of human-likeness. We use such data for training and testing
Table II summarises our results, where we can note that the reward prediction models in order to analise the goodness
larger sentence embedding size (300) generally performed of our proposed reward function. Given extended dataset
better. While a significant difference (according to a two-tailed D̂ = {(dˆ1 , y1 ), . . . , (dˆN , yN )} with (noisy) dialogue histories
Wilcoxon Singed Rank Test) at p = 0.05 was identified in dˆi , the goal is to predict dialogue scores yi as accurately
testing on the training set, no significant difference was found as possible. We represent a dialogue history via its sentence
(a) ChatDQN agents using data splits 0 to 4 (from left to right)
Fig. 3. Training performance of ChatDQN agents using different data splits of dialogues—see text for details
|Embedding|=100 |Embedding|=300
Data Split Training Testing on the Testing on the Data Split Training Testing on the Testing on the
(|dialogues|) Training Set Test Set (|dialogues|) Training Set Test Set
0 (861) 1.8778 3.7711 -1.1708 0 (1000) 1.8168 3.6785 -0.8618
1 (902) 1.3751 3.1663 -1.7006 1 (850) 2.0622 4.4598 -1.8688
2 (907) 1.4194 3.1579 -0.9723 2 (1010) 1.6896 3.6724 -1.4282
3 (785) 2.1532 4.2508 -1.3444 3 (1029) 1.9845 4.0136 -0.6109
4 (1046) 1.2204 2.1581 -1.5633 4 (951) 1.8255 4.0423 -1.4448
5 (767) 1.9456 3.9017 -1.2123 5 (832) 2.0860 4.2182 -0.8277
6 (1053) 0.4621 0.1370 -1.8443 6 (815) 2.1735 4.2592 -1.5193
7 (968) 1.8090 3.8368 -1.1137 7 (891) 2.1921 4.5799 -1.4233
8 (858) 1.7608 3.5531 -1.6678 8 (905) 1.8835 3.8337 -0.6628
9 (826) 1.8431 3.6254 -1.0919 9 (892) 2.0521 4.1882 -1.5267
10 (818) 1.9188 3.8629 -0.5394 10 (835) 2.0709 4.2852 -0.8831
11 (944) 1.8212 3.5724 -1.7020 11 (873) 2.1902 4.4848 -1.3329
12 (873) 2.0195 4.1895 -1.3456 12 (948) 1.7761 3.7927 -1.6167
13 (895) 2.0515 4.1873 -1.8034 13 (932) 1.8563 3.6208 -1.5149
14 (863) 1.9722 4.1479 -1.3244 14 (812) 1.9486 4.0347 -1.5866
15 (842) 1.8214 3.8942 -0.8921 15 (880) 1.1338 2.4880 -1.4084
16 (837) 1.8162 3.8817 -1.3784 16 (787) 2.2628 4.5583 -1.4290
17 (958) 1.6373 3.3373 -0.7726 17 (994) 0.9038 1.5106 -1.5925
18 (1012) 1.7631 3.6279 -1.2690 18 (853) 2.2405 4.4716 -1.4231
19 (862) 2.0683 4.2026 -1.5901 19 (788) 2.0686 4.2219 -0.9594
20 (17877) -0.4138 -1.2473 -1.9684 20 (17877) -0.3516 -0.3490 -2.0870
Average0−20 1.6353 3.2959† -1.3461 Average0−20 1.8031 3.7174† -1.3337
Sum0−20 34.3419 69.2146 -28.2674 Sum0−20 37.8656 78.0653 -28.0079
Upper Bound 7.1810 7.1810 7.5942 Upper Bound 7.1810 7.1810 7.5942
Lower Bound -7.2834 -7.2834 -7.7276 Lower Bound -7.2834 -7.2834 -7.7276
Random Sel. -2.4139 -2.4139 -2.5526 Random Sel. -2.4139 -2.4139 -2.5526