Deep Reinforcement Learning For Chatbots Using

Uploaded by

Mehrin Afroz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views8 pages

Deep Reinforcement Learning For Chatbots Using

Uploaded by

Mehrin Afroz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Deep Reinforcement Learning for Chatbots Using

Clustered Actions and Human-Likeness Rewards

Heriberto Cuayáhuitl Donghyeon Lee Seonghan Ryu
School of Computer Science Artificial Intelligence Research Group Artificial Intelligence Research Group
University of Lincoln Samsung Electronics Samsung Electronics
Lincoln, United Kingdom Seoul, South Korea Seoul, South Korea
[email protected] [email protected] [email protected]

Sungja Choi Inchul Hwang Jihie Kim

Artificial Intelligence Research Group Artificial Intelligence Research Group Artificial Intelligence Research Group
arXiv:1908.10331v1 [cs.AI] 27 Aug 2019

Samsung Electronics Samsung Electronics Samsung Electronics

Seoul, South Korea Seoul, South Korea Seoul, South Korea
[email protected] [email protected] [email protected]

Abstract—Training chatbots using the reinforcement learning

paradigm is challenging due to high-dimensional states, infinite
action spaces and the difficulty in specifying the reward function.
We address such problems using clustered actions instead of
infinite actions, and a simple but promising reward function
based on human-likeness scores derived from human-human
dialogue data. We train Deep Reinforcement Learning (DRL)
agents using chitchat data in raw text—without any manual
annotations. Experimental results using different splits of training
data report the following. First, that our agents learn reasonable
policies in the environments they get familiarised with, but
their performance drops substantially when they are exposed
to a test set of unseen dialogues. Second, that the choice of
sentence embedding size between 100 and 300 dimensions is
not significantly different on test data. Third, that our proposed
human-likeness rewards are reasonable for training chatbots as
long as they use lengthy dialogue histories of ≥10 sentences.
Index Terms—neural networks, reinforcement / unsupervised
/ supervised learning, sentence embeddings, chatbots, chitchat

I. I NTRODUCTION
What happens in the minds of humans during chatty inter-
actions containing sentences that are not only coherent but
also engaging? While not all chatty human dialogues are
engaging, they are arguably coherent [1]. They also exhibit Fig. 1. High-level architecture of the proposed deep reinforcement learning
large vocabularies—according to the language in focus— approach for chatbots—see text for details
because conversations can address any topic that comes to
the minds of the partner conversants. In addition, each con-
tribution by a partner conversant may exhibit multiple sen- provided labels), a Deep Reinforcement Learning (DRL) agent
tences instead of one such as greeting+question or acknowl- takes the role of one of the two partner conversants in order
edgement+statement+question. Furthermore, the topics raised to learn to select human-like sentences when exposed to both
in the conversation may go back and forth without losing human-like and non-human-like sentences. In our learning
coherence. This is a big challenge for data-driven chatbots. scenario the agent-environment interactions consist of agent-
We present a novel approach based on the reinforcement data interactions – there is no user simulator as in task-
learning [2], unsupervised learning [3] and deep learning [4] oriented dialogue systems. During each verbal contribution,
paradigms. Our learning scenario is as follows: given a data set the DRL agent (1) observes the state of the world via a deep
of human-human dialogues in raw text (without any manually neural network, which models a representation of all sentences
raised in the conversation together with a set of candidate
Work carried out while the first author was visiting Samsung Research. responses or agent actions (referred as clustered actions in
our approach); (2) it then selects an action so that its word- methods for chatbots, which has not been explored before—
based representation is sent to the environment; and (3) it at least not from the perspective of deriving the action sets
receives an updated dialogue history and a numerical reward automatically as attempted in this paper.
for having chosen each action, until a termination condition Other closely related methods to deep RL include seq2seq
is met. This process—illustrated in Figure 1—is carried out models for dialogue generation [16]–[21]. These methods
iteratively until the end of a dialogue for as many dialogues tend to be data-hungry because they are typically trained
as necessary, i.e. until there is no further improvement in the with millions of sentences, which imply high computational
agent’s performance. demands. While they can be used to address the same problem,
The contributions of this paper are as follows. in this paper we focus our attention on deep RL-based chatbots
• We propose to train chatbots using value-based deep and leave their comparison or combination as future work.
reinforcement learning using action spaces derived from Nonetheless, these related works agree with the fact that
unsupervised clustering, where each action cluster is a evaluation is a difficult part and that there is a need for
representation of a type of meaning (greeting, question better evaluation metrics [22]. This is further supported by
around a topic, statements around a topic, etc.). [23], where they found that metrics such as Bleu and Meteor
• We propose a simple though promising reward func- amongst others do not correlate with human judgments.
tion. It is based on human-human dialogues and noisy With regard to performance metrics, the reward functions
dialogues for learning to rate good vs. bad dialogues. used by deep RL dialogue agents are either specified manually
According to an analysis of dialogue reward prediction, depending on the application, or learnt from dialogue data. For
dialogues with lengthy dialogue histories (of at least 10 example, [11] conceives a reward function that rewards pos-
sentences) report strong correlations between true and itively sentences that are easy to respond and coherent while
predicted rewards on test data. penalising repetitiveness. [12] uses an adversatial approach,
• Our experiments comparing different sentence embedding where the discriminator is trained to score human vs. non-
sizes (100 vs. 300) did not report statistical differences human sentences so that the generator can use such scores
on test data. This means that similar results can be during training. [13] trains a reward function from human
obtained more efficiently with the smaller embedding ratings. All these related works are neural-based, and there
than the larger one due to less features. In other words, is no clear best reward function to use in future (chitchat)
sentence embeddings of 100 dimensions are as good as chatbots. This motivated us to propose a new metric that is
300 dimensions but less computationally demanding. easy to implement, practical due to requiring only data in raw
• Last but not least, we found that training chatbots on text, and potentially promising as described below.
multiple data splits is crucial for improved performance III. P ROPOSED A PPROACH
over training chatbots using the entire training set.
To explain the proposed learning approach we first describe
The remainder of the paper describes our proposed approach
how to conceive a finite set of dialogue actions from raw text,
in more detail and evaluates it using a publicly available
then we describe how to assign rewards, and finally describe
dataset of chitchat conversations. Although our learning agents
how to bring everything together during policy learning.
indeed improve their performance over time with dialogues
that they get familiarised with, their performance drops with A. Clustered Actions
dialogues that the agents are not familiar with. The former is
Actions in reinforcement learning chatbots correspond to
promising and in favour of our proposed approach, and the
sentences, and their size is infinite assuming all possible
latter is not, but it is a general problem faced by data-driven
combinations of words sequences in a given language. This
chatbots and an interesting avenue for future research.
is especially true in the case of open-ended conversations that
make use of large vocabularies, as opposed to task-oriented
II. R ELATED W ORK
conversations that make use of smaller (restricted) vocabu-
Reinforcement Learning (RL) methods are typically based laries. A clustered action is a group of sentences sharing a
on value functions or policy search [2], which also applies to similar or related meaning via sentence vectors derived from
deep RL methods. While value functions have been particu- word embeddings [24], [25]. While there are multiple ways of
larly applied to task-oriented dialogue systems [5]–[10], policy selecting features for clustering and also multiple clustering
search has been particularly applied to open-ended dialogue algorithms, the following requirements arise for chatbots: (1)
systems such as (chitchat) chatbots [11]–[15]. This is not unlabelled data due to human-human dialogues in raw text
surprising given the fact that task-oriented dialogue systems (this makes it difficult to evaluate the goodness of clustering
use finite action sets, while chatbot systems use infinite action features and algorithms), and (2) scalability to clustering a
sets. So far there is a preference for policy search methods for large set of data points (sentences in our case, which are
chatbots, but it is not clear whether they should be preferred mostly unique).
because they face problems such as local optima rather than Given a set of data points {x1 , · · · , xn }∀xi ∈ Rm and a
global optima, inefficiency and high variance. It is thus that similarity metric d(xi , xi0 ), the task is to find a set of k clusters
this paper explores the feasibility of value function-based with a clustering algorithm. Since in our case each data point
x corresponds to a sentence within a dialogue, we represent we sample a training dialogue from our data of human-human
sentences via their mean word vectors—similarly as in Deep conversations (lines 1-4). A human starts the conversation,
Averaging Networks [26]—denoted as which is mapped to its corresponding sentence embedding
Ni representation (lines 5-6). Then a set of candidate responses
1 X is generated including (1) the true human response and (2)
xi = cj ,
Ni j=1 randomly chosen responses (distractors). The candidate re-
sponses are clustered as described in Section III-A and the
where cj is the vector of coefficients of word j and Ni is the resulting actions are taken into account by the agent for action
number of words in sentence i. For scalability purposes, we selection (lines 8-10). Once an action is chosen, it is conveyed
use the K-Means++ algorithm [27] with the Euclidean distance to the environment, a reward is observed as described in Sec-
v
tion III-B, and the agent’s partner response is observed as well
um
uX
d(xji , xji0 ) = t (xji , xji0 )2 in order to update the dialogue history H (lines 11-14). With
j=1 such an update, the new sentence embedding representation
is generated from H in order to update the replay memory
with m dimensions, and assume that k is provided rather than
D with learning experience (s, a, r, s0 ) (lines 15-16). Then a
automatically induced – though other algorithms can be used
minibatch of experiences M B = (sj , aj , rj , s0j ) is sampled
with our approach. In this way, a trained clustering model
from D in order to update the weights θ according to the
assigns a cluster ID a ∈ A to features xi , where the number
error derived from the difference between the target value yj
of actions is equivalent to the number of clusters, i.e. |A| = k.
and the predicted value Q(s, a; θ) (see lines 18 and 20), which
B. Human-Likeness Rewards is based on the following loss function:
2
Reward functions in reinforcement learning dialogue agents

0 0 ˆ
L(θj ) = EM B r + γ max Q̂(s , a ; θ j ) − Q(s, a; θ j ) .
is often a difficult aspect. We propose to derive the rewards a 0

from human-human dialogues by assigning positive values to The target action-value function Q̂ and state s are updated ac-
contextualised responses seen in the data, and negative values cordingly (lines 21-22), and this iterative procedure continues
to randomly generated responses due to lacking coherence until convergence.
(also referred to as ‘non-human-like responses’) – see example
in Table V. Thus, anPepisode or dialogue reward can be IV. E XPERIMENTS AND R ESULTS
N i
computed as Ri = r
j=1 j (a), where i is the dialogue A. Data
in focus, j the dialogue turn in focus, and rji (a) is given We used data from the Persona-Chat data set1 , which
according to includes 17,877 dialogues for training (131,431 turns) and
(
i +1, if a is a human response in dialogue-turn i, j. 999 dialogues for testing (7,793 turns). They represent av-
rj (a)= erages of 7.35 and 7.8 dialogue turns for training and testing,
−1, if a is human but randomly chosen (incoherent).
respectively—see example dialogue in Table V. The vocabu-
C. Policy Learning lary size in the entire data set contains 19,667 unique words.
Our Deep Reinforcement Learning (DRL) agents aim to B. Experimental Setting
maximise their cumulative reward overtime according to
To analyse the performance of our ChatDQN agents we use
Q∗ (s, a; θ) = max E[rt + γrt+1 + γ 2 rt+2 + · · · |s, a, πθ ], subsets of training data vs. the entire training data set. The
πθ
former are automatically generated by using sentence vectors
where r is the numerical reward given at time step t for to represent the features of each dialogue—as described in
choosing action a in state s, γ is a discounting factor, and Section III-A. Similarly, the agents’ states are modelled using
Q∗ (s, a; θ) is the optimal action-value function using weights sentence vectors of the dialogue history with the pretrained
θ in a neural network. During training, a DRL agent will coefficients of the Glove model [25]. In all our experiments
choose actions in a probabilistic manner in order to explore we use the following neural network architecture2 :
new (s, a) pairs for discovering better rewards or to exploit • mean word vectors, one per sentence, in the input layer
already learnt values—with a reduced level of exploration (maximum number of vectors=50, with zero-padding) –
overtime and an increased level of exploitation over time. each word vector of 100 or 300 embedding size,
During testing, a DRL agent will choose the best actions a∗ • two Gated Recurrent Unit (GRU) [30] layers with latent
according to dimensionality of 256, and
πθ∗ (s) = arg max Q∗ (s, a; θ). 1 Data set downloaded from http://parl.ai/ on 18 May 2018 [29]
a∈A
2 Other hyperparameters include embedding batch size=128, dropout=0.2,
Our DRL agents implement the procedure above using a latent dimensionality=256, discount factor=0.99, size of candidate re-
generalisation of the DQN method [28]—see Algorithm 1. sponses=3, max. number of sentence vectors in H=50, burning steps=3K,
memory size=10K, target model update (C)=10K, learning steps=50K, test
After initialising replay memory D, dialogue history H, steps=100K. The number of parameters in our neural nets with 100 and 300
action-value function Q and target action-value function Q̂, sentence vector dimensions corresponds to 4.4 and 12.1 million, respectively.
Human Sentences Distorted Human Sentences
hello what are doing today? hello what are doing today?
i’m good, i just got off work and tired, i have two jobs.[r=+1] do your cats like candy?[r=-1]
i just got done watching a horror movie i just got done watching a horror movie
i rather read, i have read about 20 books this year.[r=+1] do you have any hobbies?[r=-1]
wow! i do love a good horror movie. loving this cooler weather wow! i do love a good horror movie. loving this cooler weather
but a good movie is always good.[r=+1] good job! if you live to 100 like me, you will need all that learning.[r=-1]
yes! my son is in junior high and i just started letting him watch them
yes! my son is in junior high and i just started letting him watch them
i work in the movies as well.[r=+1] what a nice gesture. i take my dog to compete in agility classes.[r=-1]
neat!! i used to work in the human services field neat!! i used to work in the human services field
yes it is neat, i stunt double, it is so much fun and hard work.[r=+1]
you work very hard. i would like to do a handstand. can you teach it?[r=-1]
yes i bet you can get hurt. my wife works and i stay at home yes i bet you can get hurt. my wife works and i stay at home
nice, i only have one parent so now i help out my mom.[r=+1] yes i do, red is one of my favorite colors[r=-1]
i bet she appreciates that very much. i bet she appreciates that very much.
she raised me right, i’m just like her.[r=+1] haha, it is definitely attention grabbing![r=-1]
my dad was always busy working at home depot my dad was always busy working at home depot
now that i am older home depot is my toy r us.[r=+1] i bet there will be time to figure it out. what are your interests?[r=-1]
TABLE I
M ODIFIED DIALOGUE FROM THE P ERSONA -C HAT DATASET [21] WITH OUR PROPOSED REWARDS : r=+1 MEANS A HUMAN - LIKE SENTENCE AND r=-1
MEANS NON - HUMAN LIKE . T HE LATTER SENTENCES , IN RED , ARE SAMPLED RANDOMLY FROM DIFFERENT DIALOGUES IN THE SAME DATASET

Algorithm 1 ChatDQN Learning number of sentence clusters would mitigate the problem, but
1: Initialise Deep Q-Networks with replay memory D, di- the larger the number of clusters the larger the computational
alogue history H, action-value function Q with random expense—i.e. more parameters in the neural network. Fig-
weights θ, and target action-value functions Q̂ with θ̂ = θ ure 2(a) shows an example of our sentence clustering using
2: Initialise clustering model from training dialogue data 100 clusters on our training data. A manual inspection showed
3: repeat that greeting sentences were mostly assigned to the same
4: Sample a training dialogue (human-human sentences) cluster, and questions expressing preferences (e.g. What is
5: Append first sentence to dialogue history H your favourite X?) were also assigned to the same cluster. In
6: s = sentence embedding representation of H this work we thus use a sentence clustering model with k=100
7: repeat derived from our training data and prior to reinforcement
8: Generate noisy candidate response sentences learning3 . In addition, we trained a second clustering model
9: A =( cluster IDs of candidate response sentences to analyse our experiments using different data splits, where
randa∈A if random number ≤ instead of clustering sentences we cluster dialogues. Given that
10: a=
maxa∈A Q(s, a; θ) otherwise we represent a sentence using a mean word vector, a dialogue
11: Execute chosen clustered action a can thus be represented by a group of sentence vectors.
12: Observe human-likeness dialogue reward r Figure 2(b) shows an example of our dialogue clustering using
13: Observe environment response (agent’s partner) 20 clusters on our training data.
14: Append agent and environment responses to H Notice that while previous related works in task-oriented
15: s0 = sentence embedding representation of H DRL-based agents typically use a user simulator, this paper
16: Append transition (s, a, r, s0 ) to D does not use a simulator. Instead, we use the dataset of human-
0 human dialogues directly and substitute one partner conversant
17:
( random minibatch (sj , aj , rj , sj ) from D
Sample
rj if final step of episode in the dialogues by a DRL agent. The goal of the agent is to
18: yj = choose the human-generated sentences (actions) out of a set
rj + γ maxa0 ∈A Q̂(s0 , a0 ; θ̂) otherwise of candidate responses.
2
19: Set err = (yj − Q(s, a; θ))
20: Gradient descent step on err with respect to θ C. Experimental Results
21: Reset Q̂ = Q every C steps The plots in Figure 3 show the training performance of
22: s ← s0 our ChatDQN agents—all using 100 clustered actions. Each
23: until end of dialogue plot contains two learning curves, one per agent, where each
24: Reset dialogue history H agent uses a different sentence embedding size (100 or 300
25: until convergence dimensions). In addition, each plot uses an automatically
generated data split according to our clustered dialogues.
These plots show evidence that all agents indeed improve their
behaviour over time even when they use only 100 actions. This
• fully connected layer with number of nodes=the number can be observed from their average episode rewards, the higher
of clusters, i.e. each cluster corresponding to one action.
3 Each experiment in this paper was ran on a GPU Tesla K80 using the fol-
While a small number of sentence clusters could result in lowing libraries: Keras (https://github.com/keras-team/keras), OpenAI (https:
actions being assigned to potentially the same cluster, a larger //github.com/openai) and Keras-RL (https://github.com/keras-rl/keras-rl).
(a) 100 clusters of training sentences

(b) 20 clusters of training dialogues

Fig. 2. Example clusters of our training data using Principal Component Analysis [31] for visualisations in 2D – black dots represent sentences or dialogues

the better in all learning curves. From a visual inspection, we in performance during testing on the test set. These results
can observe that the agents using either embedding size (100 could be confirmed in other datasets and/or settings in future
or 300) perform rather equivalently but with a small trend for work. In addition, we can observe that the ChatDQN agents
300 dimensions to dominate its counterpart – more on this trained using all data (agents with id=20) were not able to
below. achieve as good performance than those agents using smaller
The performance of our ChatDQN agents using all training data splits. Our results thus reveal that training chatbots on
dialogues is shown in Figure 4. It can be noted that in contrast some sort of domains (groups of dialogues automatically
to the previous agents where their improvement in average discovered in our case), is useful for improved performance.
reward reached values of around 2, the performance in these
agents was lower (with average episode reward < 0). We V. A NALYSIS OF H UMAN -L IKENESS R EWARDS
attribute this to the larger amount of variation exhibited from We employ the algorithm of [32] for extending a dataset of
about 1K dialogues to 17.8K dialogues. human-human dialogues with distorted dialogues. The latter
We analysed the performance of our agents further by include varying amounts of distortions, i.e. different degrees
using a test set of totally unseen dialogues during training. of human-likeness. We use such data for training and testing
Table II summarises our results, where we can note that the reward prediction models in order to analise the goodness
larger sentence embedding size (300) generally performed of our proposed reward function. Given extended dataset
better. While a significant difference (according to a two-tailed D̂ = {(dˆ1 , y1 ), . . . , (dˆN , yN )} with (noisy) dialogue histories
Wilcoxon Singed Rank Test) at p = 0.05 was identified in dˆi , the goal is to predict dialogue scores yi as accurately
testing on the training set, no significant difference was found as possible. We represent a dialogue history via its sentence
(a) ChatDQN agents using data splits 0 to 4 (from left to right)

(b) ChatDQN agents using data splits 5 to 9 (from left to right)

(c) ChatDQN agents using data splits 10 to 14 (from left to right)

(d) ChatDQN agents using data splits 15 to 19 (from left to right)

Fig. 3. Training performance of ChatDQN agents using different data splits of dialogues—see text for details

where cij,k is the vector of coefficients of word k, part of

sentence j in dialogue i, and Nji is the number of words in
the sentence in focus.
Assuming that vector Y = {y1 , ..., y|Y| } is the set of target
labels—generated as described in the dialogue generation
algorithm of [32], and using the same test data as the previous
section. In this way, dataset Dtrain = (Xtrain , Ytrain ) is used
for training neural regression models using varying amounts
of dialogue history, and dataset Dtest = (Xtest , Ytest ) is used
for testing the learnt models.
Our experiments use a 2-layer Gated Recurrent Unit (GRU)
Fig. 4. Training performance of our ChatDQN agents using all training
dialogues and two sentence embedding sizes neural network [30], similar to the one in SectionIV-B but
including Batch Normalisation [33] between hidden layers.
We trained neural networks for six different lengths of
vectors as in Deep Averaging Networks [26], where sentences dialogue history, ranging from 1 sentence to 50 sentences.
are represented with numerical feature vectors denoted as Each length size involved a separate neural network, trained
x = {x1 , ..., x|x| }. In this way, a set of word sequences sij 10 times in order to report results over multiple runs. Figure 5
in dialogue-sentence pair i, j is mapped to feature vectors reports the average Pearson correlation coefficient—between
j Ni true dialogue rewards and predicted dialogue rewards—for
1 X i each length size. It can be observed that short dialogue
xij = i cj,k ,
Nj histories contribute to obtain weak correlations, and that longer
k=1
TABLE II
AVERAGE REWARD RESULTS OF C HAT DQN AGENTS 0 TO 20 TRAINED WITH DIFFERENT DATA SPLITS AND SIZE OF SENTENCE EMBEDDING (42 AGENTS
IN TOTAL ), WHERE † DENOTES SIGNIFICANT DIFFERENCE ( AT p = 0.05) USING A TWO - TAILED W ILCOXON S IGNED R ANK T EST

|Embedding|=100 |Embedding|=300
Data Split Training Testing on the Testing on the Data Split Training Testing on the Testing on the
(|dialogues|) Training Set Test Set (|dialogues|) Training Set Test Set
0 (861) 1.8778 3.7711 -1.1708 0 (1000) 1.8168 3.6785 -0.8618
1 (902) 1.3751 3.1663 -1.7006 1 (850) 2.0622 4.4598 -1.8688
2 (907) 1.4194 3.1579 -0.9723 2 (1010) 1.6896 3.6724 -1.4282
3 (785) 2.1532 4.2508 -1.3444 3 (1029) 1.9845 4.0136 -0.6109
4 (1046) 1.2204 2.1581 -1.5633 4 (951) 1.8255 4.0423 -1.4448
5 (767) 1.9456 3.9017 -1.2123 5 (832) 2.0860 4.2182 -0.8277
6 (1053) 0.4621 0.1370 -1.8443 6 (815) 2.1735 4.2592 -1.5193
7 (968) 1.8090 3.8368 -1.1137 7 (891) 2.1921 4.5799 -1.4233
8 (858) 1.7608 3.5531 -1.6678 8 (905) 1.8835 3.8337 -0.6628
9 (826) 1.8431 3.6254 -1.0919 9 (892) 2.0521 4.1882 -1.5267
10 (818) 1.9188 3.8629 -0.5394 10 (835) 2.0709 4.2852 -0.8831
11 (944) 1.8212 3.5724 -1.7020 11 (873) 2.1902 4.4848 -1.3329
12 (873) 2.0195 4.1895 -1.3456 12 (948) 1.7761 3.7927 -1.6167
13 (895) 2.0515 4.1873 -1.8034 13 (932) 1.8563 3.6208 -1.5149
14 (863) 1.9722 4.1479 -1.3244 14 (812) 1.9486 4.0347 -1.5866
15 (842) 1.8214 3.8942 -0.8921 15 (880) 1.1338 2.4880 -1.4084
16 (837) 1.8162 3.8817 -1.3784 16 (787) 2.2628 4.5583 -1.4290
17 (958) 1.6373 3.3373 -0.7726 17 (994) 0.9038 1.5106 -1.5925
18 (1012) 1.7631 3.6279 -1.2690 18 (853) 2.2405 4.4716 -1.4231
19 (862) 2.0683 4.2026 -1.5901 19 (788) 2.0686 4.2219 -0.9594
20 (17877) -0.4138 -1.2473 -1.9684 20 (17877) -0.3516 -0.3490 -2.0870
Average0−20 1.6353 3.2959† -1.3461 Average0−20 1.8031 3.7174† -1.3337
Sum0−20 34.3419 69.2146 -28.2674 Sum0−20 37.8656 78.0653 -28.0079
Upper Bound 7.1810 7.1810 7.5942 Upper Bound 7.1810 7.1810 7.5942
Lower Bound -7.2834 -7.2834 -7.7276 Lower Bound -7.2834 -7.2834 -7.7276
Random Sel. -2.4139 -2.4139 -2.5526 Random Sel. -2.4139 -2.4139 -2.5526

without any manual annotations. The task of the agents is

to learn to choose human-like actions (sentences) out of
candidate responses including human generated and randomly
generated sentences. In our proposed rewards we assume that
the latter are generally incoherent throughout the dialogue
history. Experimental results using chitchat data report that
DRL agents learn reasonable policies using training dialogues,
but their generalisation ability in a test set of unseen dialogues
remains a key challenge for future research in this field. In
addition, we found the following: (a) that sentence embedding
sizes of 100 and 300 perform equivalently on test data; (b) that
training agents using larger amounts of training can deteriorate
Fig. 5. Bar plot showing the performance of our dialogue reward predictors
using different amounts of dialogue history (from 1 sentence to 50 sentences). performance than training with smaller amounts; and (c) that
Each bar reports an average Pearson correlation score over 10 runs, where the our proposed dialogue rewards can be predicted with strong
coefficients report the correlation between true dialogue rewards and predicted correlation (between true and predicted rewards) by using
dialogue rewards in our test data
neural-based regressors with lengthy dialogue histories of ≥
10 sentences (25 sentences was the best in our experiments).
dialogue histories (≥ 10 sentences) contribute to obtain strong
correlations. It can also be observed that the longest history Future work can explore the following avenues. First,
may not be the best choice of length size, the network using confirm these findings with other datasets and settings in
25 sentences achieved the best results. From these results we order to draw even stronger conclusions. Second, investigate
can conclude that our proposed human-likeness rewards—with further the proposed approach for improved generalisation in
lengthy dialogue histories—can be used for training future test data. For example, other methods of feature extraction,
neural-based chatbots. clustering algorithms, distance metrics, policy learning algo-
rithms, architectures, and a comparison of reward functions
VI. C ONCLUSION AND F UTURE W ORK can be explored. Last but not least, combine the proposed
This paper presents a novel approach for training Deep learning approach with more knowledge intensive resources
Reinforcement Learning (DRL) chatbots, which uses clustered [34], [35] such as semantic parsers, coreference resolution,
actions and rewards derived from human-human dialogues among others.
R EFERENCES [23] Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent
Charlin, and Joelle Pineau, “How NOT to evaluate your dialogue system:
An empirical study of unsupervised evaluation metrics for dialogue
[1] Barbara J. Grosz and Candace L. Sidner, “Attention, intentions, and the response generation,” in EMNLP, 2016.
structure of discourse,” Computational Linguistics, vol. 12, no. 3, pp. [24] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and
175–204, 1986. Jeffrey Dean, “Distributed representations of words and phrases and
[2] Richard S. Sutton and Andrew G. Barto, Reinforcement learning - an their compositionality,” in NIPS, 2013.
introduction, Adaptive computation and machine learning. MIT Press, [25] Jeffrey Pennington, Richard Socher, and Christopher D. Manning,
2nd edition edition, 2018. “Glove: Global vectors for word representation,” in EMNLP, 2014.
[3] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman, The elements [26] Mohit Iyyer, Varun Manjunatha, Jordan L. Boyd-Graber, and Hal Daumé
of statistical learning: data mining, inference, and prediction, 2nd III, “Deep unordered composition rivals syntactic methods for text
Edition, Springer series in statistics. Springer, 2009. classification,” in ACL (1), 2015.
[4] Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton, “Deep learning,” [27] David Arthur and Sergei Vassilvitskii, “K-means++: The advantages of
Nature, vol. 521, no. 7553, pp. 436–444, 2015. careful seeding,” in SODA. 2007, SIAM.
[5] Iñigo Casanueva, Pawel Budzianowski, Pei-Hao Su, Stefan Ultes, [28] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu,
Lina Maria Rojas-Barahona, Bo-Hsiang Tseng, and Milica Gasic, “Feu- Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller,
dal reinforcement learning for dialogue management in large domains,” Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,
in NAACL-HLT, 2018. Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan
[6] Heriberto Cuayáhuitl, “SimpleDS: A simple deep reinforcement learning Wierstra, Shane Legg, and Demis Hassabis, “Human-level control
dialogue system,” CoRR, vol. abs/1601.04574, 2016. through deep reinforcement learning,” Nature, vol. 518, no. 7540, 2015.
[7] Heriberto Cuayáhuitl, Seunghak Yu, Ashley Williamson, and Jacob [29] Alexander H. Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam
Carse, “Scaling up deep reinforcement learning for multi-domain Fisch, Jiasen Lu, Devi Parikh, and Jason Weston, “Parlai: A dialog
dialogue systems,” in IJCNN, 2017. research software platform,” in EMNLP (System Demonstrations), 2017.
[8] Heriberto Cuayáhuitl and Seunghak Yu, “Deep reinforcement learning [30] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bah-
of dialogue policies with less weight updates,” in INTERSPEECH, 2017. danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning
[9] Jason D. Williams, Kavosh Asadi, and Geoffrey Zweig, “Hybrid phrase representations using RNN encoder–decoder for statistical ma-
code networks: practical and efficient end-to-end dialog control with chine translation,” in EMNLP. 2014, Association for Computational
supervised and reinforcement learning,” in ACL, 2017. Linguistics.
[10] Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Çelikyilmaz, [31] M. E. Tipping and Christopher Bishop, “Probabilistic principal compo-
Sungjin Lee, and Kam-Fai Wong, “Composite task-completion dia- nent analysis,” Journal of the Royal Statistical Society, Series B, vol.
logue policy learning via hierarchical deep reinforcement learning,” in 21/3, pp. 611622, January 1999.
EMNLP, 2017. [32] Heriberto Cuayáhuitl, Seonghan Ryu, Donghyeon Lee, and Jihie Kim,
[11] Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and “A study on dialogue reward prediction for open-ended conversational
Jianfeng Gao, “Deep reinforcement learning for dialogue generation,” agents,” in NeurIPS Workshop on Conversational AI: “Today’s Practice
in EMNLP, 2016. and Tomorrow‘s Potential”, 2018.
[33] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating
[12] Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and
deep network training by reducing internal covariate shift,” in Interna-
Dan Jurafsky, “Adversarial learning for neural dialogue generation,” in
tional Conference on Machine Learning (ICML), 2015.
EMNLP, 2017.
[34] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel,
[13] Iulian Vlad Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Steven J. Bethard, and David McClosky, “The Stanford CoreNLP
Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael natural language processing toolkit,” in Association for Computational
Pieper, Sarath Chandar, Nan Rosemary Ke, Sai Rajeswar, Alexandre Linguistics (ACL) System Demonstrations, 2014.
de Brébisson, Jose M. R. Sotelo, Dendi Suhubdy, Vincent Michalski, [35] Nina Dethlefs, “Domain transfer for deep natural language generation
Alexandre Nguyen, Joelle Pineau, and Yoshua Bengio, “A deep rein- from abstract meaning representations,” IEEE Comp. Int. Mag., vol. 12,
forcement learning chatbot (short version),” CoRR, vol. abs/1801.06700, no. 3, pp. 18–28, 2017.
2018.
[14] Chinnadhurai Sankar and Sujith Ravi, “Modeling non-goal oriented
dialog with discrete attributes,” in NeurIPS Workshop on Conversational
AI: “Today’s Practice and Tomorrow‘s Potential”, 2018.
[15] Chih-Wei Lee, Yau-Shian Wang, Tsung-Yuan Hsu, Kuan-Yu Chen,
Hung-yi Lee, and Lin-Shan Lee, “Scalable sentiment for sequence-
to-sequence chatbot response with performance analysis,” CoRR, vol.
abs/1804.02504, 2018.
[16] Oriol Vinyals and Quoc V. Le, “A neural conversational model,” CoRR,
vol. abs/1506.05869, 2015.
[17] Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett,
Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill
Dolan, “A neural network approach to context-sensitive generation of
conversational responses,” in HLT-NAACL, 2015.
[18] Iulian Vlad Serban, Tim Klinger, Gerald Tesauro, Kartik Talamadupula,
Bowen Zhou, Yoshua Bengio, and Aaron C. Courville, “Multiresolution
recurrent neural networks: An application to dialogue response genera-
tion,” in AAAI, 2017.
[19] Jiwei Li, Michel Galley, Chris Brockett, Georgios P. Spithourakis, Jian-
feng Gao, and William B. Dolan, “A persona-based neural conversation
model,” in ACL, 2016.
[20] Wenjie Wang, Minlie Huang, Xin-Shun Xu, Fumin Shen, and Liqiang
Nie, “Chat more: Deepening and widening the chatting topic via a deep
model,” in SIGIR. 2018, ACM.
[21] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe
Kiela, and Jason Weston, “Personalizing dialogue agents: I have a dog,
do you have pets too?,” CoRR, vol. abs/1801.07243, 2018.
[22] Rui Yan, “”chitty-chitty-chat bot”: Deep learning for conversational AI,”
in IJCAI, 2018.

RLHF - Reinforcement Learning Using Human Feedback For Optimization of ChatGPT
No ratings yet
RLHF - Reinforcement Learning Using Human Feedback For Optimization of ChatGPT
9 pages
Chatbot
No ratings yet
Chatbot
21 pages
Exploring Bi-Directional Context For Improved Chatbot Response Generation Using Deep Reinforcement Learning
No ratings yet
Exploring Bi-Directional Context For Improved Chatbot Response Generation Using Deep Reinforcement Learning
20 pages
A Task-Oriented Chatbot Based On LSTM and Reinforcement Learning
No ratings yet
A Task-Oriented Chatbot Based On LSTM and Reinforcement Learning
5 pages
RNN and LSTM Based Chatbot Using NLP: Department of Computer Science and Engineering, MSIT, New Delhi, India
No ratings yet
RNN and LSTM Based Chatbot Using NLP: Department of Computer Science and Engineering, MSIT, New Delhi, India
4 pages
Self-Improving Chatbots Based On Reinforcement Learning: Elena Ricciardelli, Debmalya Biswas
No ratings yet
Self-Improving Chatbots Based On Reinforcement Learning: Elena Ricciardelli, Debmalya Biswas
5 pages
AI Case Study
67% (3)
AI Case Study
14 pages
ChatGPT: RLHF Training Overview
No ratings yet
ChatGPT: RLHF Training Overview
35 pages
A Conditional Generative Chatbot Using Transformer Model
No ratings yet
A Conditional Generative Chatbot Using Transformer Model
12 pages
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue With A Multi-Turn Planner
No ratings yet
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue With A Multi-Turn Planner
17 pages
Kuang 2022 J. Phys. Conf. Ser. 2170 012017
No ratings yet
Kuang 2022 J. Phys. Conf. Ser. 2170 012017
7 pages
FINAL 20230710 Manuscrpt OsakeGPT-1
No ratings yet
FINAL 20230710 Manuscrpt OsakeGPT-1
9 pages
Learning From Dialogue After Deployment: Feed Yourself, Chatbot!
No ratings yet
Learning From Dialogue After Deployment: Feed Yourself, Chatbot!
18 pages
Seq2Seq Attention Mechanism
No ratings yet
Seq2Seq Attention Mechanism
19 pages
Self-Improving Chatbots Based On Reinforcement Learning
No ratings yet
Self-Improving Chatbots Based On Reinforcement Learning
6 pages
Effective Chatbots Using Machine Learning and Natural Language Processing
No ratings yet
Effective Chatbots Using Machine Learning and Natural Language Processing
10 pages
A Conditional Generative Chatbot Using Transformer
No ratings yet
A Conditional Generative Chatbot Using Transformer
14 pages
4 - ChatGPT - Optimizing Language Models For Dialogue
100% (1)
4 - ChatGPT - Optimizing Language Models For Dialogue
1 page
Deep RL from Human Preferences
No ratings yet
Deep RL from Human Preferences
17 pages
Hierarchical Reinforcement Learning For Open-Domain Dialog - 2
No ratings yet
Hierarchical Reinforcement Learning For Open-Domain Dialog - 2
12 pages
Irjet V5i8212 PDF
No ratings yet
Irjet V5i8212 PDF
3 pages
ChatGPT Technical Overview Report
No ratings yet
ChatGPT Technical Overview Report
24 pages
1709 02349 PDF
No ratings yet
1709 02349 PDF
40 pages
Invitation Daw
No ratings yet
Invitation Daw
22 pages
Role-Playing Framework for AI Agents
No ratings yet
Role-Playing Framework for AI Agents
77 pages
ChatGPT and RLHF Insights
No ratings yet
ChatGPT and RLHF Insights
45 pages
ChatGPT's Impact on Various Fields
No ratings yet
ChatGPT's Impact on Various Fields
7 pages
LongChat-13B: Advanced Open-Source Chatbot
No ratings yet
LongChat-13B: Advanced Open-Source Chatbot
6 pages
QUT - Assignment-1
No ratings yet
QUT - Assignment-1
3 pages
Pedido 41 - 4
No ratings yet
Pedido 41 - 4
14 pages
AI Language Models: Human-Centric Tuning
No ratings yet
AI Language Models: Human-Centric Tuning
26 pages
AICA Proposal878
No ratings yet
AICA Proposal878
7 pages
Carolina SpringerBakubook2 Carolina
No ratings yet
Carolina SpringerBakubook2 Carolina
11 pages
Multilingual Chatbot Framework
No ratings yet
Multilingual Chatbot Framework
9 pages
Understanding ChatGPT: An Overview
No ratings yet
Understanding ChatGPT: An Overview
5 pages
Bias
No ratings yet
Bias
10 pages
Evaluation On ChatGPT For Chinese Language
No ratings yet
Evaluation On ChatGPT For Chinese Language
19 pages
Deep Learning NLP Techniques for Chatbots
No ratings yet
Deep Learning NLP Techniques for Chatbots
4 pages
Full Text 01
No ratings yet
Full Text 01
132 pages
A Survey On Learning-Based Approaches For Modeling and Classification
No ratings yet
A Survey On Learning-Based Approaches For Modeling and Classification
15 pages
E4. LLM Instruction Tuning
No ratings yet
E4. LLM Instruction Tuning
45 pages
Cheerbot Chatbot Towords Empathy and Emotion
No ratings yet
Cheerbot Chatbot Towords Empathy and Emotion
16 pages
Say What I Want: Towards The Dark Side of Neural Dialogue Models
No ratings yet
Say What I Want: Towards The Dark Side of Neural Dialogue Models
11 pages
ChatGPT: Capabilities and Ethics
No ratings yet
ChatGPT: Capabilities and Ethics
5 pages
Deep Reinforcement Learning from Human Preferences (深度强化学习来自人类偏好)
No ratings yet
Deep Reinforcement Learning from Human Preferences (深度强化学习来自人类偏好)
9 pages
ChatGPT: Applications and Ethical Insights
No ratings yet
ChatGPT: Applications and Ethical Insights
23 pages
Research Paper
No ratings yet
Research Paper
8 pages
Chapter 2 Literature Review
No ratings yet
Chapter 2 Literature Review
44 pages
Aligning GPT Model for Spanish QA
No ratings yet
Aligning GPT Model for Spanish QA
9 pages
AI Medical Chatbot Using RNN Techniques
No ratings yet
AI Medical Chatbot Using RNN Techniques
14 pages
Chatbots As Problem Solvers Playing Twenty Questions With Role Reversals
No ratings yet
Chatbots As Problem Solvers Playing Twenty Questions With Role Reversals
20 pages
ChatGPT Is No Stochastic Parrot. But It Also Claims That 1 Is Greater Than 1
No ratings yet
ChatGPT Is No Stochastic Parrot. But It Also Claims That 1 Is Greater Than 1
29 pages
A Survey On Dialogue Systems: Recent Advances and New Frontiers
No ratings yet
A Survey On Dialogue Systems: Recent Advances and New Frontiers
13 pages
2020-Anki P. Et Al.-Intelligent Chatbot Adapted From Question and Answer System Using RNN-LSTM Model
No ratings yet
2020-Anki P. Et Al.-Intelligent Chatbot Adapted From Question and Answer System Using RNN-LSTM Model
12 pages
Chatting and Cheating
No ratings yet
Chatting and Cheating
1 page
ChatGPT in The Age of Generative AI and Large Lang
No ratings yet
ChatGPT in The Age of Generative AI and Large Lang
60 pages
AI's Impact on Society and Risks
No ratings yet
AI's Impact on Society and Risks
3 pages
An Intelligent Model To Assess Information Systems Security Level
No ratings yet
An Intelligent Model To Assess Information Systems Security Level
6 pages
Lecture 2.2
No ratings yet
Lecture 2.2
25 pages
Improving The Domain Adaptation of Retrieval Augmented Generation (RAG) Models For Open Domain Question Answering
No ratings yet
Improving The Domain Adaptation of Retrieval Augmented Generation (RAG) Models For Open Domain Question Answering
17 pages
Electronics 13 02465
No ratings yet
Electronics 13 02465
28 pages
AI Approaches and Rational Agents Explained
No ratings yet
AI Approaches and Rational Agents Explained
21 pages
Credit Card Fraud Detection Guide
No ratings yet
Credit Card Fraud Detection Guide
6 pages
Machine MCQ
No ratings yet
Machine MCQ
32 pages
IIT M Business Analytics Quiz 2022
No ratings yet
IIT M Business Analytics Quiz 2022
224 pages
Ouyang Image Restoration Refinement With Uformer GAN CVPRW 2024 Paper-2
No ratings yet
Ouyang Image Restoration Refinement With Uformer GAN CVPRW 2024 Paper-2
10 pages
Neural Networks For Portfolio Management Optimization
No ratings yet
Neural Networks For Portfolio Management Optimization
5 pages
Chapter 2 Supervised Learning
No ratings yet
Chapter 2 Supervised Learning
49 pages
Cricket Score Prediction Report
No ratings yet
Cricket Score Prediction Report
49 pages
AI Project Cycle - IX
No ratings yet
AI Project Cycle - IX
72 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Balaji Capstone Project 2
No ratings yet
Balaji Capstone Project 2
56 pages
Foundation Models For Generalist Medical Artificial Intelligence
No ratings yet
Foundation Models For Generalist Medical Artificial Intelligence
7 pages
Mllab Manual
No ratings yet
Mllab Manual
54 pages
Active Learning For Accurate Settlement Prediction Using Numerical Simulations in Mechanized Tunneling
No ratings yet
Active Learning For Accurate Settlement Prediction Using Numerical Simulations in Mechanized Tunneling
7 pages
Internship Report: Zebu Intelligent Systems
No ratings yet
Internship Report: Zebu Intelligent Systems
32 pages
Car Price Prediction Model in R
No ratings yet
Car Price Prediction Model in R
5 pages
AIB Slide9
No ratings yet
AIB Slide9
20 pages
Chapter 45
No ratings yet
Chapter 45
6 pages
Age and Gender Detection Using Deep Learning
No ratings yet
Age and Gender Detection Using Deep Learning
14 pages
Stock Market Forecasting: From Traditional Predictive Models To Large Language Models
No ratings yet
Stock Market Forecasting: From Traditional Predictive Models To Large Language Models
45 pages
AI-guided Auto-Discovery of Low-Carbon Cost-Effective Ultra-High Performance Concrete (UHPC)
No ratings yet
AI-guided Auto-Discovery of Low-Carbon Cost-Effective Ultra-High Performance Concrete (UHPC)
17 pages
Autism Detecting Model Using Image
No ratings yet
Autism Detecting Model Using Image
21 pages
ML Notes (Module-3)
No ratings yet
ML Notes (Module-3)
21 pages
1 s2.0 S2210650224003055 Main
No ratings yet
1 s2.0 S2210650224003055 Main
15 pages
Bilevel Optimization for Neural Generalization
No ratings yet
Bilevel Optimization for Neural Generalization
16 pages
Evaluating AI Model Performance
No ratings yet
Evaluating AI Model Performance
18 pages