Question Answering System with Deep
Learning
Jake Spracher Robert M Schoenhals
scPD Stanford Law School
jeprachedstantord.edu Stanford University
[email protected]
Department of Management Science and Engineering
Stanford University.
‘
[email protected]
Abstract
‘The Stanford Question Answering Dataset (SQuAD) challenge is a machine
reading comprehension task that has gained popularity in recent years. In
this paper, we implement various existing deep learning methods with in-
cremental improvements and conduct a comparative study of their per-
formance on SQuAD dataset. Our best model achieves 76.1 FI and 66.1
EM scores on the test set. ‘This project is completed for Assignment 4 of
C8224n,
1 Introduction
The Stanford Question Answering Dataset (SQuAD) challenge, a machine comprehension
task, has gained popularity in recent years from both theoretical and practical perspectives.
The Stanford NLP group published the SQuAD(2] dataset consisting of more than 100,000
question-answer tuples taken from the set of Wikipedia articles, in which the answer to
‘each question is the consecutive segment of text in the corresponding reading passage. The
primary task is to build models that take a paragraph and a question about it as an input,
‘and identify the answer to the question from the given paragraph. ‘There has been a lot
‘of research on building a state-of-the-art deep learning system on SQuAD that has been
reported to accomplish outstanding performance (3][6][5}. Hence, the objective of this paper
is to start with the provided starter code for C5224n: Natural Language Processing with
Deep Learning (2017-2018 Winter Quarter), make successive improvements by implementing,
‘existing models, and compare their performance on SQuAD.
The rest of the paper is organized as follows. Section 2 illustrates the components and
the architecture of our system, Section 3 describes the experime i
demonstrates error analysis, Section 5 concludes the paper and discusses future work.
2 Model
1
ndlependent implementatic
and then illustrate the specific com
model is modular in that different components can be swapped with others due to
Thus, we first present al the modules considered in this paper,
ations of the modules that we run in experiments.2.1 Model Components
24.1 Encoding Layer
A d-dimensional word embeddings of a question x1 ,-+- ,xy € Rand context y1,-++ Yar €
R¢ aro fed into a bidirectional LSTM with weights shared between the question and context.
‘The encoding layer encodes the embeddings into the representation matrix of the context
jddden states Hin RY*2" and the question hidden states U in “2 where his the si
of hidden states,
2.1.2 Bidirectional Attention Layer
Bidirectional Attention Layer [3] is one of the attention layers that we use
representations for context hidden states hy € R® and the quest
uy, sty € Ra matrix S in RY" is computed according to 8; = w" [hys wjshious) €
R, where w € Fé is a weight vector learned through training.
We first compute Contert-To-Question (C2Q) attention. C2Q attention distribution is ob-
tained by a = softmax(S,.) € R™."i € {1,---,.N}. ‘The question hidden states uj are
then weighted according to a' to get C2Q attention output a = 37", ajuy € R?*, Next, we
compute Question-To-Context (Q2C) attention. Q2C attention distribution is obtained by
3 = softmax(m) € R® for m, = max, $4). i € {1,---,.N}. The context hidden states hy
are then weighted according to 8 to get Q2C attention output e! = > Shy € R2". Then,
wwe get the bidirectional attention encoding by = (his ai;hioa,; hice’) € ROY i € {1,--* ,N}
2.1.8 Coattention Layer
Another type of attention layers we implemented is Coattention layer (6). Given the question
hidden states u1,-- ,unr € R', we first compute projected question hidden states w
tanh(Wu, +b) € REYj € {1,---,M}. Also, sentinels hg and up are appended to the
context and question hidden states, which gives us {In,-+~ .ligr,tg}and {ui,++* tae, 9}
We then compute a affinity matrix L € RO+Y*+0 where Ly y= hu € R.
Using the affinity matrix L, we apply the standard attention mechanism in both directions.
Conteat-To-Question attention output is obtained by a; = SAP aja, € RE for af
softmax(L,) € RS" i € (1,--+ ,N}. Question-To-Contert attention output is computed
SLM Bh, € RE for 84 = softmax(L,,;)) © RY" 7 € {1,--- My
Next, we compute second-level attention output s; = Mt" alb; © RY. Finally, [aissi] €
RP € (1,--- , N} is fed into a bidirectional LSTM, and the resulting hidden states are
the coattention encoding.
in a similar way: by
24.4 Modeling Layer
Following the example of the BIDAF{3), we implement. a modoling layer comprised of two
layers of bidirectional LSTM’s, which outputs M R&**
2.2 Self-Attention Layer
A self-attention layer [4] is used as an alternative to the modeling layer. Given the context
hidden states H€ R**?*, we apply the attention mechanism to obtain attention distribu-
tion A = softmax(HH" /V2h) € RN*%, where softmax is taken with respect to the rows
of HH /V2h. Then, self-attention ouiput is computed by a= AH € RN*2"
2.2.1 Output Layers
The basic output layer we consider has the identical structure of the BiDAF{3}. ‘This module
is used in conjunction with the modeling layer. Let G € R#e** denote the output by
‘an attention layer. Then, the probability distribution of the start index is computed byprt = softmax(w!|G;M)) ¢ RX, where w, € RYH is a trainable woight. ‘The
passed to a bidirectional LSTM ‘that outputs My ¢ R“*". Finally, the probability
distribution of the end index is obtained by p*4 = softmax(w? [G; My] € RY.
Another type of output layer we implemented is Answer-Pointer Layer[6]. Given the
blended representation G, the probability distribution of the start index is given by
prtt = softmax(w'P, +6 ey) © RY. where Fy = tanh(VG +b @ en) € RX,
and w € R¥%,c € R,V © RM*6,b © RM are parameters to be trained. ‘The operator
© en produces a matrix by repeating the element on the left hand side for N times. Then,
‘we compute the hidden vector hy by using attention mechanism Gp, € R#° and passing it
to a standard LSTM. Finally, the probability distribution of the end index is obtained by
pi = softmax(w?F,+exew) © R, where Fy = tanh(VG-+(Wahy +b)Sen) € Re",
and W;, € R&*#e ig another trainable weight
The start and end indices (F*, ) are selected such that the joint probability pip"
is maximized subject to i