This repository provides a simple PyTorch implementation of Question Answer Matching. Here we use the corpus from Stack Exchange in English to build embeddings for entire questions. Using those embeddings, we find similar questions for a given question, and show the corresponding answers to those I found.
The TF-IDF is usually used to find how relevant a term is in a document, and the TF-IDF value is the product of two statistics, Term-Frequency (TF) and Inverse Documnet Frequency (IDF).
TF: Term Frequency, which is a value that indicates how frequently a particular term occurs in a document. And the higher it is, the more relevant it is in the document. In other words, if a term occurs more times than other terms in a document, the term has more relevance than other terms for the document.
IDF: Inverse Document Frequency, which is the inverse of Document Frequency (DF). DF measures how much information a term provides, i.e., if it's common or rare across all documents.
The higher the term frequency within a particular document, and the smaller the document containing the term in the whole document, the higher the TF-IDF value. Therefore, this value provides the effect of filtering out common words from all documents. The more terms a document contains, the closer the value of the log function is to 1, in which case the IDF value and TF-IDF value are closer to 0.
Consider that there are two documents as follows:
- document1 = "a new car, used car, car review"
- documnet2 = "a friend in need is a friend indeed"
| word | TF | TF | IDF | TF-IDF | TF-IDF |
|---|---|---|---|---|---|
| document1 | document2 | document1 | document2 | ||
| a | 1/7 | 2/8 | log(2/2) = 0 | 0 | 0 |
| new | 1/7 | 0 | log(2/1) = 0.3 | 0.04 | 0 |
| car | 3/7 | 0 | log(2/1) = 0.3 | 0.13 | 0 |
| used | 1/7 | 0 | log(2/1) = 0.3 | 0.04 | 0 |
| review | 1/7 | 0 | log(2/1) = 0.3 | 0.04 | 0 |
| friend | 0 | 2/8 | log(2/1) = 0.3 | 0 | 0.08 |
| in | 0 | 1/8 | log(2/1) = 0.3 | 0 | 0.04 |
| need | 0 | 1/8 | log(2/1) = 0.3 | 0 | 0.04 |
| is | 0 | 1/8 | log(2/1) = 0.3 | 0 | 0.04 |
| indeed | 0 | 1/8 | log(2/1) = 0.3 | 0 | 0.04 |
To feed the variable-length sequences to recurrent network such as GRU, LSTM in PyTorch, we need to follow below step.
padding -> pack sequence -> recurrent network -> unpack sequence
And to pack/unpack the sequence easily, PyTorch provides us with two useful methods: pack_padded_sequence, pad_packed_sequence.
pack_padded_sequence: Packs a tensor containing padded_sequences of variable length. The sequences should be sorted by length in a decreasing order, i.e.input[:,0]should be the longest sequence, andinput[:,-1]the shortest one.- Input: a tensor of size
T x B x *, whereTis the length of the longest sequence(equal to first element of list containing sequence length),Bis the patch size, and*is any number of dimensions (including 0). Ifbatch_firstargument is True, the input is expected inB x T x *format. - Returns:
PackedSequenceobject.
- Input: a tensor of size
-
pad_packed_sequence: Pads a packed batch of variable length sequences. It's an inverse operation topack_padded_sequence.- Input:
PackedSequenceobject. - Returns: Tuple of tensor containing the padded sequence, and a tensor containing the list of lengths of each sequence in the batch. The returned tensor's data will be of size
T x B x *, whereTis the length of the longest sequence andBis the batch size. Ifbatch_firstargument is True, the data will be transposed intoB x T x *format. Batch elements will be ordered decreasingly by their length.
- Input:
-
PackedSequence: Holds thedataand list ofbatch_sizesof a packed sequence. All RNN moduels accept packed sequences as inputs. The data tensor contains packed seqeunce, and the batch_sizes tensor contains integers holding information about the batch size at each seqeunce step.- For instance, given data 'abc' and 'x', the PackedSequence would contain 'axbc' with batch_sizes=[2,1,1].
To represent a sentence, here we used CBoW. CBoW can be defined as follows:
- Ignore the order of the tokens.
- Simply average the token vectors.
- Averaging is a differentiable operator.
- Just one operator node in the DAG(Directed Acyclic Graph).
- Generalizable to bag-of-n-grams.
- N-gram: a phrase of N tokens.
CBoW is extremely effective in text classification. For instance, if there are many positive words, the review is likely positive.
Because the data from Stack Exchange has been saved as xml, we first install beautifulsoup4 and lxml parser. You can easily install by running following commands.
$ pip install beautifulsoup4
$ pip install lxml
To load the question/answer text from xml file, run the dataLoader python script below.
example usage:
$ python dataLoader.py
The dataset contains 91,517 records, and each record contains 5 attributes: title(question), body(answer), tags , post type id, view count. Here we will mainly use title, body. Below table shows that the first 5 lines of our dataset.
Preprocessing consists of largely 3 steps:
- Text cleaning/normalization
- Tokenization
- Build TF-IDF and word embedding matrix with pre-trained word representations
The tokenization and building TF-IDF/embedding matrix used here, is not much different from that used other nlp tasks. But, as shown in the results of above data loading step, the html tags and urls (i.e. <p>, <a href=https://~>) exist in title and body columns. To clean it up and normalize our data, run preprocessing script below.
example usage:
$ python preprocessing.py
Below table shows that the first 5 lines of preprocessing results. We can see that all the html tags have disappeared.
$ python train.py -h
usage: train.py [-h] [--filename FILENAME] [--clean_drop CLEAN_DROP]
[--epochs EPOCHS] [--batch_size BATCH_SIZE]
[--learning_rate LEARNING_RATE] [--hidden_size HIDDEN_SIZE]
[--n_layers N_LAYERS] [--dropout_p DROPOUT_P]
optional arguments:
-h, --help show this help message and exit
--filename FILENAME
--clean_drop CLEAN_DROP
Drop if either title or body column is NaN
--epochs EPOCHS Number of epochs to train. Default=7
--batch_size BATCH_SIZE
Mini batch size for gradient descent. Default=2
--learning_rate LEARNING_RATE
Learning rate. Default=.001
--hidden_size HIDDEN_SIZE
Hidden size of LSTM. Default=64
--n_layers N_LAYERS Number of layers. Default=1
--dropout_p DROPOUT_P
Dropout ratio. Default=.1
example usage:
$ python train.py --epochs 15 --batch_size 2 --learning_rate .001 --hidden_size 64 --n_layers 1 --dropout_p .1
You may need to change the argument parameters.
$ python evaluate.py -h
usage: evaluate.py [-h] --model MODEL [--filename FILENAME]
[--clean_drop CLEAN_DROP] [--hidden_size HIDDEN_SIZE]
[--n_layers N_LAYERS] [--dropout_p DROPOUT_P]
optional arguments:
-h, --help show this help message and exit
--model MODEL Model file(.pth) path to load trained model's learned
parameters
--filename FILENAME
--clean_drop CLEAN_DROP
Drop if either title or body column is NaN
--hidden_size HIDDEN_SIZE
Hidden size of LSTM. Default=64
--n_layers N_LAYERS Number of layers. Default=1
--dropout_p DROPOUT_P
Dropout ratio. Default=.1
example usage:
$ python evaluate.py --model model.pth
You may need to change the argument parameters.
training-set- 60,114 question-answer pairs.
- With sampling, we doubled it: 30,057 pairs to 60,114 pairs.
test-set- 3,356 question-answer pairs
The models were trained with NVIDIA Tesla K80, and the number of epochs was 10.
Below table shows the results from the model in question matching task. If a sample question is given, question matching model tries to find the closest question based on cosine similarity (cf. If we have corresponding answers to the closest questions given, we can answer all the questions.).
| Given sample question | nearest 1 | nearest 2 | nearest 3 |
|---|---|---|---|
| Can you travel on an expired passport? | How serious is an expired passport? (similarity=.96) | What should I do with my expired passport? (similarity=.96) | Can I use a study visa in an expired passport? (similarity=.92) |
| Can I carry comics with me while traveling from the USA to india? | Can I take two laptops to india from united states? one bought in india and one in US. (similarity=.86) | Can I carry mobile phones to US from india? (similarity=.85) | Can I bring laptops to india from the US? (similarity=.85) |
| Do I need transit visa if I have to recheck my checked in baggage for a layover in dubai? | requirements for the transit visa and baggage information for layover in dubai. (similarity=.96) | Do I need a transit visa to collect and re-check my luggage at istanbul ataturk airport? (similarity=.96) | Do I need a transit visa for an hour layover in dubai? (similarity=.94) |
| What happens if I'm forced to overstay in the U.S. Because my flight is delayed or cancelled? | Returning plane ticket if delayed in the USA. (similarity=.87) | Is airline obliged to refund cost of flight if the passenger is unable to fly because his travel visa has been rejected? (similarity=.88) | What rights do I have if my flight is cancelled? (similarity=.86) |
| Are campsites always free on canary islands? | Where can I find authentic areas in canary islands? (similarity=.93) | Safe typical dishes to order away from the tourist trail in thailand when english is not supported? (similarity=.89) | Is there any region island mountain or village in japan known for hot spicy local cuisine? (similarity=.89) |
| Renting a car in israel when under? | Car rental in israel without additional insurance (similarity=.86) | What are the rules for renting a car in italy? (similarity=.85) | What should I be aware of when hiring a car from one of the cheaper rental firms in spain? (similarity=.87) |
| What does a technical stop mean in air travel? | What does it mean when a flight is delayed due to a tail swap if anything? (similarity=.92) | Is there a way to check that the conditions on your plane ticket are actually what your travel agent said they are? (similarity=.92) | Can you earn miles with different airlines for the same one flight if they are part of the same loyalty program? (similarity=.91) |
| Do I need a transit visa for paris when travelling to italy with a schengen visa? | Do I need a transit visa through italy from romania to algeria? (similarity=.93) | Do I need a transit visa for paris if i have a schengen visa issued by portugal? (similarity=.92) | Do I need a transit visa for frankfurt if i have a schengen italian visa? (similarity=.92) |
| Travel to italy via germany using us refugee travel document | Travel to italy with romanian travel document. (similarity=.79) | Do I need transit visa for germany travelling from india to poland via germany with polish d type national visa? (similarity=.79) | Schengen visa from italy without visiting italy. (similarity=.77) |
| Need to travel to the USA by ship from europe. | Do you need visa to connect in USA from Russia? (similarity=.74) | Can I travel to USA from a foreign country (similarity=.73) | Should I convert US dollars to euros while in USA or in Europe? (similarity=.72) |
- [Himanshu] Sentiment Analysis with Variable length sequences in Pytorch
- [William Falcon] Taming LSTMs: Variable-sized mini-batches and why PyTorch is good for your health
- [PyTorch] PyTorch official document - package reference - torch.nn
- [Sunwoo Park] Show, Attend, and Tell with Pytorch
- [ediwth, Kyunghyun Cho] CBoW & RN & CNN
- [Minsuk Heo] [딥러닝 자연어처리] TF-IDF
- [DOsinga/deep_learning_cookbook] 06.1 Question matching



