Updated Text Summarization
Updated Text Summarization
MODEL
BY
AUGUST 2024
TEXT SUMMARIZATION USING TRANSFORMER BASED
MODEL
BY
AUGUST, 2024
ii
DECLARATION
TRANSFORMER MODEL” is our own work and has not been submitted by any other
person for any degree or qualification at any higher institution. I also declare that the
information provided therein are ours and those that are not ours are properly
acknowledged.
iii
CERTIFICATION
This is to certify that this project titled “Text summarization using transformer-based
Azeez Olonade Lekan. The project has been read and approved as meeting the
___________________ ________________
Dr Mrs. J.F. AJAO Date
Supervisor
____________________ ______________
Dr Mrs. R.S. Babatunde Date
Head of Department
________________________ _____________
External Examiner Date
iv
DEDICATION
This study is dedicated to the Almighty God, who has been our source of Strength, Grace
and Wisdom throughout our study-period. May His name be forever praised!
v
ACKNOWLEDGEMENTS
We want to thank almighty God for his guidance and understanding. We would like to
express our deepest gratitude to everyone who supported and contributed to the
First and foremost, we are profoundly grateful to Dr. Mrs. J.F AJAO for their invaluable
guidance, encouragement, and insightful feedback throughout the duration of this project.
Our sincere appreciation is extended to our instructors, especially the esteemed Head of
Department, Dr. (Mrs.) R. S. Babatunde, for her unwavering attention and support during
our time at KWASU. We are also grateful to Drs. R. M. Isiaka, Dr. A. N. Babatunde, Dr.
S.O. Abdulsalam, Dr S.R. Yusuff, and Dr. A. F. Kadri. we have the utmost gratitude for
the wisdom bestowed to us. Their expertise and patience were instrumental in shaping the
direction and quality of our work. we also want to extend our heartfelt thanks to Kwara
State University for providing the necessary resources and facilities that made this project
possible. The support of group members is greatly appreciated. A special thanks to our
family and friends for their unwavering support and encouragement. Their understanding
and motivation have been a constant source of strength during the more challenging
moments of this journey. Finally, we would like to acknowledge all our peers and
enhanced our experience during this project. Your feedback and ideas were invaluable.
vi
TABE OF CONTENTS
FRONT PAGE......................................................................................................................i
TITLE PAGE.......................................................................................................................ii
DECLARATION................................................................................................................iii
CERTIFICATION..............................................................................................................iv
DEDICATION.....................................................................................................................v
ACKNOWLEDGEMENTS................................................................................................vi
TABE OF CONTENTS.....................................................................................................vii
LIST OF FIGURES...........................................................................................................xii
LIST OF TABLES...........................................................................................................xiii
ABSTRACT.....................................................................................................................xiv
CHAPTER ONE................................................................................................................1
INTRODUCTION.............................................................................................................1
vii
1.6 Project Layout............................................................................................................3
CHAPTER TWO...............................................................................................................5
LITERATURE REVIEW.................................................................................................5
2.0 Introduction................................................................................................................5
2.0.3 Transformer.............................................................................................................6
2.3.5 Libraries................................................................................................................15
[Link] Keras..................................................................................................................15
[Link] NLTK.................................................................................................................15
[Link] Scikit-learn.........................................................................................................15
[Link] Pandas................................................................................................................16
[Link] Gensim...............................................................................................................16
viii
[Link] Flask...................................................................................................................16
[Link] Bootstrap............................................................................................................16
[Link] GloVe.................................................................................................................16
[Link] LXML................................................................................................................17
CHAPTER THREE.........................................................................................................21
METHODOLOGY..........................................................................................................21
3.1 Introduction..............................................................................................................21
3.2.1 Algorithms............................................................................................................28
3.3.3 Model....................................................................................................................34
ix
3.4 Tune the model.........................................................................................................36
CHAPTER FOUR...........................................................................................................38
4.1 Introduction..............................................................................................................38
4.4 Result.......................................................................................................................39
4.5 Discussion................................................................................................................48
4.5.3 Limitations............................................................................................................49
CHAPTER FIVE.............................................................................................................51
x
5.0 Summary..................................................................................................................51
5.1 Conclusion...............................................................................................................51
REFERENCE....................................................................................................................52
Appendix............................................................................................................................58
xi
LIST OF FIGURES
xii
LIST OF TABLES
xiii
ABSTRACT
xiv
1
CHAPTER ONE
INTRODUCTION
Text summarization is the process of generating short, fluent, and most importantly
main idea behind automatic text summarization is to be able to find a short subset of
the most essential information from the entire set and present it in a human-readable
format. As online textual data grows, automatic text summarization methods have
potential to be very helpful because more useful information can be read in a short
time.
agreements ([Link], 2018). In order to satisfy the customer base, Juniper tries to
resolve issues quickly and efficiently. Juniper Networks maintains a Knowledge Base
(KB) which is a dataset composed of questions from customers with human written
solutions. The KB contains over twenty thousand articles. The company is currently
chatbot can search queries asked by the users in the KB and fetch links to the related
these articles so that chatbot can present the summaries to the customers. The
1
2
customers can then decide if they would like to read the entire article. The
summarization tool could be further used internally for summarizing tickets and
Over the recent years it was observed that researchers have worked on the improvement
and enhancement of generating informative and concise summaries. Here are the
This project work tends to offer a rich framework for enhancing text summarizers by
adaptive summarization. Continued research and development in these areas are likely to
The goals of this Major Qualifying Project are to research methods for text
2
3
iv. Build and host an end-to-end tool which takes texts as input and outputs a
summary
its ability to efficiently condense vast amounts of information into concise, contextually
summaries that are more accurate and cohesive than traditional methods. Their versatility
allows them to handle diverse text genres and optimize for specific domains, making
they improve the accuracy of summaries and automate content management, ultimately
processing by delivering fast, precise, and adaptable summarization solutions that can be
3
4
Chapter one
1. Background study
2. Statement of problem
Chapter two
1. Literature review
Chapter three
1. Data collection
2. Model selection
3. Training/Tuning
4. Evaluation metrics
Chapter four
1. Implementation
3. Recommendation
Chapter five
1. Summary
2. Conclusion
4
5
CHAPTER TWO
LITERATURE REVIEW
2.0 Introduction
This section explores the technologies which were used in this project (Section 2.1 - 2.3).
The section first discusses the key concepts for text summarization, followed by the
metrics used to evaluate them along with the environments (Section 2.4) and the libraries
Natural Language Processing (NLP) is a field in Computer Science that focuses on the
study of the interaction between human languages and computers (Chowdhury, 2003).
Text summarization is in this field because computers are required to understand what
humans have written and produce human-readable outputs. NLP can also be seen as a
methods, including neural network models, are also used for solving NLP related
problems. With the existing research, researchers generally rely on two types of
5
6
Extractive summarization means extracting keywords or key sentences from the original
document without changing the sentences. Then, these extracted sentences can be used to
2.0.3 Transformer
mechanism that helps identify key sentences from a passage (Vaswani and Shazeer,
2004). The idea behind this algorithm is that the sentence that is similar to most other
sentences in the passage is probably the most important sentence in the passage. Using
this idea, one can create a graph of sentences connected with all the similar sentences and
6
7
run Google’s algorithm on it to find the most important sentences. These sentences would
word in the document (Ramos and Juan, 2003). The underlying algorithm calculates the
frequency of the word in the document (term frequency) and multiplies it by the
logarithmic function of the number of documents containing that word over the total
number of documents in the dataset (inverse document frequency). Using the relevance
of each word, one can compute the relevance of each sentence. Assuming that most
relevant sentences are the most important sentences, these sentences can then be used to
7
8
humans usually expect from text summarization. The process is to understand the
original document and rephrase the document to a shorter text while capturing the key
points (Dalal and Malik, 2013). Text abstraction is primarily done using the concept of
artificial neural networks. This section introduces the key concepts needed to understand
Artificial neural networks are computing systems inspired by biological neural networks.
Such systems learn tasks by considering examples and usually without any prior
knowledge. For example, in an email spam detector, each email in the dataset is
manually labeled as “spam” or “not spam”. By processing this dataset, the artificial
neural networks evolve their own set of relevant characteristics between the emails and
8
9
To expand more, artificial neural networks are composed of artificial neurons called
units usually arranged in a series of layers. Figure 1 is the most common architecture
of a neural network model. It contains three types of layers: the input layer contains
units which receive inputs normally in the format of numbers; the output layer
contains units that “respond to the input information about how it has learned any
task”; the hidden layer contains units between input layer and output layer, and its job
is to transform the inputs to something that output layer can use (Schalkoff, 1997).
Traditional neural networks do not recall any previous work when building the
understanding of the task from the given examples. However, for tasks like text
want the model to remember the previous words when it processes the next one. To
be able to achieve that, we have to use recurrent neural networks because they are
networks with loops in them where information can persist in the model
(Christopher, 2015).
9
10
Figure 2.4 shows how a recurrent neural network (RNN) looks like if it is unrolled.
For the symbols in the figure, “ht” represents the output units value after each
timestamp (if the input is a list of strings, each timestamp can be the processing of
one word), “x” represents the input units for each timestamp, and A means a chunk of
the neural network. Figure 2 shows that the result from the previous timestamp is
passed to the next step for part of the calculation that happens in a chunk of the neural
network. Therefore, the information gets captured from the previous timestamp.
efficiently with the increasing distance between the connected information. Since
10
11
the long term. Different from the traditional RNN, inside each LSTM cell, there are
several simple linear operations which allow data to be conveyed without doing the
complex computation. As shown in Figure 3, the previous cell state containing all the
information so far smoothly goes through an LSTM cell by doing some linear
operations. Inside, each LTSM cell makes decisions about what information to keep,
and when to allow reads, writes and erasures of information via three gates that
As shown in Figure 2.6, the first gate is called the “forget gate layer”, which takes the
previous output units value ht-1 and the current input xt, and outputs a number between
0 and 1 to indicate the ratio of passing information. 0 means do not let any information
11
12
To decide what information needs to be updated, LSTM contains the “input gate
layer”. It also takes in the previous output units value ht-1 and the current input xt and
outputs a number to indicate inside which cells the information should be updated. Then,
the previous cell state Ct-1 is updated to the new state Ct. The last gate is “output gate
layer”, which decides what the output should be. Figure 2.7 shows that in the output
layer, the cell state is going through a tanh function, and then it is multiplied by the
weighted output of the sigmoid function. So, the output units value ht is passed to the
12
13
Simple linear operators connect the three gate layers. The vast LSTM neural network
consists of many LSTM cells, and all information is passed through all the cells while
the critical information is kept to the end, no matter how many cells the network has.
Word embedding is a set of feature learning techniques in NLP where words are
mapped to vectors of real numbers. It allows similar words to have similar representation,
so it builds a relationship between words and allows calculations among them (Mikolov,
Sutskeve, Chen, Corrado, and Dean, 2013). A typical example is that after representing
words to vectors, the function “king - men + women” would ideally give the vector
representation for the word “queen”. The benefit of using word embedding is that it
captures more meaning of the word and often improves the task performance, primarily
13
14
metrics that is used to score a machine-generated summary using one or more reference
summaries created by humans. ROUGE-N is the evaluation of N-grams recall over all
the reference summaries. The recall is calculated by dividing the number of overlapping
words over the total number of words in the reference summary (Lin, Chin-Yew, 2004).
The BLEU metric, contrary to ROUGE, is based on N-grams precision. It refers to the
percentage of the words in the machine generated summary overlapping with the
reference summaries (Papineni, Kishore, et al., 2002). For instance, if the reference
summary is “There is a cat and a tall dog” and the generated summary is “There is a tall
14
15
dog”, the ROUGE-1 score will be 5/8 and the BLEU score will be 5/5. This is because
the number of overlapping words are 5 and the number of words in system summary and
reference summary are 5 and 8 respectively. These two metrics are the most commonly
2.3.5 Libraries
[Link] Keras
Keras is a Python library initially released in 2015, which is commonly used for machine
learning. Keras contains many implemented activation functions, optimizers, layers, etc.
So, it enables building neural networks conveniently and fast. Keras was developed and
maintained by François Chollet, and it is compatible with Python 2.7-3.6 ([Link], n.d.).
[Link] NLTK
Natural Language Toolkit (NLTK) is a text processing library that is widely used in
tokenization, parsing, classification, etc. The NLTK team initially released it in 2001
([Link], 2018).
[Link] Scikit-learn
15
16
2018).
[Link] Pandas
Pandas provides a flexible platform for handling data in a data frame. It contains many
open-source data analysis tools written in Python, such as the methods to check missing
data, merge data frames, and reshape data structure, etc. (“Pandas”, n.d.).
[Link] Gensim
Gensim is a Python library that achieves the topic modeling. It can process a raw text
data and discover the semantic structure of input text data by using some efficient
[Link] Flask
Flask, issued in mid-2010 and developed by Armin Ronacher, is a robust web framework
for Python. Flask provides libraries and tools to build primarily simple and small web
[Link] Bootstrap
Bootstrap is an open-source JavaScript and CSS framework that can be used as a basis to
develop web applications. Bootstrap has a collection of CSS classes that can be directly
used to create effects and actions for web elements. Twitter’s team developed it in 2011
(“Introduction”, n.d.).
16
17
[Link] GloVe
performs better than other models on word similarity, word analogy, and named entity
[Link] LXML
The XML toolkit lXML, which is a Python API, is bound to the C libraries libXML2 and
libxslt. LXML can parse XML files faster than the Element Tree API, and it also derives
the completeness of XML features from libXML2 and libxslt libraries ([Link], 2017).
In order to build the text summarization tool for Juniper Networks, we first researched
existing ways of doing text summarization. Text summarization, still at its early
machine learning, has performed with state-of-the-art results for common NLP tasks
such as Named Entity Recognition (NER), Part of Speech (POS) tagging or sentiment
analysis (Socher, Bengio & Manning, 2013). In case of text summarization, the two
(Hasan, Kazi Saidul, & Vincent, 2010). Transformer was first introduced by Vaswani
and Shazeer Vaswani and Shazeer in their paper Transformer: Bringing order to text
17
18
(2004). The paper proposed the idea of using a graph-based algorithm similar to
Google’s Transformer to find the most important sentences. Juan Ramos proposed
One such model that has been gaining popularity is sequence to sequence model
(Nallapati, Zhou, Santos, Gulçehre, & Xiang 2016). Sequence to sequence models
Vinyals & Le 2014). Recent studies on abstractive summarization have shown that
sequence to sequence models using encoders and decoders beat other traditional ways
of summarizing text. The encoder part encodes the input document to a fixed-length
vector. Then the decoder part takes the fixed-length vector and decodes it to the
for our model of abstractive summarization (Rush, Chopra, & Weston, 2015;
Nallapati, Zhou, Santos, Gulçehre, & Xiang 2016; Lopyrev, 2015). All three journals
The model created by Rush et al., a group from Facebook AI Research, has used
18
19
convolutional network model for encoder, and a feedforward neural network model
for decoder (for details, please see Appendix A: Extended Technical Terms). In their
model, only the first sentence of each article content is used to generate the headline
(2015).
The model generated by Nallapati et al., a team from IBM Watson, used Long Short-
Term Memory (LSTM) in both encoder and decoder. They used the same news article
dataset as the one that the Facebook AI Research group used. In addition, the IBM
Watson group used the first two to five sentences of the articles’ content to generate
the headline (2016). Nallapati et al. were able to outperform Rush et al.’s models in
particular datasets.
The article from Konstantin Lopyrev talks about a model that uses four LSTM layers
performance (2015). Loprev also used the dataset of news articles, and the model
predicts the headlines of the articles from the first paragraph of each article.
All three works show that encoder-decoder model is a potential solution for text
information from original article content than traditional RNNs. In this project,
inspired by previous works we also used the encoder-decoder model with LSTM but
in a slightly different structure. We used three LSTM layers in the encoder and
another three LSTM layers in the decoder (details of the model are described in
Section 3.0). However, the datasets used in this project were not as clean as news
19
20
articles. Our datasets contain a lot of technical terms, coding languages as well as
extractive summarization could help extract key sentences from the articles, which
can be used as inputs to our abstractive deep learning models. This way, the input
documents for the abstractive summarization would be neater than the original ones.
20
21
CHAPTER THREE
METHODOLOGY
3.1 Introduction
The goal of this project is to explore automatic text summarization and analyze its
steps:
We worked on five datasets — the Stack Over flow dataset (Stack Dataset), the news
articles dataset (News Dataset), the Juniper Knowledge Base dataset (KB Dataset), the
Juniper Technical Assistance Center Dataset (JTAC Dataset) and the JIRA Dataset. Each
dataset consists of many cases, where each case consists of an article and a summary or a
title. Since the raw News Dataset was already cleaned, we primarily focused on cleaning
the rest four datasets. Figure 7 below shows the changes in dataset sizes before and after
cleaning the data. As shown in the figure, after cleaning the datasets, we had two large
21
22
datasets (the Stack Dataset and the KB Dataset) with over 15,000 cases and two small
Dataset dealing only with networking related issues. There are 39,320 cases in
this data frame, which is the largest dataset we worked on. For each case, we
filtered the dataset to only keep the unique question id, the question title, the
question body, and the answer body. Then we cleaned the filtered dataset by
removing chunks of code, non-English articles and short articles. Finally, we got
37,378 cases after cleaning. The reason we chose to work with the Stack Dataset
22
23
However, the Stack Dataset is supposedly cleaner than the KB Dataset, and by
running our models in a cleaner dataset, we could first focus on designing our
Second, the News Dataset is a public dataset containing news articles from Indian
news websites. We used this dataset because the dataset includes a human-
generated summary for each article, which can be used to train our model. For
our purposes, we only used the article body and the summary of each article. This
dataset was used just for extractive summarization as the dataset was not relevant
Third, the KB Dataset, which is the one we put the most emphasis on, contains
technical questions and answers about networking issues. The raw dataset is in a
directory tree of 23,989 XML files, and each XML file contains the information
about one KB article. For our training and testing, we only kept a unique
document id, a title, a list of categories that the article belongs to, and a solution
body for each KB article in the data frame. We filtered out the top 30 categories
which contained 15,233 cases. Our goal was to use the KB articles’ solutions as
Fourth, the JTAC Dataset contains information about JTAC cases. It has 8,241
cases, and each case has a unique id, a synopsis, and a description. The raw
At last, the JIRA Dataset is about JIRA bugs from various projects. JIRA is a
23
24
public project management tool developed by Atlassian for issue tracking. The
JIRA Dataset has 5,248 cases, and each case has a unique id, a summary, and a
description. Same as the JTAC Dataset, the raw JIRA Dataset is also in a JSON
file.
The five datasets we worked were very noisy containing snippets of code, invalid
characters, and unreadable sentences. For an efficient training, our models needed
datasets with no missing value and no noisy words. Based on this guideline, we followed
The Stack Data are in CSV files and can be easily transferred to a data frame by using
pandas library. However, the KB Dataset is stored in a directory tree of XML files, so
we used Lmxl to read each XML file from the root to each element and store the
Since fewer than 5% of the articles had missing values, the articles containing
24
25
In the Stack Dataset and the KB Dataset, there are many chunks of code in the
question and answer bodies. The code snippets would cause problems as the
identified the chunks of code by locating the code tags in the input strings and
deleting everything between “<code>” and “</code>” tags. Eventually, we found that
4. Detect and remove the unknown words with “&” symbol in all texts.
In the KB Dataset, we found that there are 48.76% words which cannot be recognized
by Juniper’s word embedding. Some of the unrecognized words are proper nouns, but
some of them are garbled and meaningless words that start with “&” symbols such as
the word “&npma”. The proper nouns might catch the unique information in the
articles, so we did not remove them. However, we detected and removed all the
In the KB Dataset, 164 articles are written in Spanish. Our project’s focus was only
on English words, and having words outside of English language would cause
problems while training our models. We identified the Spanish articles by looking
for some common Spanish words such as “de”, “la” and “los”. Any article
25
26
In the JTAC dataset, there are nearly 19% articles in which more than 20% of all
characters are digits, and most of the digits are meaningless in the context such as
“000x”. Digits are seldom used in training and may affect the prediction, so we
8. Check duplicate articles and write the cleaned data into a CSV file.
We also checked whether there were duplicated data, and we found that all data are
unique. Finally, we wrote the cleaned data into a comma-separated values file.
hierarchy of the existing KB categories. Each KB article was associated with a list of
categories like- “MX240”, “MX230”, “MX”, etc. In this case, the hierarchy of the
categories should reflect category “MX240” to be a child node of category “MX”. “MX”
is the name of a product series at Juniper, while “MX240” is the name of a product in the
MX series. The goal of categorizing KB data is to have a more precise structure of the
26
27
KB dataset which could be used by Juniper Networks for future data related projects as
We looped through the category lists in all cases and gathered all the unique categories in
In order to efficiently categorize the data, we removed the digits and underscores at the
beginning and the end of each category name. For example, “MX240_1” is shrunk as
“MX”. This was helpful when we used the Longest Common Substrings to categorize the
data because the longest common substring among “MX240_1” and “MX240_2” is
Since similar category names are listed consecutively, we went through the entire set and
found the LCS with minimum two characters among the neighbors. If a specific string
did not have a common substring with its previous string and its successive string, this
After we listed all the common substrings, we manually took a look at the list and picked
30 category names which were meaningful and contained many children nodes. For
example, we picked some names of main product series at Juniper such as “MX” and
“EX”, and we also chose some networking-related categories such as “SERVER” and
27
28
“VPN”.
5. Write all KB articles that belonged to that 30 categories into a CSV file.
The last step was to extract the KB articles that contained any category name that
belonged to the target 30 categories in the category list. We also generated the
extractions.
We began the text summarization by exploring the extractive summarization. The goal
was to try the extractive approach first, and use the output from extraction as an input of
the abstractive summarization. After experimenting with the two approaches, we would
then pick the best approach for Juniper Network’s Knowledge Base (KB) dataset. The
text extraction algorithms and controls were implemented in Python. The code contained
three important components - the two algorithms used, the two control methods, and the
3.2.1 Algorithms
The algorithms used for text extraction were - Transformer (Vaswani and Shazeer, 2004)
and (Ramos and Juan, 2003). These two algorithms were run on three datasets - the
News Dataset, the Stack Dataset and the KB Dataset. Each algorithm generates a list of
sentences in the order of their importance. Out of that list, the top three sentences were
28
29
Transformer was implemented by creating a graph where each sentence formed the
node, and the edges were the similarity score between the two node sentences. The
similarity score was calculated using Google’s Word2Vec public model. The model
provides a vector representation of each word which can be used to compute the cosine
similarity between them. Once the graph was formed, Google’s Transformer algorithm
was executed in the graph, and then top sentences were collected from the output.
Scikit Learn’s module was used to compute the score of each word with respect to its
dataset. Each sentence was scored by calculating the sum of the scores of each word.
in the sentences. The idea behind this was that the most important sentence in the
document is the sentence with the most uniqueness (most unique words).
To be able to verify the effectiveness of the two algorithms, two control experiments
were used. A control experiment is usually a naive procedure that helps in testing the
results of an experiment. The two control experiments used in text extraction were -
forming a summary by combining the first three lines and forming a summary by
combining three random lines in the article. By running the same metrics in these control
experiments as the experiments with the algorithm being tested, a baseline can be created
for the algorithms. In an ideal case, the performance of the algorithms should always be
29
30
The next step in our project was to work with abstractive summarization. This is one of
the most critical components of our project. We used deep learning neural network
models to create an application that could summarize a given text. The goal was to create
and train a model which can take sentences as the input and produce a summary as the
output.
The model would have pre-trained weights and a list of vocabulary that it would be
able to output. Figure 8 shows the basic flow of our data we wanted our model to
achieve. The first step was to convert each word to its index form. In the figure, the
sentences “I have a dog. My dog is tall.” is converted to its index form by using a
The index form was then passed through an embedding layer which in turn converted
the indexes to vectors. We used pre-trained word embedding matrixes to achieve this.
The output from the embedding layer was then sent as an input to the model. The
model would then compute and create a one-hot vectors matrix of the summary. A
one-hot vector is a vector of dimension equal to the size of the model’s vocabulary
where each index represents the probability of the output to be the word in that index
of the vocabulary. For example, if the index 2 in the one-hot vector is 0.7, the
probability of the result to be the word in the vocabulary index 2 is 0.7. This matrix
would then be converted to words by using the index to word mapping that was
created using the word to index map. In the figure, the final one-hot encoding when
30
31
converted to words forms the expected summary “I have a tall dog.”. Section
4.3.1, 4.3.2 and 3.3.1 expand our architecture of the above model in detail. In
summary, the process described involves converting the input text into numerical
indices, translating these indices into dense vectors via an embedding layer, and then
mechanisms) to focus on key elements in the text. This enables the model to generate
a concise summary, which in this case is "I have a tall dog." The transformer excels at
text summarization by capturing the relationships and context within the text,
31
32
32
33
Before training the model on the dataset, certain features of the dataset were extracted.
These features were used later when feeding the input data to the model and
We first collected all the unique words from the input (the article) and the expected
output (the title) of the documents and created two Python dictionaries of vocabulary
with index mappings for both the input and the output of the documents.
vocabulary to its vector form using pre-trained word embeddings. For public datasets like
Stack Overflow, we used the publicly available pre-trained GloVe model containing
word embeddings of one hundred dimensions each. For Juniper’s datasets, we used the
embedding matrix created and trained by Juniper Networks on their internal datasets. The
Juniper Network’s embedding matrix had vectors of one hundred and fifty dimensions
each. The dictionaries of the word with index mappings, the embedding matrix, and the
descriptive information about the dataset (such as the number of input words and the
maximum length of the input string) were stored in a Python dictionary for later use in
the model.
The embedding layer was a precursor layer to our model. This layer utilizes the
embedding matrix, saved in the previous step of preparing the dataset, to transform each
33
34
word to its vector form. The layer takes input each sentence represented in the form of
word indexes and outputs a vector for each word in the sentence. The dimension of this
vector is dependent on the embedding matrix used by the layer. Representing each word
in a vector space is important because it gives each word a mathematical context and
provides a way to calculate similarity among them. By representing the words in the
vector space, our model can run mathematical functions on them and train itself.
34
35
3.3.3 Model
documents into a shorter version that captures the main points, essential information, and
overall meaning of the original content. With the advent of Transformer-based models,
this process has been significantly refined and enhanced. The Transformer architecture,
introduced by Vaswani et al. in 2017, has revolutionized how machines process and
generate human language, making it a powerful tool for text summarization. The first
step in the summarization process is pre-processing the input text. This step involves
cleaning and structuring the text data, including tasks such as removing unnecessary
characters, tokenizing the text (breaking it down into individual words or subwords), and
sometimes stemming or lemmatizing words to their root forms. In order to fit the text into
the fixed input size required by the Transformer model, it is also frequently changed to
lower case and may be padded or truncated. The pre-processed text is then run through
the encoder of the Transformer model. The input text must be converted by the encoder
into a number of contextualized word embeddings that represent the connections and
dependencies among the text's words. The model is able to take into account the
significance of each word in relation to the full text thanks to the self-attention
mechanism. The encoder's ability to record both local and global dependencies while
the context. The self-attention mechanism computes attention scores for each word pair
in the text, allowing the model to focus on the most relevant parts of the text when
35
36
encoding the information. These attention scores are used to weigh the contribution of
each word in the final representation, ensuring that the model accurately captures the
nuances of the input text. After encoding the input text, the summarization process moves
to the decoding phase. The decoder is tasked with generating the summary based on the
encoded information. When abstractive summarisation is used, the decoder creates new
sentences that more succinctly express the original text's meaning. The attention scores
calculated during the encoding step serve as a guide for this generation, making sure the
summary stays true to the most crucial information included in the original text. The
decoder runs in an autoregressive way, generating the summary one word at a time. It
predicts the next word at each stage by taking into account the encoded representation of
the input text and the words that have already been created. This process keeps going
forms the summary for extractive summarization chooses particular sentences or phrases
directly from the input text. Sentences can be sorted according to how relevant they are to
the summary using transformer-based models such as BERT. The top-ranked sentences
are then used to construct a condensed version of the original text. Post-processing the
generated summary is the last stage in the summarization process. At this stage, the
repetitions. Detokenizing the content, fixing any formatting errors, and, in certain
situations, manually examining and revising the summary for quality control are
36
37
pairings are used for training. This allows the model to pick up on common patterns and
structures found in well-written summaries. The model learns to produce summaries that
After running and testing the model in different datasets, the model’s parameters were
tuned - specifically the number of hidden units was increased, the learning rate was
37
38
increased, the number of epochs was changed and a dropout parameter (percentage of
input words that will be dropped to avoid overfitting) was added at each LSTM layer in
the encoder. Different values were tested for each of the parameters while keeping in
mind the limited resources available to test the model. The models were rerun on the
datasets, and the results were compared with the previous run. The summaries were also
evaluated by human eyes and compared with the ones produced earlier. The best models
Once the model was completed and tested, a web end-to-end application was built (for
details, please see Section 5.5). The application was built in Python’s stream lit library
primarily because the models were implemented in Python. A bootstrap front-end UI was
used to showcase the results. The UI consisted of a textbox for entering text, a
dropdown for choosing the desired models for each of the three datasets and an output
box for showing the results. The UI included a summarize button which would send the
chosen options by making a POST AJAX request to the backend. The backend server
would run the text on the pre-trained model and send the result back to the front-end. The
front-end displayed the result sent by the backend. This application was the final product
of this project which was hosted on a web server and can be viewed by any modern web
browser.
38
39
CHAPTER FOUR
4.1 Introduction
This chapter presents the results of Text summarization and discusses the implications of
these results. This section summarizes the goal of the project, which is to explore
Furthermore, we will evaluate the system's performance, discuss any limitations, and
propose potential improvements. The outcomes are analysed based on the accuracy of
This section describes the experimental setup, including the five datasets used (Stack
Overflow, News, KB, JTAC, and JIRA), the data cleaning and preprocessing steps, and
Providing screenshot of the web application's user interface, showing the input textbox,
dropdown menu for model selection, and output box for displaying the summary.
39
40
4.4 Result
The primary function of the Software is to summarize every value of text based on input
data. This interface allows users to perform complex tasks like text summarization and
translation without needing to understand the underlying code, providing them with a
simple and effective tool for processing textual data. The results are presented in Table
below:
40
41
41
42
42
43
43
44
44
45
The software provides a user-friendly interface where users can input text and receive
The feedback from users during testing indicated that the interface is intuitive and easy to
navigate. Users appreciated the immediate feedback on their text summarized content.
45
46
46
47
hango ba.
47
48
48
49
4.5 Discussion
recent years. The transformer model's ability to handle long-range dependencies and
have shown that transformer-based summarization models can achieve ROUGE scores (a
traditional methods.
BLEU is typically used for evaluating machine translation but can also be applied to
49
50
can also be adapted for text summarization. It aims to address some of the limitations
of traditional metrics like BLEU by incorporating aspects that better reflect human
judgment. METEOR can also be used to evaluate the quality of generated summaries
4. Human Evaluation:
evaluating coherence, relevance, and conciseness, focusing on how well the summary
These metrics assess the model's ability to capture essential information, preserve
meaning, and maintain fluency. Additionally, human evaluation is often used to assess
4.5.3 Limitations
limitations:
50
51
3. Bias and fairness: Models can inherit biases from the training data, leading to
4. Handling out-of-domain data: Models may struggle with data that differs
resources.
summarization quality.
3. Addressing bias and fairness concerns through data curation and model
regularization techniques.
51
52
CHAPTER FIVE
5.0 Summary
leveraging the transformer model's ability to understand context and handle long-range
dependencies, these algorithms can generate concise and coherent summaries of large
summarization models have become a crucial tool in various applications, including news
5.1 Conclusion
strides in recent years. The transformer model's unique architecture and self-attention
there are still limitations and challenges to be addressed, the future of transformer-based
even more advanced and efficient summarization models that can handle complex tasks
and multimodal data. Ultimately, transformer-based text summarization has the potential
to transform the way we process and consume information, making it an exciting and
52
53
REFERENCE
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly
introduction-text-summarization/
Brownlee, J. (2017b, August 09). How to Use Metrics for Deep Learning with Keras in
from[Link]
python/
Brownlee, J. (2017c, October 11). What Are Word Embeddings for Text? Retrieved
embeddings/
53
54
Christopher, C. (2015, August 27). Understanding LSTM Networks. Retrieved March 02,
Dalal, V., & Malik, L. G. (2013, December). A Survey of Extractive and Abstractive
[Link]/en/latest/[Link]
[Link]
[Link]
Keras: The Python Deep Learning library. (n.d.). Retrieved February 27, 2018, from
[Link]
54
55
Ketkar, N. (2017). Introduction to Keras. In Deep Learning with Python (pp. 97-111).
[Link]
Lopyrev, K. (2015). Generating News Headlines with Recurrent Neural Networks. arXiv
LXML - Processing XML and HTML with Python. (2017, November 4). Retrieved
Mihalcea, R., & Tarau, P. (2004). Transformer: Bringing Order into Text. In Proceedings
2018.
Nallapati, R., Zhou, B., Gulcehre, C., & Xiang, B. (2016). Abstractive Text
55
56
Natural Language Toolkit. (2017, September 24). Retrieved February 23, from
[Link]
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). BLEU: a Method for
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global Vectors for Word
2018.
Python Data Analysis Library. (n.d.). Retrieved March 02, 2018, from
[Link]
mechanism-in-neural-network-30aaf5e39512.
Rahm, E., & Do, H. H. (2000). Data Cleaning: Problems and Current Approaches. IEEE
56
57
Rehurek, R. (2009). Gensim: Topic Modelling for Humans. Retrieved March 02, 2018,
from [Link]/gensim/[Link]
Rush, A. M., Chopra, S., & Weston, J. (2015). A Neural Attention Model for Abstractive
25, 2018.
Scikit-Learn: Machine Learning in Python. (n.d.). Retrieved February 23, 2018, from
[Link]
Schalkoff, R. J. (1997, June). Artificial Neural Networks (Vol. 1). New York: McGraw-
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with
Socher, R., Bengio, Y., & Manning, C. (2013). Deep Learning for NLP. Tutorial at
57
58
from[Link]
58
59
Appendix
In this appendix, we briefly explain some technical terms that are mentioned in this
report but are not necessarily related to the core concept of our project.
Named Entity Recognition (NER) is a way of finding and classifying names, which are
Part of Speech (POS) Tagging is a way of tagging word in text, which is corresponding
(CNN) is a kind of neural network with a structure of going deep and then feeding
59
60
Feedforward Neural Network Model is a kind of neural network where data are feeding
5. Attention Mechanism
Attention Mechanism a way of helping decoder focus on the important part of source text
60