0% found this document useful (0 votes)
17 views74 pages

Updated Text Summarization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views74 pages

Updated Text Summarization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

KWARA STATE UNIVERSITY, MALETE

The Green University for Community Development and Entrepreneurship

Faculty of Information and Communication Technology

TEXT SUMMARIZATION USING TRANSFORMER BASED

MODEL

BY

ESTHER IBNKUNOLUWA AWODOYIN


21D/47CS/01565
ABDULMUIZ ABDULGAFAR
20/47CS/01318
OLONADE AZEEZ LEKAN
21D/47CS/01543

AUGUST 2024
TEXT SUMMARIZATION USING TRANSFORMER BASED

MODEL

BY

ESTHER IBNKUNOLUWA AWODOYIN


21D/47CS/01565
ABDULMUIZ ABDULGAFAR
20/47CS/01318
OLONADE AZEEZ LEKAN
21D/47CS/01543

A Project Research Submitted to the Department of Computer Science,


Faculty of Information and Communication Technology, Kwara State
University, Malete, In Partial Fulfillment of the Requirements for the
Award of Bachelor of Science (B.S.C) Degree in Computer Science

AUGUST, 2024

ii
DECLARATION

We hereby declare that this project titled “TEXT SUMMARIZATION USING

TRANSFORMER MODEL” is our own work and has not been submitted by any other

person for any degree or qualification at any higher institution. I also declare that the

information provided therein are ours and those that are not ours are properly

acknowledged.

ESTHER IBUKUNOLUWAWODOYIN _______________________

Name of Student Signature and Date

ABDULMUIZ ABDULGAFAR _______________________

Name of Student Signature and Date

AZEEZ OLONADE LEKAN _____________________

Name of Student Signature and Date

iii
CERTIFICATION

This is to certify that this project titled “Text summarization using transformer-based

model” was carried out by Esther Ibukunoluwa Awodoyin, Abdulmuiz Abdulgafar,

Azeez Olonade Lekan. The project has been read and approved as meeting the

requirements for the award of Bachelor of Science ([Link].) Degree in Science,

Department of Computer Science, Faculty of Information and Communication

Technology. Kwara State University, Malete

___________________ ________________
Dr Mrs. J.F. AJAO Date
Supervisor

____________________ ______________
Dr Mrs. R.S. Babatunde Date
Head of Department

________________________ _____________
External Examiner Date

iv
DEDICATION

This study is dedicated to the Almighty God, who has been our source of Strength, Grace

and Wisdom throughout our study-period. May His name be forever praised!

v
ACKNOWLEDGEMENTS

We want to thank almighty God for his guidance and understanding. We would like to

express our deepest gratitude to everyone who supported and contributed to the

successful completion of this project.

First and foremost, we are profoundly grateful to Dr. Mrs. J.F AJAO for their invaluable

guidance, encouragement, and insightful feedback throughout the duration of this project.

Our sincere appreciation is extended to our instructors, especially the esteemed Head of

Department, Dr. (Mrs.) R. S. Babatunde, for her unwavering attention and support during

our time at KWASU. We are also grateful to Drs. R. M. Isiaka, Dr. A. N. Babatunde, Dr.

S.O. Abdulsalam, Dr S.R. Yusuff, and Dr. A. F. Kadri. we have the utmost gratitude for

the wisdom bestowed to us. Their expertise and patience were instrumental in shaping the

direction and quality of our work. we also want to extend our heartfelt thanks to Kwara

State University for providing the necessary resources and facilities that made this project

possible. The support of group members is greatly appreciated. A special thanks to our

family and friends for their unwavering support and encouragement. Their understanding

and motivation have been a constant source of strength during the more challenging

moments of this journey. Finally, we would like to acknowledge all our peers and

colleagues who provided helpful discussions, collaboration, and camaraderie that

enhanced our experience during this project. Your feedback and ideas were invaluable.

Thank you all for being a part of this journey with us

vi
TABE OF CONTENTS

FRONT PAGE......................................................................................................................i

TITLE PAGE.......................................................................................................................ii

DECLARATION................................................................................................................iii

CERTIFICATION..............................................................................................................iv

DEDICATION.....................................................................................................................v

ACKNOWLEDGEMENTS................................................................................................vi

TABE OF CONTENTS.....................................................................................................vii

LIST OF FIGURES...........................................................................................................xii

LIST OF TABLES...........................................................................................................xiii

ABSTRACT.....................................................................................................................xiv

CHAPTER ONE................................................................................................................1

INTRODUCTION.............................................................................................................1

1.1Background of the study.............................................................................................1

1.2 Statement of problem.................................................................................................2

2.3 Aim and Objective.....................................................................................................2

1.4 Significance of the study............................................................................................3

1.5 Scope of the study......................................................................................................3

vii
1.6 Project Layout............................................................................................................3

CHAPTER TWO...............................................................................................................5

LITERATURE REVIEW.................................................................................................5

2.0 Introduction................................................................................................................5

2.0.1 Natural Language Processing..................................................................................5

2.0.2 Text Extraction........................................................................................................6

2.0.3 Transformer.............................................................................................................6

2.3 Text Abstraction.........................................................................................................8

2.3.1 Artificial Neural Network.......................................................................................8

2.3.2 RNN and LSTM......................................................................................................9

2.3.3 Word Embedding..................................................................................................13

2.3.4 ROUGE-N and BLEU Metrics.............................................................................14

2.3.5 Libraries................................................................................................................15

[Link] Keras..................................................................................................................15

[Link] NLTK.................................................................................................................15

[Link] Scikit-learn.........................................................................................................15

[Link] Pandas................................................................................................................16

[Link] Gensim...............................................................................................................16

viii
[Link] Flask...................................................................................................................16

[Link] Bootstrap............................................................................................................16

[Link] GloVe.................................................................................................................16

[Link] LXML................................................................................................................17

2.4 Related work............................................................................................................17

CHAPTER THREE.........................................................................................................21

METHODOLOGY..........................................................................................................21

3.1 Introduction..............................................................................................................21

3.1.1 Datasets Information.............................................................................................21

3.1.2 Data Cleaning........................................................................................................24

3.1.3 Data Categorization...............................................................................................26

3.2 Extractive Summarization........................................................................................28

3.2.1 Algorithms............................................................................................................28

3.2.2 Control Experiment...............................................................................................29

3.3 Abstractive Summarization......................................................................................29

3.3.1 Preparing the dataset.............................................................................................32

3.3.2 Embedding layer...................................................................................................32

3.3.3 Model....................................................................................................................34

ix
3.4 Tune the model.........................................................................................................36

3.5 Build an end-to-end application...............................................................................37

CHAPTER FOUR...........................................................................................................38

RESULT AND DISCUSSION........................................................................................38

4.1 Introduction..............................................................................................................38

4.2 Experimental Setup..................................................................................................38

4.3 User Interface Screenshot........................................................................................38

4.4 Result.......................................................................................................................39

4.4.1 Text summarization...............................................................................................39

4.4.2 User Interface and Experiences.............................................................................44

4.4.3 SUMMARIZATION OF TEXT USING QUILLBOT...................................45

4.5 Discussion................................................................................................................48

4.5.1 Accuracy of Text Summarization.........................................................................48

4.5.2 Performance Evaluation........................................................................................48

4.5.3 Limitations............................................................................................................49

4.5.3 Future improvements............................................................................................50

CHAPTER FIVE.............................................................................................................51

SUMMARY AND CONCLUSION................................................................................51

x
5.0 Summary..................................................................................................................51

5.1 Conclusion...............................................................................................................51

REFERENCE....................................................................................................................52

Appendix............................................................................................................................58

xi
LIST OF FIGURES

Figure 2.1: Natural Language Processing (NLP)................................................................6

Figure 2.2 Simple Transformer model.................................................................................7

Figure 2.3: Simple neural network (Cburnett, 2016)...........................................................8

Figure 2.4 An unrolled recurrent neural network (Christopher, 2015)................................9

Figure 2.5 The linear operations in an LSTM (Christopher, 2015)...................................10

Figure 2.6 The forget gate layer (Christopher, 2015)........................................................11

Figure 2.7 The input gate layer (Christopher, 2015).........................................................12

Figure 2.8 The output gate layer (Christopher, 2015).......................................................13

Figure 2.9: Word Embedding............................................................................................14

Figure 3.1: Dataset Information.........................................................................................22

Figure 3.2: Abstractive summarization basic flow diagram..............................................31

Figure 3.3: Embedding layer.............................................................................................33

Figure 3.4: Architecture of training model........................................................................36

Figure 4.1: User Interface Screenshot...............................................................................39

Figure 4.2: Text of Context (Hausa) Screenshot...............................................................43

Figure 4.3: Text of Context (Yoruba) Screenshot.............................................................43

Figure 4.3: Text of Context (English) Screenshot.............................................................44

xii
LIST OF TABLES

Table 4.1: Text of Context (in English).............................................................................40

Table 4.2: Text of Context (Hausa)...................................................................................41

Table 4.3: Text of Context (Yoruba).................................................................................42

Table 4.4: Text of Context (in English).............................................................................45

Table 4.4: Text of Context (Hausa)...................................................................................46

Table 4.4: Text of Context (Yoruba).................................................................................47

xiii
ABSTRACT

The project on Text Summarization Using Transformer Models represents a significant


advancement in the field of natural language processing (NLP), aiming to enhance the
accuracy and coherence of automated text summarization. In today's information-driven
world, the ability to generate concise and meaningful summaries is crucial for efficient
information consumption. This project focuses on developing a state-of-the-art
summarization tool utilizing transformer-based models, which have revolutionized NLP
tasks with their attention mechanisms and superior contextual understanding. The
primary motivation behind this project is to address the limitations of existing
summarization methods, particularly for complex and diverse text genres. The project
leverages a Transformer model, specifically employing the BERT (Bidirectional Encoder
Representations from Transformers) or GPT (Generative Pre-trained Transformer)
architecture, implemented using the Py Torch framework, to improve the quality of
generated summaries.
Data for this project was carefully sourced from a wide range of high-quality textual
corpora to ensure the model is trained on diverse content types. The data was then pre-
processed, including tokenization, normalization, and filtering, to prepare it for model
training. The Transformer architecture was selected for its ability to capture long-range
dependencies and its effectiveness in generating coherent and contextually accurate
summaries. The expected outcome of this project is a powerful text summarization system
that produces concise, fluent, and contextually appropriate summaries, making
information more accessible and easier to digest. The system's performance will be
rigorously evaluated using both quantitative metrics and human evaluations to ensure it
meets high standards of accuracy, coherence, and relevance. Future work may explore
fine-tuning the Transformer model on domain-specific texts and experimenting with
alternative model architectures to further enhance summarization quality. This project
contributes to the ongoing development of advanced NLP tools, with a particular focus
on improving the efficiency and effectiveness of text summarization.

xiv
1

CHAPTER ONE

INTRODUCTION

1.1Background of the study

Text summarization is the process of generating short, fluent, and most importantly

accurate summary of a respectively longer text document (Brownlee, 2017a). The

main idea behind automatic text summarization is to be able to find a short subset of

the most essential information from the entire set and present it in a human-readable

format. As online textual data grows, automatic text summarization methods have

potential to be very helpful because more useful information can be read in a short

time.

Juniper Networks is a networking company that manufactures and supports

enterprise- grade routing, switching and security products as well as service

agreements ([Link], 2018). In order to satisfy the customer base, Juniper tries to

resolve issues quickly and efficiently. Juniper Networks maintains a Knowledge Base

(KB) which is a dataset composed of questions from customers with human written

solutions. The KB contains over twenty thousand articles. The company is currently

developing a chatbot to provide 24x7 fast assistance on customer questions. The

chatbot can search queries asked by the users in the KB and fetch links to the related

articles. Juniper Networks is looking for ways to be able to automatically summarize

these articles so that chatbot can present the summaries to the customers. The

1
2

customers can then decide if they would like to read the entire article. The

summarization tool could be further used internally for summarizing tickets and

issues created by Juniper’s employees.

1.2 Statement of problem

Over the recent years it was observed that researchers have worked on the improvement

and enhancement of generating informative and concise summaries. Here are the

common problems associated with nowadays text summarizers;

Understanding Context, Distinguishing Key Points, Handling Various Text Structures,

Maintaining Coherence and Readability, Managing Bias, Evaluating Summaries,

Abstractive Summarization Challenges, Resource Constraints.

This project work tends to offer a rich framework for enhancing text summarizers by

leveraging their capabilities in semantic understanding, abstractive generation, context

modeling, multimodal integration, efficient attention mechanisms, fine-tuning, and

adaptive summarization. Continued research and development in these areas are likely to

drive significant advancements in the field of text summarization, leading to more

accurate, coherent, and versatile summarization systems.

2.3 Aim and Objective

The goals of this Major Qualifying Project are to research methods for text

summarization, create an end-to-end prototype tool for summarizing documents and

2
3

identify if Juniper Networks’ datasets can be summarized effectively and efficiently. In

order to achieve these goals, we developed the following objectives:

i. Filter and clean datasets to be used for summarization.

ii. Implement transformer algorithm for text summarization.

iii. Evaluate the models and tune them if necessary.

iv. Build and host an end-to-end tool which takes texts as input and outputs a

summary

1.4 Significance of the study

The significance of text summarization, especially with transformer-based models, lies in

its ability to efficiently condense vast amounts of information into concise, contextually

relevant summaries. These models excel at highlighting essential content, producing

summaries that are more accurate and cohesive than traditional methods. Their versatility

allows them to handle diverse text genres and optimize for specific domains, making

them suitable for real-time applications. By advancing natural language understanding,

they improve the accuracy of summaries and automate content management, ultimately

enhancing information processing across various fields.

1.5 Scope of the study

the scope of text summarization with transformer models is to improve information

processing by delivering fast, precise, and adaptable summarization solutions that can be

customized for diverse fields and applications.

3
4

1.6 Project Layout

Chapter one

1. Background study

2. Statement of problem

3. Aim and objectives

4. Significance and scope of study

Chapter two

1. Literature review

Chapter three

1. Data collection

2. Model selection

3. Training/Tuning

4. Evaluation metrics

Chapter four

1. Implementation

2. Result and Evaluation

3. Recommendation

Chapter five

1. Summary

2. Conclusion

4
5

CHAPTER TWO

LITERATURE REVIEW

2.0 Introduction

This section explores the technologies which were used in this project (Section 2.1 - 2.3).

The section first discusses the key concepts for text summarization, followed by the

metrics used to evaluate them along with the environments (Section 2.4) and the libraries

used to complete this project (Section 2.5).

2.0.1 Natural Language Processing

Natural Language Processing (NLP) is a field in Computer Science that focuses on the

study of the interaction between human languages and computers (Chowdhury, 2003).

Text summarization is in this field because computers are required to understand what

humans have written and produce human-readable outputs. NLP can also be seen as a

study of Artificial Intelligence (AI). Therefore, many existing AI algorithms and

methods, including neural network models, are also used for solving NLP related

problems. With the existing research, researchers generally rely on two types of

approaches for text summarization: extractive summarization and abstractive

summarization (Dalal and Malik, 2013).

5
6

Figure 2.1: Natural Language Processing (NLP)

2.0.2 Text Extraction

Extractive summarization means extracting keywords or key sentences from the original

document without changing the sentences. Then, these extracted sentences can be used to

form a summary of the document.

2.0.3 Transformer

Transformer is an algorithm inspired by Google and based on the multi-head attention

mechanism that helps identify key sentences from a passage (Vaswani and Shazeer,

2004). The idea behind this algorithm is that the sentence that is similar to most other

sentences in the passage is probably the most important sentence in the passage. Using

this idea, one can create a graph of sentences connected with all the similar sentences and

6
7

run Google’s algorithm on it to find the most important sentences. These sentences would

then be used to create the summary.

Term Frequency-Inverse Document Frequency () is used to determine the relevance of a

word in the document (Ramos and Juan, 2003). The underlying algorithm calculates the

frequency of the word in the document (term frequency) and multiplies it by the

logarithmic function of the number of documents containing that word over the total

number of documents in the dataset (inverse document frequency). Using the relevance

of each word, one can compute the relevance of each sentence. Assuming that most

relevant sentences are the most important sentences, these sentences can then be used to

form a summary of the document.

Figure 2.2 Simple Transformer model

7
8

2.3 Text Abstraction

Compared to extractive summarization, abstractive summarization is closer to what

humans usually expect from text summarization. The process is to understand the

original document and rephrase the document to a shorter text while capturing the key

points (Dalal and Malik, 2013). Text abstraction is primarily done using the concept of

artificial neural networks. This section introduces the key concepts needed to understand

the models developed for text abstraction.

2.3.1 Artificial Neural Network

Artificial neural networks are computing systems inspired by biological neural networks.

Such systems learn tasks by considering examples and usually without any prior

knowledge. For example, in an email spam detector, each email in the dataset is

manually labeled as “spam” or “not spam”. By processing this dataset, the artificial

neural networks evolve their own set of relevant characteristics between the emails and

whether a new email is spam.

Figure 2.3: Simple neural network (Cburnett, 2016)

8
9

To expand more, artificial neural networks are composed of artificial neurons called

units usually arranged in a series of layers. Figure 1 is the most common architecture

of a neural network model. It contains three types of layers: the input layer contains

units which receive inputs normally in the format of numbers; the output layer

contains units that “respond to the input information about how it has learned any

task”; the hidden layer contains units between input layer and output layer, and its job

is to transform the inputs to something that output layer can use (Schalkoff, 1997).

2.3.2 RNN and LSTM

Traditional neural networks do not recall any previous work when building the

understanding of the task from the given examples. However, for tasks like text

summarization, the sequence of words in input documents is critical. In this case, we

want the model to remember the previous words when it processes the next one. To

be able to achieve that, we have to use recurrent neural networks because they are

networks with loops in them where information can persist in the model

(Christopher, 2015).

9
10

Figure 2.4 An unrolled recurrent neural network (Christopher, 2015)

Figure 2.4 shows how a recurrent neural network (RNN) looks like if it is unrolled.

For the symbols in the figure, “ht” represents the output units value after each

timestamp (if the input is a list of strings, each timestamp can be the processing of

one word), “x” represents the input units for each timestamp, and A means a chunk of

the neural network. Figure 2 shows that the result from the previous timestamp is

passed to the next step for part of the calculation that happens in a chunk of the neural

network. Therefore, the information gets captured from the previous timestamp.

However, in practice, traditional RNNs often do not memorize information

efficiently with the increasing distance between the connected information. Since

each activation function is nonlinear, it is hard to trace back to hundreds or thousands

of operations to get the information.

Figure 2.5 The linear operations in an LSTM (Christopher, 2015)

Fortunately, Long Short-Term Memory (LSTM) networks can convey information in

10
11

the long term. Different from the traditional RNN, inside each LSTM cell, there are

several simple linear operations which allow data to be conveyed without doing the

complex computation. As shown in Figure 3, the previous cell state containing all the

information so far smoothly goes through an LSTM cell by doing some linear

operations. Inside, each LTSM cell makes decisions about what information to keep,

and when to allow reads, writes and erasures of information via three gates that

open and close.

Figure 2.6 The forget gate layer (Christopher, 2015)

As shown in Figure 2.6, the first gate is called the “forget gate layer”, which takes the

previous output units value ht-1 and the current input xt, and outputs a number between

0 and 1 to indicate the ratio of passing information. 0 means do not let any information

pass, while 1 means let all information pass.

11
12

Figure 2.7 The input gate layer (Christopher, 2015)

To decide what information needs to be updated, LSTM contains the “input gate

layer”. It also takes in the previous output units value ht-1 and the current input xt and

outputs a number to indicate inside which cells the information should be updated. Then,

the previous cell state Ct-1 is updated to the new state Ct. The last gate is “output gate

layer”, which decides what the output should be. Figure 2.7 shows that in the output

layer, the cell state is going through a tanh function, and then it is multiplied by the

weighted output of the sigmoid function. So, the output units value ht is passed to the

next LSTM cell (Christopher, 2015).

12
13

Figure 2.8 The output gate layer (Christopher, 2015)

Simple linear operators connect the three gate layers. The vast LSTM neural network

consists of many LSTM cells, and all information is passed through all the cells while

the critical information is kept to the end, no matter how many cells the network has.

2.3.3 Word Embedding

Word embedding is a set of feature learning techniques in NLP where words are

mapped to vectors of real numbers. It allows similar words to have similar representation,

so it builds a relationship between words and allows calculations among them (Mikolov,

Sutskeve, Chen, Corrado, and Dean, 2013). A typical example is that after representing

words to vectors, the function “king - men + women” would ideally give the vector

representation for the word “queen”. The benefit of using word embedding is that it

captures more meaning of the word and often improves the task performance, primarily

when working with natural language processing.

13
14

Figure 2.9: Word Embedding

2.3.4 ROUGE-N and BLEU Metrics

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is a set of

metrics that is used to score a machine-generated summary using one or more reference

summaries created by humans. ROUGE-N is the evaluation of N-grams recall over all

the reference summaries. The recall is calculated by dividing the number of overlapping

words over the total number of words in the reference summary (Lin, Chin-Yew, 2004).

The BLEU metric, contrary to ROUGE, is based on N-grams precision. It refers to the

percentage of the words in the machine generated summary overlapping with the

reference summaries (Papineni, Kishore, et al., 2002). For instance, if the reference

summary is “There is a cat and a tall dog” and the generated summary is “There is a tall

14
15

dog”, the ROUGE-1 score will be 5/8 and the BLEU score will be 5/5. This is because

the number of overlapping words are 5 and the number of words in system summary and

reference summary are 5 and 8 respectively. These two metrics are the most commonly

used metrics when working with text summarization.

2.3.5 Libraries

[Link] Keras

Keras is a Python library initially released in 2015, which is commonly used for machine

learning. Keras contains many implemented activation functions, optimizers, layers, etc.

So, it enables building neural networks conveniently and fast. Keras was developed and

maintained by François Chollet, and it is compatible with Python 2.7-3.6 ([Link], n.d.).

[Link] NLTK

Natural Language Toolkit (NLTK) is a text processing library that is widely used in

Natural Language Processing (NLP). It supports the high-performance functions of

tokenization, parsing, classification, etc. The NLTK team initially released it in 2001

([Link], 2018).

[Link] Scikit-learn

Scikit-learn is a machine learning library in Python. It performs easy-to-use

dimensional reduction methods such as Principal Component Analysis (PCA),

clustering methods such as k- means, regression algorithms such as logistic

regression, and classification algorithms such as random forests ([Link],

15
16

2018).

[Link] Pandas

Pandas provides a flexible platform for handling data in a data frame. It contains many

open-source data analysis tools written in Python, such as the methods to check missing

data, merge data frames, and reshape data structure, etc. (“Pandas”, n.d.).

[Link] Gensim

Gensim is a Python library that achieves the topic modeling. It can process a raw text

data and discover the semantic structure of input text data by using some efficient

algorithms, such as and Latent Dirichlet Allocation (Rehurek, 2009).

[Link] Flask

Flask, issued in mid-2010 and developed by Armin Ronacher, is a robust web framework

for Python. Flask provides libraries and tools to build primarily simple and small web

applications with one or two functions (Das., 2017).

[Link] Bootstrap

Bootstrap is an open-source JavaScript and CSS framework that can be used as a basis to

develop web applications. Bootstrap has a collection of CSS classes that can be directly

used to create effects and actions for web elements. Twitter’s team developed it in 2011

(“Introduction”, n.d.).

16
17

[Link] GloVe

Global Vectors for Word Representation (GloVe), which is initially developed by a

group of Stanford students in 2014, is a distributed word representations model that

performs better than other models on word similarity, word analogy, and named entity

recognition. (Pennington, Richard and Christopher, 2014).

[Link] LXML

The XML toolkit lXML, which is a Python API, is bound to the C libraries libXML2 and

libxslt. LXML can parse XML files faster than the Element Tree API, and it also derives

the completeness of XML features from libXML2 and libxslt libraries ([Link], 2017).

2.4 Related work

In order to build the text summarization tool for Juniper Networks, we first researched

existing ways of doing text summarization. Text summarization, still at its early

stage, is a field in Natural Language Processing (NLP). Deep learning, an area in

machine learning, has performed with state-of-the-art results for common NLP tasks

such as Named Entity Recognition (NER), Part of Speech (POS) tagging or sentiment

analysis (Socher, Bengio & Manning, 2013). In case of text summarization, the two

common approaches are extractive and abstractive summarization.

For extractive summarization, dominant techniques including and Transformer

(Hasan, Kazi Saidul, & Vincent, 2010). Transformer was first introduced by Vaswani

and Shazeer Vaswani and Shazeer in their paper Transformer: Bringing order to text

17
18

(2004). The paper proposed the idea of using a graph-based algorithm similar to

Google’s Transformer to find the most important sentences. Juan Ramos proposed

(2003). He explored the idea of using a word’s uniqueness to perform keyword

extraction. This kind of extraction can also be applied to an entire sentence by

calculating the of each word in the sentence. We implemented both of these

algorithms - Transformer and, and compared their performances in different datasets.

Abstractive summarization is most commonly performed with deep learning models.

One such model that has been gaining popularity is sequence to sequence model

(Nallapati, Zhou, Santos, Gulçehre, & Xiang 2016). Sequence to sequence models

have been successful in speech recognition and machine translation (Sutskever,

Vinyals & Le 2014). Recent studies on abstractive summarization have shown that

sequence to sequence models using encoders and decoders beat other traditional ways

of summarizing text. The encoder part encodes the input document to a fixed-length

vector. Then the decoder part takes the fixed-length vector and decodes it to the

expected output (Bahdanau, Cho, & Bengio, 2014).

We focused on three recent pieces of research on text summarization as inspirations

for our model of abstractive summarization (Rush, Chopra, & Weston, 2015;

Nallapati, Zhou, Santos, Gulçehre, & Xiang 2016; Lopyrev, 2015). All three journals

have used encoder-decoder models to perform abstractive summarization on the

dataset of news articles to predict the headlines.

The model created by Rush et al., a group from Facebook AI Research, has used

18
19

convolutional network model for encoder, and a feedforward neural network model

for decoder (for details, please see Appendix A: Extended Technical Terms). In their

model, only the first sentence of each article content is used to generate the headline

(2015).

The model generated by Nallapati et al., a team from IBM Watson, used Long Short-

Term Memory (LSTM) in both encoder and decoder. They used the same news article

dataset as the one that the Facebook AI Research group used. In addition, the IBM

Watson group used the first two to five sentences of the articles’ content to generate

the headline (2016). Nallapati et al. were able to outperform Rush et al.’s models in

particular datasets.

The article from Konstantin Lopyrev talks about a model that uses four LSTM layers

and attention mechanism, a mechanism that helps improve encoder-decoder model’s

performance (2015). Loprev also used the dataset of news articles, and the model

predicts the headlines of the articles from the first paragraph of each article.

All three works show that encoder-decoder model is a potential solution for text

summarization. Using LSTM layers in encoder-decoder also allow capturing more

information from original article content than traditional RNNs. In this project,

inspired by previous works we also used the encoder-decoder model with LSTM but

in a slightly different structure. We used three LSTM layers in the encoder and

another three LSTM layers in the decoder (details of the model are described in

Section 3.0). However, the datasets used in this project were not as clean as news

19
20

articles. Our datasets contain a lot of technical terms, coding languages as well as

unreadable characters. Therefore, we tried to combine the extractive summarization

and abstractive summarization to test if it provides better performance. We hoped that

extractive summarization could help extract key sentences from the articles, which

can be used as inputs to our abstractive deep learning models. This way, the input

documents for the abstractive summarization would be neater than the original ones.

20
21

CHAPTER THREE

METHODOLOGY

3.1 Introduction

The goal of this project is to explore automatic text summarization and analyze its

applications on Juniper’s datasets. To achieve the goal, we completed the following

steps:

1. Choose and clean datasets

2. Build the extractive summarization model

3. Build the abstractive summarization model

4. Test and compare models on different datasets

5. Tune the abstractive summarization model

6. Build an end-to-end application

3.1.1 Datasets Information

We worked on five datasets — the Stack Over flow dataset (Stack Dataset), the news

articles dataset (News Dataset), the Juniper Knowledge Base dataset (KB Dataset), the

Juniper Technical Assistance Center Dataset (JTAC Dataset) and the JIRA Dataset. Each

dataset consists of many cases, where each case consists of an article and a summary or a

title. Since the raw News Dataset was already cleaned, we primarily focused on cleaning

the rest four datasets. Figure 7 below shows the changes in dataset sizes before and after

cleaning the data. As shown in the figure, after cleaning the datasets, we had two large

21
22

datasets (the Stack Dataset and the KB Dataset) with over 15,000 cases and two small

datasets with nearly 5,000 cases to work with.

Figure 3.1: Dataset Information

The Stack Dataset is a collection of questions and answers on the StackOverflow

website ([Link], 2018). We used a filtered version of the Stack

Dataset dealing only with networking related issues. There are 39,320 cases in

this data frame, which is the largest dataset we worked on. For each case, we

filtered the dataset to only keep the unique question id, the question title, the

question body, and the answer body. Then we cleaned the filtered dataset by

removing chunks of code, non-English articles and short articles. Finally, we got

37,378 cases after cleaning. The reason we chose to work with the Stack Dataset

is because it contains technical questions similar to that of the KB Dataset.

22
23

However, the Stack Dataset is supposedly cleaner than the KB Dataset, and by

running our models in a cleaner dataset, we could first focus on designing our

model to set a benchmark.

Second, the News Dataset is a public dataset containing news articles from Indian

news websites. We used this dataset because the dataset includes a human-

generated summary for each article, which can be used to train our model. For

our purposes, we only used the article body and the summary of each article. This

dataset was used just for extractive summarization as the dataset was not relevant

to the Juniper KB Dataset.

Third, the KB Dataset, which is the one we put the most emphasis on, contains

technical questions and answers about networking issues. The raw dataset is in a

directory tree of 23,989 XML files, and each XML file contains the information

about one KB article. For our training and testing, we only kept a unique

document id, a title, a list of categories that the article belongs to, and a solution

body for each KB article in the data frame. We filtered out the top 30 categories

which contained 15,233 cases. Our goal was to use the KB articles’ solutions as

input and predict the KB articles’ titles.

Fourth, the JTAC Dataset contains information about JTAC cases. It has 8,241

cases, and each case has a unique id, a synopsis, and a description. The raw

dataset is in a noisy JSON file.

At last, the JIRA Dataset is about JIRA bugs from various projects. JIRA is a

23
24

public project management tool developed by Atlassian for issue tracking. The

JIRA Dataset has 5,248 cases, and each case has a unique id, a summary, and a

description. Same as the JTAC Dataset, the raw JIRA Dataset is also in a JSON

file.

3.1.2 Data Cleaning

The five datasets we worked were very noisy containing snippets of code, invalid

characters, and unreadable sentences. For an efficient training, our models needed

datasets with no missing value and no noisy words. Based on this guideline, we followed

these basic steps to clean our datasets:

1. Read data file and make a data frame

The Stack Data are in CSV files and can be easily transferred to a data frame by using

pandas library. However, the KB Dataset is stored in a directory tree of XML files, so

we used Lmxl to read each XML file from the root to each element and store the

information we need into a data frame ([Link], 2017).

2. Check for missing values.

Since fewer than 5% of the articles had missing values, the articles containing

missing values were dropped from the data frames.

3. Detect and remove the code part in all texts.

24
25

In the Stack Dataset and the KB Dataset, there are many chunks of code in the

question and answer bodies. The code snippets would cause problems as the

summarization models cannot capture the inference of the code. Therefore, we

identified the chunks of code by locating the code tags in the input strings and

deleting everything between “<code>” and “</code>” tags. Eventually, we found that

nearly 33% of KB articles contain some type of code in them.

4. Detect and remove the unknown words with “&” symbol in all texts.

In the KB Dataset, we found that there are 48.76% words which cannot be recognized

by Juniper’s word embedding. Some of the unrecognized words are proper nouns, but

some of them are garbled and meaningless words that start with “&” symbols such as

the word “&npma”. The proper nouns might catch the unique information in the

articles, so we did not remove them. However, we detected and removed all the

unknown words started with “&” symbol.

5. Detect and remove the Spanish articles.

In the KB Dataset, 164 articles are written in Spanish. Our project’s focus was only

on English words, and having words outside of English language would cause

problems while training our models. We identified the Spanish articles by looking

for some common Spanish words such as “de”, “la” and “los”. Any article

containing a Spanish word was removed.

25
26

6. Detect and remove the articles less than 10 characters or 3 words.

In the KB Dataset, some solution articles only contained a link or a period, so we

removed any article less than 10 characters or 3 words.

7. Detect and remove the articles that have many digits.

In the JTAC dataset, there are nearly 19% articles in which more than 20% of all

characters are digits, and most of the digits are meaningless in the context such as

“000x”. Digits are seldom used in training and may affect the prediction, so we

removed such articles.

8. Check duplicate articles and write the cleaned data into a CSV file.

We also checked whether there were duplicated data, and we found that all data are

unique. Finally, we wrote the cleaned data into a comma-separated values file.

3.1.3 Data Categorization

Data Categorization in our project involved categorizing KB articles by creating a

hierarchy of the existing KB categories. Each KB article was associated with a list of

categories like- “MX240”, “MX230”, “MX”, etc. In this case, the hierarchy of the

categories should reflect category “MX240” to be a child node of category “MX”. “MX”

is the name of a product series at Juniper, while “MX240” is the name of a product in the

MX series. The goal of categorizing KB data is to have a more precise structure of the

26
27

KB dataset which could be used by Juniper Networks for future data related projects as

well. The detailed steps of categorizing KB articles are listed below:

1. Get the set of all categories.

We looped through the category lists in all cases and gathered all the unique categories in

a set arrange alphabetically.

2. Remove digits and underscores in each category name.

In order to efficiently categorize the data, we removed the digits and underscores at the

beginning and the end of each category name. For example, “MX240_1” is shrunk as

“MX”. This was helpful when we used the Longest Common Substrings to categorize the

data because the longest common substring among “MX240_1” and “MX240_2” is

“MX240_”, whereas ideally, we wanted the category to be just “MX”.

3. Find a list of Longest Common Substrings (LCS) among category names.

Since similar category names are listed consecutively, we went through the entire set and

found the LCS with minimum two characters among the neighbors. If a specific string

did not have a common substring with its previous string and its successive string, this

string was regarded as a parent node which had no child node.

4. Manually examine the LCS and pick 30 decent category names.

After we listed all the common substrings, we manually took a look at the list and picked

30 category names which were meaningful and contained many children nodes. For

example, we picked some names of main product series at Juniper such as “MX” and

“EX”, and we also chose some networking-related categories such as “SERVER” and

27
28

“VPN”.

5. Write all KB articles that belonged to that 30 categories into a CSV file.

Build the hierarchy map for KB categories.

The last step was to extract the KB articles that contained any category name that

belonged to the target 30 categories in the category list. We also generated the

hierarchy map for all KB categories, so it is easier for future category-cased

extractions.

3.2 Extractive Summarization

We began the text summarization by exploring the extractive summarization. The goal

was to try the extractive approach first, and use the output from extraction as an input of

the abstractive summarization. After experimenting with the two approaches, we would

then pick the best approach for Juniper Network’s Knowledge Base (KB) dataset. The

text extraction algorithms and controls were implemented in Python. The code contained

three important components - the two algorithms used, the two control methods, and the

metrics used to compare all the results.

3.2.1 Algorithms

The algorithms used for text extraction were - Transformer (Vaswani and Shazeer, 2004)

and (Ramos and Juan, 2003). These two algorithms were run on three datasets - the

News Dataset, the Stack Dataset and the KB Dataset. Each algorithm generates a list of

sentences in the order of their importance. Out of that list, the top three sentences were

joined together to form the summary.

28
29

Transformer was implemented by creating a graph where each sentence formed the

node, and the edges were the similarity score between the two node sentences. The

similarity score was calculated using Google’s Word2Vec public model. The model

provides a vector representation of each word which can be used to compute the cosine

similarity between them. Once the graph was formed, Google’s Transformer algorithm

was executed in the graph, and then top sentences were collected from the output.

Scikit Learn’s module was used to compute the score of each word with respect to its

dataset. Each sentence was scored by calculating the sum of the scores of each word.

in the sentences. The idea behind this was that the most important sentence in the

document is the sentence with the most uniqueness (most unique words).

3.2.2 Control Experiment

To be able to verify the effectiveness of the two algorithms, two control experiments

were used. A control experiment is usually a naive procedure that helps in testing the

results of an experiment. The two control experiments used in text extraction were -

forming a summary by combining the first three lines and forming a summary by

combining three random lines in the article. By running the same metrics in these control

experiments as the experiments with the algorithm being tested, a baseline can be created

for the algorithms. In an ideal case, the performance of the algorithms should always be

better than the performance of the control experiments.

29
30

3.3 Abstractive Summarization

The next step in our project was to work with abstractive summarization. This is one of

the most critical components of our project. We used deep learning neural network

models to create an application that could summarize a given text. The goal was to create

and train a model which can take sentences as the input and produce a summary as the

output.

The model would have pre-trained weights and a list of vocabulary that it would be

able to output. Figure 8 shows the basic flow of our data we wanted our model to

achieve. The first step was to convert each word to its index form. In the figure, the

sentences “I have a dog. My dog is tall.” is converted to its index form by using a

word to index map.

The index form was then passed through an embedding layer which in turn converted

the indexes to vectors. We used pre-trained word embedding matrixes to achieve this.

The output from the embedding layer was then sent as an input to the model. The

model would then compute and create a one-hot vectors matrix of the summary. A

one-hot vector is a vector of dimension equal to the size of the model’s vocabulary

where each index represents the probability of the output to be the word in that index

of the vocabulary. For example, if the index 2 in the one-hot vector is 0.7, the

probability of the result to be the word in the vocabulary index 2 is 0.7. This matrix

would then be converted to words by using the index to word mapping that was

created using the word to index map. In the figure, the final one-hot encoding when

30
31

converted to words forms the expected summary “I have a tall dog.”. Section

4.3.1, 4.3.2 and 3.3.1 expand our architecture of the above model in detail. In

summary, the process described involves converting the input text into numerical

indices, translating these indices into dense vectors via an embedding layer, and then

using the transformer architecture (encoder and decoder with self-attention

mechanisms) to focus on key elements in the text. This enables the model to generate

a concise summary, which in this case is "I have a tall dog." The transformer excels at

text summarization by capturing the relationships and context within the text,

allowing it to condense information effectively.

31
32

Figure 3.2: Abstractive summarization basic flow diagram

32
33

3.3.1 Preparing the dataset

Before training the model on the dataset, certain features of the dataset were extracted.

These features were used later when feeding the input data to the model and

comprehending the output information from the model.

We first collected all the unique words from the input (the article) and the expected

output (the title) of the documents and created two Python dictionaries of vocabulary

with index mappings for both the input and the output of the documents.

In addition, we created an embedding matrix by converting all the words in the

vocabulary to its vector form using pre-trained word embeddings. For public datasets like

Stack Overflow, we used the publicly available pre-trained GloVe model containing

word embeddings of one hundred dimensions each. For Juniper’s datasets, we used the

embedding matrix created and trained by Juniper Networks on their internal datasets. The

Juniper Network’s embedding matrix had vectors of one hundred and fifty dimensions

each. The dictionaries of the word with index mappings, the embedding matrix, and the

descriptive information about the dataset (such as the number of input words and the

maximum length of the input string) were stored in a Python dictionary for later use in

the model.

3.3.2 Embedding layer

The embedding layer was a precursor layer to our model. This layer utilizes the

embedding matrix, saved in the previous step of preparing the dataset, to transform each

33
34

word to its vector form. The layer takes input each sentence represented in the form of

word indexes and outputs a vector for each word in the sentence. The dimension of this

vector is dependent on the embedding matrix used by the layer. Representing each word

in a vector space is important because it gives each word a mathematical context and

provides a way to calculate similarity among them. By representing the words in the

vector space, our model can run mathematical functions on them and train itself.

Figure 3.3: Embedding layer

34
35

3.3.3 Model

Text summarization is the process of condensing a lengthy document or a set of

documents into a shorter version that captures the main points, essential information, and

overall meaning of the original content. With the advent of Transformer-based models,

this process has been significantly refined and enhanced. The Transformer architecture,

introduced by Vaswani et al. in 2017, has revolutionized how machines process and

generate human language, making it a powerful tool for text summarization. The first

step in the summarization process is pre-processing the input text. This step involves

cleaning and structuring the text data, including tasks such as removing unnecessary

characters, tokenizing the text (breaking it down into individual words or subwords), and

sometimes stemming or lemmatizing words to their root forms. In order to fit the text into

the fixed input size required by the Transformer model, it is also frequently changed to

lower case and may be padded or truncated. The pre-processed text is then run through

the encoder of the Transformer model. The input text must be converted by the encoder

into a number of contextualized word embeddings that represent the connections and

dependencies among the text's words. The model is able to take into account the

significance of each word in relation to the full text thanks to the self-attention

mechanism. The encoder's ability to record both local and global dependencies while

paying attention to every word at once enables a more comprehensive understanding of

the context. The self-attention mechanism computes attention scores for each word pair

in the text, allowing the model to focus on the most relevant parts of the text when

35
36

encoding the information. These attention scores are used to weigh the contribution of

each word in the final representation, ensuring that the model accurately captures the

nuances of the input text. After encoding the input text, the summarization process moves

to the decoding phase. The decoder is tasked with generating the summary based on the

encoded information. When abstractive summarisation is used, the decoder creates new

sentences that more succinctly express the original text's meaning. The attention scores

calculated during the encoding step serve as a guide for this generation, making sure the

summary stays true to the most crucial information included in the original text. The

decoder runs in an autoregressive way, generating the summary one word at a time. It

predicts the next word at each stage by taking into account the encoded representation of

the input text and the words that have already been created. This process keeps going

until a stopping condition is satisfied, which is often when a predetermined summary

length is reached or an end-of-sequence token is generated. In contrast, the model that

forms the summary for extractive summarization chooses particular sentences or phrases

directly from the input text. Sentences can be sorted according to how relevant they are to

the summary using transformer-based models such as BERT. The top-ranked sentences

are then used to construct a condensed version of the original text. Post-processing the

generated summary is the last stage in the summarization process. At this stage, the

output is polished to make sure it is grammatically sound, cohesive, and devoid of

repetitions. Detokenizing the content, fixing any formatting errors, and, in certain

situations, manually examining and revising the summary for quality control are

36
37

examples of post-processing operations. Transformer-based models are frequently

refined on certain summarization datasets in order to enhance the quality of the

summaries produced. In order to fine-tune the model, a huge corpus of text-summary

pairings are used for training. This allows the model to pick up on common patterns and

structures found in well-written summaries. The model learns to produce summaries that

are both clear and useful by optimizing on pertinent facts.

Figure 3.4: Architecture of training model

3.4 Tune the model

After running and testing the model in different datasets, the model’s parameters were

tuned - specifically the number of hidden units was increased, the learning rate was

37
38

increased, the number of epochs was changed and a dropout parameter (percentage of

input words that will be dropped to avoid overfitting) was added at each LSTM layer in

the encoder. Different values were tested for each of the parameters while keeping in

mind the limited resources available to test the model. The models were rerun on the

datasets, and the results were compared with the previous run. The summaries were also

evaluated by human eyes and compared with the ones produced earlier. The best models

were picked out which formed the backend of our system.

3.5 Build an end-to-end application

Once the model was completed and tested, a web end-to-end application was built (for

details, please see Section 5.5). The application was built in Python’s stream lit library

primarily because the models were implemented in Python. A bootstrap front-end UI was

used to showcase the results. The UI consisted of a textbox for entering text, a

dropdown for choosing the desired models for each of the three datasets and an output

box for showing the results. The UI included a summarize button which would send the

chosen options by making a POST AJAX request to the backend. The backend server

would run the text on the pre-trained model and send the result back to the front-end. The

front-end displayed the result sent by the backend. This application was the final product

of this project which was hosted on a web server and can be viewed by any modern web

browser.

38
39

CHAPTER FOUR

RESULT AND DISCUSSION

4.1 Introduction

This chapter presents the results of Text summarization and discusses the implications of

these results. This section summarizes the goal of the project, which is to explore

automatic text summarization and analyse its applications on Juniper's datasets.

Furthermore, we will evaluate the system's performance, discuss any limitations, and

propose potential improvements. The outcomes are analysed based on the accuracy of

nutritional value predictions and the

4.2 Experimental Setup

This section describes the experimental setup, including the five datasets used (Stack

Overflow, News, KB, JTAC, and JIRA), the data cleaning and preprocessing steps, and

the extractive and abstractive summarization models implemented.

4.3 User Interface Screenshot

Providing screenshot of the web application's user interface, showing the input textbox,

dropdown menu for model selection, and output box for displaying the summary.

1. Initial Input Screen: Where users input text.

39
40

Figure 4.1: User Interface Screenshot

4.4 Result

4.4.1 Text summarization

The primary function of the Software is to summarize every value of text based on input

data. This interface allows users to perform complex tasks like text summarization and

translation without needing to understand the underlying code, providing them with a

simple and effective tool for processing textual data. The results are presented in Table

below:

40
41

Table 4.1: Text of Context (in English)

S/N Text of Context (in English) Summarized Summarized version


Version ( 0.2) ( 0.7)
1. A scientist creates a time The scientist is The scientist is faced
machine and travels back to a faced with an with an impossible
critical moment in his past. He impossible choice: choice: to protect his
encounters his younger self on to protect his past past self or preserve
the verge of making a devastating self or preserve the the stability of his
mistake. Armed with the stability of his future. A scientist
knowledge of how his life will future. A scientist creates a time
unfold, he wrestles with the creates a time machine and travels
temptation to change his fate. machine and travels back to a critical
Warning his younger self could back to a critical moment in his past.
save years of pain and regret. But moment in his past. He encounters his
altering the past might create younger self on the
unforeseen ripple effects that verge of making a
could destroy everything he devastating mistake.
knows. The scientist is faced with Warning his younger
an impossible choice: to protect self could save years
his past self or preserve the of pain and regret.
stability of his future. The
decision he makes will alter the
course of both their lives.

41
42

Table 4.2: Text of Context (Hausa)

S/N Text of Context (Hausa) Summarized Summarized version

Version ( 0.2) ( 0.7)

1. Yana da sanin abin Yana da sanin abin


Wani masanin kimiyya ya kirkiro
da zai faru, kuma da zai faru, kuma
wata na’ura da zata iya tafiya
yana kokarin yanke yana kokarin yanke
cikin lokaci. Ya koma baya don
shawara ko ya shiga shawara ko ya shiga
ya hadu da kansa a lokacin da zai
cikin al’amarin. cikin al’amarin. Ya
yi babban kuskure. Yana da sanin
koma baya don ya
abin da zai faru, kuma yana
hadu da kansa a
kokarin yanke shawara ko ya
lokacin da zai yi
shiga cikin al’amarin. Idan ya
babban kuskure. da
gaya wa kansa gaskiya, zai iya
kuma guje wa illolin
kauce wa wahalhalu da yawa.
da ba zai iya hango
Amma, canza tarihi na iya haifar
ba.
da mummunan sakamako marasa
Idan ya gaya wa
tabbas. Yana cikin rudani
kansa gaskiya, zai
tsakanin gyara rayuwarsa da
iya kauce wa
kuma guje wa illolin da ba zai iya
wahalhalu da yawa.
hango ba.

42
43

Table 4.3: Text of Context (Yoruba)

S/N Text of Context (Yoruba) Summarized Summarized version


Version ( 0.2) ( 0.7)
1. le pọnjú sí i. Nígbà le pọnjú sí i. Nígbà tí
Onímọ̀ sáyẹ́nsì kan ṣẹ̀dá ẹ̀rọ tó le
tí ó dé ibẹ̀, ó pàdé ó dé ibẹ̀, ó pàdé ara
ran an lọ sí ayé atijọ́, ó sì padà lọ
ara rẹ̀ nígbà rẹ̀ nígbà ìgbàlódé tó
sí ìgbà tí ó ti ṣe aṣìṣe tó le pọnjú
ìgbàlódé tó ń ṣe ń ṣe aṣìṣe tó le yí
sí i. Nígbà tí ó dé ibẹ̀, ó pàdé ara
aṣìṣe tó le yí ìwàláàyè rẹ̀ padà.
rẹ̀ nígbà ìgbàlódé tó ń ṣe aṣìṣe tó
ìwàláàyè rẹ̀ padà. Onímọ̀ sáyẹ́nsì kan
le yí ìwàláàyè rẹ̀ padà. Ó mọ̀
ṣẹ̀dá ẹ̀rọ tó le ran an
ohun tí yóò ṣẹlẹ̀, ó sì ń kọ́ láti
lọ sí ayé atijọ́, ó sì
pinnu bí ó yóò ṣe yẹ kọ́. Tí ó bá
padà lọ sí ìgbà tí ó ti
sọ òtítọ́ fún ara rẹ̀, ó lè yí ìrìnàjò
ṣe aṣìṣe tó ṣe yẹ kọ́.
rẹ̀ padà pẹ̀lú ìròyìn tó dára.
Tí ó bá sọ òtítọ́ fún
Ṣùgbọ́n yíyí ayé padà lè mú
ara rẹ̀, ó lè yí ìrìnàjò
àwọn abajade tó kọjá àfojúsùn
rẹ̀ padà pẹ̀lú ìròyìn tó
wá. Ó ní láti yan lẹ́yìn ìdùnnú ara
dára. Ó ní láti yan
rẹ̀ ní ìgbà náà àti ìṣòro tí ó le jẹ́
lẹ́yìn ìdùnnú ara rẹ̀ ní
kágbára ní ọjọ́ iwájú rẹ̀.
ìgbà náà àti ìṣòro tí ó
le jẹ́ kágbára ní ọjọ́
iwájú rẹ̀. Ó mọ̀ ohun
tí yóò ṣẹlẹ̀, ó sì ń kọ́
láti pinnu bí ó

43
44

Figure 4.2: Text of Context (Hausa) Screenshot

Figure 4.3: Text of Context (Yoruba) Screenshot

44
45

Figure 4.3: Text of Context (English) Screenshot

4.4.2 User Interface and Experiences

The software provides a user-friendly interface where users can input text and receive

summarize analysis. The interface includes:

1. Input fields for each text to summarize component.

2. A summarize button to process the input data.

3. Display of the analysis result, showing the text summarized.

The feedback from users during testing indicated that the interface is intuitive and easy to

navigate. Users appreciated the immediate feedback on their text summarized content.

45
46

4.4.3 SUMMARIZATION OF TEXT USING QUILLBOT

Table 4.4: Text of Context (in English)

S/N Text of Context (in English) Summarized Summarized version


Version (Short) (Long)
1. A scientist creates a time A scientist creates a A scientist creates a
machine and travels back to a time machine, time machine and
critical moment in his past. He travels back to a travels back to a
encounters his younger self on critical past critical moment in his
the verge of making a moment, past, where he
devastating mistake. Armed with confronting his encounters his
the knowledge of how his life younger self's younger self on the
will unfold, he wrestles with the decision to protect brink of making a
temptation to change his fate. his past or preserve disastrous mistake. He
Warning his younger self could his future. grapples with the
save years of pain and regret. temptation to change
But altering the past might his fate, fearing it
create unforeseen ripple effects could save years of
that could destroy everything he pain and regret, but
knows. The scientist is faced also create unforeseen
with an impossible choice: to consequences. The
protect his past self or preserve scientist must decide
the stability of his future. The whether to protect his
decision he makes will alter the past self or preserve
course of both their lives. his future stability, a
decision that will alter
their lives.

46
47

Table 4.4: Text of Context (Hausa)

S/N Text of Context (Hausa) Summarized Summarized version

Version (Short) (Long)

1. Wani masanin The speaker is a man


Wani masanin kimiyya ya kirkiro
kimiyya ya kirkiro who is adamant
wata na’ura da zata iya tafiya
wata na'ura da zata about his love for a
cikin lokaci. Ya koma baya don
iya tafiya cikin woman, who is a
ya hadu da kansa a lokacin da zai
lokaci. Yana da woman who is not a
yi babban kuskure. Yana da sanin
sanin abin da zai woman. He is a
abin da zai faru, kuma yana
faru, kuma yana woman who is not a
kokarin yanke shawara ko ya
kokarin yanke woman, but a
shiga cikin al’amarin. Idan ya
shawara ko. woman who is a
gaya wa kansa gaskiya, zai iya
woman who is not a
kauce wa wahalhalu da yawa.
woman
Amma, canza tarihi na iya haifar

da mummunan sakamako marasa

tabbas. Yana cikin rudani

tsakanin gyara rayuwarsa da

kuma guje wa illolin da ba zai iya

hango ba.

47
48

Table 4.4: Text of Context (Yoruba)

S/N Text of Context (Yoruba) Summarized Summarized version

Version (Short) (Long)

1. Onímọ̀ sáyẹ́nsì kan The text describes a


Onímọ̀ sáyẹ́nsì kan ṣẹ̀dá ẹ̀rọ tó le
ṣẹ̀dá ẹ̀rọ tó le ran situation where a
ran an lọ sí ayé atijọ́, ó sì padà lọ
an lọ sí ayé atijọ́, ó man is thrown into a
sí ìgbà tí ó ti ṣe aṣìṣe tó le pọnjú
sì padà lọ sí ìgbà tí river, and the man is
sí i. Nígbà tí ó dé ibẹ̀, ó pàdé ara
ó ti ṣe aṣìṣe tó le thrown into a river.
rẹ̀ nígbà ìgbàlódé tó ń ṣe aṣìṣe tó
pọnjú sí i. The man is then
le yí ìwàláàyè rẹ̀ padà. Ó mọ̀
thrown into a river,
ohun tí yóò ṣẹlẹ̀, ó sì ń kọ́ láti
and the river is
pinnu bí ó yóò ṣe yẹ kọ́. Tí ó bá
flooded with water.
sọ òtítọ́ fún ara rẹ̀, ó lè yí ìrìnàjò
The man is then
rẹ̀ padà pẹ̀lú ìròyìn tó dára.
thrown into a river,
Ṣùgbọ́n yíyí ayé padà lè mú
and the river is
àwọn abajade tó kọjá àfojúsùn
filled with water.
wá. Ó ní láti yan lẹ́yìn ìdùnnú

ara rẹ̀ ní ìgbà náà àti ìṣòro tí ó le

jẹ́ kágbára ní ọjọ́ iwájú rẹ̀.

48
49

4.5 Discussion

4.5.1 Accuracy of Text Summarization

Text summarization using transformer algorithms has shown impressive accuracy in

recent years. The transformer model's ability to handle long-range dependencies and

understand context has led to significant improvements in summarization quality. Studies

have shown that transformer-based summarization models can achieve ROUGE scores (a

common evaluation metric for summarization) of up to 45-50%, outperforming

traditional methods.

4.5.2 Performance Evaluation

Performance evaluation of transformer-based text summarization models typically

involves metrics such as:

1. ROUGE (Recall-Oriented Understudy for Gisting Evaluation):

ROUGE is a metric used to assess the quality of summaries by comparing them to

reference summaries. It includes variations like ROUGE-N, L, W, and S, which

measure n-gram overlap, LCS, weighted LCS, and skip-bigrams overlap.

2. BLEU (Bilingual Evaluation Understudy):

BLEU is typically used for evaluating machine translation but can also be applied to

summarization. It measures how many n-grams in the generated summary appear in

the reference summaries.

3. METEOR (Metric for Evaluation of Translation with Explicit ORdering):

49
50

is a metric developed to evaluate the quality of machine-generated translations, but it

can also be adapted for text summarization. It aims to address some of the limitations

of traditional metrics like BLEU by incorporating aspects that better reflect human

judgment. METEOR can also be used to evaluate the quality of generated summaries

by comparing them to human-written reference summaries.

4. Human Evaluation:

Human evaluation involves human judges assessing the quality of summaries by

evaluating coherence, relevance, and conciseness, focusing on how well the summary

covers key points and conveys information without redundancy.

These metrics assess the model's ability to capture essential information, preserve

meaning, and maintain fluency. Additionally, human evaluation is often used to assess

the model's performance in terms of coherence, relevance, and overall quality.

4.5.3 Limitations

Despite their success, transformer-based text summarization models have some

limitations:

1. Computational resources: Training transformer models requires significant

computational power and memory.

2. Data requirements: Large amounts of high-quality training data are needed to

achieve optimal performance.

50
51

3. Bias and fairness: Models can inherit biases from the training data, leading to

unfair or inaccurate summaries.

4. Handling out-of-domain data: Models may struggle with data that differs

significantly from the training data.

4.5.3 Future improvements

Future improvements in transformer-based text summarization may focus on:

1. Developing more efficient training methods to reduce computational

resources.

2. Incorporating multimodal data (e.g., images, videos) to enhance

summarization quality.

3. Addressing bias and fairness concerns through data curation and model

regularization techniques.

4. Exploring transfer learning and few-shot learning to adapt models to new

domains and tasks.

51
52

CHAPTER FIVE

SUMMARY AND CONCLUSION

5.0 Summary

Text summarization using transformer algorithms is a revolutionary approach that has

significantly improved the accuracy and efficiency of automatic summarization. By

leveraging the transformer model's ability to understand context and handle long-range

dependencies, these algorithms can generate concise and coherent summaries of large

documents. With the ability to process vast amounts of data, transformer-based

summarization models have become a crucial tool in various applications, including news

aggregation, document analysis, and information retrieval.

5.1 Conclusion

In conclusion, text summarization using transformer algorithms has made tremendous

strides in recent years. The transformer model's unique architecture and self-attention

mechanism have enabled it to outperform traditional summarization methods. While

there are still limitations and challenges to be addressed, the future of transformer-based

summarization looks promising. As research continues to evolve, we can expect to see

even more advanced and efficient summarization models that can handle complex tasks

and multimodal data. Ultimately, transformer-based text summarization has the potential

to transform the way we process and consume information, making it an exciting and

rapidly advancing field of study

52
53

REFERENCE

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly

Learning to Align and Translate. arXiv preprint arXiv:1409.0473. Retrieved

February 28, 2018.

Brill, E. (2000). Part-of-speech Tagging. Handbook of Natural Language Processing,

403-414. Retrieved March 01, 2018.

Brownlee, J. (2017a, November 29). A Gentle Introduction to Text Summarization.

Retrieved March 02, 2018, from [Link]

introduction-text-summarization/

Brownlee, J. (2017b, August 09). How to Use Metrics for Deep Learning with Keras in

Python Retrieved February 28, 2018,

from[Link]

python/

Brownlee, J. (2017c, October 11). What Are Word Embeddings for Text? Retrieved

February 28, 2018, from [Link]/what-are-word-

embeddings/

Chowdhury, G. (2003). Natural Language Processing. Annual Review of Information

Science and Technology, 37(1), 51-89. doi:10.1002/aris.1440370103. Retrieved

March 02, 2018

53
54

Christopher, C. (2015, August 27). Understanding LSTM Networks. Retrieved March 02,

2018, from [Link]/posts/2015-08-Understanding-LSTMs/

Dalal, V., & Malik, L. G. (2013, December). A Survey of Extractive and Abstractive

Text Summarization Techniques. In Emerging Trends in Engineering and

Technology (ICETET), 2013 6th International Conference on (pp. 109-110).

IEEE. Retrieved March 01, 2018.

Das., K. (2017). Introduction to Flask. Retrieved February 27, 2018, from

[Link]/en/latest/[Link]

Hasan, K. S., & Ng, V. (2010, August). Conundrums in Unsupervised Keyphrase

Extraction: Making Sense of the State-of-the-art. In Proceedings of the 23rd

International Conference on Computational Linguistics: Posters (pp. 365-373).

Association for Computational Linguistics. Retrieved February 28, 2018.

[Link]. (2018). Retrieved March 02, 2018, from

[Link]

Juniper Networks. (2018). Retrieved March 02, 2018, from

[Link]

Keras: The Python Deep Learning library. (n.d.). Retrieved February 27, 2018, from

[Link]

54
55

Ketkar, N. (2017). Introduction to Keras. In Deep Learning with Python (pp. 97-111).

Apress, Berkeley, CA. Retrieved February 26, 2018, from

[Link]

Lin, C. Y. (2004). Rouge: A Package for Automatic Evaluation of Summaries. Text

Summarization Branches Out. Retrieved February 25, 2018.

Lopyrev, K. (2015). Generating News Headlines with Recurrent Neural Networks. arXiv

preprint arXiv:1512.01712. Retrieved February 28, 2018.

LXML - Processing XML and HTML with Python. (2017, November 4). Retrieved

February 25, 2018, from [Link]/[Link].

Mihalcea, R., & Tarau, P. (2004). Transformer: Bringing Order into Text. In Proceedings

of the 2004 Conference on Empirical Methods in Natural Language Processing.

Retrieved February 27, 2018.

Mohit, B. (2014). Named Entity Recognition. In Natural Language Processing of Semitic

Languages (pp. 221-245). Springer, Berlin, Heidelberg. Retrieved February 27,

2018.

Nallapati, R., Zhou, B., Gulcehre, C., & Xiang, B. (2016). Abstractive Text

Summarization Using Sequence-to-sequence RNNs and beyond. arXiv preprint

arXiv:1602.06023. Retrieved February 23, 2018.

55
56

Natural Language Toolkit. (2017, September 24). Retrieved February 23, from

[Link]

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). BLEU: a Method for

Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual

Meeting on Association for Computational Linguistics (pp. 311-318). Association

for Computational Linguistics. Retrieved March 01, 2018.

Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global Vectors for Word

Representation. In Proceedings of the 2014 Conference on Empirical Methods in

Natural Language Processing (EMNLP) (pp. 1532-1543). Retrieved March 01,

2018.

Python Data Analysis Library. (n.d.). Retrieved March 02, 2018, from

[Link]

Radhakrishnan, P. (2017, October 16). Attention Mechanism in Neural Network –

Hacker [Link] March 02, 2018, from [Link]

mechanism-in-neural-network-30aaf5e39512.

Rahm, E., & Do, H. H. (2000). Data Cleaning: Problems and Current Approaches. IEEE

Data Eng. Bull., 23(4), 3-13. Retrieved March 01, 2018.

56
57

Ramos, J. (2003, December). Usings to Determine Word Relevance in Document

Queries. In Proceedings of the First Instructional Conference on Machine

Learning (Vol. 242, pp.133-142). Retrieved March 01, 2018.

Rehurek, R. (2009). Gensim: Topic Modelling for Humans. Retrieved March 02, 2018,

from [Link]/gensim/[Link]

Rush, A. M., Chopra, S., & Weston, J. (2015). A Neural Attention Model for Abstractive

Sentence Summarization. arXiv preprint arXiv:1509.00685. Retrieved February

25, 2018.

Scikit-Learn: Machine Learning in Python. (n.d.). Retrieved February 23, 2018, from

[Link]

Schalkoff, R. J. (1997, June). Artificial Neural Networks (Vol. 1). New York: McGraw-

Hill. Retrieved March 02, 2018.

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with

Neural Networks. In Advances in Neural Information Processing Systems (pp.

3104-3112). Retrieved March 02, 2018.

Socher, R., Bengio, Y., & Manning, C. (2013). Deep Learning for NLP. Tutorial at

Association of Computational Logistics (ACL), 2012, and North American

Chapter of the Association of Computational Linguistics (NAACL). Retrieved

March 01, 2018.

57
58

[Link]. (2018). Retrieved March 02, 2018,

from[Link]

58
59

Appendix

A: Extended Technical Terms

In this appendix, we briefly explain some technical terms that are mentioned in this

report but are not necessarily related to the core concept of our project.

1. Named Entity Recognition (NER)

Named Entity Recognition (NER) is a way of finding and classifying names, which are

nouns in text, into predefined categories (Mohit, 2014).

2. Part of Speech (POS) Tagging

Part of Speech (POS) Tagging is a way of tagging word in text, which is corresponding

to a specific part of speech (Brill, 2000).

3. Convolutional Network Model

Convolutional Network Model, which is also known as Convolutional Neural Network

(CNN) is a kind of neural network with a structure of going deep and then feeding

forward (Glorot & Bengio, 2010)

4. Feedforward Neural Network Model

59
60

Feedforward Neural Network Model is a kind of neural network where data are feeding

forward through all hidden layers (Glorot & Bengio, 2010).

5. Attention Mechanism

Attention Mechanism a way of helping decoder focus on the important part of source text

when generating outputs in the encoder-decoder model (Radhakrishnan, 2017).

6. Keras Categorical Accuracy

Keras Categorical Accuracy is a metric that can be used on classification problem in

Keras (Brownlee, 2017b).

7. Keras Categorical Loss

Keras Categorical Loss is a loss function used in classification problem in Keras to

measure the cost of inaccurate predictions (Ketkar, 2017).

60

You might also like