Text Summarization Using Transformer Model
Text Summarization Using Transformer Model
Abstract—The increased availability of online feedback or a concise form of these reviews while preserving helpful
review tools, and the enormous amount of information on these information encompassing enough details to inform a decision.
platforms, have made text summarization a vital research area Researchers in the past have studied two types of sum-
in natural language processing. Instead of potential consumers
going through thousands of reviews to get needed information, marization methods. Extractive summarization generates sum-
summarization will enable them to see a concise form of a maries by taking out important sentences from the original
chunk of reviews with relevant information. News and scientific text and putting them together to form a concise and mean-
articles have been used in text summarization models. This study ingful summary. In contrast, abstractive summarization creates
proposes a text summarization method based on the Text-to-Text summaries without reusing phrases from the original text [3],
Transfer Transformer (T5) model. We use the University of Cali-
fornia, Irvine’s (UCI) drug reviews dataset. We manually created [4].
human summaries for the ten most useful reviews of a particular Rule-based methods have been used for text summarization
drug for 500 different drugs from the dataset. We fine-tune tasks, as seen in [5], [6]. Many of the works also use neural
the Text-to-Text Transfer Transformer (T5) model to perform network architectures like the convolutional neural network
abstractive text summarization. The model’s effectiveness was (CNN), long-short term memory (LSTM), recurrent neural
evaluated using the ROUGE metrics, and our model achieved an
average of ROUGE1, ROUGE2, and ROUGEL scores of 45.62, networks (RNN), and autoencoders [1], [6], [7]. Transformers
25.58, and 36.53, respectively. We also fine-tuned this model have brought about a great improvement in natural language
on a standard dataset(BBC News Dataset) previously used for processing tasks and the text summarization field [8]–[13].
text summarization and got average ROUGE1, ROUGE2, and Most of the above studies use the CNN/DailyMail, BBC news
ROUGEL scores of 69.05, 59.70, and 52.97, respectively. dataset, and clinical reports dataset.
Index Terms—Deep Learning, Natural Language Processing,
Text Summarization, T5 Model This paper proposes a text summarization method based
on the transformer architecture that summarizes the ten most
useful reviews of 500 drugs from a database of about 3671
I. I NTRODUCTION
drugs. We use the University of California, Irvine (UCI) drug
Natural language processing is an essential aspect of ar- reviews dataset that was introduced by Gräßer et al. [14],
tificial intelligence for the possibilities inherent in the field. [15]. We fine-tune the Text-to-Text Transfer Transformer (T5)
Things that were almost impossible just a few decades ago model [16] for the text summarization task. We created human
are now possible, and diverse discoveries are made faster. summaries for the 500 drugs to train and evaluate the model.
Computers can read, understand text, translate from and to We also trained and tested the model on the BBC news dataset
different languages, respond to messages, determine the sen- [17].
timent/emotion in a text, and even make summaries from text
[1]. Text summarization deals with creating a concise form of II. BACKGROUND
a text from a larger text while preserving its intended meaning Considering how important and helpful text summarization
[2]. is, it has drawn the attention of many researchers over the
With the increase in the availability of online feed- years. The following subsections highlight some past literature
back/review tools and the enormous amount of information in this field.
on these platforms, it is vital to have effective methods that
automatically produce good and informative summaries. It has A. Rule-Based Methods, Supervised Learning Algorithms, and
become a common practice for individuals to check online Neural Network
for the opinion of others that have used a particular product In the past, traditional rule-based methods [6] that followed
or enjoyed a service before they decide on buying a product a set of rules to determine the importance of sentences to be
or paying for a service. These online review platforms have included in the final summary were prevalent. The advent of
been helpful to both potential customers to know if a product deep learning architectures like the long short-term memory
is worthy of purchase and for manufacturers of products to model (LSTM) and recurrent neural networks (RNN) [1] has
understand how to adjust better to satisfy their customer needs. also helped in performing text summarization tasks. Joshi
For an intending customer to go through thousands of reviews et al. [7] compared several deep learning models for text
would be time-consuming. Text summarization will help create summarization and discovered convolutional neural network
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on September 20,2024 at 05:12:24 UTC from IEEE Xplore. Restrictions apply.
based approaches performed better with extractive summariza- Khandelwal et al. [9] trained a unidirectional transformer-
tion while RNN methods performed better with abstractive based model on a 2-billion-word corpus based on Wikipedia
summarization. (WikiLM). They fine-tuned this model with an encoder-
N. Yadav and N. Chatterjee [6] involved sentiment analysis decoder architecture model to perform text summarization
in their text summarization process to clearly distinguish be- on the CNN/Daily Mail dataset. They evaluated their models
tween a positive or negative emotion in a text [18]. In [6], they with pre-training and without pre-training and their approach
perform sentiment analysis on the document understanding achieved ROUGE1, ROUGE2, and ROUGEL scores of 39.65,
conferences (DUC) dataset. The scores from that process help 17.74, and 36.85 respectively with pre-training.
determine the sentences to be part of the final summary text Torres in [11] used the pre-trained BERTSUM model for
according to the specified number of sentences. Their method extractive summarization of the CNN/Daily Mail dataset.
performed better in recall values as compared with random BERTSUM is an extension of the BERT model. Still, it is
indexing based summarizer and latent semantic analysis based different in that it can learn sentence representations and have
summarizer. the ability to embed pairs of sentences to learn adjacency
Shirwandkar et al. [2] used feature extraction methods to de- patterns between them. The author used the LEAD-3 metric as
termine whether to include the sentence in the final summary. a benchmark to compare with the fine-tuned model summaries
These methods include sentence position, sentence length, and discovered that LEAD3 summaries were twice the length
numerical token, term frequency - inverse document frequency, of the reference summary. The precision of the BERTSUM
cosine similarity between sentence and centroid, bi-gram, model was lower, but the recall was better.
tri-gram, and proper noun. They implemented the restricted Vinod et al. [12] also fine-tuned the BERTSUM model to
boltzmann machine (RBM) with fuzzy logic to produce two perform text summarization on a dataset of clinical reports.
different summaries of one document and combine the output In [12], the BERTSUM model, trained initially on a corpus
from both according to a set of rules to form the final summary. of news articles, is further trained with specific strategies to
Their method demonstrated a significant improvement over improve performance with the medical dataset. They used
the use of RBM alone with an average precision, recall and @highlights in all target files. The @highlights help specify
F-measure of 0.88, 0.80, and 0.84 respectively. those sentences of the report that are part of a good summary,
Krishnan et al. [19] used feature extraction methods to and a doctor consulted for human evaluation deemed 76.2%
determine sentence scores on the text in the dataset. They of the summaries generated as effective.
implemented supervised learning algorithms like naive bayes, Zolotareva et al. [13] built a Seq2Seq model with LSTM
k-nearest neighbor (KNN), random forest, sequential minimal layers for the encoder and decoder network based on the
optimization (SMO), J48, and bagging on the BBC news concept of the text-to-text transformer model to perform text
summary data set to perform extractive text summarization. summarization on a dataset of BBC news articles. To compare
KNN, bagging, and random forest performed better than SMO, results, they also fine-tuned the text-to-text transfer trans-
naive bayes, and J48. The average precision for ROUGE1, former(T5) model on the same task. The fine-tuned T5 model
ROUGE2, and ROUGEL scores across all classifiers are 0.597, performed better with ROUGE1, ROUGE2, and ROUGEL
0.470, and 0.583 respectively while that of recall is 0.488, F1 scores as 0.473, 0.265, and 0.361 respectively. Gupta et
0.368, and 0.477. al. [8] fine-tuned various transformer architecture models like
Sharaff et al. [5] proposed an extractive text summarization the text-to-text transfer transformer (T5), bidirectional and
model using word, sentence scoring and fuzzy analysis. They auto-regressive transformers (BART), and pre-training with ex-
compared the results using the bell membership function, tracted gap sentences for abstractive summarization sequence-
triangular membership function. They acheived the following to-sequence (PEGASUS) and trained these models on the BBC
precision, recall, and f-measure 0.410, 0.768, 0.535 respec- news summary dataset. The T5 model outperformed the other
tively using triangular membership function. They use the models with ROUGE1, ROUGE2, and ROUGEL scores of
BBC news summary data set. 0.47, 0.33, and 0.42 respectively.
III. M ETHODS
B. Transformers A. Dataset
Recently, the use of transformers has brought about a We used the University of California, Irvine (UCI) drug re-
revolution in the Natural Language Processing field. Various views dataset that was introduced by Gräßer et al. [14], [15]. It
researchers have fine-tuned pre-trained transformer models to was crawled from pharmaceutical websites like drugs.com and
get some state-of-the-art model performance. druglib.com. This dataset has been widely used for research
Authors Yang Liu and Mirella Lapata [10] present abstrac- studies in sentiment analysis [15], [20], [21]. It contains about
tive and extractive text summarization on multiple datasets 215,063 records with six attributes (drug name, condition,
(CNN/DailyMail news highlights, New York Times (NYT), useful count, review text, date collected, and rating). We use
and Xsum). They proposed a novel model based on the BERT the drug name, useful count, and reviews fields. This dataset
architecture. The BERT based model outperformed the LEAD- has a total of about 3671 drug names, and the useful count
3 baselines referenced in the article. field depicts how useful a review is.
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on September 20,2024 at 05:12:24 UTC from IEEE Xplore. Restrictions apply.
B. Data Preprocessing TABLE I
A N EXAMPLE OUTPUT FROM THE TOKENIZATION PROCESS
We summarize the ten most useful reviews for the first
500 drug names (after sorting the drug names alphabetically) Input Text Tokenizer Output
in the dataset. We extracted these ten reviews based on the ”The day is bright!”, ”it is {’input ids’: [[37, 239, 19, 2756,
bright and fair.” 55, 1], [34, 19, 2756, 11, 2725, 5,
useful counts for each drug name and combined the review text 1]], ’attention mask’: [[1, 1, 1, 1,
for summarization. For each drug name, we manually created 1, 1], [1, 1, 1, 1, 1, 1, 1]]}
human summaries for the combined reviews to enable us to
evaluate the performance of our model later. We performed
the following preprocessing steps on the review and target variant from the Hugging Face’s transformers library [22].
summary text. Table II shows the training arguments used for this model. We
Conversion to lowercase letters: We converted all review chose the T5-small due to a lack of adequate computational
text and target summary text to lowercase letters. power. The T5 model has five variants, including T5-small,
Removal of punctuation: We removed all punctuation and T5-base, T5-large, T5-3B, and T5-11B. The T5-small model’s
special characters, excluding periods. This is because periods architecture comprises six layers in each encoder and decoder,
indicate the end of a sentence and will be important in the eight attention heads, and has about 60 million parameters
summarization process. [16]. Table II shows the hyperparameter settings used for
Removal of stop-words: Stop words are a set of commonly the model. In Table III, we give a brief description of the
used words in a language that does not add to the overall hyperparameters listed in Table II.
meaning of a sentence. Examples of some stop words in
English: “a”, “the”, “is”, “are”. TABLE II
For the summarization task with the T5 model, we added L IST OF H YPERPARAMETERS AND VALUES
the keyword “summarize” as a prefix to all the reviews to
Hyperparameter Value
specify the intended task. We took an average of the review evaluation strategy epoch
text and the target summary, which informed our choice of learning rate 0.0005
maximum input length of 2048 and maximum target length of per device train batch size 4
per device eval batch size 4
128 words for padding and truncation as needed. weight decay 0.001
save total limits 3
C. Text-to-Text Transfer Transformer (T5) model num train epochs 5
Raffel et al. [16] proposed a transformer-based sequence-to- predict with generate True
fp16 True
sequence model. It is an encoder-decoder model architecture
trained in a unified objective manner that models every prob-
lem in a text-to-text format. The input and output are both in
text format. The stack of encoders (each made up of a self- TABLE III
attention layer and a feed-forward network) takes input as a E XPLANATION OF THE LIST OF HYPERPARAMETERS
sequence of tokens that maps to a sequence of embedding [8], Hyperparameter Description
[16]. The decoder’s architecture is like the encoder but has a evaluation strategy specifies the mode in which evalu-
standard attention mechanism after every self-attention layer. ation is done during training. it can
be any of none, steps, or epochs.
The output of the last decoder passes into a dense layer with learning rate specifies the desired learning rate
a softmax activation function, and the weights are shared with which determines the level of
the input embedding matrix [16]. weight updates during training.
per device train batch size refers to the batch size per
The model is trained on the colossal clean crawled cor- GPU/TPU Core/CPU for training.
pus(C4) dataset, which is about 700GB and can be fine-tuned per device train eval size refers to the batch size per
for tasks like summarization, classification, translation, and GPU/TPU Core/CPU for evalua-
tion.
question answering. The model works by receiving a text input weight decay specifies the weight decay value to
containing the desired task to be performed as the prefix, and be applied to all layers excluding
it produces an output in text format. the bias and layerNorm weights.
save total limits helps to regulate the number of
D. Fine-Tuning checkpoints to be saved.
num train epochs refers to the number of passes the
The datasets are split into 80% for training, 10% for testing, training dataset makes through the
and 10% for validation. We use the T5 tokenizer to get the model.
predict with generate helps to generate summaries and
data into a form readable by the model. The tokenizer process enables mixed precision training.
returns a dictionary of the input ids and an attention mask for fp16 specifies the use of fp16 16bits
each input. Table I shows an example of an input text and the precision training as opposed to the
default of 32bits to help speed up
corresponding tokenizer output. training.
In this work, we fine-tune the T5 model on the UCI drug re-
view dataset [14] [15] using the seq2seq model of the T5-small
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on September 20,2024 at 05:12:24 UTC from IEEE Xplore. Restrictions apply.
E. Evaluation the UCI drug reviews dataset were not as impressive as those
We use the ROUGE metrics (Recall-Oriented Understudy from the BBC news dataset because they had fewer training
for Gisting Evaluation) [23] to evaluate the model. This samples. Creating human summaries takes a lot of time and
evaluation metric helps to assess the quality of a model- effort, making it difficult for us to use the entire drugs and
generated summary by comparing it with a corresponding reviews in the dataset.
human summary. We specifically worked with the ROUGE1, On the BBC news dataset, our fine-tuned model’s perfor-
ROUGE2, and ROUGEL scores. Table IV gives an overview mance was an improvement over some of the previous works
of these ROUGE metrics. [8], [13] that also had results from fine-tuning the T5 model
for text summarization on BBC news dataset.
TABLE IV
E VALUATION M ETRICS A. Statistical Analysis
Name Description Since the training process of deep learning models gen-
ROUGE1 calculates the unigram overlap be- erally includes pseudorandom numbers (like initial weights
tween the model-generated sum- and random mini-batch training), we ran the models on both
maries and a set of reference
human-generated summaries. [23] datasets three times each under the same parameters to obtain
ROUGE2 calculates the bigram overlap be- the average ROUGE scores.
tween the model-generated sum- Table VI and Table VII show the confidence intervals at
maries and a set of reference
human-generated summaries. [23] 95% confidence levels of these averages from both datasets.
ROUGEL calculates the longest common
sub-sequence overlap between the TABLE VI
model-generated summaries and a UCI D RUG R EVIEWS DATASET
set of reference human-generated
summaries. [23] ROUGE CATEGORY CONFIDENCE INTERVAL
ROUGE1 43.80 - 47.44
ROUGE2 22.86 - 28.30
These scores are calculated with the recall and precision ROUGEL 34.18 - 38.88
values to get the F-measure score which is the corresponding
ROUGE score [23].
Let NOV = Number Of Overlapping Words
Let HS = Total Number Of Words In Human Summary TABLE VII
BBC N EWS DATASET
Let MS = Total Number Of Words In Model Generated
Summary ROUGE CATEGORY CONFIDENCE INTERVAL
ROUGE1 68.49 - 69.61
Recall = N OV /HS (1) ROUGE2 59.35 - 60.04
ROUGEL 52.35 - 53.59
P recision = N OV /M S (2)
ROU GEScore = (2∗precision∗Recall)/(precision+Recall) From Table VI and Table VII above, we can say that at 95%
(3) confidence level, the true mean of the ROUGE scores of the
IV. R ESULTS population will lie within the values specified in the tables.
Since we manually summarized the drug review dataset, V. C ONCLUSIONS AND F UTURE W ORKS
we also test our model on the BBC news dataset [17] for
Text summarization methods reduce large text into a con-
better evaluation. This dataset consists of 2225 documents
densed but meaningful form. These methods have evolved to
from the BBC news website in 5 topic areas from 2004 -
produce better results. As information availability increases, it
2005. These topic areas are business, entertainment, politics,
is an important research area for its ability to save time while
sports, and tech. Table V shows the average of the ROUGE
producing satisfying results. In this paper we create human
scores obtained from the test sets on the UCI drug review
summaries for 500 drugs from the UCI drug reviews dataset.
dataset [14], [15] and the BBC news dataset [17].
We fine-tune the T5 model to generate summaries automati-
cally from the ten most useful reviews for each of those 500
TABLE V
AVERAGE ROUGE S CORES drugs. Our model achieves an average of 45.62, 25.58, and
36.53 for ROUGE1, ROUGE2, and ROUGEL, respectively.
Dataset ROUGE1 ROUGE2 ROUGEL We also fine-tune the T5 model on the BBC News dataset
Drug Reviews 45.62 25.58 36.53
BBC News 69.05 59.70 52.97 and achieve an average ROUGE1, ROUGE2, and ROUGEL
scores of 69.05, 59.70, and 52.97, respectively. In the future,
we intend to use all reviews from the database to enable
From the ROUGE scores shown in Table V, we can see a robust dataset for better model performance and explore
that the model performed considerably well. The results from feature extraction and other text summarization methods.
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on September 20,2024 at 05:12:24 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES processing,” CoRR, vol. abs/1910.03771, 2019. [Online]. Available:
http://arxiv.org/abs/1910.03771
[1] E. Doǧan and B. Kaya, “Deep learning based sentiment analysis and [23] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,”
text summarization in social networks,” in 2019 International Artificial in Text summarization branches out, 2004, pp. 74–81.
Intelligence and Data Processing Symposium (IDAP). IEEE, 2019, pp.
1–6.
[2] N. S. Shirwandkar and S. Kulkarni, “Extractive text summarization using
deep learning,” in 2018 Fourth International Conference on Computing
Communication Control and Automation (ICCUBEA), 2018, pp. 1–5.
[3] S. Chopra, M. Auli, and A. M. Rush, “Abstractive sentence summa-
rization with attentive recurrent neural networks,” in Proceedings of the
2016 conference of the North American chapter of the association for
computational linguistics: human language technologies, 2016, pp. 93–
98.
[4] O. Tas and F. Kiyani, “A survey automatic text summarization,” Pres-
sAcademia Procedia, vol. 5, no. 1, pp. 205–213, 2007.
[5] A. Sharaff, A. S. Khaire, and D. Sharma, “Analysing fuzzy based
approach for extractive text summarization,” in 2019 International
conference on intelligent computing and control systems (ICCS). IEEE,
2019, pp. 906–910.
[6] N. Yadav and N. Chatterjee, “Text summarization using sentiment
analysis for duc data,” in 2016 International Conference on Information
Technology (ICIT), 2016, pp. 229–234.
[7] A. Joshi, E. Fidalgo, E. Alegre, and U. de León, “Deep learning based
text summarization: approaches, databases and evaluation measures,” in
International Conference of Applications of Intelligent Systems, 2018.
[8] A. Gupta, D. Chugh, R. Katarya et al., “Automated news summarization
using transformers,” in Sustainable Advanced Computing. Springer,
2022, pp. 249–259.
[9] U. Khandelwal, K. Clark, D. Jurafsky, and L. Kaiser, “Sample effi-
cient text summarization using a single pre-trained transformer,” arXiv
preprint arXiv:1905.08836, 2019.
[10] Y. Liu and M. Lapata, “Text summarization with pretrained encoders,”
arXiv preprint arXiv:1908.08345, 2019.
[11] S. Torres, “Evaluating extractive text summarization with bertsum,”
2021.
[12] P. Vinod, S. Safar, D. Mathew, P. Venugopal, L. M. Joly, and J. George,
“Fine-tuning the bertsumext model for clinical report summarization,”
in 2020 International Conference for Emerging Technology (INCET).
IEEE, 2020, pp. 1–7.
[13] E. Zolotareva, T. M. Tashu, and T. Horváth, “Abstractive text summa-
rization using transfer learning.” in ITAT, 2020, pp. 75–80.
[14] D. Dua and C. Graff, “UCI machine learning repository,” 2017.
[Online]. Available: http://archive.ics.uci.edu/ml
[15] F. Gräßer, S. Kallumadi, H. Malberg, and S. Zaunseder, “Aspect-based
sentiment analysis of drug reviews applying cross-domain and cross-
data learning,” in Proceedings of the 2018 International Conference on
Digital Health, 2018, pp. 121–125.
[16] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of trans-
fer learning with a unified text-to-text transformer,” arXiv preprint
arXiv:1910.10683, 2019.
[17] D. Greene and P. Cunningham, “Practical solutions to the problem of
diagonal dominance in kernel document clustering,” in Proceedings of
the 23rd international conference on Machine learning, 2006, pp. 377–
384.
[18] H. Kumar, B. Harish, and H. Darshan, “Sentiment analysis on imdb
movie reviews using hybrid feature extraction method.” International
Journal of Interactive Multimedia & Artificial Intelligence, vol. 5, no. 5,
2019.
[19] D. Krishnan, P. Bharathy, M. Venugopalan et al., “A supervised approach
for extractive text summarization using minimal robust features,” in 2019
International Conference on Intelligent Computing and Control Systems
(ICCS). IEEE, 2019, pp. 521–527.
[20] C. Colón-Ruiz and I. Segura-Bedmar, “Comparing deep learning archi-
tectures for sentiment analysis on drug reviews,” Journal of Biomedical
Informatics, vol. 110, p. 103539, 2020.
[21] N. Punith and K. Raketla, “Sentiment analysis of drug reviews using
transfer learning,” in 2021 Third International Conference on Inventive
Research in Computing Applications (ICIRCA). IEEE, 2021, pp. 1794–
1799.
[22] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue,
A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and
J. Brew, “Huggingface’s transformers: State-of-the-art natural language
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on September 20,2024 at 05:12:24 UTC from IEEE Xplore. Restrictions apply.