SALAÜN Mathilde
KARUNATHASAN Nilany
JEGATHEESWARAN Janany
SAMBATH Sïndoumady
TEXT SUMMARIZATION
STRATEGIES
Theme Analysis and Evolution of NLP Techniques
OVERVIEW
About Us
Context
List of techniques
References
ABOUT US
Mathilde Salaün Janany Jegatheeswaran
Data Developper Big Data Engineer
https://www.linkedin.com/in/ https://www.linkedin.com/in/
mathilde-salaun-13378b252/ janany-jegatheeswaran-
a729661ba/
Nilany Karunathasan Sïndoumady Sambath
Data Scientist Software Engineer
https://www.linkedin.com/ https://www.linkedin.com/
in/nilany-karunathasan- in/s%C3%AFndoumady-
7b49691ba/ sambath-a7519a209/
CONTEXT
Issue Purpose Demand
Information Overload due to Simplifying abundant Need for complex and
Internet growth material for accessibility powerful summarization
tools
Objective Analysis
Machine-generated Summarization concepts,
summaries aligned with techniques, metrics, and
human-created future scopes
TECHNIQUES
Extractive Summarization
Text Hybrid Summarization
Summarization
Abstractive Summarization
PARADIGM I : EXTRACTIVE
SUMMARIZATION
An approach that involves selecting and combining crucial sentences or phrases directly from the original text to construct a summary.
Focuses on identifying and extracting the most pertinent information while preserving the exact wording from the source material.
KEY INFORMATION SENTENCE ORIGINAL WORDING
TEXT INPUT IDENTIFICATION
COMBINATION SUMMARY OUTPUT
SELECTION PRESERVATION
SPECIFIC METHOD : TF-IDF WEIGHTING
OF MULTI-WORD TERMS
Multi-word Terms
Classic TF-IDF for single-word terms
Introduction of maximal word limit | Preprocessing
Recognize document-specific phrases
Utilize Python nltk library
Text splitting, tokenization, and symbol
removal
Custom stopword list
Creating the TF-IDF Matrix
Define Maximal Term Length (TL)
Generate Multi-word Terms
Calculate TF and IDF
Most Important Sequence
Find Sequences (up to 1000 words)
Calculate TF-IDF Scores
Rank Sequences
Select Highest-Ranking Sequence as
Summary
SPECIFIC METHOD : TF-IDF WEIGHTING
OF MULTI-WORD TERMS
Pipeline of the Approach
DOCUMENT CORPUS PREPROCESSING MULTI-WORD TERMS COMPUTE TF-IDF
GENERATE CANDIDATE SEQUENCES
BEST SCORED SUMMARY
TF-IDF SCORES FOR SEQUENCES
PARADIGM II : ABSTRACTIVE
SUMMARIZATION
Techniques of
Abstractive text
summarization
Uses natural language techniques to interpret
Structure based Semantic based
and understand the important aspects of a text approach
approach
and generate a more “human” friendly summary
Needs a deeper analysis of the text.
Tree Template Ontology Semantic Graph
Ability to generate new sentences. Based Method
based based based
Abstractive methods classified into two Information item Multimodal
categories namely : structured based approach Rule based Graph based based methods semantic model
and semantic based approach.
EXAMPLE : PEGASUS
PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models)
text summary template specially designed for abstractive summarization
uses deep learning in combination with natural language processing (NLP)
built on Transformer architecture
Architecture Schema
EXAMPLE : PEGASUS
The rows of the table represent the different models evaluated, while the columns represent the RED metrics for each
dataset.
PEGASUS is highlighted with two different configurations, PEGASUS_LARGE (C4) and PEGASUS_LARGE (HugeNews),
which likely indicate two variants of the PEGASUS model trained with different datasets or hyperparameters.
ROUGE scores are generally higher for PEGASUS compared to other models, suggesting that PEGASUS performs better
for the automatic text summarization task on these specific datasets. This may be due to PEGASUS' specialized pre-
training method that is optimized for the summary task.
Models Performance
ALTERNATIVE APPROACH :
HYBRID SUMMARIZATION
Hybrid text summarization methods combine elements of both extractive and abstractive approaches into a more nuanced
approach. The aim is to leverage the factual accuracy of extractive techniques and the flexibility of abstractive methods.
Typically, a hybrid model first selects important sentences or phrases from the source text using extractive techniques and
then generates a concise and coherent summary by paraphrasing and rephrasing the extracted content in an abstractive
manner.
Example of Hybrid Summarization :
Link :
SPECIFIC METHOD : GRAPH BASED
SUMMARIZATION
Text Pre-processing
Input Processing
Word Tokenization
POS Taging
Lemmatization
Graph Generation
Node=> Sentence
Weighted-Edge => Similarity Measur
Semantic inclusion using doc2vec
Example
SPECIFIC METHOD : GRAPH BASED
SUMMARIZATION
Processing Output
Post-Processing
Ranking Generate Summary
TextRank Algorithm
Vertex Voting
Clustering & Selection
Score per sentence Generate Clusters
Cluster based Rank Calculation
Topic per Cluster
SPECIFIC METHOD : GRAPH BASED
SUMMARIZATION
Results on English and Persian document based on ROUGE score
SPECIFIC METHOD : NEURAL NETWORK
BASED SUMMARIZATION
Neural network-based summarization methods use artificial neural networks to automatically generate concise and
coherent summaries of text. These methods can fall either into extractive or abstractive summarization.
Extractive summarization using neural networks involves training a model to select and rank important sentences
or phrases directly from the input text. Here's a basic outline of how a neural network for extractive summarization
can be structured:
Example of Extractive Summarization using DL
SPECIFIC METHOD : NEURAL NETWORK
BASED SUMMARIZATION
Overview of Abstractive Summarization using Deep Learning
RNN, LSTM, GRU Attention Mechanisms Transformer Models
(formerly used)
Took precedent input into account, but had Address limitations by allowing model to focus Self-attention mechanism allows considering the
difficulties handling long-term dependencies on different parts of the input text while entire context of the input text, facilitating better
and forgot information from the beginning of generating each word of the summary. capture of long-range dependencies
the document
SPECIFIC METHOD : NEURAL NETWORK
BASED SUMMARIZATION
Pre-trained Models
Pointer-Generator Networks
(BERT, GPT...) Metrics
Hard to evalutate due to
subjectiveness.
Most common metrics are :
ROUGE (Recall-Oriented
Understudy for Gisting
Evaluation)
BLEU (Bilingual
Evaluation Understudy)
METEOR
Fine-tuned for summarization tasks, they have Handles out-of-vocabulary words, incorporating
shown impressive performance in NLP a mechanism to copy words directly from the
applications, including abstractive summarization. source document into the summary
COMPARAISON / PROS AND CONS
EXTRACTIVE ABSTRACTIVE HYBRIDE
Respect for grammar
Preservation of Information Preservation of Information
Adaptability
Interpretability Reduced Redundancy
Human-like summary
Reduced Risk of Information Improved Coherence
PROS Loss
Ability to grasp the context and
Handling Ambiguity
its subtleties
Language Fluency Domain Adaptability
Non-Structural Information
Customization and Flexibility
Processing
Increased Complexity
Limited Creativity Costly in terms of time and
Training Data Challenges
Redundancy equipment
Computational Resources
CONS Difficulty with Incoherent Texts Information loss risk
Evaluation Challenges
Dependency on Sentence Technical complexity
Risk of Redundancy
Importance Metrics Potential biases
Interpretability Unsure
FUTURE CHALLENGES
Handling Multiple Real-time Summarization Domain-specific
Document Summarization Summarization
REFERENCES
General Overview
Yadav, D., Desai, J., & Yadav, A. K. (Year). Automatic Text Summarization Methods : A Comprehensive Review.
https://arxiv.org/ftp/arxiv/papers/2204/2204.01849.pdf?fbclid=IwAR0zVHc1Be5Usggg5TI7_VUMO8LpyCHDwc8dIh16iqsW-
WCiCXTcOIZHIdg
On Extractive Summarization
Krimberg, S., Vanetik, N., & Litvak, M. (2021). Summarization of financial documents with TF-IDF weighting of multi-word terms, FNP,
Computer Science, Business, https://doi.org/10.1016/j.mlwa.2022.100324
On Abstractive Summarization
Zhang, J., Zhao, Y., Saleh, M., & Liu, P. J. (2020). PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
Rawat, P., Ganpatrao, N. G., & Gupta, D. (2017). Text Summarization Using Abstractive Methods. Journal of Network Communications
and Emerging Technologies (JNCET)
On Hybrid Summarization
Elsaid, A., Mohammed, A., Fattouh, L., & Sakre, M. (2020). A Hybrid Arabic Text Summarization Approach Based on Seq-to-Seq and
Transformer
On Graph based Summarization
Mihalcea, R. (2004, 1 juillet). TextRank : Bringing order into text. ACL Anthology. https://aclanthology.org/W04-3252/
Bichi, A. A., Samsudin, R., Hassan, R., Hasan, L., & Rogo, A. A. (2023). Graph-based Extractive Text summarization Method for Hausa
Text. PLOS ONE, 18(5), e0285376. https://doi.org/10.1371/journal.pone.0285376