0% found this document useful (0 votes)
31 views22 pages

Exploring The Extractive Method of Text Summarization

The document explores the extractive method of text summarization. It discusses extractive summarization, which uses a ranking algorithm to select important sentences from the original text to include in the summary. It also briefly mentions abstractive summarization but focuses on explaining extractive summarization with an example using Python code.

Uploaded by

RAPTER GAMING
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views22 pages

Exploring The Extractive Method of Text Summarization

The document explores the extractive method of text summarization. It discusses extractive summarization, which uses a ranking algorithm to select important sentences from the original text to include in the summary. It also briefly mentions abstractive summarization but focuses on explaining extractive summarization with an example using Python code.

Uploaded by

RAPTER GAMING
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

Exploring the Extractive


Method of Text Summarization
In this article, we will explore the two
main approaches of NLP text
summarization, namely extractive and
abstractive.

By Shilpi Mazumdar

15 min. read

Introduction
Often there are many situations where we don’t
have/get enough time to read and understand
lengthy documents, research papers, or news
articles. Similarly, summarizing a large volume of
text while retaining essential information is crucial
in many fields, such as journalism, research, and
business. This is where NLP text summarization
comes into play, which is a technique that
automatically generates a condensed version of a
given text while preserving its essential meaning.
In this article, we will explore the two main
approaches of NLP text summarization, namely

[Link] 1/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

extractive and abstractive, and examine their


applications, strengths, and weaknesses.

Learning Objectives

In this article, you will:

1. Understand the different categories of text


vectorization.
2. Understanding extractive and abstraction
approach through examples.
3. Learn the difference between both vectorization
techniques.
4. And the future aspects of text summarization.

Table of Contents
1. Types of Text Summarization
2. Extractive Summarization
3. Abstractive Summarization
4. Understanding with Code
5. Comparison of Extractive and Abstractive Text
Summarization
6. Future Outlook of Text Summarization
7. Conclusion

Types of Text Summarization


Broadly, the NLP text summarization can be
divided into two main categories.

Extractive Approach
Abstractive Approach

Let’s dive a little deeper into each of the above-


mentioned categories.

[Link] 2/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

Extractive Summarization
So, what exactly happens in the extractive
summarization method? It simply takes out the
important sentences or phrases from the original
text and joins them to form a summary.

Now, the question that comes is, exactly on what


basis are those sentences termed as important?
So, basically, a ranking algorithm is used, which
assigns scores to each of the sentences in the
text based on their relevance to the overall
meaning of the document. The most relevant
sentences are then chosen to be included in the
summary.

There are various ways through which the ranking


of sentences can be performed.
TF-IDF (term frequency-inverse document
frequency)
Graph-based methods such as TextRank
Machine learning-based methods such as Support
Vector Machines (SVM) and Random Forests.

[Link] 3/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

The main motive of the extractive method is to


maintain the original meaning of the text. Also,
this method works well when the input
text/content is already in a well-structured
manner, both physically and logically, just like the
content in newspapers.

Abstractive Summarization
Okay, now let’s come to the abstractive
summarization method. The name itself implies
that it has arrived from the root form of the word
abstract, which means outline/summary or the
basic idea of a voluminous thing(text). Now unlike
the extractive method, it simply doesn’t pick out
the important sentences, rather, it analyses the
input text and generates new phrases or
sentences that capture the essence of the original
text and convey the same meaning as the original
text but more concisely and coherently.

Again, how exactly is the summary generated in


this method? So, in brief, the input text is
analyzed by a neural network model that learns to
generate new phrases and sentences that capture
the essence of the original text. The model is
trained on large amounts of text data and learns
to understand the relationships between words
and sentences, and generates new text that
conveys the same meaning as the original text in
a more understandable manner.

[Link] 4/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

This method uses advanced NLP techniques such


as natural language generation (NLG) and deep
learning to understand the context and generate
the summary. The resulting summaries are usually
shorter and more readable than the ones
generated by the extractive method, but they can
sometimes contain errors or inaccuracies.

Note that, here in this article, we’ll only deal with


the extractive text summarization method.

Understanding with Code


Here, we’ll focus on the extractive method and
understand it more with an example.

But, before that, let’s quickly understand it with a


flowchart.

Here, we will use a Python library called NLTK


(Natural Language Toolkit) to implement the
extractive method. NLTK provides a wide range of
functionalities for natural language processing,
including text tokenization, stopword removal, and
sentence scoring.

[Link] 5/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

Let’s take a look at the following code that


demonstrates how to use NLTK to generate a
summary from a given text:

Frequency-based Approach

# import the required libraries


import nltk
[Link]('punkt') # punkt tokenizer for sentence
tokenization
[Link]('stopwords') # list of stop words, such as
'a', 'an', 'the', 'in', etc, which would be dropped
from collections import Counter # Imports the Counter
class from the collections module, used for counting the
frequency of words in a text.
from [Link] import stopwords # Imports the stop words
list from the NLTK corpus
# corpus is a large collection of text or speech data used
for statistical analysis

from [Link] import sent_tokenize, word_tokenize #


Imports the sentence tokenizer and word tokenizer from the
NLTK tokenizer module.
# Sentence tokenizer is for splitting text into sentences
# word tokenizer is for splitting sentences into words

# this function would take 2 inputs, one being the text,


and the other being the summary which would contain the
number of lines
def generate_summary(text, n):
# Tokenize the text into individual sentences
sentences = sent_tokenize(text)

# Tokenize each sentence into individual words and remove


stopwords
stop_words = set([Link]('english'))
# the following line would tokenize each sentence from
sentences into individual words using the word_tokenize
function of [Link] module
# Then removes any stop words and non-alphanumeric
characters from the resulting list of words and converts

[Link] 6/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

them all to lowercase


words = [[Link]() for word in word_tokenize(text) if
[Link]() not in stop_words and [Link]()]

# Compute the frequency of each word


word_freq = Counter(words)

# Compute the score for each sentence based on the


frequency of its words
# After this block of code is executed, sentence_scores
will contain the scores of each sentence in the given
text,
# where each score is a sum of the frequency counts of its
constituent words

# empty dictionary to store the scores for each sentence


sentence_scores = {}

for sentence in sentences:


sentence_words = [[Link]() for word in
word_tokenize(sentence) if [Link]() not in stop_words
and [Link]()]
sentence_score = sum([word_freq[word] for word in
sentence_words])
if len(sentence_words) < 20:
sentence_scores[sentence] = sentence_score

# checks if the length of the sentence_words list is less


than 20 (parameter can be adjusted based on the desired
length of summary sentences)
# If condition -> true, score of the current sentence is
added to the sentence_scores dictionary with the sentence
itself as the key
# This is to filter out very short sentences that may not
provide meaningful information for summary generation

# Select the top n sentences with the highest scores


summary_sentences = sorted(sentence_scores,
key=sentence_scores.get, reverse=True)[:n]
summary = ' '.join(summary_sentences)

return summary

[Link] 7/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

Using a Sample Text From Wikipedia to Generate


Summary

text = '''
Weather is the day-to-day or hour-to-hour change in the
atmosphere.
Weather includes wind, lightning, storms, hurricanes,
tornadoes (also known as twisters), rain, hail, snow, and
lots more.
Energy from the Sun affects the weather too.
Climate tells us what kinds of weather usually happen in
an area at different times of the year.
Changes in weather can affect our mood and life. We wear
different clothes and do different things in different
weather conditions.
We choose different foods in different seasons.
Weather stations around the world measure different parts
of weather.
Ways to measure weather are wind speed, wind direction,
temperature and humidity.
People try to use these measurements to make weather
forecasts for the future.
These people are scientists that are called
meteorologists.
They use computers to build large mathematical models to
follow weather trends.'''

summary = generate_summary(text, 5)
summary_sentences = [Link]('. ')
formatted_summary = '.\n'.join(summary_sentences)

print(formatted_summary)

Output

The following output is what we would be getting


as a summary. This summary would contain 5
sentences.

[Link] 8/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

We wear different clothes and do different things


in different weather conditions.
Weather stations around the world measure
different parts of weather.
Climate tells us what kinds of weather usually
happen in an area at different times of the year.
Weather includes wind, lightning, storms,
hurricanes, tornadoes (also known as twisters),
rain, hail, snow, and lots more.
Ways to measure weather are wind speed, wind
direction, temperature and humidity.

What’s happening in the above code?


So, the above code takes a text and a desired
number of sentences for the summary as input
and returns a summary generated using the
extractive method. The method first tokenizes the
text into individual sentences and then tokenizes
each sentence into individual words. Stopwords
are removed from the words, and then the
frequency of each word is computed.

Then the score for each sentence is computed


based on the frequency of its words, and the top n
sentences with the highest scores are selected to
form the summary. Finally, the summary is
generated by joining the selected sentences
together.

In the next section, we will explore how the


extractive method can be further improved using

[Link] 9/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

advanced techniques such as TF-IDF.

TF-IDF Approach

# importing the required libraries

# importing TfidfVectorizer class to convert a collection


of raw documents to a matrix of TF-IDF features.
from sklearn.feature_extraction.text import
TfidfVectorizer

# importing cosine_similarity function to compute the


cosine similarity between two vectors.
from [Link] import cosine_similarity

# importing nlargest to return the n largest elements from


an iterable in descending order.
from heapq import nlargest

def generate_summary(text, n):


# Tokenize the text into individual sentences
sentences = sent_tokenize(text)

# Create the TF-IDF matrix


vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(sentences)

# Compute the cosine similarity between each sentence and


the document
sentence_scores = cosine_similarity(tfidf_matrix[-1],
tfidf_matrix[:-1])[0]

# Select the top n sentences with the highest scores


summary_sentences = nlargest(n,
range(len(sentence_scores)),
key=sentence_scores.__getitem__)

summary_tfidf = ' '.join([sentences[i] for i in


sorted(summary_sentences)])

return summary_tfidf

[Link] 10/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

Using a Sample Text to Check the Summary

text = '''
Weather is the day-to-day or hour-to-hour change in the
atmosphere.
Weather includes wind, lightning, storms, hurricanes,
tornadoes (also known as twisters), rain, hail, snow, and
lots more.
Energy from the Sun affects the weather too.
Climate tells us what kinds of weather usually happen in
an area at different times of the year.
Changes in weather can affect our mood and life. We wear
different clothes and do different things in different
weather conditions.
We choose different foods in different seasons.
Weather stations around the world measure different parts
of weather.
Ways to measure weather are wind speed, wind direction,
temperature and humidity.
People try to use these measurements to make weather
forecasts for the future.
These people are scientists that are called
meteorologists.
They use computers to build large mathematical models to
follow weather trends.'''

summary = generate_summary(text, 5)
summary_sentences = [Link]('. ')
formatted_summary = '.\n'.join(summary_sentences)

print(formatted_summary)

The following output is what we would be getting


as a summary. This summary would contain 5
sentences.

Energy from the Sun affects the weather too.


Changes in weather can affect our mood and life.
We wear different clothes and do different things
in different weather conditions.
[Link] 11/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

Weather stations around the world measure


different parts of the weather.
People try to use these measurements to make
weather forecasts for the future.

The above code generates a summary for a given


text using a tf idf approach. A function to generate
a summary that takes a text parameter and an n
parameter(number of sentences in summary). The
function tokenizes the text into individual
sentences, creates a TF-IDF matrix using the
TfidfVectorizer class, and computes the cosine
similarity between each sentence and the
document using the cosine_similarity function.
Next, the function selects the top n sentences with
the highest scores using the nlargest function
from the heapq library and joins them into a string
using the join method.

Okay, before moving further, let’s quickly


understand the cosine similarity. You can jump to
the next part if you are already familiar with this.

So, the cosine similarity considers the angle


between the vectors of word frequencies for each
document rather than just their magnitudes. This
means that documents with similar word
frequencies and distributions will have a smaller
angle between their vectors and, thus a higher
cosine similarity score. Let’s understand this with
a simple example.

[Link] 12/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

We have two sentences.

1. “I love cats and dogs.”


2. “I love only cats.”

We first need to convert each sentence into a


vector representation to calculate the similarity
between these two sentences using cosine
similarity with TF-IDF. Here’s how we can do that:

1. “I love cats and dogs.” -> [1, 1, 1, 1, 0, 0]


2. “I love only cats.” -> [1, 1, 1, 0, 1, 0]

How are we getting the vector representation? We


need to perform the following steps.
1. Break the sentence into individual words ->
tokenization:

“I love cats and dogs.” -> [‘I’, ‘love’, ‘cats’, ‘and’,


‘dogs’, ‘.’]
“I love only cats.” -> [‘I’, ‘love’, ‘only’, ‘cats’, ‘.’]

2. Now, Create a vocabulary of unique words from


both sentences:
[‘I’, ‘love’, ‘cats’, ‘and’, ‘dogs’, ‘.’, ‘only’] 3. Now
convert each sentence into a binary vector of size
equal to the vocabulary, where 1 represents the
presence of the word in the sentence and 0
represents its absence.
“I love cats and dogs.” -> [1, 1, 1, 1, 1, 1, 0]
Explanation:
‘I’ is present, hence 1
‘love’ is present, hence 1
‘cats’ is present, hence 1

[Link] 13/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

‘and’ is present, hence 1


‘dogs’ is present, hence 1
‘.’ is present, hence 1
‘only’ is absent, hence 0
“I love only cats.” -> [1, 1, 1, 0, 0, 1, 1]
Explanation:
‘I’ is present -> 1
‘love’ is present -> 1
‘cats’ is present -> 1
‘and’ is absent -> 0
‘dogs’ is absent -> 0
‘.’ is present -> 1
‘only’ is present -> 1
Each vector has six elements corresponding to the
six unique words in the sentences. The values in
each vector represent the frequency of each word
in its respective sentence.

Next, we compute the TF-IDF weights for each


word in both sentences. Let’s assume all words’
inverse document frequency (IDF) is the same for
simplicity. Then, the weights are:

“I love cats and dogs.” -> [0.0, 0.0, 0.0, 0.0, 0.0,
0.0] “I love only cats.” -> [0.0, 0.0, 0.0, 0.0, 0.0,
0.0]

Since each word occurs in both sentences, their


IDF values are zero, making the TF-IDF weights
for each word also zero.

[Link] 14/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

Finally, we compute the cosine similarity between


the two vectors using the formula:

cosine_similarity = (v1 . v2) / (||v1|| * ||v2||)

where v1 and v2 are the vector representations of


the sentences, and ‘.’ denotes the dot product of
two vectors. ||v1|| and ||v2|| are the Euclidean
norms of the two vectors.

Using the vector representations and the formula


above, the cosine similarity between the two
sentences is:

The dot product of the vectors [1, 1, 1, 1, 1, 1, 0] and


[1, 1, 1, 0, 0, 1, 1] is:

1*1 + 1*1 + 1*1 + 1*0 + 1*0 + 1*1 + 0*1 = 4

The magnitude (or Euclidean length) of the first


vector [1, 1, 1, 1, 1, 1, 0] is:
sqrt(1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 0^2) =
sqrt(6) -> 2.44

Similarly, the magnitude for the second vector [1,


1, 1, 0, 0, 1, 1] is:
sqrt(1^2 + 1^2 + 1^2 + 0^2 + 0^2 + 1^2 + 1^2) =
sqrt(5) -> 2.23

Therefore, the cosine similarity between the two


sentences is:

cosine_similarity = 4 / (2.44 * 2.23) => 4 / 5.4412


= 0.74 (approx)
[Link] 15/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

This indicates that the two sentences are


somewhat similar but not very similar.

Evaluation Metrics

Let’s now check how well our approach is working.


I got this particular text from this link.
Following is the text.

Weather is the day-to-day or hour-to-hour change


in the atmosphere. Weather includes wind,
lightning, storms, hurricanes, tornadoes (also
known as twisters), rain, hail, snow, and lots more.
Energy from the Sun affects the weather too.
Climate tells us what kinds of weather usually
happen in an area at different times of the year.
Changes in weather can affect our mood and life.
We wear different clothes and do different things
in different weather conditions. We choose
different foods in different seasons.

Weather stations around the world measure


different parts of the weather. Ways to measure
weather are wind speed, wind direction,
temperature and humidity. People try to use these
measurements to make weather forecasts for the
future. These people are scientists that are called
meteorologists. They use computers to build large
mathematical models to follow weather trends.

How can we check the accuracy of the above text’s


summary when we generate one? So, one way is to

[Link] 16/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

use human evaluation as the ground truth. In this


approach, we can generate summaries using each
method (frequency-based, TF-IDF), and then ask
human evaluators to rate the quality of each
summary based on different criteria such as
coherence, readability, and relevance to the
original text. We can then calculate the average
score for each method based on the ratings given
by the evaluators. This will give us a quantitative
measure of the performance of each method.

Another approach is to use ROUGE (Recall-


Oriented Understudy for Gisting Evaluation), which
is a commonly used metric for evaluating text
summarization models. ROUGE measures the
overlap between the generated and reference
summaries (i.e., the ground truth).

Let’s first go with the human evaluation method.

We got the following summary(5 sentences) as the


output using the frequency-based approach.

We wear different clothes and do different things


in different weather conditions.
Weather stations around the world measure
different parts of the weather.
Climate tells us what kinds of weather usually
happen in an area at different times of the year.
Weather includes wind, lightning, storms,
hurricanes, tornadoes (also known as twisters),
rain, hail, snow, and lots more.
[Link] 17/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

Wind speed, direction, temperature, and humidity


are ways to measure weather.

We got the following summary(5 sentences) as the


output using the TF-IDF approach.

Energy from the Sun affects the weather too.


Changes in weather can affect our mood and life.
We wear different clothes and do different things
in different weather conditions.
Weather stations around the world measure
different parts of the weather.
People try to use these measurements to make
weather forecasts for the future.

The average rating human evaluators rated the


frequency-based approach as ⅘ and the TF-IDF
approach as ⅗

So, as per human evaluation, the frequency-based


approach works better.

Now, let’s see how the machine evaluates.

Let’s see the evaluation using ROUGE. The


following has a reference summary, which is
human-generated, and we will check how well the
artificially generated summary is as compared to
the human-generated summary.

# in case it's not installed onto your system.


! pip install rouge

import rouge

[Link] 18/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

from rouge import Rouge


# a defined function called evaluate_rouge taking two
arguments,
# one being reference text and the other summary text,
# and uses the ROUGE metric to evaluate the quality of the
summary text compared to the reference text.
# The function uses the rouge library to compute the ROUGE
scores and returns the F1 score of the ROUGE-1 metric.
def evaluate_rouge(reference_text, summary_text):
rouge = Rouge()
scores = rouge.get_scores(reference_text, summary_text)
return scores[0]['rouge-1']['f']

# the following is a human generated summary


reference_summary = '''
Weather is a gradual slow change through days and hours in
the atmosphere and can vary from wind to snow.
Climate tells a lot about the weather in an area.
The livelihood of people changes according to the change
in weather.
Weather stations measure different parts of weather.
People who use measurements to make weather forecasts for
the future are called meteorologists, and are
scientists.'''

# the sample text from Wikipedia


text = '''
Weather is the day-to-day or hour-to-hour change in the
atmosphere.
Weather includes wind, lightning, storms, hurricanes,
tornadoes (also known as twisters), rain, hail, snow, and
lots more.
Energy from the Sun affects the weather too.
Climate tells us what kinds of weather usually happen in
an area at different times of the year.
Changes in weather can affect our mood and life. We wear
different clothes and do different things in different
weather conditions.
We choose different foods in different seasons.
Weather stations around the world measure different parts
of weather.

[Link] 19/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

Ways to measure weather are wind speed, wind direction,


temperature and humidity.
People try to use these measurements to make weather
forecasts for the future.
These people are scientists that are called
meteorologists.
They use computers to build large mathematical models to
follow weather trends.'''

# Generate summary using frequency-based/TF-IDF approach


summary = generate_summary(text, 5)

# Evaluate the summary using ROUGE


rouge_score = evaluate_rouge(reference_summary, summary)

print(f"ROUGE score: {rouge_score}")

# For frequency based approach we are getting a score of


0.336
# For TF-IDF approach we are getting a score of 0.465

Here, a reference summary and a text are defined.


Then, a summary is generated from the text using
the frequency-based approach and then the tf-idf
approach. Next, the ROUGE score of the
generated summary is evaluated against the
reference summary using the evaluate_rouge()
function. The ROUGE score measures the
similarity between the generated and reference
summaries. The higher the ROUGE score, the
more similar the two summaries are.

Now, here for the frequency-based approach, we


get a score of 0.336; using the TF-IDF approach,
we get a score of 0.465. So, in this evaluation
method, the TF-IDF approach works better.

[Link] 20/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

Comparison of Extractive and Abstractive


Text Summarization

Future Outlook of Text Summarization


The future of this particular field finds its way on
the higher steps of the technology ladder as every
day, new techniques and ways are being explored
by the R&D teams. The use of machine learning
and NLP will gradually improve the quality and
accuracy of the summaries that will be generated.

This field also includes the usage of deep learning


models, such as recurrent neural networks and
transformers, hence leading to a better
understanding of what exactly the text is about.
Additionally, more advancements in language
generation techniques will lead to the
development of more sophisticated abstractive
summarization methods.

Ultimately the advanced solutions would help us


save time, increase productivity, and make
information more accessible and easily digestible.

Conclusion

[Link] 21/22
10/3/23, 3:19 PM Exploring the Extractive Method of Text Summarization

Text summarization is a fast-growing field in


natural language processing, and it has the
potential to revolutionize the way we consume and
process information. In this article, we covered

Extractive summarization techniques select and


combine existing sentences from a text to
create a summary. In contrast, abstractive
techniques generate new sentences while
keeping the essence of the original text intact.
Extractive summarization has advantages over
abstractive summarization, where some of them
have higher accuracy, lower computational
complexity, and better preservation of factual
information.
Abstractive summarization has advantages over
extractive summarization, including the ability
to create more concise and coherent
summaries and also the potential to capture
the overall meaning of a text.
Text summarization has many real-world
applications, including journalism, finance,
healthcare, and the legal industry.
As the amount of digital information grows, text
summarization will become an essential tool for
efficient processing and making sense of large
volumes of text.

Related

[Link] 22/22

You might also like