Text Summarization In Python
With SpaCy Library
by Varsha Saini
According to a research paper by Anthony Cocciolo from Pratt Institute, Textual data
on the internet is decreasing gradually.
He told in the research paper as We may believe that online users are not interested
much in textual data anymore. Right from the year 1990, they have researched that
text content on websites is decreasing year by year.
More and more websites are using ways to provide content in smaller bits. These
smaller text bits could be used with Images, Videos, Infographics to convey messages
in a shorter context.
These facts give emphasis to the need for a process known as Text Summarization.
We will look into its definition, applications, and then we will build a Text
Summarization Python algorithm with the help of the spaCy library.
Table of Contents
What is Text Summarization?
As the above discussion might have already provided you with an image of a textual
summary. It is a technique to convert a long piece of content into a shorter one
without removing the actual context.
With the help of this technique, A summary of any text material could be generated.
It only removes text data which does not change the overall meaning of the content.
One practical example of it is with mobile application Inshorts, This application
provides 60 words News summary. It only contains Headlines and important facts
rather than varied opinions.
It has a greater scope of application in the scientific research field. As there are
research papers containing thousands of pages having important documentation. Text
Summarization could help scientists in focusing only on the key phrases from all that
data.
In earlier times it was manual work to produce a summary of textual content. With the
advancement in artificial intelligence and Natural Language Processing techniques it
is much easier to perform the task.
In this article, we will be using one such advanced Python library named spaCy. With
the help of spaCy library, it becomes very much easy to dig out important information
from tons of text data.
Types of Text Summarization
There are no fixed guidelines for categorization of the techniques that we use for it.
Although for performing tasks in an organized way they are generally be divided into
these following types:
1. Short Tail Summarization: In this type of summary the input content is very short
and precise. Even after having a short length it needs to be summarized in such a way
that it could be bounded further without any change in its meaning.
2. Long Tail Summarization: As you might have already grasped by the name. The
content here could be too long to be handled by a human being alone. It could contain
text data from thousands of pages and books at once.
3. Single Entity: When the input usually contains elements from just one source.
4. Multiple Entities: When the input contains elements from different document
sources. This is one of the most useful applications of this technique.
5. General Purpose: In this type of Text Summarization Python has no attribute for
the type of input is provided. The algorithm does not have a sense of the domain in
which the text deals. It is performed after multiple training of algorithms on various
types of textual content.
6. Domain-Specific: They are performed under a specific domain each time. For
example, a Text Summarization algorithm that summarizes the Food recipe in just a
few words. So it does have a domain of Food recipes and knows the context behind
the data.
7. Informative Summarization: This summary keeps all the information related to
the actual content. No change in meaning is seen in the output results.
8. Headlines Generation: News channels and applications use this type of summary.
Headlines are generated according to the text content of an article.
9. Keyword Extraction: Only the most important keywords and phrases are
extracted from the whole data. For example, extracting the phrases where some verbal
conversation is going on and leaving all the narrations behind.
These were just a few of the types and for further explanation, you can check the
article here:
We can create Python algorithms from any of the above-explained types with the help
of the spaCy library. There are basically two techniques to build the final Text
Summarization spaCy Model using Python language.
Below is the simple explanation for both of these techniques:
Extractive Technique: In this technique, the important phrases from the
actual content are taken together to build a simple and short summary.
Abstractive Technique: Builds a summary with new phrases and words
but keeps the original meaning alive.
Human beings generally use the Abstractive method to summarize something. We use
words that we are more familiar with but there is one problem with the summary
created by human beings. They never come up with a neutral one and the essence of
their opinion could be easily seen in the final output.
Machines are better at doing this task because they do not have their own opinion.
They just work on the pattern and training we have provided and take decisions with
Artificial Intelligence that they have gained.
Applications of Text Summarization
1. News: There are multiple applications of this technique in the field of News. It
includes creating an introduction, Generating headlines, Embedding captions on
pictures.
2. Scientific Research: Algorithms are used to dig out important information from
Scientific research papers. AI is outranking human beings in doing so.
3. Social Media Posting: Content on Social media is preferred to be concise.
Companies use this technique to convert long blog articles into shorter ones suited for
the audience.
4. Creating Study Notes: Many applications use this process to create student notes
from vast syllabus and content.
5. Conversation Summary: Long conversations and meeting recording could be first
converted into text and then important information could be fetched out of them.
6. Movie Plots and Reviews: The whole movie plot could be converted into bullet
points through this process.
7. Deliverable Feeds: They are the short piece of information derived from the
complete informative articles. These are generally delivered to people through emails
or feed delivery services.
8. Content Writing: Not from the scratch though but on providing a topic and points
an outlined summary could be generated.
Although there are hundreds of other applications we just limited them to a few main
topics.
Text Summarization in Python With
spaCy
We will be building some Python algorithms for performing the basics of automated
Text Summarization.
The spaCy library is our choice for doing so but you could go with any other Machine
Learning library of your choice.
We have already written an article on the complete implementation of the spaCy
library you can read it in our blog.
1. Importing the Library
import spacy
from [Link].stop_words import STOP_WORDS
from string import punctuation
Here [Link] contain English stopwords. Similarly, you can access stopwords
for any language using its extension such as ‘en‘ is for the English Language.
2. Getting data
extra_words=list(STOP_WORDS)+list(punctuation)+['\n']
nlp=[Link]('en')
doc = """Your Text Content Here"""docx = nlp(doc)
Getting Data for Text Summarization
[Link](‘en’) is used to load the object for the English language. extra_words is
created to hold all the stop words and punctuation.
We have used the text about NLP from Algorithmia but you could choose any text
material that you have.
3. Creating Vocabulary with spaCy
All the extra words are removed and the count of each other word is entered into the
dictionary.
all_words=[[Link] for word in docx]
Freq_word={}
for w in all_words:
w1=[Link]()
if w1 not in extra_words and [Link]():
if w1 in Freq_word.keys():
Freq_word[w1]+=1
else:
Freq_word[w1]=1
Output Screen:
Creating Vocabulary
4. Assigning a Title – Headline Generation
With the help of spaCy, we can actually find titles of the content that we have entered.
This way it could be implemented for Headline generation.
val=sorted(Freq_word.values())
max_freq=val[-3:]
print("Topic of document given :-")
for word,freq in Freq_word.items():
if freq in max_freq:
print(word ,end=" ")
else:
continue
Output Screen:
Headline Generation
We got “NLP Human Language Text” as our title which is quite near to our text. It is
not the best title but this is just the basic test of features provided by python library
spaCy.
5. TFIDF
Short form for Term Frequency – Inverse Document Frequency, It is used to represent
how important a given word is to a document on a complete collection relatively.
We can represent it with the code below:
for word in Freq_word.keys():
Freq_word[word] = (Freq_word[word]/max_freq[-1])
Tf – Idf
After getting the strength of each individual word we can have the strength of each
sentence. This way we will get to know the importance of each sentence so that the
sentences having no importance could be removed from the summary. Python Text
Summarization is one of the best practice to go with.
6. Sentence Strength
The sentence with the most important words will have much more importance. We
can find out this with the code below:
sent_strength={}
for sent in [Link]:
for word in sent :
if [Link]() in Freq_word.keys():
if sent in sent_strength.keys():
sent_strength[sent]+=Freq_word[[Link]()]
else:
sent_strength[sent]=Freq_word[[Link]()]
else:
continue
Output Screen:
Sentence Strength Calculation
7. Getting Important Sentences
We will now be sorting the sentences according to their strength and choosing only
according to the requirement. Sometimes the strength requirement is high for example
in controversial topics it is better to choose all the important factors.
The code to perform this would be:
top_sentences=(sorted(sent_strength.values())[::-1])
top30percent_sentence=int(0.3*len(top_sentences))
top_sent=top_sentences[:top30percent_sentence]
8. Creating the Final Summary
Here the final summary is being created using only the valuable information. All the
content which had less to none importance will be removed from the content.
We will be using the following code to perform this:
summary=[]
for sent,strength in sent_strength.items():
if strength in top_sent:
[Link](sent)
else:
continue
for i in summary:
print(i,end="")
Output Screen:
Text Summarization in Python
Here is our Text summary created in Python Language with the help of spaCy. If you
want to learn Python Programming then do check out some free Python courses listed
by us.
What are your words on this Summary?
It could be refined and made perfect with other spaCy filters that we will look at
some time in our next tutorials.
For the complete code, you can check our GitHub repository