0% found this document useful (0 votes)

49 views8 pages

Text Summarization Using Natural Language Processing

This document discusses using natural language processing for text summarization. It introduces the problem of needing to summarize long texts to save time when taking in information. The document then performs text summarization on a 691-word article about Sachin Tendulkar to demonstrate the process. Key steps include preprocessing the text, calculating word frequencies, segmenting sentences, scoring sentences, and selecting the highest scoring sentences to extract the summary. The summarization reduces the article to 259 words while maintaining its essential meaning.

Uploaded by

ANANDA CHATTERJEE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views8 pages

Text Summarization Using Natural Language Processing

Uploaded by

ANANDA CHATTERJEE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Text Summarization using Natural

Language Processing

Introduction- Every day, we are inundated with information. There are numerous articles that we
read on a daily basis. As a result, there is a lot of data moving about, largely in the form of text. If we
need to learn something about an article, we must read the entire piece to understand it, and many
times those articles become excessively long, such as a 5000-word article, which takes a long time.
So, in order to receive the useful information contained in 1000 words, we must read the entire 5000-
word article, which is a complete waste of time, and if we need to read several articles like that for
work purposes, it will take a long time, resulting in a loss of work hour. The goal of text summarizing
is to see if we can come up with a method that employs natural language processing to do so. This
method will not only save time in comprehending a text, but it will also allow someone to read
multiple texts in a short period of time, saving time in the long term.
Objective-
1. Extraction of useful information out of a huge amount of text.
2. Reduction of reading time.
3. Enable to read more articles as the time for reading each article will be reduced thus gather
more information from different articles without losing much time.
4. Selecting articles which allows one to process more information when reading because only
the most significant aspects of the content are captured.
Problem Statement- An article about Sachin Tendulkar has been collected from the internet which is
made of around 691 words. Text summarization will be achieved using Natural Language Processing
(NLP) to get important points about that article which are enough in gaining an understanding of the
idea of the text.
The codes to achieve this text summarization is written below.
Google drive was mounted and the .txt file which contains the document was read and stored in a list
named contents.

# google drive was mounted to read file

from google.colab import drive

drive.mount('/content/drive/')

# file was read and stored in a list named contents

f=open('/content/drive/MyDrive/Text.txt','r',encoding='latin1')

f1=f.readlines()

contents=[]
for line in f1:

contents.append(line)

contents

The list was converted to string and then Unicode characters “\x91” and “]x92” were removed from
the string and kept in a variable named text.

#list contents was converted to string and stored in text

text = ' '.join([str(elem) for elem in contents])

# Unicode character \x91 and \x92 was replaced with “‘” and kept in variable text

text=text.replace("\x91","'")

text=text.replace("\x92","'")

text

The length of the text is found out.

len(text)

Number of words in the string

f=len(text.split())

print ("The number of words in the given text is : " + str(f))

Importing the important libraries.
The spacy library is imported. From Spacy STOP_WORDS have been imported.
From String, class punctuation has been imported

import spacy

from spacy.lang.en.stop_words import STOP_WORDS

from string import punctuation

A small size model "en_core_web_sm" has been loaded.

nlp= spacy.load("en_core_web_sm")

The whole text has been applied to nlp model and assigned to some doc object.

doc=nlp(text)

Iterate over every single token using list comprehension and these are the tokens to be worked upon.

tokens=[token.text for token in doc]

print(tokens)

These are the all punctuations and one extra punctuation '\n' has been added.

punctuation=punctuation+’\n’

Text Cleaning
An empty dictionary word_freq has been created.

word_freq={}

List of STOP_WORDS has been stored in the stop_words variable.

stop_words= list(STOP_WORDS)

A loop has been run over the doc to get those words that are not in the list of STOP_WORDS and also
not in the list of punctuations, and then the words were added to the word_freq dictionary and the
number of times they appear in doc has been added as a value in the dictionary.

for word in doc:

if word.text.lower() not in stop_words:

if word.text.lower() not in punctuation:

if word.text not in word_freq.keys():

word_freq[word.text]= 1

else:

word_freq[word.text]+= 1

print(word_freq)

The maximum no of times a word appear has been figured out stored in variable max_freq.

x=(word_freq.values())

a=list(x)

a.sort()

max_freq=a[-1]

max_freq
All the score of the words in word_freq dictionary has been normalized by dividing each value in the
dictionary by max_freq and to do this a loop has been run on word_freq dictionary and all the values
were normalized.
Sentence Tokenization

for word in word_freq.keys():

word_freq[word]=word_freq[word]/max_freq print(word_freq)

Sentences in doc object have been segmented by using list comprehension method and kept in
variable sent_tokens.

sent_score={}

sent_tokens=[sent for sent in doc.sents]

print(sent_tokens)

Score of each individual sentence has been found out based on the word_freq counter. An empty
dictionary sent_score has been created which will hold each sentence as a key and its value as a score.
A loop was iterated on each individual sentence and it was checked the words in those sentences if
appear in word_freq dictionary and then based on the score of a word in word_freq dictionary
sent_score has been determined.

for sent in sent_tokens:

for word in sent:

if word.text.lower() in word_freq.keys():

if sent not in sent_score.keys():

sent_score[sent]=word_freq[word.text.lower()]

else:
sent_score[sent]+= word_freq[word.text.lower()]

print(sent_score)

Select 30% sentences with a maximum score

30% of sentences which is having maximum score out of this sent_score dictionary have been
grabbed.
From heapq module nlargest library was imported.from total sent_score 30% has been evaluated
which comes to 8, which means maximum 13 sentences can be extracted which contains all important
information.

from heapq import nlargest

len(sent_score) *0.3

Getting Summary
Three parameters were passed to nlargest() function. First parameter is maximum number of sentense
which in this case is 8. Second parameter is iterable on which we are going to apply this and in this
case it's sent_score. Third parameter is based on which key we are going to do all these things are here
it s sent_score.get,here get is used as a function which will return us those values sent_score based on
which we will get 8 sentences having 40% of maximum value.
summary=nlargest(n=13,iterable=sent_score,key=sent_score.get)

print(summary)

List comprehension was applied to get the final summarized text.

final_summary=[word.text for word in summary]

final_summary
re module was imported to perform regex operation.

import re

Empty list f1 was created and a loop was run on the final extracted text, then regex operation was
done to remove '\n' from all text and appended to list f1.

f1=[]

for sub in final_summary:

f1.append(re.sub('\n','',sub))

List of final summarized text was converted to string using join() function and kept in variable f2.

f2=" ".join(f1)

The split() function was used to count the number of words in the final string.
f3=len(f2.split())

print ("The number of words in final summary is : " + str(f3))

Conclusion- The article on Sachin Tendulkar was condensed into a 259-word document from a 691-
word original, and this condensed document contains vital information that is the essence of the entire
piece, making it understandable in a short amount of time.
My Linkedin ID is: Linkedin

TextSimp Summarization Project
No ratings yet
TextSimp Summarization Project
3 pages
Ai Assignment3 Lcs2023007
No ratings yet
Ai Assignment3 Lcs2023007
8 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
Ass 3
No ratings yet
Ass 3
3 pages
Lab Manual - NLP
No ratings yet
Lab Manual - NLP
60 pages
Python Text Summarization Techniques
No ratings yet
Python Text Summarization Techniques
18 pages
NLP Day1
No ratings yet
NLP Day1
4 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
11 pages
NLP Lab Codes Till Mod3
No ratings yet
NLP Lab Codes Till Mod3
7 pages
3
No ratings yet
3
5 pages
NLP Record
No ratings yet
NLP Record
6 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Text Summarization in Python With SpaCy Library
No ratings yet
Text Summarization in Python With SpaCy Library
10 pages
Summary Generator Methodology Guide
No ratings yet
Summary Generator Methodology Guide
6 pages
Summerization Presentation
No ratings yet
Summerization Presentation
9 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
29 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
NLTK Text Analysis Cheatsheet
No ratings yet
NLTK Text Analysis Cheatsheet
3 pages
NLTK Cheatsheet for Text Analysis
No ratings yet
NLTK Cheatsheet for Text Analysis
3 pages
NLTK Text Analysis Cheatsheet
No ratings yet
NLTK Text Analysis Cheatsheet
3 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
22 pages
NLP Miniproject
No ratings yet
NLP Miniproject
8 pages
All Practicals
No ratings yet
All Practicals
33 pages
Self Evaluation Exercises
No ratings yet
Self Evaluation Exercises
12 pages
NLP Lab Manual - Final
No ratings yet
NLP Lab Manual - Final
15 pages
4.system Design & Specs
No ratings yet
4.system Design & Specs
10 pages
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
No ratings yet
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
16 pages
Text Summarization Using Natural Language Processing
No ratings yet
Text Summarization Using Natural Language Processing
5 pages
AIT526 Lab 2 Text Summarization
No ratings yet
AIT526 Lab 2 Text Summarization
4 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Lesson 5 NLP Libraries
No ratings yet
Lesson 5 NLP Libraries
69 pages
Extractive Text Summarization Using Word Frequency
No ratings yet
Extractive Text Summarization Using Word Frequency
6 pages
Natural Language Processing Lab Manual
No ratings yet
Natural Language Processing Lab Manual
24 pages
Gen AIL
No ratings yet
Gen AIL
12 pages
Natural Language Processing in Python - Exploring Word Frequencies With NLTK
No ratings yet
Natural Language Processing in Python - Exploring Word Frequencies With NLTK
5 pages
NLP PRGRM-1
No ratings yet
NLP PRGRM-1
7 pages
Document Summarisation
No ratings yet
Document Summarisation
22 pages
DSC 202
No ratings yet
DSC 202
8 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
28 pages
Detailed Explanation of The Code
No ratings yet
Detailed Explanation of The Code
4 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
7 pages
Internship
No ratings yet
Internship
10 pages
NLP Notes
No ratings yet
NLP Notes
12 pages
Python Search Engine Project Guide
No ratings yet
Python Search Engine Project Guide
20 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Text Summarization Using NLP: Bachelor of Technology Computer Science and Engineering
No ratings yet
Text Summarization Using NLP: Bachelor of Technology Computer Science and Engineering
44 pages
Ai&Ml Bai601 NLP Lab Manual
No ratings yet
Ai&Ml Bai601 NLP Lab Manual
48 pages
Text Summarization
No ratings yet
Text Summarization
13 pages
NLP Lab Programms
No ratings yet
NLP Lab Programms
9 pages
NLP Text Processing Techniques
No ratings yet
NLP Text Processing Techniques
6 pages
NLP Experiment 2
No ratings yet
NLP Experiment 2
5 pages
Real Analysis II Notes - Final
No ratings yet
Real Analysis II Notes - Final
80 pages
Data Driven AI
No ratings yet
Data Driven AI
253 pages
Receipt Details: Menu
No ratings yet
Receipt Details: Menu
2 pages
Biasing BJT Part2
No ratings yet
Biasing BJT Part2
22 pages
TCP Ip Protocol Suite
No ratings yet
TCP Ip Protocol Suite
40 pages
AST 111 - Lesson 3
No ratings yet
AST 111 - Lesson 3
12 pages
Port Scanning and Security Tools Overview
No ratings yet
Port Scanning and Security Tools Overview
61 pages
Linksys WAP54G - DD-WRT Wiki
No ratings yet
Linksys WAP54G - DD-WRT Wiki
5 pages
PSLS-EDS-4SOL-2049 SmartLogger ModBus Interface Definitions
No ratings yet
PSLS-EDS-4SOL-2049 SmartLogger ModBus Interface Definitions
85 pages
Preliminaries
No ratings yet
Preliminaries
45 pages
Understanding Computer Virus Mechanics
No ratings yet
Understanding Computer Virus Mechanics
9 pages
Python Data Analytics Overview
No ratings yet
Python Data Analytics Overview
110 pages
Cne Theory
No ratings yet
Cne Theory
103 pages
IDS 312L Manual 20210709
No ratings yet
IDS 312L Manual 20210709
53 pages
ADA MID SEM QUESTION BANK - Final
No ratings yet
ADA MID SEM QUESTION BANK - Final
2 pages
UAV Communications For Future Wireless Networks
No ratings yet
UAV Communications For Future Wireless Networks
121 pages
Resume Parser Project Overview
No ratings yet
Resume Parser Project Overview
1 page
Simplivity Upgrade or Migrate
No ratings yet
Simplivity Upgrade or Migrate
11 pages
C Programming Lab Guide for Beginners
No ratings yet
C Programming Lab Guide for Beginners
45 pages
Lab 4 Wireshark DoS Analysis o FTP Protocol
No ratings yet
Lab 4 Wireshark DoS Analysis o FTP Protocol
12 pages
CMT F Page
No ratings yet
CMT F Page
9 pages
Code Quality in Multi-threaded Apps
No ratings yet
Code Quality in Multi-threaded Apps
13 pages
Arya Uttamchandni
No ratings yet
Arya Uttamchandni
1 page
Free VPN Gate Servers List & Guide
No ratings yet
Free VPN Gate Servers List & Guide
1 page
Diploma Embedded System
No ratings yet
Diploma Embedded System
6 pages
Jeff Lorbeck: Qualcomm Leadership Profile
No ratings yet
Jeff Lorbeck: Qualcomm Leadership Profile
1 page
Bachelor of Vocation IT Syllabus
No ratings yet
Bachelor of Vocation IT Syllabus
101 pages
Chapter 2: Literature Review
No ratings yet
Chapter 2: Literature Review
38 pages
Time Table For Summer 2024 Theory Examination
No ratings yet
Time Table For Summer 2024 Theory Examination
9 pages
ES26 Course Information
No ratings yet
ES26 Course Information
2 pages
Intelizign Nisha Katke
No ratings yet
Intelizign Nisha Katke
1 page
Understanding the Domain Name System
No ratings yet
Understanding the Domain Name System
18 pages
Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation
No ratings yet
Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation
32 pages
PROFIBUS-DP Adapter Configuration Guide
No ratings yet
PROFIBUS-DP Adapter Configuration Guide
2 pages
CNC Controller Motion Schematics Rev. D
No ratings yet
CNC Controller Motion Schematics Rev. D
8 pages

Text Summarization Using Natural Language Processing

Uploaded by

Text Summarization Using Natural Language Processing

Uploaded by

Text Summarization using Natural

# google drive was mounted to read file

from google.colab import drive

# file was read and stored in a list named contents

#list contents was converted to string and stored in text

text = ' '.join([str(elem) for elem in contents])

The length of the text is found out.

Number of words in the string

print ("The number of words in the given text is : " + str(f))

from spacy.lang.en.stop_words import STOP_WORDS

from string import punctuation

A small size model "en_core_web_sm" has been loaded.

tokens=[token.text for token in doc]

List of STOP_WORDS has been stored in the stop_words variable.

for word in doc:

if word.text.lower() not in stop_words:

if word.text.lower() not in punctuation:

if word.text not in word_freq.keys():

for word in word_freq.keys():

sent_tokens=[sent for sent in doc.sents]

for sent in sent_tokens:

for word in sent:

if sent not in sent_score.keys():

Select 30% sentences with a maximum score

from heapq import nlargest

List comprehension was applied to get the final summarized text.

final_summary=[word.text for word in summary]

for sub in final_summary:

print ("The number of words in final summary is : " + str(f3))

You might also like