0% found this document useful (0 votes)

17 views4 pages

NLP Projects

Uploaded by

Joshua David

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views4 pages

NLP Projects

Uploaded by

Joshua David

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Text Concordance alpha_tokens = [token.

lower() for token in tokens if

[Link]()]
import nltk english_words = set([Link]())
from [Link] import gutenberg valid_tokens = [token for token in alpha_tokens if
from [Link] import Text token in english_words]
corpus=[Link]("[Link] filtered_tokens = [token for token in valid_tokens if
t") token not in stop_words]
text=Text(corpus) stemmer_tokens = [[Link](token) for token
[Link]("monstrous") in filtered_tokens] # Corrected variable name

Output print("Original text :", text)

print("Tokenized text :", tokens)
Displaying 1 of 1 matches: print("Filtered text :", filtered_tokens)
Who cannot want the thought , how monstrous It print("Validated text :", valid_tokens)
was for Malcolme , and for Dona. print("Alpha text :", alpha_tokens)
print("Stemmed text :", stemmer_tokens)

Vocabulary Count
Output
import nltk
text=("welcome to the world") Original text : This is a sample text that we used to
words = nltk.word_tokenize(text) demonstrate NLTK text processing 123
num_words=len(words) Tokenized text : ['This', 'is', 'a', 'sample', 'text', 'that',
num_the = [Link]('the') 'we', 'used', 'to', 'demonstrate', 'NLTK', 'text',
unique_words=set(words) 'processing', '123']
num_unique_words=len(unique_words) Filtered text : ['sample', 'text', 'used', 'demonstrate',
percen_unique=(num_unique_words/num_words)* 'text']
100 Validated text : ['this', 'is', 'a', 'sample', 'text', 'that',
print(words) 'we', 'used', 'to', 'demonstrate', 'text']
print("the number of words:",num_words) Alpha text : ['this', 'is', 'a', 'sample', 'text', 'that', 'we',
print('number of occurence of "the":',num_the) 'used', 'to', 'demonstrate', 'nltk', 'text', 'processing']
print("number of unique
words:",num_unique_words)
print("percentage of unique
words:",percen_unique)
Bag of Words

from sklearn.feature_extraction.text import

Output
CountVectorizer
['welcome', 'to', 'the', 'world'] corpus = ["This is the first document",
the number of words: 4 "This document is the second document",
number of occurence of "the": 1 "And this is the third one",
number of unique words: 4 "Is this the first document"]
percentage of unique words: 100.0
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

Text Preprocessing for i in range(len(corpus)):

print(f"BoW representation of Document {i+1}:
import nltk {X[i].toarray()[0]}")
[Link]('stopwords')
[Link]('words')
from [Link] import word_tokenize Output
from [Link] import stopwords
from [Link] import PorterStemmer BoW representation of
from [Link] import words Document 1: [0 1 1 1 0 0 1 0 1]
BoW representation of
text = 'This is a sample text that we used to Document 2: [0 2 0 1 0 1 1 0 1]
demonstrate NLTK text processing 123' BoW representation of
tokens = word_tokenize(text) Document 3: [1 0 0 1 1 0 1 1 1]
stop_words = set([Link]('english')) BoW representation of
Document 4: [0 1 1 1 0 0 1 0 1]
# Corrected variable name
stemmer = PorterStemmer()
TF-IDF filtered_tokens = [token for token in tokens if
[Link]() not in stop_words]
from [Link] import word_tokenize lemmatizer = WordNetLemmatizer()
from [Link] import stopwords lemmatized_tokens =
from [Link] import PorterStemmer [[Link](token) for token in
from collections import Counter filtered_tokens]
import math pos_tags = nltk.pos_tag(lemmatized_tokens)
pos_word_corpus = [(word, tag) for word, tag in
def calculate_tf(word, document): pos_tags]
word_frequency = [Link](word)
return word_frequency / len(document) for word, tag in pos_word_corpus:
print(word, ":", tag)
def calculate_idf(word, corpus):
num_documents_containing_word = len([True
for document in corpus if word in document]) Output
if num_documents_containing_word == 0:
return 0 quick : JJ
else: brown : NN
return math.log10(len(corpus) / fox : JJ
num_documents_containing_word) jump : NN
lazy : NN
def calculate_tfidf(document, corpus): dog : NN
PS = PorterStemmer()
stop_words = set([Link]('english'))
words = [[Link]([Link]()) for word in Named Entity Recognition
word_tokenize(document) if [Link]() not in
stop_words] import nltk
word_tfidf_values = {} [Link]('averaged_perceptron_tagger')
for word in words: [Link]('maxent_ne_chunker')
if word not in word_tfidf_values: [Link]('words')
tf = calculate_tf(word, words) text="Josh works for Twitter in California."
idf = calculate_idf(word, corpus) tokens=nltk.word_tokenize(text)
word_tfidf_values[word] = tf * idf tagged=nltk.pos_tag(tokens)
return word_tfidf_values entities=[Link].ne_chunk(tagged)
for entity in entities:
corpus = [ "This is the first document", "This if hasattr(entity,'label'):
document is the second document", "And this is the print([Link](),''.join(c[0] for c in
third one", "Is this the first document" ] [Link]()))
document = "This is the second document"
tfidf_vector = calculate_tfidf(document, corpus)
print(tfidf_vector) Output

PERSON Josh
Output GPE Twitter
GPE California
{'second': 0.3010299956639812,
'document': 0.06246936830414996}

Pos Tagging

import nltk
[Link]('averaged_perceptron_tagger')

from [Link] import stopwords

from [Link] import WordNetLemmatizer

[Link]('punkt')
[Link]('stopwords')
[Link]('wordnet')

line = "quick brown fox jumps over the lazy dog"

tokens = nltk.word_tokenize(line)
stop_words = set([Link]('english'))
Pos Tagging via HMM Chatbot

import nltk import nltk

[Link]('brown') from [Link] import Chat,reflections
from [Link] import brown pairs=[[r"Hello|hi|hey|hola",
["Hello,I am Aura,your AI assistant. How may I help
def train_hmm_tagger(): you?"]],
tagged_sentence = [r"How are you|How are you doing",
brown.tagged_sents(categories='news') ["I'm good, how about you?"]],
size = int(len(tagged_sentence) * 0.9) [r"What song always gets you in a good mood?",
trained_sents = tagged_sentence[:size] ['"Happy" by Pharrell Williams never fails to put a
test_sents = tagged_sentence[size:] smile on my face.']],
symbols = set([word for sentence in [r"Suggest a trending song",
tagged_sentence for word, _ in sentence]) ['Good 4 U by Olivia Rodrigo',
states = set([tag for sentence in 'Montero(Call Me By Your Name) by Lil Nas X',
tagged_sentence for _, tag in sentence]) 'Save Your Tears by The Weeknd',
trainer = 'Levitating by Dua Lipa']],
[Link](states=st [r"quit",["Good bye"]],
ates, symbols=symbols) [r"(.*)",["Could you try again?"]]]
hmm_tagger = bot=Chat(pairs,reflections)
trainer.train_supervised(trained_sents) [Link]()
return hmm_tagger
Output
def pos_tag_sentence(sentence, hmm_tagger):
tokens = nltk.word_tokenize(sentence) >hi
tagged_tokens = hmm_tagger.tag(tokens) Hello,I am Aura,your AI assistant. How may I help
return tagged_tokens you?
>how are you
hmm_tagger = train_hmm_tagger() I'm good, how about you?
sentence = input("Enter the sentence to be >What song always gets you in a good mood?
tagged?") "Happy" by Pharrell Williams never fails to put a
tagged = pos_tag_sentence(sentence, smile on my face.
hmm_tagger) >Suggest a trending song
print(tagged) Save Your Tears by The Weeknd
>bye
Good bye
Output

Enter the sentence to be tagged?

The sky is so beautiful.

[('The', 'AT'), ('sky', 'NN'), ('is', 'BEZ'), ('so', 'QL'),

('beautiful', 'JJ')]
TEXT CLASSIFICATION USING LOGISTIC TEXT CLASSIFICATION USING NAÏVE
REGRESSION BAYES

from [Link] import word_tokenize import nltk

from [Link] import stopwords from [Link] import movie_reviews
from [Link] import PorterStemmer from [Link] import stopwords
from sklearn.feature_extraction.text import from [Link] import NaiveBayesClassifier
CountVectorizer from [Link] import word_tokenize
from sklearn.model_selection import train_test_split from [Link] import WordNetLemmatizer
from sklearn.linear_model import from sklearn.model_selection import train_test_split
LogisticRegression [Link]('movie_reviews')
def preprocess(text): [Link]('stopwords')
ps = PorterStemmer() stop_words = set([Link]('english'))
stop_words = set([Link]('english')) lemmatizer = WordNetLemmatizer()
words = [word_tokenize(sentence) for sentence
in text] def preprocess(text):
filtered_words = [[[Link](word) for word in tokens = word_tokenize([Link]())
tokenized if word not in stop_words and tokens = [[Link](token) for token
[Link]()] for in tokens if token not in stop_words and
tokenized in words] [Link]()]
filtered_sentences = [' '.join(sentence) for return dict([Link](tokens))
sentence in filtered_words] pos_reviews = [(movie_reviews.raw(fileid),
return filtered_sentences 'positive') for fileid in movie_reviews.fileids('pos')]
neg_reviews = [(movie_reviews.raw(fileid),
sentences = ["The food is tasty", "the quality of 'negative') for fileid in movie_reviews.fileids('neg')]
food is low", "i will never recommend their food", tot_rev = pos_reviews + neg_reviews
"I got sick after having their food", "I was processed_data = [(preprocess(text), category) for
in cloudnine after tasting their food", (text, category) in tot_rev]
"My favourite is their desserts", "the food train_data, val_data =
was not cooked properly"] train_test_split(processed_data, test_size=0.2,
classes = [1, 0, 0, 0, 1, 1, 0] random_state=42)
test_sentences = ["food is not cooked properly", "I
feel sick after having food", "I love their desserts", classifier = [Link](train_data)
"was in cloudnine after tasting their food"] new_text = ["The movie was amazing", "the movie
was terrible", "The movie was awful"]
vectorizer = CountVectorizer() for text in new_text:
sentences = preprocess(sentences) new_features = preprocess(text)
vect1 = vectorizer.fit_transform(sentences) predicted_category =
# Splitting data for testing [Link](new_features)
# train_data, test_data, train_labels, test_labels = print(f"The predicted category for '{text}' is
train_test_split(vect1, classes, test_size=0.2, '{predicted_category}'")
random_state=42
nb = LogisticRegression() Output
[Link](vect1, classes)
test_sentences = preprocess(test_sentences) The predicted category for 'The movie was
vect2 = [Link](test_sentences) amazing' is 'positive'
pred_classes = [Link](vect2) The predicted category for 'the movie was terrible'
print(pred_classes) is 'negative'
The predicted category for 'The movie was awful' is
Output 'negative’

[0 0 1 1]

NLP Exp3
No ratings yet
NLP Exp3
3 pages
x0 Process
No ratings yet
x0 Process
4 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
28 pages
Dsbda 7
No ratings yet
Dsbda 7
1 page
NLP Techniques for Text Processing
No ratings yet
NLP Techniques for Text Processing
41 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
22 pages
TSA Lab Manual New
No ratings yet
TSA Lab Manual New
14 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
Natural Language Processing Journal
No ratings yet
Natural Language Processing Journal
73 pages
DS 7
No ratings yet
DS 7
3 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
NLP Lab Codes Till Mod3
No ratings yet
NLP Lab Codes Till Mod3
7 pages
NLP Lab1
No ratings yet
NLP Lab1
6 pages
NLP Tasks for MCA Students
No ratings yet
NLP Tasks for MCA Students
16 pages
7 TextAnalysis
No ratings yet
7 TextAnalysis
3 pages
Record
No ratings yet
Record
6 pages
NLP Lab
No ratings yet
NLP Lab
18 pages
Python NLP Techniques Guide
No ratings yet
Python NLP Techniques Guide
18 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
NLP Session 4
No ratings yet
NLP Session 4
13 pages
Soundarya 256 NLP Practs
No ratings yet
Soundarya 256 NLP Practs
14 pages
TP1 NLP
No ratings yet
TP1 NLP
7 pages
Self Evaluation Exercises
No ratings yet
Self Evaluation Exercises
12 pages
C24064 - NLP - Lab Manual
No ratings yet
C24064 - NLP - Lab Manual
28 pages
NLP Lab Assignment 8
No ratings yet
NLP Lab Assignment 8
14 pages
R22 NLP Python Programs
No ratings yet
R22 NLP Python Programs
15 pages
NLPPractical
No ratings yet
NLPPractical
12 pages
Python NLP Practical Exercises
No ratings yet
Python NLP Practical Exercises
14 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
NLP Lab Manual - Final
No ratings yet
NLP Lab Manual - Final
15 pages
DSBDA Practical 7 Tutorial
No ratings yet
DSBDA Practical 7 Tutorial
11 pages
Natural Language Processing Practical Journal
No ratings yet
Natural Language Processing Practical Journal
27 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
NLP Using Python
No ratings yet
NLP Using Python
4 pages
Ai&Ml Bai601 NLP Lab Manual
No ratings yet
Ai&Ml Bai601 NLP Lab Manual
48 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
Arabic 2 English
No ratings yet
Arabic 2 English
7 pages
NLP Assignment 4 (22bce9560)
No ratings yet
NLP Assignment 4 (22bce9560)
12 pages
1a NLTK
No ratings yet
1a NLTK
10 pages
NLP Python Code Examples and Techniques
No ratings yet
NLP Python Code Examples and Techniques
16 pages
NLP Record
No ratings yet
NLP Record
23 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
NLP Practical Journal
No ratings yet
NLP Practical Journal
36 pages
Query Expansion Using WordNet
No ratings yet
Query Expansion Using WordNet
6 pages
Natural Langauage Processing (NLP) : Tokenization of Words
No ratings yet
Natural Langauage Processing (NLP) : Tokenization of Words
8 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
NLP Pratical
No ratings yet
NLP Pratical
14 pages
AP19110010110 Lab Assignment-2 - Jupyter Notebook
No ratings yet
AP19110010110 Lab Assignment-2 - Jupyter Notebook
18 pages
Python Text Processing Techniques
No ratings yet
Python Text Processing Techniques
13 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
NLP Study Plan For Beginners - HW Samples
No ratings yet
NLP Study Plan For Beginners - HW Samples
47 pages
All Practicals
No ratings yet
All Practicals
33 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
Natural Language Processing Lab 9
No ratings yet
Natural Language Processing Lab 9
13 pages
NLP Record
No ratings yet
NLP Record
15 pages
NLP
No ratings yet
NLP
12 pages
Cordillera
No ratings yet
Cordillera
15 pages
HB PC200EN-6K (1997-2003) Oil
No ratings yet
HB PC200EN-6K (1997-2003) Oil
1 page
Mtoto Fund Profile - Final
No ratings yet
Mtoto Fund Profile - Final
7 pages
Course Syllabus - Seng 204 Fluid Mech I 2023 24
No ratings yet
Course Syllabus - Seng 204 Fluid Mech I 2023 24
3 pages
Time of The Dragon Boxset
82% (11)
Time of The Dragon Boxset
247 pages
High Performance Concrete With Partial Replacement of Cement by ALCCOFINE & Fly Ash
No ratings yet
High Performance Concrete With Partial Replacement of Cement by ALCCOFINE & Fly Ash
6 pages
Phần I-Đề Thi Olympic Truyen Thống 3 0 / 4 Lần Thứ Xxi - Nẫm 2 0 1 5
No ratings yet
Phần I-Đề Thi Olympic Truyen Thống 3 0 / 4 Lần Thứ Xxi - Nẫm 2 0 1 5
27 pages
معدات غسيل السجاد للشركات
No ratings yet
معدات غسيل السجاد للشركات
1 page
Basic College Mathematics 3rd Edition Miller Fast Access
No ratings yet
Basic College Mathematics 3rd Edition Miller Fast Access
305 pages
Through A Glass Brightly Using Science To See Our Species As We Really Are No-Wait Download
100% (16)
Through A Glass Brightly Using Science To See Our Species As We Really Are No-Wait Download
16 pages
Cellular and Molecular Neurophysiology Fourth Edition Constance Hammond Digital Version 2025
No ratings yet
Cellular and Molecular Neurophysiology Fourth Edition Constance Hammond Digital Version 2025
150 pages
Biodegradable Pots2
No ratings yet
Biodegradable Pots2
3 pages
Sikafiber Novoconurw-0855
No ratings yet
Sikafiber Novoconurw-0855
3 pages
Low-rank Multimodal Fusion Method
No ratings yet
Low-rank Multimodal Fusion Method
10 pages
Barangay Performance Management System
No ratings yet
Barangay Performance Management System
6 pages
Unit II 10 Data Preprocessing Techniques
No ratings yet
Unit II 10 Data Preprocessing Techniques
13 pages
Đề Thi Thử TN THPT 2021 - Môn Tiếng Anh - THPT Quế Võ 1 - Bắc Ninh - Lần 1 - File Word Có Lời Giải
No ratings yet
Đề Thi Thử TN THPT 2021 - Môn Tiếng Anh - THPT Quế Võ 1 - Bắc Ninh - Lần 1 - File Word Có Lời Giải
14 pages
Fibonacci Sequence in Nature
No ratings yet
Fibonacci Sequence in Nature
6 pages
Grade 5 Science Lesson Plan: Electromagnets
No ratings yet
Grade 5 Science Lesson Plan: Electromagnets
8 pages
Hannecard Metal Solutions Per Line
No ratings yet
Hannecard Metal Solutions Per Line
9 pages
Thesis About The Hoover Dam
100% (2)
Thesis About The Hoover Dam
8 pages
Hayek's Unpublished Wittgenstein Biography
100% (1)
Hayek's Unpublished Wittgenstein Biography
88 pages
The Hounds of Baskerville CHAPTER 12
No ratings yet
The Hounds of Baskerville CHAPTER 12
2 pages
Reading Practice Week 15
No ratings yet
Reading Practice Week 15
9 pages
CH 4
No ratings yet
CH 4
3 pages
To Boulez and Beyond (Boulez, PierrePeyser, JoanWuorinen, Charles) (Z-Library)
100% (4)
To Boulez and Beyond (Boulez, PierrePeyser, JoanWuorinen, Charles) (Z-Library)
390 pages
CSE List 1 Project - Exhibition - Judgement - Sheet-2025
No ratings yet
CSE List 1 Project - Exhibition - Judgement - Sheet-2025
4 pages
Class 12 Physics Magnetic Fields
No ratings yet
Class 12 Physics Magnetic Fields
2 pages
D7 Quality Control Tests For Suspension & Emulsions Finalized Ok
100% (4)
D7 Quality Control Tests For Suspension & Emulsions Finalized Ok
36 pages
Intact Stability Fundamentals
No ratings yet
Intact Stability Fundamentals
56 pages

NLP Projects

Uploaded by

NLP Projects

Uploaded by

Text Concordance alpha_tokens = [token.

lower() for token in tokens if

Output print("Original text :", text)

from sklearn.feature_extraction.text import

Text Preprocessing for i in range(len(corpus)):

from [Link] import stopwords

line = "quick brown fox jumps over the lazy dog"

import nltk import nltk

Enter the sentence to be tagged?

[('The', 'AT'), ('sky', 'NN'), ('is', 'BEZ'), ('so', 'QL'),

from [Link] import word_tokenize import nltk

You might also like