0% found this document useful (0 votes)

14 views7 pages

NLP Lab

The document outlines Lab Journal 7 for a Data Science course focusing on Natural Language Processing techniques. It details objectives such as lowercasing, HTML removal, URL removal, punctuation removal, stop words removal, tokenization, and calculating term frequency (TF) and term frequency-inverse document frequency (TF-IDF). The lab utilizes Python with libraries like Pandas, NLTK, and BeautifulSoup to preprocess movie reviews data and analyze the results.

Uploaded by

ammazammaz.ussain2018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views7 pages

NLP Lab

Uploaded by

ammazammaz.ussain2018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Introduction to Data Science

CSL487

Lab Journal 7

Student Name
Enrolment No.
Class and Section

Department of Computer Science

BAHRIA UNIVERSITY, ISLAMABAD
Lab # 7: Natural Language processing

Objectives:

o Lowercasing

o Remove HTML

o Remove URLs

o Remove punctautions

o Stop words removal

o Tokenization

o Tf

o Tf/idf

Tools Used:

Anaconda-Jupyter notebook

Submission Date:

Evaluation: Signatures of Lab Engineer:

• Task 1: Perform the following preprocessing techniques on the reviews of movies data:
o Lowercasing
o Remove HTML
o Remove URLs
o Remove punctautions
o Stop words removal
o Tokenization
o Tf
o Tf/idf

Program/Procedure:
import pandas as pd

import re

import nltk

from bs4 import BeautifulSoup

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

# Download necessary NLTK data files

nltk.download('punkt')

nltk.download('stopwords')

# Sample Movie Reviews Dataset

data = {

'Review': [

"<html><p>This movie was amazing! 10/10</p></html>",

"Check out this review at https://moviereviews.com/123",

"I didn't like this movie; it was too long...",

"WOW! The acting was great, but the story was slow.",

"An absolute masterpiece. Would recommend to everyone!"

df = pd.DataFrame(data)
print("Original Data:\n", df)

# 1. Lowercasing

df['Processed_Review'] = df['Review'].str.lower()

# 2. Remove HTML Tags

df['Processed_Review'] = df['Processed_Review'].apply(lambda x: BeautifulSoup(x,

"html.parser").get_text())

# 3. Remove URLs

df['Processed_Review'] = df['Processed_Review'].apply(lambda x: re.sub(r'http\S+|www.\S+', '', x))

# 4. Remove Punctuation

df['Processed_Review'] = df['Processed_Review'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

# 5. Tokenization

df['Tokens'] = df['Processed_Review'].apply(lambda x: word_tokenize(x))

# 6. Remove Stopwords (Handling Error for Missing Stopwords)

try:

stop_words = set(stopwords.words('english'))

except LookupError:

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
df['Tokens'] = df['Tokens'].apply(lambda x: [word for word in x if word not in stop_words])

# 7. Convert Tokens Back to Processed Text (For TF and TF-IDF)

df['Processed_Review'] = df['Tokens'].apply(lambda x: ' '.join(x))

# 8. Term Frequency (TF)

tf_vectorizer = TfidfVectorizer(use_idf=False, norm='l1')

tf_matrix = tf_vectorizer.fit_transform(df['Processed_Review'])

df_tf = pd.DataFrame(tf_matrix.toarray(), columns=tf_vectorizer.get_feature_names_out())

# 9. Term Frequency-Inverse Document Frequency (TF-IDF)

tfidf_vectorizer = TfidfVectorizer()

tfidf_matrix = tfidf_vectorizer.fit_transform(df['Processed_Review'])

df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Display Processed DataFrames

print("\nProcessed Reviews:\n", df[['Review', 'Processed_Review']])

print("\nTerm Frequency (TF):\n", df_tf)

print("\nTF-IDF:\n", df_tfidf)
Analysis:

NLP Prac2-2
No ratings yet
NLP Prac2-2
11 pages
Sentiment Analysis Part 1
No ratings yet
Sentiment Analysis Part 1
9 pages
Preprocessing in Python
No ratings yet
Preprocessing in Python
50 pages
Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
No ratings yet
Beginner's Guide To Data Cleaning and Feature Extraction in NLP - by Enes Gokce - Towards Data Science
20 pages
Python NLP Techniques Guide
No ratings yet
Python NLP Techniques Guide
18 pages
British Airways Forage Report
No ratings yet
British Airways Forage Report
12 pages
FALLSEM2024-25 BCSE332P LO VL2024250102168 2024-10-07 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE332P LO VL2024250102168 2024-10-07 Reference-Material-I
18 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
NLP Assignment2
No ratings yet
NLP Assignment2
7 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
Installing WordCloud in Jupyter
No ratings yet
Installing WordCloud in Jupyter
11 pages
NLP - Assignment2 Proper RNN Working
No ratings yet
NLP - Assignment2 Proper RNN Working
3 pages
Assignment-9 (NLP)
No ratings yet
Assignment-9 (NLP)
2 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
NLP Problem-Solving Techniques
No ratings yet
NLP Problem-Solving Techniques
118 pages
NLP Sentimental Analysis 1736351356
No ratings yet
NLP Sentimental Analysis 1736351356
32 pages
Ir Lab 2 Ir Learning Outcomes: Pyterrier
No ratings yet
Ir Lab 2 Ir Learning Outcomes: Pyterrier
7 pages
R002 KrishAhuja BDA Lab9.Ipynb - Colab
No ratings yet
R002 KrishAhuja BDA Lab9.Ipynb - Colab
3 pages
Token Ization
No ratings yet
Token Ization
5 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Sentiment Analysis Using Text Mining PDF
100% (1)
Sentiment Analysis Using Text Mining PDF
12 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
Sentiment Analysis Summary Report
No ratings yet
Sentiment Analysis Summary Report
1 page
NLP with NLTK: Restaurant Reviews Analysis
No ratings yet
NLP with NLTK: Restaurant Reviews Analysis
5 pages
Assignment 2 IR
No ratings yet
Assignment 2 IR
6 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
NLPPractical
No ratings yet
NLPPractical
12 pages
Pandas PD Numpy NP NLTK NLTK - Sentiment.vader Re Wordcloud Seaborn Sns Matplotlib - Pyplot PLT
No ratings yet
Pandas PD Numpy NP NLTK NLTK - Sentiment.vader Re Wordcloud Seaborn Sns Matplotlib - Pyplot PLT
6 pages
Unit2 Full
No ratings yet
Unit2 Full
28 pages
AI Lab Manual Aktu
No ratings yet
AI Lab Manual Aktu
11 pages
1a NLTK
No ratings yet
1a NLTK
10 pages
Combine PDF
No ratings yet
Combine PDF
124 pages
Assignment-10 (NLP-part-2)
No ratings yet
Assignment-10 (NLP-part-2)
2 pages
NLP Text Preprocessing Guide
No ratings yet
NLP Text Preprocessing Guide
19 pages
Essential NLP Pre-processing Steps
No ratings yet
Essential NLP Pre-processing Steps
20 pages
NLP Lab Manual - Final
No ratings yet
NLP Lab Manual - Final
15 pages
NLP Lab1
No ratings yet
NLP Lab1
2 pages
Sentiment Analysis with NLTK
No ratings yet
Sentiment Analysis with NLTK
4 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
NLP Lab Codes Till Mod3
No ratings yet
NLP Lab Codes Till Mod3
7 pages
Detailed Explanation of The Code
No ratings yet
Detailed Explanation of The Code
4 pages
DSBDA Practical 7 Tutorial
No ratings yet
DSBDA Practical 7 Tutorial
11 pages
Experiment 2
No ratings yet
Experiment 2
4 pages
IMDB Reviews Sentiment Analysis Report
No ratings yet
IMDB Reviews Sentiment Analysis Report
17 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
Chatbot
No ratings yet
Chatbot
6 pages
Web Scraping & Inverted Index Guide
No ratings yet
Web Scraping & Inverted Index Guide
10 pages
Natural Language Processing With Python
No ratings yet
Natural Language Processing With Python
7 pages
Part A
No ratings yet
Part A
6 pages
23951a04e3 Acsd08
No ratings yet
23951a04e3 Acsd08
11 pages
Basenlp
No ratings yet
Basenlp
5 pages
Sentiment Analysis With NLP Deep Learning
No ratings yet
Sentiment Analysis With NLP Deep Learning
8 pages
Text Noise Removal & Preprocessing
No ratings yet
Text Noise Removal & Preprocessing
38 pages
Experiment No 3
No ratings yet
Experiment No 3
7 pages
N011 Lab4 NLP
No ratings yet
N011 Lab4 NLP
12 pages
Importing The Libraries
No ratings yet
Importing The Libraries
3 pages
Lab 2 NLP
No ratings yet
Lab 2 NLP
2 pages
Data Science Project
No ratings yet
Data Science Project
24 pages
Job Opportunities in Pakistan
No ratings yet
Job Opportunities in Pakistan
10 pages
Recreational Pilot Licence RPL Learn To Fly Melbourne Course Guide PDF
No ratings yet
Recreational Pilot Licence RPL Learn To Fly Melbourne Course Guide PDF
21 pages
Six Limbs of Saranagati Explained
No ratings yet
Six Limbs of Saranagati Explained
1 page
Field Research
No ratings yet
Field Research
7 pages
Problem Solving Decision Making and Professional Judgment A Guide For Lawyers and Policy Makers
100% (8)
Problem Solving Decision Making and Professional Judgment A Guide For Lawyers and Policy Makers
696 pages
Larson Questions 1
No ratings yet
Larson Questions 1
21 pages
Special Right Triangles: Find The Missing Side Lengths. Leave Your Answers As Radicals in Simplest Form
No ratings yet
Special Right Triangles: Find The Missing Side Lengths. Leave Your Answers As Radicals in Simplest Form
4 pages
DWCL Alumni Tracking System Development
100% (1)
DWCL Alumni Tracking System Development
7 pages
MC4 Templae
No ratings yet
MC4 Templae
13 pages
Official Test Bank Indian Art and Culture For Civil Services and Other Competitive Examinations 3rd Edition Nitin Singhania Singhania
No ratings yet
Official Test Bank Indian Art and Culture For Civil Services and Other Competitive Examinations 3rd Edition Nitin Singhania Singhania
352 pages
1st Quarter 2022 Exam TOS UCSP
No ratings yet
1st Quarter 2022 Exam TOS UCSP
2 pages
Resume S
No ratings yet
Resume S
2 pages
1.SIP 2025 2028.docx School Profile Final Na Gyud
100% (3)
1.SIP 2025 2028.docx School Profile Final Na Gyud
11 pages
Interview Guide Cheat Sheet
No ratings yet
Interview Guide Cheat Sheet
4 pages
Table of Contents - 2
No ratings yet
Table of Contents - 2
3 pages
Filipino Youth: Culture and Values
No ratings yet
Filipino Youth: Culture and Values
8 pages
Competency-Based Management: John W. Slocum, Jr. Susan E. Jackson Don Hellriegel
No ratings yet
Competency-Based Management: John W. Slocum, Jr. Susan E. Jackson Don Hellriegel
35 pages
Algebra TXT'
No ratings yet
Algebra TXT'
41 pages
Combined Pa2 Syllabus - 21 - 22 - Class 8
No ratings yet
Combined Pa2 Syllabus - 21 - 22 - Class 8
1 page
The Effectiveness of 7E Learning Model To Improve Scientific Literacy
No ratings yet
The Effectiveness of 7E Learning Model To Improve Scientific Literacy
5 pages
Suryanamaskar 3rdeditionexcerpts
No ratings yet
Suryanamaskar 3rdeditionexcerpts
105 pages
Introduction To Hindu Samskaras
100% (1)
Introduction To Hindu Samskaras
2 pages
Brock Blasdell Resume - Academic
No ratings yet
Brock Blasdell Resume - Academic
1 page
Q) 9) A) Short Notes - Pilot
No ratings yet
Q) 9) A) Short Notes - Pilot
2 pages
Vasquez, L. M., & Lee, G. (2017) - Creative Writing. Manila: Rex Book Store Inc.
No ratings yet
Vasquez, L. M., & Lee, G. (2017) - Creative Writing. Manila: Rex Book Store Inc.
2 pages
CHE 412: Thermodynamics II Course
No ratings yet
CHE 412: Thermodynamics II Course
4 pages
Osteomyelitis
No ratings yet
Osteomyelitis
28 pages
253 BICTE. 1st Sem Back 2081
No ratings yet
253 BICTE. 1st Sem Back 2081
2 pages
A Semi-Detailed Lesson Plan in Kindergarten I. Objectives: at The
92% (119)
A Semi-Detailed Lesson Plan in Kindergarten I. Objectives: at The
2 pages
PRSSA Alumni & Student News
No ratings yet
PRSSA Alumni & Student News
7 pages

NLP Lab

Uploaded by

NLP Lab

Uploaded by

Introduction to Data Science

Department of Computer Science

o Stop words removal

Evaluation: Signatures of Lab Engineer:

from bs4 import BeautifulSoup

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

# Download necessary NLTK data files

# Sample Movie Reviews Dataset

"<html><p>This movie was amazing! 10/10</p></html>",

"Check out this review at https://moviereviews.com/123",

"I didn't like this movie; it was too long...",

"An absolute masterpiece. Would recommend to everyone!"

# 2. Remove HTML Tags

df['Processed_Review'] = df['Processed_Review'].apply(lambda x: BeautifulSoup(x,

df['Processed_Review'] = df['Processed_Review'].apply(lambda x: re.sub(r'http\S+|www.\S+', '', x))

df['Processed_Review'] = df['Processed_Review'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

df['Tokens'] = df['Processed_Review'].apply(lambda x: word_tokenize(x))

# 6. Remove Stopwords (Handling Error for Missing Stopwords)

# 7. Convert Tokens Back to Processed Text (For TF and TF-IDF)

df['Processed_Review'] = df['Tokens'].apply(lambda x: ' '.join(x))

# 8. Term Frequency (TF)

tf_vectorizer = TfidfVectorizer(use_idf=False, norm='l1')

df_tf = pd.DataFrame(tf_matrix.toarray(), columns=tf_vectorizer.get_feature_names_out())

# 9. Term Frequency-Inverse Document Frequency (TF-IDF)

df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Display Processed DataFrames

print("\nProcessed Reviews:\n", df[['Review', 'Processed_Review']])

print("\nTerm Frequency (TF):\n", df_tf)

You might also like