0% found this document useful (0 votes)
14 views7 pages

NLP Lab

The document outlines Lab Journal 7 for a Data Science course focusing on Natural Language Processing techniques. It details objectives such as lowercasing, HTML removal, URL removal, punctuation removal, stop words removal, tokenization, and calculating term frequency (TF) and term frequency-inverse document frequency (TF-IDF). The lab utilizes Python with libraries like Pandas, NLTK, and BeautifulSoup to preprocess movie reviews data and analyze the results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views7 pages

NLP Lab

The document outlines Lab Journal 7 for a Data Science course focusing on Natural Language Processing techniques. It details objectives such as lowercasing, HTML removal, URL removal, punctuation removal, stop words removal, tokenization, and calculating term frequency (TF) and term frequency-inverse document frequency (TF-IDF). The lab utilizes Python with libraries like Pandas, NLTK, and BeautifulSoup to preprocess movie reviews data and analyze the results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Introduction to Data Science

CSL487

Lab Journal 7

Student Name
Enrolment No.
Class and Section

Department of Computer Science


BAHRIA UNIVERSITY, ISLAMABAD
Lab # 7: Natural Language processing

Objectives:

o Lowercasing

o Remove HTML

o Remove URLs

o Remove punctautions

o Stop words removal

o Tokenization

o Tf

o Tf/idf

Tools Used:

Anaconda-Jupyter notebook

Submission Date:

Evaluation: Signatures of Lab Engineer:


• Task 1: Perform the following preprocessing techniques on the reviews of movies data:
o Lowercasing
o Remove HTML
o Remove URLs
o Remove punctautions
o Stop words removal
o Tokenization
o Tf
o Tf/idf

Program/Procedure:
import pandas as pd

import re

import nltk

from bs4 import BeautifulSoup

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

# Download necessary NLTK data files

nltk.download('punkt')

nltk.download('stopwords')

# Sample Movie Reviews Dataset

data = {

'Review': [

"<html><p>This movie was amazing! 10/10</p></html>",

"Check out this review at https://moviereviews.com/123",

"I didn't like this movie; it was too long...",

"WOW! The acting was great, but the story was slow.",

"An absolute masterpiece. Would recommend to everyone!"

df = pd.DataFrame(data)
print("Original Data:\n", df)

# 1. Lowercasing

df['Processed_Review'] = df['Review'].str.lower()

# 2. Remove HTML Tags

df['Processed_Review'] = df['Processed_Review'].apply(lambda x: BeautifulSoup(x,


"html.parser").get_text())

# 3. Remove URLs

df['Processed_Review'] = df['Processed_Review'].apply(lambda x: re.sub(r'http\S+|www.\S+', '', x))

# 4. Remove Punctuation

df['Processed_Review'] = df['Processed_Review'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

# 5. Tokenization

df['Tokens'] = df['Processed_Review'].apply(lambda x: word_tokenize(x))

# 6. Remove Stopwords (Handling Error for Missing Stopwords)

try:

stop_words = set(stopwords.words('english'))

except LookupError:

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
df['Tokens'] = df['Tokens'].apply(lambda x: [word for word in x if word not in stop_words])

# 7. Convert Tokens Back to Processed Text (For TF and TF-IDF)

df['Processed_Review'] = df['Tokens'].apply(lambda x: ' '.join(x))

# 8. Term Frequency (TF)

tf_vectorizer = TfidfVectorizer(use_idf=False, norm='l1')

tf_matrix = tf_vectorizer.fit_transform(df['Processed_Review'])

df_tf = pd.DataFrame(tf_matrix.toarray(), columns=tf_vectorizer.get_feature_names_out())

# 9. Term Frequency-Inverse Document Frequency (TF-IDF)

tfidf_vectorizer = TfidfVectorizer()

tfidf_matrix = tfidf_vectorizer.fit_transform(df['Processed_Review'])

df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Display Processed DataFrames

print("\nProcessed Reviews:\n", df[['Review', 'Processed_Review']])

print("\nTerm Frequency (TF):\n", df_tf)

print("\nTF-IDF:\n", df_tfidf)
Analysis:

You might also like