Introduction to Data Science
CSL487
Lab Journal 7
Student Name
Enrolment No.
Class and Section
Department of Computer Science
BAHRIA UNIVERSITY, ISLAMABAD
Lab # 7: Natural Language processing
Objectives:
o Lowercasing
o Remove HTML
o Remove URLs
o Remove punctautions
o Stop words removal
o Tokenization
o Tf
o Tf/idf
Tools Used:
Anaconda-Jupyter notebook
Submission Date:
Evaluation: Signatures of Lab Engineer:
• Task 1: Perform the following preprocessing techniques on the reviews of movies data:
o Lowercasing
o Remove HTML
o Remove URLs
o Remove punctautions
o Stop words removal
o Tokenization
o Tf
o Tf/idf
Program/Procedure:
import pandas as pd
import re
import nltk
from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
# Download necessary NLTK data files
nltk.download('punkt')
nltk.download('stopwords')
# Sample Movie Reviews Dataset
data = {
'Review': [
"<html><p>This movie was amazing! 10/10</p></html>",
"Check out this review at https://moviereviews.com/123",
"I didn't like this movie; it was too long...",
"WOW! The acting was great, but the story was slow.",
"An absolute masterpiece. Would recommend to everyone!"
df = pd.DataFrame(data)
print("Original Data:\n", df)
# 1. Lowercasing
df['Processed_Review'] = df['Review'].str.lower()
# 2. Remove HTML Tags
df['Processed_Review'] = df['Processed_Review'].apply(lambda x: BeautifulSoup(x,
"html.parser").get_text())
# 3. Remove URLs
df['Processed_Review'] = df['Processed_Review'].apply(lambda x: re.sub(r'http\S+|www.\S+', '', x))
# 4. Remove Punctuation
df['Processed_Review'] = df['Processed_Review'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
# 5. Tokenization
df['Tokens'] = df['Processed_Review'].apply(lambda x: word_tokenize(x))
# 6. Remove Stopwords (Handling Error for Missing Stopwords)
try:
stop_words = set(stopwords.words('english'))
except LookupError:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
df['Tokens'] = df['Tokens'].apply(lambda x: [word for word in x if word not in stop_words])
# 7. Convert Tokens Back to Processed Text (For TF and TF-IDF)
df['Processed_Review'] = df['Tokens'].apply(lambda x: ' '.join(x))
# 8. Term Frequency (TF)
tf_vectorizer = TfidfVectorizer(use_idf=False, norm='l1')
tf_matrix = tf_vectorizer.fit_transform(df['Processed_Review'])
df_tf = pd.DataFrame(tf_matrix.toarray(), columns=tf_vectorizer.get_feature_names_out())
# 9. Term Frequency-Inverse Document Frequency (TF-IDF)
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['Processed_Review'])
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
# Display Processed DataFrames
print("\nProcessed Reviews:\n", df[['Review', 'Processed_Review']])
print("\nTerm Frequency (TF):\n", df_tf)
print("\nTF-IDF:\n", df_tfidf)
Analysis: