0% found this document useful (0 votes)

57 views6 pages

NLTK Punkt Tab Resource Guide

Uploaded by

ahbajwa102

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views6 pages

NLTK Punkt Tab Resource Guide

Uploaded by

ahbajwa102

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

24/11/2024, 23:24 mapreduce.

ipynb - Colab

https://colab.research.google.com/drive/1YscQHdGXIwpRFjZXAdHQvwSoecTEFxr-#scrollTo=cHKPYhQ5PPFW&printMode=true 1/6
24/11/2024, 23:24 mapreduce.ipynb - Colab
import nltk
import string
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer

# Download necessary resources

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
# Download the 'punkt_tab' data package
nltk.download('punkt_tab') # This line is added to fix the error

# Sample large text (replace this with your own text)

large_text = """
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) focused on enabling machines to understand and process human languages.
It encompasses a range of tasks such as text analysis, machine translation, and sentiment analysis. NLP applications are widely used in chatbots, virtual ass
language translation services, and more. Despite significant advancements, NLP still faces challenges such as ambiguity, context understanding, and linguisti
"""

### Preprocessing Techniques ###

# 1. Remove special characters, numbers, and extra spaces
def clean_text(text):
text = re.sub(r'\d+', '', text) # Remove numbers
text = re.sub(r'[^\w\s]', '', text) # Remove special characters
text = text.strip() # Remove leading/trailing spaces
text = re.sub(r'\s+', ' ', text) # Remove extra spaces
return text

# 2. Tokenize the text into words

def tokenize_text(text):
return word_tokenize(text)

# 3. Convert tokens to lowercase

def to_lowercase(tokens):
return [token.lower() for token in tokens]

# 4. Remove stopwords
def remove_stopwords(tokens):
stop_words = set(stopwords.words('english'))
return [token for token in tokens if token not in stop_words]

# 5. Lemmatize tokens
def lemmatize_tokens(tokens):
lemmatizer = WordNetLemmatizer()
return [lemmatizer.lemmatize(token) for token in tokens]

# 6. Stem tokens (Optional, for comparison)

def stem_tokens(tokens):
stemmer = PorterStemmer()
return [stemmer.stem(token) for token in tokens]

### Applying the Preprocessing Steps ###

# Step 1: Clean text
cleaned_text = clean_text(large_text)
print("\n--- Cleaned Text ---\n", cleaned_text)

# Step 2: Tokenize
tokens = tokenize_text(cleaned_text)
print("\n--- Tokens ---\n", tokens)

# Step 3: Convert to lowercase

lower_tokens = to_lowercase(tokens)
print("\n--- Lowercase Tokens ---\n", lower_tokens)

# Step 4: Remove Stopwords

filtered_tokens = remove_stopwords(lower_tokens)
print("\n--- Tokens without Stopwords ---\n", filtered_tokens)

# Step 5: Lemmatize
lemmatized_tokens = lemmatize_tokens(filtered_tokens)
print("\n--- Lemmatized Tokens ---\n", lemmatized_tokens)

# Step 6: Stem (Optional)

stemmed_tokens = stem_tokens(filtered_tokens)
print("\n--- Stemmed Tokens ---\n", stemmed_tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt_tab.zip.

--- Cleaned Text ---

Natural Language Processing NLP is a subfield of artificial intelligence AI focused on enabling machines to understand and process human languages It en

--- Tokens ---

['Natural', 'Language', 'Processing', 'NLP', 'is', 'a', 'subfield', 'of', 'artificial', 'intelligence', 'AI', 'focused', 'on', 'enabling', 'machines', '

--- Lowercase Tokens ---

['natural', 'language', 'processing', 'nlp', 'is', 'a', 'subfield', 'of', 'artificial', 'intelligence', 'ai', 'focused', 'on', 'enabling', 'machines', '

--- Tokens without Stopwords ---

['natural', 'language', 'processing', 'nlp', 'subfield', 'artificial', 'intelligence', 'ai', 'focused', 'enabling', 'machines', 'understand', 'process',

--- Lemmatized Tokens ---

['natural', 'language', 'processing', 'nlp', 'subfield', 'artificial', 'intelligence', 'ai', 'focused', 'enabling', 'machine', 'understand', 'process',

--- Stemmed Tokens ---

['natur', 'languag', 'process', 'nlp', 'subfield', 'artifici', 'intellig', 'ai', 'focus', 'enabl', 'machin', 'understand', 'process', 'human', 'languag'

from collections import defaultdict

# Sample large article

article = """
MapReduce is a programming model that is widely used for processing and generating large datasets.
This model was introduced by Google and is now a core concept in big data. MapReduce involves two key functions: Map and Reduce.
The Map function processes key-value pairs and produces intermediate key-value pairs.
The Reduce function merges all intermediate values associated with the same key.
Together, MapReduce enables distributed processing of large datasets across a cluster of computers.
"""

# ----------------------------------------
# Step 1: Mapper
# ----------------------------------------
def mapper(text):
"""
The mapper splits the text into words and emits key-value pairs (word, 1).
"""
print("### Step 1: Mapper ###")
words = text.lower().replace('.', '').replace(',', '').split() # Normalize text
mapped_data = [(word, 1) for word in words] # Emit (word, 1) for each word
print("Mapper Output:", mapped_data[:10], "...") # Print first 10 pairs for illustration
return mapped_data

# ----------------------------------------
# Step 2: Shuffle and Sort
https://colab.research.google.com/drive/1YscQHdGXIwpRFjZXAdHQvwSoecTEFxr-#scrollTo=cHKPYhQ5PPFW&printMode=true 4/6
24/11/2024, 23:24 mapreduce.ipynb - Colab
# ----------------------------------------
def shuffle_and_sort(mapped_data):
"""
The shuffle and sort phase groups the key-value pairs by keys.
"""
print("\n### Step 2: Shuffle and Sort ###")
grouped_data = defaultdict(list)
for word, count in mapped_data:
grouped_data[word].append(count)
print("Shuffled and Sorted Output (Sample):", dict(list(grouped_data.items())[:5])) # Print first 5 groups
return grouped_data

# ----------------------------------------
# Step 3: Reducer
# ----------------------------------------
def reducer(shuffled_data):
"""
The reducer aggregates the values for each key by summing them up.
"""
print("\n### Step 3: Reducer ###")
reduced_data = {word: sum(counts) for word, counts in shuffled_data.items()}
print("Reducer Output (Sample):", dict(list(reduced_data.items())[:5])) # Print first 5 reduced results
return reduced_data

# ----------------------------------------
# Combine the Steps in a MapReduce Pipeline
# ----------------------------------------
def mapreduce_pipeline(text):
"""
Executes the full MapReduce process: Map, Shuffle/Sort, Reduce.
"""
print("### MapReduce Pipeline ###\n")
# Step 1: Map
mapped_data = mapper(text)

# Step 2: Shuffle and Sort

shuffled_data = shuffle_and_sort(mapped_data)

# Step 3: Reduce
reduced_data = reducer(shuffled_data)

return reduced_data

# ----------------------------------------
# Run the MapReduce Pipeline
https://colab.research.google.com/drive/1YscQHdGXIwpRFjZXAdHQvwSoecTEFxr-#scrollTo=cHKPYhQ5PPFW&printMode=true 5/6
24/11/2024, 23:24 mapreduce.ipynb - Colab

# ----------------------------------------
result = mapreduce_pipeline(article)

# ----------------------------------------
# Display Final Results
# ----------------------------------------
print("\n--- Final Word Count ---")
for word, count in sorted(result.items(), key=lambda x: x[1], reverse=True): # Sort by frequency
print(f"{word}: {count}")

### MapReduce Pipeline ###

### Step 1: Mapper ###

Mapper Output: [('mapreduce', 1), ('is', 1), ('a', 1), ('programming', 1), ('model', 1), ('that', 1), ('is', 1), ('widely', 1), ('used', 1), ('for', 1)

### Step 2: Shuffle and Sort ###

Shuffled and Sorted Output (Sample): {'mapreduce': [1, 1, 1], 'is': [1, 1, 1], 'a': [1, 1, 1], 'programming': [1], 'model': [1, 1]}

### Step 3: Reducer ###

Reducer Output (Sample): {'mapreduce': 3, 'is': 3, 'a': 3, 'programming': 1, 'model': 2}

--- Final Word Count ---

and: 4
mapreduce: 3
is: 3
a: 3
the: 3
model: 2
processing: 2
large: 2
datasets: 2
key: 2
map: 2
reduce: 2
function: 2
key-value: 2
pairs: 2
intermediate: 2
of: 2
programming: 1
that: 1
widely: 1
used: 1

https://colab.research.google.com/drive/1YscQHdGXIwpRFjZXAdHQvwSoecTEFxr-#scrollTo=cHKPYhQ5PPFW&printMode=true 6/6

NLP Syllabus
No ratings yet
NLP Syllabus
2 pages
NLP Record300
No ratings yet
NLP Record300
24 pages
NLP - Course EDC 1 29
No ratings yet
NLP - Course EDC 1 29
29 pages
NLP with Python: Beginner's Tutorial
No ratings yet
NLP with Python: Beginner's Tutorial
72 pages
Pertemuan 3 - Preprocessing
No ratings yet
Pertemuan 3 - Preprocessing
25 pages
NLP Practicals All
No ratings yet
NLP Practicals All
57 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
NLB Lab Manuel 2
No ratings yet
NLB Lab Manuel 2
71 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
NLP Pipeline
No ratings yet
NLP Pipeline
58 pages
NLP Tokenization Techniques Guide
No ratings yet
NLP Tokenization Techniques Guide
50 pages
NLP Full Overview
No ratings yet
NLP Full Overview
37 pages
DSBDA7
No ratings yet
DSBDA7
5 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Essential NLP Pre-processing Steps
No ratings yet
Essential NLP Pre-processing Steps
20 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
Unit I NLP
No ratings yet
Unit I NLP
5 pages
Natural Language Processing With Python
No ratings yet
Natural Language Processing With Python
7 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
NLP Tokenization Techniques Lab Report
No ratings yet
NLP Tokenization Techniques Lab Report
175 pages
Tinywow Pythass3 77951173
No ratings yet
Tinywow Pythass3 77951173
17 pages
NLP 1
No ratings yet
NLP 1
11 pages
N LP Notes Detailed
No ratings yet
N LP Notes Detailed
12 pages
Exp 5
No ratings yet
Exp 5
2 pages
Natural Language Processing - Personal Notes
No ratings yet
Natural Language Processing - Personal Notes
8 pages
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
No ratings yet
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
81 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
NLP Practicals
No ratings yet
NLP Practicals
6 pages
CSR 322 Syllabus
No ratings yet
CSR 322 Syllabus
2 pages
Topic 2: Introduction To Natural Language Processing (NLP)
No ratings yet
Topic 2: Introduction To Natural Language Processing (NLP)
16 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
Wisdom Natural Language Processing
No ratings yet
Wisdom Natural Language Processing
4 pages
DS 7
No ratings yet
DS 7
3 pages
NLP Notes Unit 1
No ratings yet
NLP Notes Unit 1
179 pages
Text Preprocessing & NLTK Guide
No ratings yet
Text Preprocessing & NLTK Guide
8 pages
Google NLP: Understanding Language Processing
No ratings yet
Google NLP: Understanding Language Processing
8 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
C24064 - NLP - Lab Manual
No ratings yet
C24064 - NLP - Lab Manual
28 pages
Tokenization
No ratings yet
Tokenization
4 pages
NLP Techniques: POS and NER Lab Guide
No ratings yet
NLP Techniques: POS and NER Lab Guide
36 pages
Lecture01 Introduction
No ratings yet
Lecture01 Introduction
35 pages
CSE 3652 Lab Record Format - PDF
No ratings yet
CSE 3652 Lab Record Format - PDF
13 pages
Unit Iii
No ratings yet
Unit Iii
6 pages
Reading4 NLP
No ratings yet
Reading4 NLP
64 pages
Natural Language Processing in Data Science
No ratings yet
Natural Language Processing in Data Science
7 pages
Analysis of Applied Natural Language Processing With Python - Implementing Machine Learning and Deep Learning Algorithms For Natural Language Processing (PDFDrive)
No ratings yet
Analysis of Applied Natural Language Processing With Python - Implementing Machine Learning and Deep Learning Algorithms For Natural Language Processing (PDFDrive)
2 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
Natural Language Processing: Presented By
No ratings yet
Natural Language Processing: Presented By
22 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
01 Introduction To Natural Language Processing
No ratings yet
01 Introduction To Natural Language Processing
42 pages
Beginners Practical Guide To NLP
No ratings yet
Beginners Practical Guide To NLP
18 pages
NLP Unit1 Presentation
No ratings yet
NLP Unit1 Presentation
65 pages
Understanding Chatbots and NLP
No ratings yet
Understanding Chatbots and NLP
18 pages
AMLTA
No ratings yet
AMLTA
17 pages
Roadmap For Mastering Natural Language Processing
No ratings yet
Roadmap For Mastering Natural Language Processing
3 pages
Phoenix SecureCore Setup Utility
No ratings yet
Phoenix SecureCore Setup Utility
23 pages
Ilensys Fresher Hiring Process
No ratings yet
Ilensys Fresher Hiring Process
12 pages
Intel® Xeon® Gold 6442Y Processor
No ratings yet
Intel® Xeon® Gold 6442Y Processor
5 pages
PiBoy Advance Instructions
No ratings yet
PiBoy Advance Instructions
5 pages
Assignment PWP
No ratings yet
Assignment PWP
3 pages
A Survey On Neural Network Hardware Accelerators
No ratings yet
A Survey On Neural Network Hardware Accelerators
22 pages
Azure Data Engineer Interview Guide
No ratings yet
Azure Data Engineer Interview Guide
158 pages
J-1160 Set 01
No ratings yet
J-1160 Set 01
183 pages
Realtek Wi-Fi SDK For Android L 5.0
No ratings yet
Realtek Wi-Fi SDK For Android L 5.0
14 pages
Online Electric Bike Rental Service
No ratings yet
Online Electric Bike Rental Service
11 pages
Unit - 2 - Word Level Analysis
No ratings yet
Unit - 2 - Word Level Analysis
24 pages
Create a Gmail Connector for Vantage
No ratings yet
Create a Gmail Connector for Vantage
18 pages
Dissertation On Infringement of Copyright Law in Music Industry
No ratings yet
Dissertation On Infringement of Copyright Law in Music Industry
66 pages
Tabla de Conversion de Pantone A NCS (Natural Color System)
No ratings yet
Tabla de Conversion de Pantone A NCS (Natural Color System)
22 pages
CMOS Logic Styles-1 (Unit 3)
No ratings yet
CMOS Logic Styles-1 (Unit 3)
45 pages
PSPP Dev
No ratings yet
PSPP Dev
117 pages
Microsoft 98-382 JavaScript Exam Guide
No ratings yet
Microsoft 98-382 JavaScript Exam Guide
63 pages
Tsel Erp - User Manual For Bapk Form v01.1
No ratings yet
Tsel Erp - User Manual For Bapk Form v01.1
17 pages
Tourism System Project Report
No ratings yet
Tourism System Project Report
46 pages
Control, Parameter Setting
No ratings yet
Control, Parameter Setting
5 pages
Solidworks 2013 Bible Matt Lombard Instant Download
100% (3)
Solidworks 2013 Bible Matt Lombard Instant Download
62 pages
Enhancing Lessons with Digital & Non-Digital Tools
No ratings yet
Enhancing Lessons with Digital & Non-Digital Tools
30 pages
Python - List of Experiments
No ratings yet
Python - List of Experiments
3 pages
ds-cdm425 Data Sheet
No ratings yet
ds-cdm425 Data Sheet
5 pages
Creativity Lab - Ee 223 B Experiment No. 4: AIM: To Make An Automatic Visitor Counter Circuit Using CD4026 IC and LDR
No ratings yet
Creativity Lab - Ee 223 B Experiment No. 4: AIM: To Make An Automatic Visitor Counter Circuit Using CD4026 IC and LDR
6 pages
DACO-RFP-4040-2024 - Digital Channels Activation Version 1.0
No ratings yet
DACO-RFP-4040-2024 - Digital Channels Activation Version 1.0
46 pages
ACCT+604+Mini+Project+1 Anukruti
No ratings yet
ACCT+604+Mini+Project+1 Anukruti
3 pages
Advanced DRFM for Defense Systems
No ratings yet
Advanced DRFM for Defense Systems
1 page
1721359-Fresher Sample Resume
No ratings yet
1721359-Fresher Sample Resume
1 page
UFE Canada
No ratings yet
UFE Canada
17 pages