Natural Language Processing

The document provides an overview of Natural Language Processing (NLP), highlighting its focus on creating models from text data and the unique challenges it presents. It outlines a basic NLP process, introduces the TF-IDF method for featurizing text, and suggests optional reading materials. Additionally, it mentions a practical code along project for building a spam detection filter using Python and Spark.

Uploaded by

abhimanyu thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

142 views19 pages

Natural Language Processing

Uploaded by

abhimanyu thakur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Natural Language

Processing
Let’s learn something!
Python and Spark

● Let’s now learn about the basics of

Natural Language Processing!
● This is the ﬁeld of machine learning that
focuses on creating models from a text
data source (straight from articles of
words).
Python and Spark

● The NLP section of the course will just

contain a single custom code along
example because the documentation
doesn’t really have a full example and the
custom code along is a larger multi-step
process.
Python and Spark

● This is a very large ﬁeld of machine

learning with its own unique challenges
and sets of algorithms and features, so
what we cover here will be scratching
just the surface!
Python and Spark

● Optional Reading Suggestions:

○ Wikipedia Article on NLP
○ NLTK Book (separate Python library)
○ Foundations of Statistical Natural
Language Processing (Manning)
Python and Spark

● Examples of NLP
○ Clustering News Articles
○ Suggesting similar books
○ Grouping Legal Documents
○ Analyzing Consumer Feedback
○ Spam Email Detection
Python and Spark

● Our basic process for NLP:

○ Compile all documents (Corpus)
○ Featurize the words to numerics
○ Compare features of documents
Python and Spark

● A standard way of doing this is through

the use of what is known as “TF-IDF”
methods.
● TF-IDF stands for Term Frequency -
Inverse Document Frequency
● Let’s explain how it works!
NLP

Simple Example:
● You have 2 documents:
○ “Blue House”
○ “Red House”
● Featurize based on word count:
○ “Blue House” -> (red,blue,house) -> (0,1,1)
○ “Red House” -> (red,blue,house) -> (1,0,1)
NLP

● A document represented as a vector of word

counts is called a “Bag of Words”
○ “Blue House” -> (red,blue,house) -> (0,1,1)
○ “Red House” -> (red,blue,house) -> (1,0,1)
● These are now vectors in an N-dimensional
space, we can compare vectors with cosine
similarity:
NLP

● We can improve on Bag of Words by

adjusting word counts based on
their frequency in corpus (the group
of all the documents)
● We can use TF-IDF (Term Frequency
- Inverse Document Frequency)
NLP

● Term Frequency - Importance of the term

within that document
○ TF(x,y) = Number of occurrences of term x in
document y
● Inverse Document Frequency - Importance of
the term in the corpus
○ IDF(t) = log(N/dfx) where
■ N = total number of documents
■ dfx = number of documents with the
term
NLP

● Mathematically, TF-IDF is then

expressed:
Python and Spark

● Spark has a lot of pyspark.ml.feature

tools to help out with this entire process
and make it all easy for you!
● Let’s jump to a custom code along
example!
Tools for NLP
Part One
Python and Spark

● Before we jump into the code along

project, let’s explore a few of the tools
Spark has for dealing with text data.
● Then we’ll be able to use them easily in
our project!
Tools for NLP
Part Two
NLP Code Along
Python and Spark

● Let’s work through building a spam

detection ﬁlter using Python and Spark!
● Our data set consists of volunteered text
messages from a study in Singapore and
some spam texts from a UK reporting
site.
● Let’s get started

L5 - L6 - Natural Language Processing
100% (1)
L5 - L6 - Natural Language Processing
94 pages
Python for NLP and Semantic SEO
No ratings yet
Python for NLP and Semantic SEO
163 pages
Reading4 NLP
No ratings yet
Reading4 NLP
64 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Natural Language Processing in Data Science
No ratings yet
Natural Language Processing in Data Science
7 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Text Classification Reseach Paper
No ratings yet
Text Classification Reseach Paper
4 pages
NLP Materia
No ratings yet
NLP Materia
29 pages
NLP Essentials for AI Enthusiasts
No ratings yet
NLP Essentials for AI Enthusiasts
4 pages
NLP Coding Guide for Beginners
No ratings yet
NLP Coding Guide for Beginners
10 pages
NLP Record300
No ratings yet
NLP Record300
24 pages
NLP Full Overview
No ratings yet
NLP Full Overview
37 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
Unit 3 AI-ML Driven Data Science and Automation
No ratings yet
Unit 3 AI-ML Driven Data Science and Automation
49 pages
1 - Overview of NLP
No ratings yet
1 - Overview of NLP
39 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
Unit No 1 Introduction To NLP
No ratings yet
Unit No 1 Introduction To NLP
20 pages
NLP Notes Unit 1
No ratings yet
NLP Notes Unit 1
179 pages
Disruptive Technologies AI Lecture 3
No ratings yet
Disruptive Technologies AI Lecture 3
19 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
31 pages
NLP Unit1 Presentation
No ratings yet
NLP Unit1 Presentation
65 pages
Unit Iii
No ratings yet
Unit Iii
6 pages
Unit 4
No ratings yet
Unit 4
39 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
Gentle Start To Natural Language Processing Using Python
No ratings yet
Gentle Start To Natural Language Processing Using Python
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Big Data Analytics Chap 11
No ratings yet
Big Data Analytics Chap 11
8 pages
NLP Handwritten Notes
No ratings yet
NLP Handwritten Notes
26 pages
GBHRFTHRDF
No ratings yet
GBHRFTHRDF
3 pages
AI&NLP
No ratings yet
AI&NLP
1 page
Ai CH 4
No ratings yet
Ai CH 4
53 pages
NLP 1
No ratings yet
NLP 1
11 pages
NLP Intro Logistics MIHE
No ratings yet
NLP Intro Logistics MIHE
21 pages
NLP with Python: Beginner's Tutorial
No ratings yet
NLP with Python: Beginner's Tutorial
72 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
Module 2
No ratings yet
Module 2
19 pages
NLP Course Notes 2024-2025
No ratings yet
NLP Course Notes 2024-2025
38 pages
Unit 4
No ratings yet
Unit 4
8 pages
Natural Language Processing A Machine Learning Perspective by Yue Zhang, Westlake University Zhiyang Teng, Westlake University
No ratings yet
Natural Language Processing A Machine Learning Perspective by Yue Zhang, Westlake University Zhiyang Teng, Westlake University
768 pages
Intro To Natural Language Processing (NLP)
No ratings yet
Intro To Natural Language Processing (NLP)
13 pages
NLP LectureNotes UNIT 1
No ratings yet
NLP LectureNotes UNIT 1
55 pages
NLP Basics for Beginners
No ratings yet
NLP Basics for Beginners
19 pages
10366-Article Text-12682-1-10-20240404
No ratings yet
10366-Article Text-12682-1-10-20240404
7 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Natural Language Processing - Personal Notes
No ratings yet
Natural Language Processing - Personal Notes
8 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
Ai 2
No ratings yet
Ai 2
7 pages
Understanding Chatbots and NLP
No ratings yet
Understanding Chatbots and NLP
18 pages
NLP for AI and Business Solutions
No ratings yet
NLP for AI and Business Solutions
13 pages
Learn NLP With Python
No ratings yet
Learn NLP With Python
39 pages
Module-1 Introduction To NLP
No ratings yet
Module-1 Introduction To NLP
28 pages
Topic 2: Introduction To Natural Language Processing (NLP)
No ratings yet
Topic 2: Introduction To Natural Language Processing (NLP)
16 pages
Ai Applications Unit-1
No ratings yet
Ai Applications Unit-1
11 pages
Unit 1
No ratings yet
Unit 1
99 pages
Getting Started With Artificial Intelligence - Preview - Final 1 - KUO12425USEN PDF
No ratings yet
Getting Started With Artificial Intelligence - Preview - Final 1 - KUO12425USEN PDF
18 pages
CH - 5 JS
No ratings yet
CH - 5 JS
109 pages
Server Side PHP 1
No ratings yet
Server Side PHP 1
19 pages
Machine Learning Section
No ratings yet
Machine Learning Section
29 pages
Clustering
No ratings yet
Clustering
43 pages
Haard 1
No ratings yet
Haard 1
1 page
Paul Mather The New Microsoft Project
No ratings yet
Paul Mather The New Microsoft Project
41 pages
DAA Lab
No ratings yet
DAA Lab
6 pages
PHP Webforms
No ratings yet
PHP Webforms
39 pages
Spring Slides
No ratings yet
Spring Slides
63 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Chapter 09 Advanced Data Structures
No ratings yet
Chapter 09 Advanced Data Structures
9 pages
Youtube PavanKumar Manual Testing 02 (Practical)
No ratings yet
Youtube PavanKumar Manual Testing 02 (Practical)
21 pages
CH - 5 JS
No ratings yet
CH - 5 JS
109 pages
Spark DataFrame Basics
No ratings yet
Spark DataFrame Basics
10 pages
UDEMY - SK - SelectorsHub Tutorial - A Free Next Gen XPath & Locators Tool
No ratings yet
UDEMY - SK - SelectorsHub Tutorial - A Free Next Gen XPath & Locators Tool
20 pages
Spring Boot Ecommerce Masterclass
No ratings yet
Spring Boot Ecommerce Masterclass
337 pages
Tutorial 1 What Is Cucumber-BDD
No ratings yet
Tutorial 1 What Is Cucumber-BDD
9 pages
Tutorial 8 DataTable Aslists in Cucumber
No ratings yet
Tutorial 8 DataTable Aslists in Cucumber
13 pages
UDEMY - SK - XPath Tutorial From Basic To Advance Level
No ratings yet
UDEMY - SK - XPath Tutorial From Basic To Advance Level
9 pages
Lecture 3
No ratings yet
Lecture 3
15 pages
Tutorial 6 BackgroundKeyword
No ratings yet
Tutorial 6 BackgroundKeyword
9 pages
Apache POI: Excel File Handling Guide
No ratings yet
Apache POI: Excel File Handling Guide
12 pages
Tutorial 10 Data Driven Testing in Cucumber Scenario Outline
No ratings yet
Tutorial 10 Data Driven Testing in Cucumber Scenario Outline
10 pages
Testing - Log4J
No ratings yet
Testing - Log4J
7 pages
Xpath Vs CSS - Everything You Need To Know About XPath and CSS
No ratings yet
Xpath Vs CSS - Everything You Need To Know About XPath and CSS
11 pages
Slides For Windows OS
No ratings yet
Slides For Windows OS
43 pages
IPD Checklist
No ratings yet
IPD Checklist
1 page
Iterator+in+Java+Collection+ Iterator
No ratings yet
Iterator+in+Java+Collection+ Iterator
8 pages
Google Maps JSON Parsing Example
No ratings yet
Google Maps JSON Parsing Example
1 page
BC Contact Numbers Emails All
No ratings yet
BC Contact Numbers Emails All
1 page
1 P1: Analysis of The Functions of Stream Cipher and Block Cipher Using A Range of Appropriate Examples in Practice
No ratings yet
1 P1: Analysis of The Functions of Stream Cipher and Block Cipher Using A Range of Appropriate Examples in Practice
39 pages
DAG Representation of Basic Block
No ratings yet
DAG Representation of Basic Block
3 pages
Fault Detection and Classification in Ring Power System With DG Penetration Using Hybrid CNN-LSTM
No ratings yet
Fault Detection and Classification in Ring Power System With DG Penetration Using Hybrid CNN-LSTM
23 pages
CV - Unit Iii
No ratings yet
CV - Unit Iii
25 pages
Illustrated Microsoft Office 365 and Office 2016 Projects Loose Leaf Version 1st Edition Cram Solutions Manual Download
100% (23)
Illustrated Microsoft Office 365 and Office 2016 Projects Loose Leaf Version 1st Edition Cram Solutions Manual Download
10 pages
MTX - Associate Machine Learning Engineer
No ratings yet
MTX - Associate Machine Learning Engineer
2 pages
AI Viva Questions and Answers
No ratings yet
AI Viva Questions and Answers
13 pages
Error Analysis in Numerical Computation
No ratings yet
Error Analysis in Numerical Computation
22 pages
Grokking Algorithms. 2nd Edition Aditya Y. Bhargava. ebook full reference edition
100% (2)
Grokking Algorithms. 2nd Edition Aditya Y. Bhargava. ebook full reference edition
45 pages
8 Sem Project Report
No ratings yet
8 Sem Project Report
21 pages
T Test For Correlated Samples
No ratings yet
T Test For Correlated Samples
31 pages
GR 10 Edwardsmaths Test or Assignment Trig Functions T2 2022 Eng
No ratings yet
GR 10 Edwardsmaths Test or Assignment Trig Functions T2 2022 Eng
3 pages
K-FORCE 2019 Symposium Research Papers
No ratings yet
K-FORCE 2019 Symposium Research Papers
4 pages
Social Network Analysis
No ratings yet
Social Network Analysis
69 pages
Integration Booklet 2 - McGrathematics
No ratings yet
Integration Booklet 2 - McGrathematics
24 pages
BERT4Rec Sequential Recommendation With BidirectionalEncoder Representations From Transformer
No ratings yet
BERT4Rec Sequential Recommendation With BidirectionalEncoder Representations From Transformer
11 pages
Backstepping Control of Nonlinear Dynamical Systems Sundarapandian Vaidyanathan Ahmad Taher Azar Download
No ratings yet
Backstepping Control of Nonlinear Dynamical Systems Sundarapandian Vaidyanathan Ahmad Taher Azar Download
81 pages
Walpole Ch04
100% (1)
Walpole Ch04
35 pages
Image Processing and Computer Vision: Goals
No ratings yet
Image Processing and Computer Vision: Goals
14 pages
The Error Correcting Codes (ECC) Page: Welcome!
No ratings yet
The Error Correcting Codes (ECC) Page: Welcome!
9 pages
Finite-Difference Method for Transient Conduction
No ratings yet
Finite-Difference Method for Transient Conduction
17 pages
Merge Sort Algorithm and Complexity Analysis
No ratings yet
Merge Sort Algorithm and Complexity Analysis
2 pages
Assignment and Game Theory
No ratings yet
Assignment and Game Theory
42 pages
BDA - Research Paper6
No ratings yet
BDA - Research Paper6
10 pages
Automatic Control For Mechanical Engineers
No ratings yet
Automatic Control For Mechanical Engineers
176 pages
Adaptive Dynamic Programming With Applications in Optimal Control 1st Edition Derong Liu - Quickly Download The Ebook To Read Anytime, Anywhere
100% (2)
Adaptive Dynamic Programming With Applications in Optimal Control 1st Edition Derong Liu - Quickly Download The Ebook To Read Anytime, Anywhere
59 pages
9 - CFG Simplification
100% (1)
9 - CFG Simplification
7 pages
Kannada Character Recognition Using CNN
No ratings yet
Kannada Character Recognition Using CNN
5 pages
Ijser: Hybrid Data Encryption and Decryption Using Rsa and Rc4
No ratings yet
Ijser: Hybrid Data Encryption and Decryption Using Rsa and Rc4
10 pages
Intelligent Inverse Kinematic Control of PDF
No ratings yet
Intelligent Inverse Kinematic Control of PDF
12 pages