Introduction to
Natural Language
Processing (NLP)
How do machines
understand human
languages?
What you’ll Learn in this session
➢ What NLP is and why it’s important.
➢ Everyday applications of NLP
➢ Challenges in making machines understand
language. GeeksforGeeks (2024).
➢ Preliminary methods
➢ Regular Expressions
➢ Probability Theory
Learning Outcomes
By the end of this session, learners will:
Define Natural Language Processing (NLP) and understand its role in
enabling machines to process human language.
Identify real-world applications of NLP such as chatbots, machine
translation, and sentiment analysis.
Explain key challenges in NLP, including ambiguity, context
dependency, and multilingual processing.
Apply basic Python concepts like regular expressions to preprocess
and analyze text data.
Challenges in Making Machines
Understand Language
What is Natural Language Processing (NLP)?
➢ NLP is the ability of computers to understand, process, and respond to human
language.
➢ It’s a combination of computer science, linguistics, and artificial intelligence.
➢ NLP enables machines to:
Understand human communication.
Extract meaning from text and speech.
Input
Output
("Translate: Processing
("Bonjour")
Hello")
Machine
TitleTranslation
Image Here
Everyday Applications of NLP
Where Do we Use NLP?
“What’s the Google Analyzing YouTube Email spam
weather Translate for product Auto- filters.
today?” multilingual reviews as Captions.
(Answered by communicati positive or
Siri/Google on. negative.
Assistant).
1 2 3 4 5
Conversational Agents
Title
A Generic language model is essentially a next word predictor…
Image Here
Title
Seen language modelling before?
Image Here
Title of Language
The Complexity
➢ Humans speak over 7,000 languages, and each language has unique
grammar, vocabulary, and expressions.
➢ Machines must learn to process not just different languages but also context,
emotion, and nuance.
Bawo
ni
The man saw the girl with the telescope.
➢ The man used the telescope toImage
see the girl.
Here Sannu Hello Bonjour
➢ The girl had the telescope, and the man saw her.
Ndewo
How do machines handle this
complexity to understand what we
mean?
Challenges in Making Machines
Understand Language
Challenges in Making Machines Understand Us
Ambiguity:
Example: “I saw
Multilingualism:
the man with
the telescope.” Machines must understand
(Who has the multiple languages and
telescope?) code-switching.
Context Noisy Data:
Dependency: Social media text is
Example: The word messy with
“bank” could abbreviations,
mean a riverbank emojis, and slang
or a financial (e.g., "OMG dat’s
institution. gr8").
How Machines
Title “See” Text
Machines don’t "understand" text the way we do. NLP breaks text into
manageable parts:
– Tokenization: Splitting text into words or sentences.
How Machines
Title “See” Text
Machines don’t "understand" text the way we do. NLP breaks text into
manageable parts:
Regular Expressions (Regex): Pattern matching for cleaning and extracting
information.
Image Here
How MachinesTitle
“See” Text
Machines don’t "understand" text the way we do. NLP breaks text into
manageable parts:
Stopword Removal: Removing common words like “is” and “the”.
Image Here
Title “See” Text
How Machines
Machines don’t "understand" text the way we do. NLP breaks text into
manageable parts:
Stemming/Lemmatization: Converting words to their base/ dictionary form.
Image Here
Title “See” Text
How Machines
breaks text into manageable parts:
Tokenization
Stemmig/
Lowercasing
Lematization
Punctuation
and Special Stopword
Character Removal
Removal
We’ll explore these concepts step-by-step in Python later!
Preliminary Methods
Preliminary Methods
Fundamental concepts and tools necessary for basic text processing in NLP.
Raw text is messy and often contains irrelevant information. These methods
clean and structure data for efficient analysis.
Example:
➢ Processing customer feedback by removing irrelevant characters like
hashtags or emojis.
➢ Preparing text for chatbots by breaking down long sentences.
Probability Theory
Title
Regular Expressions (Regex)
Regular Expressions (Regex) are powerful tools used to find patterns in
text.
They help extract relevant information like
03 01 Emails
Dates
02
Phone numbers
The Regular Expression Module
You need to import the regex library: import re
Use the [Link]() to see if a string matches a regular
expression
Use the [Link]() to extract portions of a string that match your
regex
Regular Expression Quick Guide
^ Marks the start of a line.
$ Indicates the end of a line.
. Represents any single character.
\s Matches whitespace character.
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times, but non-greedily.
Regular Expression Quick Guide
+ Repeats a character one or more times
+? Repeats a character one or more times, but in a non-greedy way.
[aeiou] Matches a single character from the specified set of vowels.
[^XYZ] Matches a single character not in the listed set.
[a-z0-9]
Defines a range of characters to match, such as lowercase letters or digits.
( Marks the starting point for a string to be extracted.
) Marks the endpoint for a string extraction.
Probability Theory
The branch of mathematics that deals with the study of random
events, quantifying uncertainty, and modeling the likelihood of
outcomes.
It is based on the concept of a probability measure P, defined on a
sample space Ω (the set of all possible outcomes), with events
being subsets of Ω.
Mathematical language for quantifying uncertainty
Probability Theory
➢ Ω : Sample Space, set of all outcomes of a random experiment
➢ A : Event (A ⊆ Ω), collection of possible outcomes of an experiment
➢ P(A): Probability of event A, P is a function: events → ℝ
❑ 0 ≤ P(A) ≤ 1 for any event A ⊆ Ω.
❑ P(Ω) = 1
❑ P(A) ≥ 0 , for all A
❑ If A1 , A2 , … are disjoint (mutually exclusive) events then:
∞
𝑷(𝑨𝟏 ∪ 𝑨𝟐 ∪ ⋯ ) = 𝑷 (𝑨𝒊 )
𝒊=𝟏
Probability Theory
NLP models use probabilities to make decisions, such as predicting the
next word in a sentence.
Probability Theory: Key Concepts
NLP deals with
Probabilities enable
uncertainty due to
models to quantify
ambiguity in
uncertainty and make
language, data
data-driven predictions.
sparsity, and context.
Independence: Events Bayes’ Theorem (Conditional
A and B are Probability):
independent if
P(A∩B)=P(A)P(B). 𝑷(𝑩|𝑨)𝑷(𝑨)
𝑷(𝑨|𝑩) =
𝑷(𝑩)
Example
➢ Language Modeling: Predict the next word wn given prior context
w1,…,wn-1
𝑷(𝒘𝟏 , 𝒘𝟐 , … , 𝒘𝒏 ) = ෑ 𝑷 (𝒘𝒊 |𝒘𝟏 , 𝒘𝟐 , … , 𝒘𝒊−𝟏 )
𝒊=𝟏
➢ Text Classification (Naive Bayes): Assign a document D to a class C
𝑷(𝑫|𝑪) ⋅ 𝑷(𝑪)
𝑷(𝑪|𝑫) =
𝑷(𝑫)
Hands-On Demonstration
Let’s demonstrate a few example
in Python…
Key Takeaways
NLP enables machines to understand, process, and generate human language.
Applications span Chatbots, machine translation, sentiment analysis, and more.
Ambiguity, context dependency, resource limitations, and multilingual processing
make NLP complex
Regex tools like [Link]() and [Link]() to clean and preprocess text.
Probability theory provides a structured way to handle uncertainty and make
predictions based on language data.
Assignment Questions
1. Define Natural Language Processing (NLP) in your own words.
2. List at least three real-world applications of NLP and explain their significance.
3. Identify and explain two challenges that make NLP complex.
4. Extract the following patterns using regex:
a) All email addresses from the text below:
“Contact us at support@[Link] or sales@[Link]. For more, email
info@[Link].”
b) All words that end with "ing" from this sentence:
“NLP is amazing for cleaning and processing text while learning new techniques.”
5. Write a Python program to clean the following text by:
a) Removing all punctuation.
b) Converting it to lowercase.
c) Splitting it into words.
Resources
➢ Python regex Documentation: [Link]
➢ Regex cheat sheet: [Link]
expresso
➢ Friedl, J. E. (2006). Mastering Regular Expressions, 3rd Edition. O'Reilly Media,
Inc. [Link]
➢ Jurafsky, D., & Martin, J. H. (2024). Speech and language processing (3rd ed.
Draft). [Link]
Next Class
Text Preprocessing, Tokenization, Stemming &
Lemmatization