0% found this document useful (0 votes)
23 views29 pages

Nlput-Unit1 Notes

Natural Language Processing (NLP) is a field of artificial intelligence focused on enabling computers to understand and interpret human languages, addressing the complexities and ambiguities inherent in natural language. NLP combines linguistics and computer science techniques to analyze and manipulate text and speech, facilitating applications such as smart assistants, machine translation, and sentiment analysis. Key components of NLP include Natural Language Understanding (NLU) and Natural Language Generation (NLG), which work together to enhance human-computer interaction.

Uploaded by

rishikareddy983
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views29 pages

Nlput-Unit1 Notes

Natural Language Processing (NLP) is a field of artificial intelligence focused on enabling computers to understand and interpret human languages, addressing the complexities and ambiguities inherent in natural language. NLP combines linguistics and computer science techniques to analyze and manipulate text and speech, facilitating applications such as smart assistants, machine translation, and sentiment analysis. Key components of NLP include Natural Language Understanding (NLU) and Natural Language Generation (NLG), which work together to enhance human-computer interaction.

Uploaded by

rishikareddy983
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY

(AN AUTONOMOUS INSTITUTE)

Accredited by NBA & NAAC, Approved by AICTE, Affiliated to JNTUH, Hyderabad


1.1 INTRODUCTION TO NLP

• Natural language processing (NLP) is an area of computer science and artificial


intelligence concerned with the interaction between computers and humans in natural
language.

• The ultimate goal of NLP is to help computers understand language like we do.

• Natural language processing (NLP) is a field of artificial intelligence in which


computers analyze, understand, and derive meaning from human language in a smart
and useful way.

• NLP strives to build machines that understand text or voice and respond in text or
voice data in the same way humans do

1.1a Why NLP ????? - NEED FOR NLP

Computers can understand the structured form of data like spreadsheets and the tables in the
database, but human languages, texts, and voices form an unstructured category of data, and it
gets difficult for the computer to understand it, and there arises the need for Natural Language
Processing.

The word “natural” here is used to contrast natural languages with formal languages. In this
sense, all the languages humans speak are natural. Many experts believe that language emerged
naturally tens of thousands of years ago and has evolved organically ever since. Formal
languages, on the other hand, are types of languages that are invented by humans and have
strictly and explicitly defined syntax (i.e., what is grammatical) and semantics (i.e., what it
means).

Programming languages such as C and Python are good examples of formal languages. These
languages are defined in such a strict way that it is always clear what is grammatical and
ungrammatical. When you run a compiler or an interpreter on the code you write in those
languages, you either get a syntax error or not. The compiler won’t say something like, “Hmm,
this code is maybe 50% grammatical.” Also, the behaviour of your program is always the same
if it’s run on the same code, assuming external factors such as the random seed and the system
states remain constant. Your interpreter won’t show one result 50% of the time and another the
other 50% of the time.

This is not the case for human languages. You can write a sentence that is maybe grammatical.
For example, do you consider the phrase “The person I spoke to” ungrammatical? There are
some grammar topics where even experts disagree with each other. This is what makes human
languages interesting but challenging, and why the entire field of NLP even exists. Human
languages are ambiguous, meaning that their interpretation is often not unique. Both structures
(how sentences are formed) and semantics (what sentences mean) can have ambiguities in
human language. As an example,

• As an example, “ He Saw a girl with a telescope ”

• Is it the boy, who’s using a telescope to see a girl (from somewhere far), or

• The girl, who has a telescope and is seen by the boy?

There seem to be at least two interpretations of this sentence

• The reason you are confused upon reading this sentence is because you don’t know
what the phrase “with a telescope” is about.

• More technically, you don’t know what this prepositional phrase (PP) modifies.
• This is called a PP-attachment problem and is a classic example of syntactic
ambiguity. A syntactically ambiguous sentence has more than one interpretation of
how the sentence is structured. You can interpret the sentence in multiple ways,
depending on which structure of the sentence you believe

Another type of ambiguity that may arise in natural language

“I saw a bat”

There is no question how this sentence is structured. The subject of the sentence is “I” and the
object is “a bat,” connected by the verb “saw.”

In other words, there is no syntactical ambiguity in it. But how about its meaning? “Saw” has
at least two meanings. One is the past tense of the verb “to see.” The other is to cut some object
with a saw.

Similarly, “a bat” can mean two very different things: is it a nocturnal flying mammal or a
piece of wood used to hit a ball?

All in all, does this sentence mean that I observed a flying mammal or that I cut a baseball or
cricket bat? Or even (cruelly) that I cut a nocturnal animal with a saw? You never know, at least
from this sentence alone.

• Ambiguity is what makes natural languages rich but also challenging to process.

• Human languages are interesting but challenging, they are ambiguous, meaning that

their interpretation is often not unique.

• Both structures (how sentences are formed) and semantics (what sentences mean) can

have ambiguities in human language.

1.1b HOW NLP WORKS ?


• NLP combines the field of linguistics and computer science to decipher language
structure and guidelines to make models which can comprehend, break down and
separate significant details from text and speech.

• NLP involves a variety of techniques, including computational linguistics, machine


learning, and statistical modeling. These techniques are used to analyze, understand,
and manipulate human language data, including text, speech, and other forms of
communication.
• NLP includes a range of algorithms, tasks, and problems that take human-produced text
as an input and produce some useful information, such as labels, semantic
representations, and so on, as an output.

• Other tasks, such as translation, summarization, and text generation, directly produce
text as output.

AI,ML,DL & NLP


• Natural Language Processing, Machine Learning, and Artificial Intelligence are used
interchangeably, yet they have different definitions.

• AI is an umbrella term for machines that can simulate human intelligence, while
ML,DL and NLP are subsets of AI.

• Artificial Intelligence is a part of the greater field of Computer Science that enables
computers to solve problems previously handled by biological systems.

• Machine Learning is an application of AI that provides systems the ability to


automatically learn and improve from experience without being explicitly programmed.

• Deep learning is a method in artificial intelligence (AI) that teaches computers to


process data in a way that is inspired by the human brain. Deep learning models can
recognize complex patterns in pictures, text, sounds, and other data to produce accurate
insights and predictions.

• Natural Language Processing is a form of AI that gives machines the ability to not
just read, but to understand and interpret human language.
1.2 APPLICATIONS OF NLP

Smart Assistants

Amazon's Alexa and Apple's Siri are two of the prime examples of such interaction where
humans use speech to interact with the system and perform different tasks.

Another example of natural interaction is Google's homepage where you can perform search
operations via speech. Natural language processing lays at the foundation of such interaction.

Spelling correction

Microsoft Corporation provides word processor software like MS-word, PowerPoint for the
spelling correction.
Machine Translation
Machine translation is used to translate text or speech from one natural language to another
natural language.

Chatbot

Implementing the Chatbot is one of the important applications of NLP. It provides


the customer chat services

Sentiment Analysis

Sentiment Analysis is also known as opinion mining. It is used on the web to analyse the
attitude, behaviour, and emotional state. This application is implemented through a
combination of NLP and statistics by assigning the values to the text, identify the mood of the
context.
Email Filtering
It checks/ analyzing incoming emails for red flags that signal spam or phishing content and
then automatically moving those emails to a separate folder.
eg: look for common trigger words,
such as "free" and "earn money".

1.3 Building of NLP Application


NLP enables humans to interact with computers in natural language, making technology more
accessible and easier to use for a wide range of people.

NLP allows machines to analyze and understand large amounts of text data, which is valuable
for extracting insights, detecting patterns, and making data-driven decisions.

Large volumes of textual data

NLP makes it possible for computers to read text, hear speech, interpret it, measure sentiment
etc.

Structuring a highly unstructured data source


NLP is important because it helps resolve ambiguity in language and adds useful numeric
structure to the data for many downstream applications, such as speech recognition or text
analytics.
Evolution of NLP

Evolving from human-computer interaction to human-computer conversation

The first critical part of NLP Advancements – Biometrics

The second critical part of NLP advancements - Humanoid Robotics

NLP is broadly made of two parts:


Natural Language Understanding (NLU)

Natural Language Generation (NLG)

Natural Language Understanding (NLU)

NLU is branch of natural language processing (NLP), which helps computers understand
and interpret human language by breaking down the elemental pieces of speech.
In NLU, machine learning models improve over time as they learn to recognize syntax,
context, language patterns, unique definitions, sentiment, and intent.

NLU is a subset of natural language processing, which uses syntactic and semantic analysis
of text and speech to determine the meaning of a sentence. Syntax refers to the grammatical
structure of a sentence, while semantics alludes to its intended meaning. NLU also
establishes a relevant ontology: a data structure which specifies the relationships between
words and phrases. While humans naturally do this in conversation, the combination of
these analyses is required for a machine to understand the intended meaning of different
texts.

Natural Language Generation (NLG)

Natural language generation is another subset of natural language processing. While natural
language understanding focuses on computer reading comprehension, natural language
generation enables computers to write. NLG is the process of producing a human language
text response based on some data input. This text can also be converted into a speech format
through text-to-speech services.

Delivering a meaningful, personalized experience beyond pre-scripted responses requires


natural language generation. This enables the chatbot to interrogate data repositories,
including integrated back-end systems and third-party databases, and to use that
information in creating a response.

NLU is about analysis. NLG is about synthesis.

An NLP application may involve one or both.


Sentiment analysis and semantic search are examples of NLU.

Captioning an image or video is mainly an NLG task since input is not textual.

Text summarization and chatbot are applications that involve NLU and NLG.

There's also Natural Language Interaction (NLI) of which Amazon Alexa and Siri are
examples.

Challenges

Systems are as yet incapable of understanding the way humans do. Until then, progress will
be limited to better pattern matching.

In the area of chatbots, there's a need to model common sense.

Africa alone has about 2100 languages. We need to find a way to solve this even if training
data is limited.

Just measuring progress is a challenge. We need datasets and evaluation procedures tuned
to concrete goals.

Language is inherently ambiguous, with words and phrases often having multiple meanings
depending on context. Resolving ambiguity is challenging for NLP systems.

Understanding the context in which a word or phrase is used is crucial for accurately
interpreting meaning. NLP systems need to be able to understand and use context
effectively.

NLP models often require large amounts of annotated data for training, and obtaining such
data can be costly and time-consuming, especially for languages with fewer resources.

1.4 NLP TASKS

NLP stands for Natural Language Processing, which is a field of artificial intelligence focused

on enabling computers to understand, interpret, and generate human language. NLP tasks

include
1. Text Classification:

Assigning predefined categories or labels to text, such as spam detection or sentiment

analysis.

Text classification is the process of classifying pieces of text into different categories.

This NLP task is one of the simplest yet most widely used.

For example, spam filtering is one type of text classification. It classifies emails (or other

types of text, such as web pages) into two categories—spam or not spam.

2. PART-OF-SPEECH TAGGING

Part-of-speech tagging is the process of tagging each word in a sentence with a

corresponding part-of-speech tag.

As an example, let’s take the sentence “I saw a girl with a telescope.”

The POS tags for this sentence are shown in figure


These tags come from the Penn Treebank POS tagset, which is the most popular standard
corpus for training and evaluating various NLP tasks such as POS tagging and parsing. The
results of POS tagging are often used as the input to other downstream NLP tasks, such as
machine translation and parsing.

3. PARSING

Parsing is the task of analyzing the structure of a sentence.

Broadly speaking, there are two main types of parsing,

constituency parsing and

dependency parsing.

Constituency parsing uses context-free grammars to represent natural language sentences.


A context-free grammar is a way to specify how smaller building blocks of a language (e.g.,
words) are combined to form larger building blocks (e.g., phrases and clauses) and
eventually sentences. To put it another way, it specifies how the largest unit (a sentence) is
broken down to phrases and clauses and all the way down to words. The ways the linguistic
units interact with each other are specified by a set of production rules as follows:
A production rule describes a transformation from the symbol on the left-hand side (e.g.,
“S”) to the symbols on the right-hand side (e.g., “NP VP”).

The first rule means that a sentence is a noun phrase (NP) followed by a verb phrase (VP).
Now the parser’s job is to figure out how to reach the final symbol (in this case, “S”) starting
from the raw words in the sentence. You can think of those rules as transformation rules
from the symbols on the right to the ones on the left by traversing the arrow backward.

For example, using the rule “DT a” and “NN girl,” you can convert “a girl” to “DT
NN.” Then, if you use “NP DT NN,” you can reduce the entire phrase to “NP.” If you
illustrate this process in a tree-like diagram, you get something like the one shown in figure
below.

Tree structures that are created in the process of parsing are called parse trees, or simply

parses. The figure is a subtree because it doesn’t cover the entirety of the tree (i.e., it doesn’t

show all the way from “S” to words). Using the sentence “I saw a girl with a telescope” that

we discussed earlier and see if you can parse it by hand. If you keep breaking down the

sentence using the production rules until you get the final “S” symbol, you get the tree-like

structure shown in figure


The other type of parsing is called dependency parsing. Dependency parsing uses dependency

grammars to describe the structure of sentences, not in terms of phrases but in terms of words

and the binary relations between them. For example, the result of dependency parsing of the

earlier sentence is shown in figure below

Notice that each relation is directional and labeled. A relation specifies which word depends

on which word and the type of relationship between the two. For example, the relation

connecting “a” to “girl” is labeled “det,” meaning the first word is the determiner of the

second. If you take the most central word, “saw,” and pull it upward, you’ll notice that these

words and relations form a tree. Such trees are called dependency trees

4. TEXT GENERATION

Text generation, also called natural language generation (NLG), is the process of

generating natural language text from something else. Summarization, text simplification,

and grammatical error correction all produce natural language text as output and are

instances of text-generation tasks.


Because all of these tasks take natural language text as their input, they are called text-to-

text generation.

Another class of text-generation task is called data-to-text generation. For those tasks, the

input is data that is not text. A publisher may wish to generate news text based on events

such as sports game outcomes and weather. There is also a growing interest in generating

natural language text that best describes a given image, called image captioning.

Finally, a third class of text classification is unconditional text generation, where natural

language text is generated randomly from a model. You can train models so that they can

generate random academic papers, Linux source code, or even poems and play scripts

5. Named Entity Recognition (NER): Identifying and classifying named entities

mentioned in text, such as names of people, organizations, and locations.

NER attempts to extract entities (for example, person, location, and organization) from a

given body of text or a text corpus.

For example, the sentence, John gave Mary two apples at school on Monday will be

transformed to [John] name gave [Mary] name [two] number apples at [school] organization on

[Monday.] time. NER is an imperative topic in fields such as information retrieval and

knowledge representation.

6. Question Answering: Providing relevant answers to questions posed in natural

language.

QA techniques possess a high commercial value, and such techniques are found at the

foundation of chatbots and VA (for example, Google Assistant and Apple Siri).

Chatbots have been adopted by many companies for customer support. Chatbots can be

used to answer and resolve straightforward customer concerns (for example, changing a

customer's monthly mobile plan), which can be solved without human intervention.
QA touches upon many other aspects of NLP such as information retrieval, and knowledge

representation. Consequently, all this makes developing a QA system very difficult.

7. Machine Translation (MT)

MT is the task of transforming a sentence/phrase from a source language (for example,

German) to a target language (for example, English). This is a very challenging task as,

different languages have highly different morphological structures, which means that it is

not a one-to-one transformation. Furthermore, word-to-word relationships between

languages can be one-to-many, one-to-one, many-to-one, or many-to-many.

8. Sentiment Analysis: Determining the sentiment (positive, negative, or neutral)

expressed in a piece of text.

9. Language Generation: Generating human-like text, such as in chatbots or automated

content creation.
1.5 Development of NLP Applications
The development of NLP applications is a highly iterative process, consisting of many phases

of research, development, and operations.

DATA COLLECTION

• Most modern NLP applications are based on machine learning.

• Machine learning, by definition, requires data on which NLP models are trained. Data
can be collected from humans (e.g., by hiring in-house annotators and having them go
through a bunch of text instances), crowdsourcing (e.g., using platforms such as
Amazon Mechanical Turk), or automated mechanisms (e.g., from application logs or
clickstreams).

ANALYSIS AND EXPERIMENTING

• After collecting the data, you move on to the next phase where you analyze and run
some experiments.

• For analyses, you usually look for signals such as: What are the characteristics of the
text instances? How are the training labels distributed? Can you come up with signals
that are correlated with the training labels? Can you come up with some simple rules
that can predict the training labels with reasonable accuracy? Should we even use ML?
This list goes on and on.
• This analysis phase includes aspects of data science, where various statistical
techniques may come in handy. The goal in this phase is to narrow down the possible
set of approaches to a couple of promising ones, before you go all-in and start training
a gigantic model.

• TRAINING

This is when you start adding more data and computational resources (e.g., GPUs) for
training your model. It is not uncommon for modern NLP models to take days if not
weeks to train, especially if they are based on neural network models. It is critical at
this phase that you keep your training pipeline reproducible. Chances are, you will need
to run this several times with different sets of hyperparameters, which are tuning values
set before starting the model’s learning process. It is also likely that you will need to
run this pipeline several months later, if not years

• IMPLEMENTATION

When you have a model that is working with acceptable performance, you move on to
the implementation phase. This is when you start making your application “production
ready.” This process basically follows software engineering best practices, including:
writing unit and integration tests for your NLP modules, refactoring your code, having
your code reviewed by other developers, improving the performance of your NLP
modules, and dockerizing your application.

• DEPLOYING

Your NLP application is finally ready to deploy. You can deploy your NLP application
in many ways—it can be an online service, a recurring batch job, an offline application,
or an offline one-off task.

If this is an online service that needs to serve its predictions in real time, it is a good
idea to make this a microservice to make it loosely coupled with other services.

• MONITORING

An important final step for developing NLP applications is monitoring. This not only
includes monitoring the infrastructure such as server CPU, memory, and request
latency, but also higher-level ML statistics such as the distributions of the input and the
predicted labels. Some of the important questions to ask at this stage are:

• What do the input instances look like?

• Are they what you expected when you built your model?
• What do the predicted labels look like?

• Does the predicted label distribution match the one in the training data?

The purpose of the monitoring is to check that the model you built is behaving as intended. If
the incoming text or data instances or the predicted labels do not match your expectation, you
may have an out-of-domain problem, meaning that the domain of the natural language data you
are receiving is different

1.6 Structure of NLP applications

The structures of modern, machine learning–based NLP applications are becoming surprisingly
similar for two main reasons—

1. one is that most modern NLP applications rely on machine learning to some degree,
and they should follow best practices for machine learning applications.

2. The other is that, due to the advent of neural network models, a number of NLP tasks,
including text classification, machine translation, dialog systems, and speech
recognition, can now be trained end-to-end.

Figure below illustrates the typical structure of a modern NLP application.

There are two main infrastructures: the training and the serving infrastructure.

The training infrastructure is usually offline and serves the purpose of training the machine
learning model necessary for the application. It takes the training data, converts it to some data
structure that can be handled by the pipeline, and further processes it by transforming the data
and extracting the features. This part varies greatly from task to task. Finally if the model is a
neural network, data instances are batched and fed to the model, which is optimized to
minimize the loss. The trained model is usually serialized and stored to be passed to the serving
infrastructure.

The serving infrastructure’s job is to, given a new instance, produce the prediction, such as
classes, tags, or translations. The first part of this infrastructure, which reads the instance and
transforms it into some numbers, is similar to the one for training. In fact, you must keep the
dataset reader and the transformer identical. Otherwise, discrepancies will arise in the way
those two process the data, also known as training-serving skew. After the instance is processed,
it’s fed to the pretrained model to produce the prediction.

1.7 Your first NLP application : Sentiment Analysis

Sentiment analysis in natural language processing (NLP) is the process of determining the

sentiment or emotion expressed in a piece of text. It involves using computational methods to

analyze and classify the sentiment of text as positive, negative, or neutral. Sentiment analysis

is widely used in various applications, including social media monitoring, customer feedback

analysis, and market research. In machine learning, classification means categorizing

something into a set of predefined, discrete categories.

• One of the most basic tasks in sentiment analysis is the classification of polarity, that

is, to classify whether the expressed opinion is positive, negative, or neutral.

• Classification of polarity is one type of sentence classification task.

• Another type of sentence classification task is spam filtering, where each sentence is

categorized into two classes—spam or not spam. It’s called binary classification if there

are only two classes. If there are more than two classes (the five-star classification

system mentioned earlier, for example), it’s called multiclass classification.

In contrast, when the prediction is a continuous value instead of discrete categories, it’s called

regression.

• predict the price of a house based on its properties

• predict stock prices based on the information


• collected from news articles and social media posts

But most linguistic units such as characters, words, and part-of-speech tags are discrete. For

this reason, most uses of machine learning in NLP are classification, not regression.

Steps Involved In Sentiment Analysis Classification

Applications

Applications of sentiment analysis in NLP are diverse and include:

1. Social Media Monitoring: Analyzing the sentiment of social media posts to

understand public opinion or customer feedback.


2. Customer Feedback Analysis: Analyzing reviews, survey responses, and customer

support interactions to gauge customer satisfaction.

3. Brand Monitoring: Monitoring mentions of a brand online to understand public

perception and sentiment.

4. Market Research: Analyzing sentiment in market reports, news articles, and other

sources to understand market trends and consumer behavior.

5. Political Analysis: Analyzing sentiment in political speeches, news articles, and social

media to understand public opinion and political trends.

Challenges

1. Ambiguity: Sentiment can be expressed in complex ways, making it challenging to

accurately interpret.

2. Context: Understanding the context of text is crucial for accurate sentiment analysis,

as the same words can have different meanings in different contexts.

3. Sarcasm and Irony: Sentiment analysis algorithms can struggle to detect sarcasm and

irony, which can lead to misinterpretation.

4. Language Variations: Sentiment analysis models trained on one language or dialect

may not perform well on text written in a different language or dialect.

5. Data Quality: The quality of the training data used to train sentiment analysis models

can significantly impact their performance.


1.8 What is a dataset?(CORPUS)

A dataset simply means a collection of data. It consists of pieces of data that follow the same

format.

In NLP, records in a dataset are usually some type of linguistic units, such as words, sentences,

or documents. A dataset of natural language texts is called a corpus (plural: corpora).

As an example, let’s think of a (hypothetical) dataset for spam filtering. Each record in this

dataset is a pair of a piece of text and a label, where the text is a sentence or a paragraph (e.g.,

from an email) and the label specifies whether the text is spam. Both the text and the label are

the fields of a record.

Some NLP datasets and corpora have more complex structures. For example, a dataset may

contain a collection of sentences, where each sentence is annotated with detailed linguistic

information, such as part-of-speech tags, parse trees, dependency structures, and semantic

roles.

If a dataset contains a collection of sentences annotated with their parse trees, the dataset is

called a treebank.

The most famous example of this is Penn Treebank (PTB)

([Link] which has been serving as the defacto standard

dataset for training and evaluating NLP tasks such as part-of-speech (POS) tagging and

parsing.

• A closely related term to a record is an instance.

• In machine learning, an instance is a basic unit for which the prediction is made. For

example, in the spam-filtering task mentioned earlier, an instance is one piece of text.

• An instance is usually created from a record in a dataset, as is the case for the spam-

filtering task.

• Finally, a label is a piece of information attached to some linguistic unit in a dataset.


• A spam-filtering dataset has labels that correspond to whether each text is a spam.

• A treebank may have one label per word for its part of speech.

Data sets
1.9 What are word embeddings?

Word embeddings are one of the most important concepts in modern NLP. Word embeddings
are a type of representation for words in a continuous vector space where the positioning of
words captures semantic relationships between them.

In simpler terms, word embeddings are numerical representations of words that allow
computers to understand their meanings and relationships with other words. Word embeddings
in natural language processing (NLP) are numerical representations of words that capture
semantic relationships between words based on their usage in text.

Word embeddings are typically learned from large text corpora using neural network models,
such as Word2Vec, GloVe, or FastText. These models map words to high-dimensional vectors
in such a way that words with similar meanings are represented by vectors that are close to
each other in the vector space. Word embeddings are useful in NLP tasks such as language
modeling, sentiment analysis, and machine translation, as they allow models to capture the
meaning of words and the relationships between them.

In the eyes of computers, “cat” is no closer to “dog” than it is to “pizza.”

One way to deal with discrete words programmatically is to assign indices to individual words
as follows (here we simply assume that these indices are assigned alphabetically): ¡
index("cat") = 1

index("dog") = 2

index("pizza") = 3

The entire, finite set of words that one NLP application or task deals with is called vocabulary.
Just because words are now represented by numbers doesn’t mean you can do arithmetic
operations on them and conclude that “cat” is equally similar to “dog” (difference between 1
and 2), as “dog” is to “pizza” (difference between 2 and 3). Those indices are still discrete and
arbitrary.

“What if we can represent them on a numerical scale?”

Conceptually, the numerical scale would look like the one shown in figure below
This is a step forward. Now we can represent the fact that “cat” and “dog” are more similar to
each other than “pizza” is to those words.

But still, “pizza” is slightly closer to “dog” than it is to “cat.”

• What if we wanted to place it somewhere that is equally far from “cat” and “dog?”

• Maybe only one dimension is too limiting.

• How about adding another dimension to this, as shown in figure

Much better! Because computers are really good at dealing with multidimensional spaces you
can simply keep doing this until you have a sufficient number of dimensions.

Let’s have three dimensions.

In this 3-D space, you can represent those three words as follows:

¡ vec("cat") = [0.7, 0.5, 0.1]

¡ vec("dog") = [0.8, 0.3, 0.1]

¡ vec("pizza") = [0.1, 0.2, 0.8]


Figure below illustrates this three-dimensional space

The x -axis (the first element) here represents some concept of “animal-ness” and the z-axis
(the third dimension) corresponds to “food-ness.

This is essentially what word embeddings are.

Think of a multidimensional space that has as many dimensions as there are words.

Then, give to each word a vector that is filled with zeros but just one 1, as shown

¡ vec("cat") = [1, 0, 0]

¡ vec("dog") = [0, 1, 0]

¡ vec("pizza") = [0, 0, 1]

Notice that each vector has only one 1 at the position corresponding to the word’s index.
These special vectors are called one-hot vectors.

Need for Word Embedding?

Word embeddings are crucial in natural language processing (NLP) for several reasons:

1. Semantic Representation: Word embeddings provide a way to represent words in a


continuous vector space, where words with similar meanings are closer together. This
allows NLP models to capture semantic relationships between words and understand
their meanings in context.

2. Dimensionality Reduction: Word embeddings reduce the dimensionality of the input


space, making it easier for NLP models to process and learn from text data. This can
lead to more efficient and effective models.
3. Generalization: Word embeddings can generalize to unseen words based on their
similarity to words in the training data. This is particularly useful in NLP tasks where
the vocabulary is large and constantly evolving.

4. Improved Performance: NLP models that use word embeddings often achieve better
performance compared to models that use traditional sparse representations of words,
such as one-hot encoding. Word embeddings capture more nuanced relationships
between words, leading to improved performance on tasks like text classification,
sentiment analysis, and machine translation.

5. Transfer Learning: Pre-trained word embeddings can be used as a starting point for
training NLP models on specific tasks. This allows models to leverage knowledge
learned from large text corpora and achieve better performance with less training data.

You might also like