0% found this document useful (0 votes)
15 views17 pages

Language Modelling

The document discusses Natural Language Processing (NLP) with a focus on language modeling, which predicts the probability of word sequences. It outlines two main types of language modeling: statistical and neural, detailing methods such as N-grams and Maximum Likelihood Estimation. Additionally, it covers evaluation techniques for language models, including intrinsic and extrinsic evaluations, along with metrics like cosine similarity, accuracy, F1 score, and perplexity.

Uploaded by

ha0744106123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views17 pages

Language Modelling

The document discusses Natural Language Processing (NLP) with a focus on language modeling, which predicts the probability of word sequences. It outlines two main types of language modeling: statistical and neural, detailing methods such as N-grams and Maximum Likelihood Estimation. Additionally, it covers evaluation techniques for language models, including intrinsic and extrinsic evaluations, along with metrics like cosine similarity, accuracy, F1 score, and perplexity.

Uploaded by

ha0744106123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Natural Language

Processing
By
Dr. Pankaj Dadure
Assistant Professor
School of Computer Science
UPES Dehradun
Language Modelling
• Language modeling is the way of determining the
probability of any sequence of words.
• Language modeling is the task of predicting the
next word or character in a document and can be
used to train language models that can be applied
to a wide range of natural language tasks like text
generation, text classification, and question
answering.
Methods of Language Modelings

Two types Statistical Language Modelings: Statistical Language Modeling, or


Language Modeling, is the development of probabilistic models that are
of able to predict the next word in the sequence given the words that
Language precede. Examples such as N-gram language modeling.
Modelings:
Neural Language Modelings: Neural network methods are achieving
better results than classical methods both on standalone language
models and when models are incorporated into larger models on
challenging tasks like speech recognition and machine translation. A way
of performing a neural language model is through word embeddings.
Statistical Language Modelings
• It is also called as probabilistic language modeling.
• Goal: Compute the probability of sentence or sequence of words

P(w)= P(𝑤1, 𝑤2, 𝑤3, 𝑤4, …. 𝑤𝑛 )

• Related task: Probability of an upcoming words

P(𝑤1, 𝑤2, 𝑤3, 𝑤4 , 𝑤5 ) = P(𝑤5 | 𝑤1, 𝑤2, 𝑤3, 𝑤4 )


Reminder: The Chain Rule
• The definition of conditional probabilities
P(B|A)P(A) 𝑃(𝐴,𝐵)
P(A|B)= rewrite as: P(A|B)=
𝑃(𝐵) 𝑃(𝐵)

𝑃 𝐴 𝐵 𝑃 𝐵 = 𝑃 𝐴, 𝐵
𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝐵 𝑃 𝐵
• More Variables:
P(A,B,C,D)=P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• The Chain Rule in general:
P(𝑥1 , 𝑥2 , 𝑥3 , … . , 𝑥𝑛 ) = P(𝑥1 )P(𝑥2 |𝑥1 )P(𝑥3 |𝑥1 𝑥2 )…..P(𝑥𝑛 |𝑥1 , … , 𝑥𝑛−1 )
The chain rule applied to compute joint probability
of words in sentence
• P(Its water is so transparent) = P(its) × P(water | its) × P(is | its water)
× P(so | its water is) × P(transparent | its water is so)

P(𝑤1 , 𝑤2 , 𝑤3 ,…, 𝑤𝑛 ) = ς𝑖 𝑃(𝑤𝑖 |𝑤1 , 𝑤2 , 𝑤3 , … , 𝑤𝑖−1 )

Note: The chain rule shows the link between the joint probability of a
sequence and the conditional probability of a word given previous
words.
Chain rule to Markov Model
• The previous equation suggests that we could estimate the joint probability of an
entire sequence of words by multiplying together a number of conditional
probabilities.
• But using the chain rule doesn’t really seem to help us! We don’t know any way to
compute the exact probability of a word given a long sequence of preceding words,
P(𝑤𝑛 |𝑤1𝑛−1 ).
• As we said above, we can’t just estimate by counting the number of times every
word occurs following every long string, because language is creative, and any
particular context might have never occurred before!
• The intuition of the N-gram model is that instead of computing the probability of a
word given its entire history, we will approximate the history by just the last few
words
Markov Model
• Models that assign probabilities to sequences of words are called language model.
• The simplest model that assigns probabilities to sentences and sequences of words, the
n-gram.
• An n-gram is a sequence n-gram of n words: a 2-gram (which we’ll call bigram) is a two-
word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-
gram (a trigram) is a three-word sequence of words like “please turn your”, or “turn your
homework”.
Maximum Likelihood Estimation of N-Gram Model
Parameters
• Maximum Likelihood Estimation (MLE) is a statistical method used to estimate the
parameters of a probability distribution that best describe a given dataset.
• The fundamental idea behind MLE is to find the values of the parameters that maximize
the likelihood of the observed data, assuming that the data are generated by the
specified distribution.
• A parameter is a numerical characteristic of a distribution.
• Normal distributions, as we know, have mean (µ) & variance (σ2) as parameters.
Binomial distributions have the number of trials (n) & probability of success (p) as
parameters. Gamma distributions have shape (k) and scale (θ) as parameters.
Exponential distributions have the inverse mean (λ) as the parameter.
Maximum Likelihood Estimation for bigram
probabilities
𝐶(𝑊𝑖−1, 𝑤𝑖 )
𝑃 𝑤𝑖 𝑤𝑖−1 =
𝐶 (𝑤𝑖−1 )

For Example: <s> I am Sam </s>


<s> Sam I am </s>
<s> I do not like the green eggs and ham </s>
2 1 2
P(I | <s>) = = 0.67 P(sam | <s>) = = 0.33 P(am | I) = = 0.67
3 3 3

1 1 1
P(</s> | sam) = = 0.5 P(sam | am) = = 0.5 P(do | I) = = 0.33
2 2 3
Intrinsic evaluation
• Intrinsic evaluation - Aims to measure the quality of embeddings by assessing their
performance on specific NLP tasks that are related to the embedding space itself, such
as word similarity, analogy, and classification.
• Cosine similarity

• Spearman correlation

• Accuracy
Cosine Similarity
Cosine similarity measures the similarity between two vectors by computing the
cosine of the angle between them. In the context of embeddings, cosine
similarity is often used to measure the similarity between two words, or between
a word and its context. The formula for cosine similarity is as follows:

𝑉1 ⋅ 𝑉2
cosine_similarity(𝑉1 , 𝑉2 ) =
∥ 𝑉1 ∥∥ 𝑉2 ∥

where v1 and v2 are the embeddings of two words


Spearman Correlation
• Spearman correlation measures the
monotonic relationship between two
variables, which can be the similarity scores
of two sets of words or phrases computed
by humans and by embeddings.
• A high Spearman correlation indicates that
the embeddings are able to capture the
semantic relationships between words that
humans perceive.
Accuracy
• Accuracy measures the performance of embeddings on classification tasks, such as
sentiment analysis or topic classification.
• Given a dataset of labeled examples, the embeddings are used to represent each
example, and a classifier is trained on these representations. The accuracy of the
classifier on a held-out test set is then used as a measure of the quality of the
embeddings.
Extrinsic evaluation
Extrinsic evaluation - aims to measure the quality of embeddings by assessing their
performance on downstream NLP tasks, such as machine translation or text classification,
that are not directly related to the embedding space itself.
F1 Score
F1 score is a metric commonly used in binary classification problems, such as sentiment
analysis or named entity recognition. It combines precision and recall into a single score
that ranges from 0 to 1. A high F1 score indicates that the embeddings are able to
capture the relevant features of the input data. The formula for F1 score is as follows:

F1=2⋅precision⋅recall / precision+recall
Perplexity
• It measures how well a language model can predict a held-out test
set of text, given the embeddings as input. A low perplexity
indicates that the embeddings are able to capture the semantic and
syntactic structures of the language. The formula for perplexity is as
follows:

You might also like