0% found this document useful (0 votes)

18 views27 pages

Module 5

Uploaded by

Vaheed ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views27 pages

Module 5

Uploaded by

Vaheed ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Module 5

Discourse Processing: Cohesion, Reference Resolution, Discourse Cohension and Structure

Discourse Analysis is extracting the meaning out of the corpus or text. Discourse Analysis is
very important in Natural language Processing and helps train the NLP model better.

Concept of Coherence

Coherence in terms of Discourse in NLP means making sense of the utterances or making
meaningful connections and correlations. There is a lot of connection between the
coherence and the discourse structure (discussed in the next section). We use the property
of good text, coherence, etc., to evaluate the quality of the output generated by the natural
language processing generation system.

What are coherent discourse texts? Well, if we read a paragraph from a newspaper, we can
see that the entire paragraph is interrelated; hence we can say that the discourse is
coherence, but if we only combine the newspaper headlines consecutively, then it is not a
discourse, it is just a group of sentences that are also non-coherence.

Let us now learn about the two major properties of coherence, i.e., Coherence relation
between utterances and Coherence relation between entities.

Coherence Relation Between Utterances

When we say that the discourses are coherent, then it simply means that the discourse has
some sort of meaningful connection. The coherent relation tells us that there is some sort
of connection present between the utterances.

Relationship Between Entities

If there is some kind of relationship between the entities, then we can also say that the
discourse in NLP is coherent. So, the coherence between the entities is known as entity-
based coherence.

Discourse Structure

So far, we have discussed discourse and coherence, but we have not discussed the structure
of the discourse in NLP. Let us now look at the structure that discourse in NLP must have.
Now, the structure of the discourse depends on the type of segmentation applied to the
discourse.

What is discourse segmentation ? Well, when we determine the types of structures for a
large discourse, we term its segmentation. The segmentation is a difficult thing to
implement, but it is very necessary as discourse segmentation is used in fields like :

 Information Retrieval,
 Text summarization,
 Information Extraction, etc.

Algorithms for Discourse Segmentation

We have different algorithms for Unsupervised Discourse Segmentation and Supervised

Discourse Segmentation. Let us now learn about the various algorithms used for discourse
segmentation in this section.

Unsupervised Discourse Segmentation

The class of unsupervised segmentation is also termed or represented as linear

segmentation. Let us take an example to understand this discourse segmentation better.

Suppose we have a text with us, and the task is to segment the text into various units of
multi-paragraphs. In the multi-paragraphs, a single unit is going to represent a passage of
the text.

Now the algorithm will take the help of cohesion (that we have discussed above), and the
algorithm will classify the dependent texts and tie them together using some linguistic
devices. In simpler terms, unsupervised discourse segmentation means the classification and
grouping up of similar texts with the help of coherent discourse in NLP.

The unsupervised discourse segmentation can also be performed with the help of lexicon
cohesion. The lexicon cohesion indicates the relationship among similar units, for example,
synonyms.

Supervised Discourse Segmentation

In the previous segmentation, there was no certain labeled segment boundary to separate
the discourse segments. But in the supervised discourse segmentation, we only deal with
the training data set having a labeled boundary. To differentiate or structure the discourse
segments, we make use of cue words or discourse makers. These cue words or discourse
maker works to signal the discourse structure. As there can be varied domains of discourse
in NLP so, the cue words or discourse makers are domain specific.

Text Coherence

As we have previously discussed, the coherent discourse in NLP aims to find the coherence
relation among the discourse text. Now, to find the structure in discourse, we use lexical
repetition, but by using this lexical repetition, we cannot satisfy the conditions of coherent
discourse. So, to prove such a kind of discourse relation, Hebb has proposed some solutions.

Suppose we have two kinds of related sentences, namely: S0 and S1.

Result

We can say that the second statement, i.e., S1 can be the cause of the first statement,
i.e., S0. For example, Rahul is late. He will be punished.

In the above example, we can say that the first statement, S0, i.e., Rahul is late, has caused
the second statement, i.e., S1, i.e., He will be punished.

Explanation

Similar to the result, We can say that the first statement, i.e., S0 can be the cause of the
second statement, i.e., S1. For example, Rahul fought with his friend. He was drunk.

Parallel

By the term parallel, we mean that the assertion from the statement S0, i.e., p(a1, a2, …),
and the assertion from the statement S1, i.e. p(b1, b2, …), the ai and bi is similar for all the
values of I.

In simpler terms, it shows us that the sentences are parallel. For example, He wants food.
She wants money. Both of the statements are parallel as there is a sense of want in both
sentences.

Elaboration

Elaboration means that proposition P is inferring from both the assertions S0 and S1. For
example, Rahul is from Delhi. Rohan is from Mumbai.

Occasion

The occasion takes place when the change in the state is inferred from the first assertion S0,
the final state is inferred from the statement S1, and vice-versa. Let us take an example to
understand the relationship occasion better. For example, Rahul took the money. he gave it
to Rohan.

Building Hierarchical Discourse Structure

In the previous section, we discussed how text coherence takes place. Let us now try to
build a hierarchal discourse structure with the help of a group of statements. We generally
create the hierarchical structure among the coherence relations to get the entire discourse
in NLP.

Let us consider the following phrases and serially number them.

 S1:
Rahul went to the bank to deposit money.
 S2:
He then went to Rohan's shop.
 S3 :
He wanted a phone.
 S4 :
He did not have a phone.
 S5:
He also wanted to buy a laptop from Rohan's shop.

Now the entire discourse can be represented using the below hierarchal discourse structure.

Reference Resolution

The extraction of the meaning or interpretation of the sentences of discourse is one of the
most important tasks in natural language processing, and to do so, we first need to know
what or who is the entity that we are talking about. Reference resolution means
understanding the type of entity that is being talked about.

By the term reference, we mean the linguistic expression that is used to denote an
individual or an entity. For example, look at the below sentences.

 Rahul went to the farm.

 He cooked food.
 His farm was very big.
In the above sentences, Rahul, He, and His references. So, we can simply define the
reference resolution as the task of determination of the entities that are being referred to
by the linguistic expressions.

Let us now look at the various terminologies used in the reference resolution.

Terminology Used in Reference Resolution

 Referring expression:
The NLP expression that performs the reference is termed a referring expression. For
example, the passage that we have talked about in the above section is an example
of the referring expression.
 Referent:
Referent is the entity we have referred to. For example, in the above
passage, Rahul is the referent.
 Co-refer:
As the name suggests, Co-refer is a term used for an entity if two or more
expressions are referring to the same entity. For example, Rahul and He is used for
the same entity, i.e., Rahul.
 Antecedent:
The term that has been licensed to use another term is termed antecedent. For
example, in the above passage, Rahul is the antecedent of the reference He.
 Anaphora & Anaphoric:
The referring expression is termed anaphoric. Anaphora & Anaphoric can be said to
be the term or reference used for an entity that has previously been introduced in
the same sentence.
 Discourse model:
It is the model that has the overall representation of the entities that have been
referred to in the discourse text. It also contains the relationship of the involved
discourse in the NLP.

Types of Referring Expressions

As we have previously discussed, the NLP expression that performs the reference is termed
a referring expression. We have mainly five types of referring expressions in Natural
Language Processing. Let us discuss them one by one.

1. Indefinite Noun Phrases

Indefinite noun reference is a kind of reference that represents the entity that is new to the
discourse context's hearer. To understand the indefinite noun phrase, let us take an
example.

For example:
In the sentence : Rahul is doing some work., some is an indefinite noun phrase.
2. Definite Noun Phrases

A definite noun reference is a kind of reference that represents the entity that is not new to
the discourse context's hearer. The discourse context's hearer can easily identify the
definite noun reference. To understand the definite noun phrase, let us take an example.

For example:
In the sentence: Rahul loves reading the Times of India., the Times of India is an indefinite
noun phrase.

3. Pronouns

Pronouns is a form of definite reference (its working is the same as we have learned in
English grammar).

For example:
In the sentence, Rahul learned as much as he could. Here, he is the pronoun that is referring
to the noun Rahul.

4. Demonstratives

The demonstratives are also used to demonstrate the nouns but they behave differently
than the simple pronouns.

For example, that, this, these, and those, are some examples of demonstratives.

5. Names

Names can be the name of the person, location, organization, etc. So, it is the simplest form
of referring to the expressions.

For example, in the above examples, Rahul is the name referring expression.

Reference Resolution Tasks

To resolve the reference, we can use the two resolution tasks. Let us discuss them one by
one.

1. Co-reference Resolution

In the Co-reference Resolution, the main aim is to find the referring expression from the
provided text that refers to the same entity. In a discourse in NLP, Co-refer is a term used
for an entity if two or more expressions are referring to the same entity.

For example, Rahul and He is used for the same entity i.e., Rahul.

The Co-reference Resolution can be simply termed as finding the relevant co-refer
expressions among the provided discourse text. Let us take an example for more clarity.
For example, Rahul went to the farm. He cooked food. In this example, Rahul and He is the
referring expressions.

We have some sort of constraints present on the Co-reference Resolution. Let us learn
about the constraint.

Constraint on Co-reference Resolution:

In the English language, we have many pronouns. If we are using the pronouns he and she,
then we can easily resolve it. But if we are using the pronoun it, the resolution can be tricky,
and if we have a set of co-referring expressions, then it becomes more complex to resolve it.
In simpler terms, if we are using the it pronoun, then the exact determination of the
referred noun is complex.

2. Pronominal Anaphora Resolution

By the terms Pronominal Anaphora Resolution, we are aiming to find the antecedent for
the current single pronoun.

For example, in the passage - Rahul went to the farm. He cooked food., Rahul is the
antecedent of the reference He.
Language Modeling: Introduction, N-Gram Models, Language Model Evaluation, Parameter
Estimation, Language Model Adaptation, Types of Language Models, Language-Specific
Modeling Problems, Multilingual and Cross lingual Language Modeling.

Introduction

 In the modern world, the innumerable arsenal of languages helps us express our
feelings and thoughts, which is unique to the human species because it is a way to
express unique ideas and customs within different cultures and societies.
o Humans also have the capacity to use complex language far more than any
other species on Earth.
 AI systems that understand these languages and generate text are known as
language models and are the latest and trending software technology in this decade.
 Language modeling is a crucial element in modern NLP applications and makes the
machines understand qualitative information. Each language model type, in one way
or another, turns the qualitative information generated by humans into quantitative
information, which in turn allows people to communicate with machines as they do
with each other to a limited extent.

Let us understand further about N-grams and language models and how they are built using
N-grams.

Language Modeling

Language modeling (LM) is the use of various statistical and probabilistic techniques to
determine the probability of a given sequence of words occurring in a sentence. Language
models assign the probabilities to a sentence or a sequence of words or the probability of an
upcoming word given a previous set of words.

 Language models are useful for a vast number of NLP applications such as next word
prediction, machine translation, spelling correction, authorship Identification, and
natural language generation.
 The central idea in language models is to use probability distributions over word
sequences that describe how often the sequence occurs as a sentence in some
domain of interest.
 Language models estimate the likelihood of texts belonging to a language. The
sequences are divided into multiple elements, and the language model models the
probability of an element given the previous elements.
o The elements can be bytes, characters, subwords or tokens. Then, the
sequence likelihood is the product of the elements’ probabilities.
 Language models are primarily of two kinds: N-Gram language models and Grammar-
based language models such as probabilistic context-free grammar.
 Language models can also be classified into Statistical Language Models and Neural
language models.
 Utility of generating language text using language modeling: Independently of any
application, we could use a language model as a random sentence generator where
we sample sentences according to their language model probability.
o There are very few real-world use cases where you want to actually generate
language randomly.
o But the understanding of how to do this and what happens when you do so
will allow us to do more interesting things later.

Statistical Language Modeling

A statistical language model is simply a probability distribution over all possible sentences.
Statistical language models learn the probability of word occurrence based on examples of
text.

 While simpler models may look at a context of a short sequence of

words, larger statistical language models may function at the level of sentences or
paragraphs.
o Typically, most commonly used statistical language models operate at the
level of words.
o Also, almost all language models decompose the probability of a sentence
into a product of conditional probabilities.
 Statistical language models use traditional statistical techniques like N-grams, Hidden
Markov Models (HMM), and certain linguistic rules to learn the probability
distribution of words.
o Out of all these models, the most popular and easily implementable are N-
gram language models.

Neural Language Modeling

 Neural Language Models use different kinds of approaches like neural networks such
as feedforward neural networks, recurrent neural nets, attention-based networks,
and transformers-based neural nets late to model the language, and they have also
surpassed the statistical language models in their effectiveness.
 Neural language models have many advantages over the statistical language models
as they can handle much longer histories and also can generalize better over contexts
of similar words and are more accurate at word prediction.
 Neural net language models are also much more complex and slower and need
more energy to train and are less interpretable than statistical language models.
 Hence for practical purposes where there are not a lot of computing power and
training data, and especially for smaller tasks, statistical language models like the n-
gram language model are the right tool.
 On the other side, Large language models (LLMs) based on neural networks, in
particular, represented state of the art and gave rise to major advancements in NLP
AI .
o They hold the promise of transforming domains through learned knowledge
and slowly becoming ubiquitous in day-to-day life related to text and speech
use cases
o LLM sizes have also been increasing 10X every year for the last few years, and
these models grow in complexity and size along with their capabilities, like,
for example, few shot learners.

Evaluating Language Models

 A machine learning model, in general, is considered good if it predicts a test set of

sentences.
o We reserve some portion of the data for estimating parameters, and we use
the remainder for testing the model.
o A good model assigns high probabilities to the test sentences where typically
the probability of each sentence is normalized for length.
 An alternative approach that measures how well the test samples are predicted is
using the measure of perplexity.
o Models that minimize perplexity will also maximize the probability.
o When we look at the mathematical formulation for perplexity, it is simply the
inverse of the probability with a log transform applied on top of it.
 Perplexity: Perplexity is a measure of surprise in random choices.
o Distributions with high uncertainty have high perplexity. A uniform
distribution has high perplexity because it is hard to predict a random draw
from it.
o A peaked distribution has low perplexity because it is easy to predict the
outcome of a random draw from it.

What are N-grams?

N-gram Models

N-gram models are a particular set of language models based on the statistical frequency of
groups of tokens.

 An n-gram is an ordered group of n tokens. The bigrams of the sentence The cat eats
fish. are (The, cat), (cat, eats), (eats, fish) and (fish, .). The trigrams are (The, cat,
eats), (cat, eats, fish) and (eats, fish, .).
 The smallest n-grams with n =1 are called unigrams. Unigrams are simply the tokens
appearing in the sentence.
 The conditional probability that a certain token appears after previous tokens are
estimated by Maximum Likelihood Estimation on a set of training sequences.

N-grams - The central concept is that the next word dependent on the previous n words.
Intuitive Formulation

The intuitive idea behind n-grams and n-gram models is that instead of computing the
probability of a word given its entire history, we can approximate the history by just the last
few words like humans do while understanding speech and text.

Illustration for N-gram probabilities

P(wn∣w1…wn−1)≈P(wn) unigram

P(wn∣w1…wn−1)≈P(wn∣wn−1) bigram

P(wn∣w1…wn−1)≈P(wn∣wn−1wn−2) trigram

P(wn∣w1…wn−1)≈P(wn∣wn−1wn−2wn−3) 4-gram

P(wn∣w1…wn−1)≈P(wn∣wn−1wn−2wn−3wn−4)5-gram

 Hence N-Gram models are also a simple class of language models (LM's) that assign
probabilities to sequences of words using shorter context.
 We can predict the chance of emerging a word in a given position using these
assigned probabilities. For example, the last word of an n-gram given the previous
words.

The Use of N-grams

N-gram models are the simplest and most common kind of language model.

 In general, simply modeling using n-grams in n-gram models is an insufficient model

of a language because the text in languages have long-distance dependencies.
 But based on practice, we can still effectively use N-Gram models to represent
languages.
 Choice of N in N-Gram models: The accuracy of the model increases with an increase
in N.
o But with bigger N values, we run into the risk that we may not get good
estimates for N-Gram probabilities, and the N-Gram tables will be more
sparse.
o Smaller the N, the model will be less accurate. But we may get better
estimates for N-Gram probabilities, and The N-Gram tables will be less sparse.
 In reality, we do not use higher than Trigram (not more than Bigram)
o For example, the size of N-gram tables with 10,000 words: 10,000 for
unigram,

10,000∗10,000=100,000,000 for bigram,

10,000∗10,000∗10,000=1,000,000,000,000 for trigram.

Bi-gram Model

 The bigram model approximates the probability of a word given all the previous
words by using only the conditional probability of one preceding word.
 So, the prediction for the next word is dependent on the previous word alone.
 It is also one of the widely used models.

Probability Estimation

 There are two main steps generally in building a machine learning model: Defining
the model and Estimating the model’s parameters, which is called the training or the
learning step. For language models, the definition depends on the model we choose.
 There are also two quantities we need to estimate for developing the language
models for all words in the vocabulary of the language for which we are working
with:
o The Probability of observing a sequence of words from a language. For
example, Pr(Colorless green ideas sleep furiously)
 This is the probability of a sentence or sequence Pr(w1,w2,…,wn)
o The Probability of observing a word having observed a sequence. For
example, Pr(furiously | Colorless green ideas)
 This is the probability of the next word in a sequence Pr(wk+1∣w1,
…,wk)
o Pr(w1, w2, …, wn) is short for Pr(W1=w1,W1=w2,…,Wn=wn)
o The w notation denotes that the random variable W1 takes on value w1 and
so on. e.g., Pr(I,love,fish)=Pr(W1=I,W2=love,W3=fish)
 Typical assumptions made in probability estimation methods: Probability models
almost always make independence assumptions.
o Even though two variables, X and Y, are not actually independent, the model
may treat them as independent, which can drastically reduce the number of
parameters to be estimated.
o Models without independence assumptions have way too many parameters
to estimate reliably from the data we may have and are intractable.
o Sometimes, the independence assumptions may not be correct, and these
models, when relying on those incorrect formulations, may often be incorrect
as well as they may assign probability mass to events that cannot occur.
 Probability estimation without independence assumption: If there are no
independence assumptions about the sequence, then one way to estimate is the
fraction of times we see it. Pr(w1, w2, …, wn) = #(w1, w2, …, wn) / N where N is the
total number of sequences.
o Estimating using frequency fractionals is problematic because we need to
have seen the particular sentence many times to have to assign a good
probability for Pr(w1, w2, …, wn) = #(w1, w2, …, wn) / N.
o Also, estimating from sparse observations is unreliable, and we won't have a
solution for new sequences.
 Maximum Likelihood estimation: The most basic parameter estimation technique is
the relative frequency estimation (frequencies are counts) which is also called the
method of Maximum Likelihood Estimation (MLE).
o The estimation simply works by counting the number of times the word
appears conditioned on the sentence and then normalizing the probabilities.
We also need some source text corpora.
o Chain rule of probability in estimation: To estimate the probabilities, we
usually rely on the Chain Rule of Probability, where we decompose the joint
probability into a product of conditional probabilities using the independence
assumption.
 It is also to be kept in mind that estimating conditional probabilities
with long contexts is usually difficult, and for example, conditioning
on 4 or more words itself is very hard.
 Markov assumption in probability estimation: The use of Markov assumption in
probability estimation solves a lot of problems we encounter with data sparsity and
conditional probability calculations.
o The assumption is that the probability of a word depends only on the
previous word(s). It is like saying the next event in a sequence depends only
on its immediate past context.
o Markov models are the class of probabilistic models that assume that we can
predict the probability of some future unit without looking too far into the
past.

Challenges of Probability Estimation

 Reliable Generalization: The main problem to tackle with estimating parameters or

probabilities is to do it reliably so that the models generalize well to unseen data.
o We can estimate unigrams quite reliably, but they are often not a good
model**.
o Higher order n-grams require large amounts of data but are better models
but also have a tendency to overfit the data.
 General Challenges of building a good language model
o It is difficult to build a language model that does well on the task we want to
use it for, as it takes a huge amount of time for both development and
training.
o A good language model should also model the language well where if we ask
questions of the model, it should provide reasonable answers, and also, it
should be able to do it a bit recursively depending on our further responses.
o We also need well-formed language (like the English language) sentences to
be more probable where words that the model predicts as the next in a
sequence should fit well grammatically, semantically, contextually, and
culturally.
 Given the simplicity at which the basic models work, this is a complex
task and too much to ask for.

Sensitivity to the Training Corpus

Most language models currently in practice, especially statistical language models, are
extremely sensitive to changes in the style, topic, or genre of the text on which they are
trained.
 Hence they are very sensitive to the training corpus from which they are trained on.
 For example, one is much better off using a very big corpus of words of transcripts
from telephone conversations for modeling casual phone conversations than using a
corpus even of millions of words of transcripts from TV and radio news broadcasts.
 The effect is quite strong, even for changes that seem trivial to a human. For
example, a language model trained on a certain publication of news wire text will see
its perplexity doubled when applied to a very similar news publication of wire text
from the same time period.

Smoothing

Smoothing is the task of adjusting the maximum likelihood estimate of probabilities to

produce more accurate probabilities.

 Central idea in smoothing algorithms: We will assign some probability mass to

unseen events such that we will take away some probability mass from seen events.
 Smoothing can also be seen as the process of flattening a probability
distribution implied by a language model so that all reasonable word sequences can
occur with some probability.
o This often involves broadening the distribution by redistributing weight from
high-probability regions to zero-probability regions.
 The name smoothing is used in the sense that these smoothing techniques tend to
make distributions more uniform by adjusting low probabilities, such as zero
probabilities upward and high probabilities downward.
 Smoothing not only prevents zero probabilities but also attempts to improve the
accuracy of the model as a whole.
 In higher-order N-gram models, which tend to be domain or application-specific,
smoothing provides a way of generating generalized language models.

Laplace Smoothing or Add-one Smoothing

Laplace smoothing merely adds the number one to each count (hence the alternate name
adds one smoothing). We also need to adjust the denominator to take into account the
extra observations since there is a fixed number of words in the vocabulary, and for unseen
words, there is an increment of one.

 Laplace smoothing does not perform well enough to be used in modern n-gram
models but is a useful tool and introduction to most other concepts.
 A related way to view Laplace smoothing is as discounting or lowering some non-
zero discount counts in order to get the probability mass that will be assigned to the
zero counts.
 Add-k smoothing: One alternative to add-one smoothing is to move a bit less of the
probability mass from the seen to the unseen events. Instead of adding 1 to each
count and we add a fractional count. This algorithm is therefore called add-k
smoothing.
o Add-k smoothing requires that we have a method for choosing k, and that can
be done by optimizing for k by trying different values on a holdout set.
o Add-k smoothing is useful for some tasks like text classification also, in
addition to probability estimation, but still doesn’t work well for language
modeling as it generates counts with poor variances and often inappropriate
discounts.

Evaluating Language Models in NLP

Language models are very useful in a broad range of applications like speech recognition,
machine translation part-of-speech tagging, parsing, Optical Character Recognition (OCR),
handwriting recognition, information retrieval, and many other daily tasks.

 One of the main steps in the usage of language models is to evaluate the
performance beforehand and use them in further tasks.
 This lets us build confidence in the handling of the language models in NLP and also
lets us know if there are any places where the model may behave
uncharacteristically.

In practice, we need to decide on the dataset to use, the method to evaluate, and also
select a metric to evaluate language models. Let us learn about each of the elements
further.

How to Evaluate a Language Model?

 Evaluating a language model lets us know whether one language model is better
than another during experimentation and also to choose among already trained
models.
 There are two ways to evaluate language models in NLP: Extrinsic
evaluation and Intrinsic evaluation.
o Intrinsic evaluation captures how well the model captures what it is supposed
to capture, like probabilities.
o Extrinsic evaluation (or task-based evaluation) captures how useful the model
is in a particular task.
 Comparing among language models: We compare models by collecting a corpus of
text which is common for models which we are comparing for.
o We then divide the data into training and test sets and train the parameters
of both models on the training set.
o We then compare how well the two trained models fit the test set.

What Does Evaluating a Model Mean?

 After we train models, Whichever model assigns a higher probability to the test set is
generally considered to accurately predicts the test set and hence a better model.
 Among multiple probabilistic language models, the better model is the one that has a
tighter fit to the test data or that better predicts the details of the test data and
hence will assign a higher probability to the test data.

Issue of Data Leakage or Bias in Language Models

 Most evaluation metrics for language models in NLP are based on test set probability,
so it is important not to let the test sentences into the training set.
 Example: Assuming we are trying to compute the probability of a
particular test sentence, and if our test sentence is part of the training corpus, we
will mistakenly assign it an artificially high probability when it occurs in the test set.
o We call this situation training on the test set.
o Training on the test set introduces a bias that makes the probabilities all look
too high and causes huge inaccuracies in metrics like perplexity. **

Extrinsic Evaluation

Extrinsic evaluation is the best way to evaluate the performance of a language model
by embedding it in an application and measuring how much the application improves.

 It is an end-to-end evaluation where we can understand if a particular improvement

in a component is really going to help the task at hand.
 Example: For speech recognition, we can compare the performance of two language
models by running the speech recognizer twice, once with each language model, and
seeing which gives the more accurate transcription.

Intrinsic Evaluation

We need to take advantage of intrinsic measures because running big language models in
NLP systems end-to-end is often very expensive, and it is easier to have a metric that can be
used to quickly evaluate potential improvements in a language model.

An intrinsic evaluation metric is one that measures the quality of a model-independent of

any application.

 We also need a test set for an intrinsic evaluation of a language model in NLP
 The probabilities of an N-gram model training set come from the corpus it is trained
on, the training set or training corpus.
 We can then measure the quality of an N-gram model by its performance on some
unseen test set data called the test set or test corpus.
 We will also sometimes call test sets and other datasets that are not in our training
sets held out corpora because we hold them out from the training data.

Good scores during intrinsic evaluation do not always mean better scores during extrinsic
evaluation, so we need both types of evaluation in practice.

Perplexity

Perplexity is a very common method to evaluate the language model on some held-out
data. It is a measure of how well a probability model predicts a sample.

 Perplexity is also an intrinsic measure (without the use of external datasets) to

evaluate the performance of language models which come under NLP.
o Perplexity as a metric quantifies how uncertain a model is about the
predictions it makes. Low perplexity only guarantees a model is confident,
not accurate.
o Perplexity also often correlates well with the model’s final real-world
performance, and it can be quickly calculated using just the probability
distribution the model learns from the training dataset.

The Intuition

 The basic intuition is that the higher the perplexity measure is, the better the
language model is at modeling unseen sentences.
 Perplexity can also be seen as a simple monotonic function of entropy. But
perplexity is often used instead of entropy due to the fact that it is arguably more
intuitive to our human minds than entropy.

Calculating Perplexity

 Perplexity of a probability model like language models in NLP: For a model of

an unknown probability distribution, and a proposed probability model, we can
evaluate perplexity measure mathematically as b−N1∑i=1Nlogbq(xi)
o We can choose b as 2.
o In general, better models assign higher probabilities to the test events. Hence
good models will have lower perplexity values and are less surprised by the
test sample.
o If all the probabilities were 1, then the perplexity would be one and the
model would perfectly predict the text. Conversely, the perplexity will be
higher for poorer language models.
 Perplexity denoted by PP of a discrete probability distribution p is mathematically
defined as PP(p):=2H(p)=2−∑xp(x)log2p(x)=∏xp(x)−p(x)
o Where H(p) is the entropy (in bits) of the distribution and x ranges over
events which we will learn about further.
o Perplexity of a random variable X may be defined as the perplexity of the
distribution over its possible values x.
 One other formulation for Perplexity from the perspective of language models in
NLP: It is the multiplicative inverse of the probability assigned to the test set by the
language model normalized by the number of words in the test set.
o We can define perplexity mathematically as:

 PP(W)=P(w1w2…wN)−1/n =

o We know that if a language model can predict unseen words from the test set
if the P(a sentence from a test set) is highest, then such a language model is
more accurate.
Interpreting Perplexity

 Perplexity intuitively provides a more human way of thinking about the random
variable’s uncertainty. The reasoning is that the perplexity of a uniform discrete
random variable with K outcomes is K.
o Example: The perplexity of a fair coin is two and the perplexity of a fair six-
sided die is six.
o This kind of framework provides a frame of reference for interpreting a
perplexity value.
 Simple framework to interpret perplexity: If the perplexity of some random variable
X is 10, our uncertainty towards the outcome of X is equal to the uncertainty we
would feel towards a 10-sided die, helping us intuit the uncertainty more deeply.

Perplexity to Compare Different N-Gram Models

 Steps to compute perplexity for n-gram models:

o We first calculate the joint probability of all the words in the sentence under
the n-gram model after we estimate the model parameters from the training
corpus.
o We will then transform the joint probability into a perplexity for each
sentence by multiplying the probabilities of each word together.
 We first need to calculate the length of the sentence in words by
including the end-of-sentence word as well and then calculate
the perplexity = 1/(pow(sentence_probability, 1.0/sentence_length))
o Then we compute a single perplexity from the overall model (if there are

multiple sentences) as:

 Generic benchmarks and typical values of perplexity n-gram models: If we assume a
corpus of English with a vocabulary size of ~50,000, we can establish typical
evaluation metrics for unigram, bigram, and trigram language models.
o In a bigram model, each word depends only on the previous word in the
sentence while in a unigram model each word is chosen
completely independently of other words in the sentence.
o The typical reported perplexity figures for such a dataset are ~74 for a
trigram model, ~137 for a bigram model, and ~955 for a unigram model. The
perplexity for a model that simply assigns probability 1/50,000 to each word
in the vocabulary would be 50,000.
o Hence the trigram model gives a big improvement over bigram and unigram
models and a huge improvement over assigning a probability of 1/50,000 to
each word in the vocabulary.

Perplexity in the Real World

 Perplexity is used as a measure in training language models related to standardized

datasets like One Billion Word Benchmark. The dataset was collected from thousands
of online news articles published in 2011, all broken down into their component
sentences.
 Perplexity as a metric measures how accurately a model can mimic the style of the
dataset it is being tested against models trained on datasets from some period as the
benchmark dataset have an unfair advantage due to vocabulary similarity and may
not work when testing for different time periods and slightly different datasets even
though they were in the same domain.
 Perplexity also rewards models for mimicking the test dataset, and it may end up
favoring the models most likely to imitate subtly toxic content (if the dataset is
related to freely flowing language like a text from social media) as studies have
shown such content is more polarized and gets easily discussed compared to non-
toxic topics.

Pros and Cons

 Advantages of using Perplexity

o Fast to calculate and hence allows researchers to select among models that
are unlikely to perform well in real-world scenarios where computing is
prohibitively costly and testing is time-consuming and expensive.
o Useful to have an estimate of the model uncertainty/information density
 Disadvantages of Perplexity
o Not good for final evaluation since it just measures the model’s confidence
and not its accuracy
o Hard to make comparisons across different datasets with different context
lengths, vocabulary sizes, word vs. character-based models, etc.
o Perplexity can also end up rewarding models that mimic outdated datasets.

Entropy

Entropy is a metric that has been used to quantify the randomness of a process in many
fields and compare worldwide languages, specifically in computational linguistics.

Definition for Entropy: The entropy (also called self-information) of a random variable is
the average level of the information, surprise, or uncertainty inherent to the single
variable's possible outcomes.

 The more certain or the more deterministic an event is, the less information it will
contain. In a nutshell, the information is an increase in uncertainty or entropy.
 Entropy of a discrete distribution p(x) over the event space X is given by H(p)=−∑x∈X
p(x)logp(x)
o H(X) >=0; H(X) = 0 only when the value of X is indeterminate and hence
providing no new information
o The smallest possible entropy for any distribution is zero.
o We also know that the entropy of a probability distribution is maximized
when it is uniform.
Entropy in Different Fields of NLP & AI

 In terms of probability theory NLP, language perspective, and probability theory NLP,
entropy can also be defined as a statistical parameter that measures how much
information is produced for each letter of a text in the language.
o If the language is translated into binary digits (0 or 1) in the most efficient
way, the entropy H is the average number of binary digits required per letter
of the original language.
 From a machine learning perspective, entropy is a measure of uncertainty, and the
objective of the machine learning model is to minimize uncertainty.
o Decision tree learning algorithms use relative entropy to determine the
decision rules that govern the data at each node.
o Classification algorithms in machine learning like logistic regression or
artificial neural networks often employ a standard loss function called cross
entropy loss that minimizes the average cross entropy between ground truth
and predicted distributions.

Historical Perspective for Entropy

 Entropy in Information Theory: It was introduced by Claude Shannon in his

definition is a statistical parameter which measures, in a certain sense, how much
information is produced on the average for each letter of a text in the language.
o If the language is translated into binary digits (0 or 1) in the most efficient
way, the entropy is the average number of binary digits required per letter of
the original language.
 Entropy for a natural language: The entropy of a natural language is the average
amount of information of one character in an infinite length of text, which
characterizes the complexity of natural language.
o Historically, there have been many proposals for experimentally estimating
the entropy rate as the true probability distributions of natural language.
o Most of these approaches relied on the predictive power of humans or
computational models such as n-gram language models and compression
algorithms.
 Using entropy as a metric: The main idea is that if a model captures more of the
structure of a language, then the entropy of the model should be lower and we can
use entropy as a measure of the quality of the models.

Cross Entropy

Due to the fact that we can not access an infinite amount of text in the language, and the
true distribution of the language is unknown, we define a more useful and usable
metric called Cross Entropy.

 Intuition for Cross entropy: It is often used to measure the closeness of two
distributions where one distribution is from the sample text (Q) that the language
model aims to learn with as much proximity as possible and the other is
the empirical distribution of the language (P).
o Mathematical cross-entropy is defined as:
 H(P,Q)=EP[−logQ] which can also be written as (P,Q)=H(P)+DKL(P∣∣Q)
 H(P,Q) is the entropy and DKL(P∣∣Q) is the Kullback–Leibler (KL)
divergence of Q from P. It is also known as the relative entropy of P
with respect to Q.
 From the formulation, we can see that the cross entropy of Q with respect to P is
the sum of two terms entropy and relative entropy:
o H(P), the entropy of P, is the average number of bits needed to encode any
possible outcome of P.
o The number of extra bits required to encode any possible outcome of P
optimized over Q.

The empirical entropy H(P) is unoptimizable, so when we train a language model with the
objective of minimizing the cross-entropy loss, the true objective is to minimize the KL
divergence of the distribution which was learned by our language model from the empirical
distribution of the language.

Handling Unknown Words

 Tokenizers in Language Models: Tokenization is the first and important step in any
NLP pipeline, especially for language models which break unstructured data and
natural language text into chunks of information that can be considered as discrete
elements.
o The token occurrences in a document can be used directly as
a vector representing that document.
o The goal when crafting the vocabulary with tokenizers is to do it in such a way
that the tokenizer tokenizes as few words as possible into the unknown token.
 Issue with unknown vocabulary/tokens: The general approach in most tokenizers is
to encode the rare words in your dataset using a special token UNK by convention so
that any new out-of-vocabulary word would be labeled as belonging to the rare word
category.
o We expect the model to learn how to deal with the other words from the
custom UNK token.
o It is also generally a bad sign if we see that the tokenizer is producing a lot of
these unknown tokens as the tokenizer was not able to retrieve a sensible
representation of a word and we are losing information along the way.
 Methods to handle unknown tokens / OOV (out of vocabulary): Character level
embeddings and sub-word tokenization are some effective ways to unknown tokens.
o Under sub-word tokenization, WordPiece and BPE are de facto
methods employed by successful language models such as BERT and GPT, etc.
 Character level embeddings: Character and subword embeddings are introduced as
an attempt to limit the size of embedding matrices such as in BERT but they have
the advantage of being able to handle new slang words, misspellings, and OOV
words.
o The required embedding matrix is much smaller than what is required for
word-level embeddings. Generally, the vectors represent each character in
any language
o Example: Instead of a single vector for "king" like in word embeddings, there
would be a separate vector for each of the letters "k", "i", "n", and "g".
o Character embeddings do not encode the same type of information that word
embeddings contain and can be thought of as encoding lexical information
and may be used to enhance or enrich word-level embeddings.
o Character level embeddings are also generally shallow in meaning but if we
have the character embedding, every single word's vector can be formed
even it is out-of-vocabulary words.
 Subword tokenization: Subword tokenization allows the model to have a reasonable
vocabulary size while being able to learn meaningful context-independent
representations and also enables the model to process words it has never seen
before by decomposing them into known subwords.
o Example: The word refactoring can be split into re, factor, and ing.
Subwords re, factor, and ing occur more frequently than the word refactoring,
and their overall meaning is also kept intact.
 Byte-Pair Encoding (BPE): BPE was initially developed as an algorithm to compress
texts and then used by OpenAI for tokenization when pretraining the GPT model.
o It is used by a lot of Transformer models like GPT, GPT-2, RoBERTa, BART, and
DeBERTa.
o BPE brings the perfect balance between character and word-level hybrid
representations which makes it capable of managing large corpora.
o This kind of behavior also enables the encoding of any rare words in the
vocabulary with appropriate subword tokens without introducing any
“unknown” tokens.

Multilingual and Cross lingual Language Modeling

What Is Cross-Lingual Language?

Although studies in the field of NLP, or Natural Language Processing, have been carried out
for years, all studies in the field have been in English. The vast majority of sentences that
machines could understand, or rather perceive and encode, were in English. Breaking this
English-dominated orientation and enabling machines to perceive and encode almost every
language that exists in a global manner is called Cross-Lingual Language studies.

Cross-Lingual Language NLP is a very difficult and complex process. The reason for this
complexity and difficulty lies in the fundamental differences between languages. All of the
more than 5000 languages spoken around the world have different rules and vectors. So
machines need to be trained to recognize these languages, to make sense out of shapes.
For this reason, although there are language recognition systems that work quite well in
different languages today, the process is still evolving and this takes time

Why Cross-Lingual Language is Important

The development of Cross-Lingual Language is of utmost importance for the perception of

writing in different languages that can be used around the world and to be able to offer
translation activities to improve the communication of people all over the world with
machines.

NLP With Other Languages: Non-English NLP

As the world we live in globalizes, technologies that can only analyze English are far from
reality. For important tasks to be performed by computers, machines need to understand
many languages and be able to scan documents in these languages. This is especially
important in the finance and education sectors.

There are many different challenges in the development of Multi-Language NLP. For this
reason, many different approaches have been adopted over time for the development of
technology. We will discuss some of these approaches and models in detail.

One of the less successful models or approaches is to write algorithms that are trained on
each language separately. This proved to be very costly and time-consuming. Given the low
success rate, this approach has fallen behind.

There are some examples of this approach, for example, there are some models that are
trained in German only. The downside of this endeavor is that it is extremely difficult. Many
companies need models that can recognize several languages at once, and training models
for each language individually would cost millions of dollars and take months.

NLP is based on analyzing a lot of data. Machines can analyze huge amounts of text at the
same time and recognize languages and their vector fields. We will discuss some of the
multilingual NLP models in detail.
We can take a brief look at how Non-English pipelines are formed to better understand
these models. First, large amounts of data need to be collected. This data will then be
labeled. A large amount of cleaning may be required within the data. The more resources
available in the language, the easier it will be to prepare these pipelines. In this way, models
can be trained more accurately and faster.

Multilingual NLP Tasks

It is very important to mention Multilingual NLP Tasks. There is still a digital perception that
English is the language that everyone knows, that it is innate, and that it is used worldwide.
This idea leads to social inequalities and these social inequalities are reflected in the future,
in the technological world.

First of all, when machines can analyze a language, they encode and decode not only the
linguistic structure but also the culture that this language is connected. Therefore, this
process, which appears to be purely technological, becomes more cultural and socialized
to a great extent. Problems such as racism, sexism, and discrimination in algorithms stem
from such one-sided approaches.

On the other hand, the world is far from being a place where nations live in isolation, it is
very global. The whole world interacts with each other. This increases the need for
multilingual NLP and this technology is actively used in many different fields. For example:

· Recruitment processes and analyzing resumes in different languages

· Finance, review, and analysis of financial records in different languages, credit

processes

· Security, analysis of criminal records and bureaucratic documents in different

languages

· Education, equal opportunities, and analysis of transcripts or texts in different

languages

· Fast and secure translation services

Many more important aspects and needs, such as the use of multilingual NLP can be given
as examples of Multilingual NLP tasks.

Cross-Lingual Language Models in Machine Learning

We mentioned that developing Cross-Lingual Language models is a very difficult task. We

also talked about the fact that developing models for each language individually is a time-
consuming, expensive and time-consuming process that has been tried before and has a low
success rate. So which models are currently enabling machine learning technologies that can
detect different languages?

Many different models make this possible. Different approaches and diversity in the field
increase the chances of success. But some models stand out because they are more popular
or more successful than others. The most important factor in the popularity of these models
is their ease of use. Models that can be developed without the need for too much time and
financial resources are of course preferred by many people. Let's take a brief look at some
of these popular models.

mBbert

Bert, or Bidirectional Encoder Representations from Transformers, is a very important

model. This model works unsupervised with pre-trained data. In other words, none of the
large amounts of data used to train the model is labeled. In other words, it has not been
human-controlled. In this way, the model can learn and evolve on its own in deep layers.

It can work in more than 100 languages. So what makes it possible?

· Masked Language Modeling (MLM): Here, the model randomly masks 15% of the
words in the sentence or text. And the sentence has to guess these words. This distinguishes
it from other ways of working, such as RNN, because it does not learn words one after the
other.
· Next Sentence Prediction (NSP): Here the model combines two masked sentences as
input. These sentences may or may not be ordered in the text. The model then needs to
predict whether these sentences follow one after the other.

Through them, the model gains insight into how languages work. It can perceive languages
without human supervision.

XLM

XLM is a model that combines many different models. These are:

· Casual Language Modeling (CLM)

· Just like Bert, a Masked Language Modeling (MLM)

· Translation Language Modeling (TLM)

XLM uses two different pre-training methods. These can be divided into supervised and
unsupervised. One is the source language and the other is the target language. XLM has
many different checkpoints where it checks whether the correct choice has been made.

Multifit

Muliti-fit is a model that works differently than the other two models. It is based on the
tokenization of subwords, not words, and uses QRNN models.

Let's briefly explain subword tokenization. Morphology is the study of the structure,
inflection, and inflection of words. Therefore, working only on "words" does not give
accurate results in languages rich in this respect.

In morphologically rich languages such as Turkish, it is necessary to focus on subwords to

get accurate language perception results. Because the inflections of words are very common
in Turkish, these sublimes are also quite numerous.

The tokenization of these subwords allows the machines to detect words that are not very
common.

Unit - 5
No ratings yet
Unit - 5
21 pages
Disclosure
No ratings yet
Disclosure
7 pages
Unit 5
No ratings yet
Unit 5
13 pages
NLP Vi6
No ratings yet
NLP Vi6
11 pages
Concept of Coherence: Coherence Relation Between Utterances
No ratings yet
Concept of Coherence: Coherence Relation Between Utterances
5 pages
Coreference Resolution in NLP Explained
No ratings yet
Coreference Resolution in NLP Explained
13 pages
Unit 4 NLP
No ratings yet
Unit 4 NLP
14 pages
Unit 5
No ratings yet
Unit 5
26 pages
Discourse Analysis and Reference Resolution
100% (2)
Discourse Analysis and Reference Resolution
14 pages
NLP Module 5
No ratings yet
NLP Module 5
53 pages
NLP UNIT 5 Part A
No ratings yet
NLP UNIT 5 Part A
40 pages
NLP Unit-5.1 Notes
No ratings yet
NLP Unit-5.1 Notes
26 pages
Unit 4 NLP
No ratings yet
Unit 4 NLP
57 pages
NLP Unit V Notes
100% (1)
NLP Unit V Notes
21 pages
Unit 7 - Pragmatics, Discourse, Dialogue, and Natural Language Generation
No ratings yet
Unit 7 - Pragmatics, Discourse, Dialogue, and Natural Language Generation
17 pages
Discourse and Pragmatic Processing
No ratings yet
Discourse and Pragmatic Processing
15 pages
Unit-5 Aim 502
No ratings yet
Unit-5 Aim 502
7 pages
NLP Discourse Processing Guide
No ratings yet
NLP Discourse Processing Guide
156 pages
Chapter
No ratings yet
Chapter
13 pages
Wa0000.
No ratings yet
Wa0000.
13 pages
NLP Unit V Notes
No ratings yet
NLP Unit V Notes
21 pages
Discourse Linguistics: Discourse Structure Text Coherence and Cohesion Reference Resolution
100% (1)
Discourse Linguistics: Discourse Structure Text Coherence and Cohesion Reference Resolution
41 pages
Unit VI
No ratings yet
Unit VI
45 pages
NLP 5
No ratings yet
NLP 5
5 pages
NLP Unit Class Notes
No ratings yet
NLP Unit Class Notes
14 pages
Discourse Structure and Algorithms For Segmentation
No ratings yet
Discourse Structure and Algorithms For Segmentation
6 pages
Sanders&Noormand 2000
No ratings yet
Sanders&Noormand 2000
25 pages
Discourse Structure Text Coherence and Cohesion Reference Resolution
No ratings yet
Discourse Structure Text Coherence and Cohesion Reference Resolution
41 pages
NLP Discourse Analysis Guide
No ratings yet
NLP Discourse Analysis Guide
66 pages
Sanders e Noordman - 2000 - The Role of Coherence Relations and Their Linguistic Markers in Text Processing
No ratings yet
Sanders e Noordman - 2000 - The Role of Coherence Relations and Their Linguistic Markers in Text Processing
25 pages
Discourse Segmentation
No ratings yet
Discourse Segmentation
5 pages
Discourse Analysis: Cohesion & Coherence
No ratings yet
Discourse Analysis: Cohesion & Coherence
16 pages
A Short Analysis of Discourse Coherence
No ratings yet
A Short Analysis of Discourse Coherence
6 pages
Coherence 1
No ratings yet
Coherence 1
26 pages
Coherence, Reference, and The Theory of Grammar: Andrew Kehler
No ratings yet
Coherence, Reference, and The Theory of Grammar: Andrew Kehler
12 pages
DS ASSign 1
No ratings yet
DS ASSign 1
6 pages
23 Jurafsky
No ratings yet
23 Jurafsky
25 pages
Pragmatics of Discourse Coherence
No ratings yet
Pragmatics of Discourse Coherence
16 pages
Anaphora Resolution in NLP Discourse
No ratings yet
Anaphora Resolution in NLP Discourse
100 pages
Lecture 11 DISCOURSE-AND-DIALOG
No ratings yet
Lecture 11 DISCOURSE-AND-DIALOG
21 pages
Discourse Processing (NLP)
No ratings yet
Discourse Processing (NLP)
16 pages
Discourse Coherence
No ratings yet
Discourse Coherence
25 pages
Toward A Taxonomy of Coherence Relations
No ratings yet
Toward A Taxonomy of Coherence Relations
37 pages
Understanding Discourse Basics
100% (1)
Understanding Discourse Basics
23 pages
Cohesion and Coherence in Discourse
No ratings yet
Cohesion and Coherence in Discourse
8 pages
Discourse Cohesion and Coherence Analysis
No ratings yet
Discourse Cohesion and Coherence Analysis
18 pages
NLP QB2 GT Ans
No ratings yet
NLP QB2 GT Ans
11 pages
Cohesion and Coherence Explained
No ratings yet
Cohesion and Coherence Explained
69 pages
NLP for Information Retrieval
No ratings yet
NLP for Information Retrieval
8 pages
Lecture-8. Only For This Batch
No ratings yet
Lecture-8. Only For This Batch
46 pages
Farag Et Al. - 2020 - Analyzing Neural Discourse Coherence Models
No ratings yet
Farag Et Al. - 2020 - Analyzing Neural Discourse Coherence Models
11 pages
Discourse Analysi1 Group 4
No ratings yet
Discourse Analysi1 Group 4
14 pages
Discourse and Pragmatic Processing: Natural Language Processing (CSE 5321)
100% (1)
Discourse and Pragmatic Processing: Natural Language Processing (CSE 5321)
18 pages
Challenges (NLP) and F C Structure
No ratings yet
Challenges (NLP) and F C Structure
8 pages
Discourse Group 7
No ratings yet
Discourse Group 7
14 pages
Discourse Analysis
No ratings yet
Discourse Analysis
13 pages
MEDIA AND INFORMATION LITERACY Quiz 2
No ratings yet
MEDIA AND INFORMATION LITERACY Quiz 2
2 pages
Understanding Textual Aids in English
No ratings yet
Understanding Textual Aids in English
11 pages
FedRAMP ISCP Template Guide
No ratings yet
FedRAMP ISCP Template Guide
39 pages
Authorized Style Lessons in Clarity and Grace 12th Edition
No ratings yet
Authorized Style Lessons in Clarity and Grace 12th Edition
321 pages
El Uso Del Tiempo
No ratings yet
El Uso Del Tiempo
7 pages
Unit 0 - Intro To AP Biology
No ratings yet
Unit 0 - Intro To AP Biology
40 pages
Media and Communication Skills (Sem - 6th)
No ratings yet
Media and Communication Skills (Sem - 6th)
35 pages
AD-27-05 - NSAI ISO 27001.2022 Readiness Questionnaire - Rev 1 .01
No ratings yet
AD-27-05 - NSAI ISO 27001.2022 Readiness Questionnaire - Rev 1 .01
12 pages
C-SAFE Trining For Staff
No ratings yet
C-SAFE Trining For Staff
56 pages
Context Engineering
100% (1)
Context Engineering
165 pages
Backhaus, 2004
No ratings yet
Backhaus, 2004
22 pages
Strategic Management Theory and Cases an Integrated Approach 13th Edition Charles W L Hill eBook and TestBank Bundle Exam Prep
No ratings yet
Strategic Management Theory and Cases an Integrated Approach 13th Edition Charles W L Hill eBook and TestBank Bundle Exam Prep
338 pages
Predictive Analysis of Stock Market Trends A Machine Learning Approach
No ratings yet
Predictive Analysis of Stock Market Trends A Machine Learning Approach
6 pages
GTU Winter 2024 Exam Schedule
No ratings yet
GTU Winter 2024 Exam Schedule
38 pages
From The Outside In: Using Environmental Scanning For Evidence-Based Planning
No ratings yet
From The Outside In: Using Environmental Scanning For Evidence-Based Planning
10 pages
IHS Markit Chemical Process Economics Program PEP Brochure
No ratings yet
IHS Markit Chemical Process Economics Program PEP Brochure
4 pages
2023-09-05 Chapter 1
No ratings yet
2023-09-05 Chapter 1
43 pages
The Garden of Microchips
No ratings yet
The Garden of Microchips
3 pages
Competency Model Investment Analyst Aug2021 PDF
No ratings yet
Competency Model Investment Analyst Aug2021 PDF
20 pages
Homework Rubric High School Social Studies
100% (1)
Homework Rubric High School Social Studies
7 pages
Dco Lecture 1
No ratings yet
Dco Lecture 1
30 pages
Internal Comms for NGOs
100% (1)
Internal Comms for NGOs
56 pages
DSPC Score Sheet
No ratings yet
DSPC Score Sheet
11 pages
Basic Concepts On Educational Technology
100% (2)
Basic Concepts On Educational Technology
13 pages
Wealth Growth for HNIs
No ratings yet
Wealth Growth for HNIs
12 pages
Unit 4 - Writing Academic Texts (8522)
No ratings yet
Unit 4 - Writing Academic Texts (8522)
13 pages
Grade 6 Lesson Plan For Geography
No ratings yet
Grade 6 Lesson Plan For Geography
25 pages
Aula 1.3 - EPPLER 2006 Compara Mapa Mental Conceitual Metafora Visual
No ratings yet
Aula 1.3 - EPPLER 2006 Compara Mapa Mental Conceitual Metafora Visual
10 pages
Imara Reiki
No ratings yet
Imara Reiki
7 pages
Impact of Economic Globalization On International
No ratings yet
Impact of Economic Globalization On International
6 pages

Module 5

Uploaded by

Module 5

Uploaded by

Module 5

Discourse Processing: Cohesion, Reference Resolution, Discourse Cohension and Structure

Coherence Relation Between Utterances

Relationship Between Entities

Algorithms for Discourse Segmentation

We have different algorithms for Unsupervised Discourse Segmentation and Supervised

Unsupervised Discourse Segmentation

The class of unsupervised segmentation is also termed or represented as linear

Supervised Discourse Segmentation

Suppose we have two kinds of related sentences, namely: S0 and S1.

Building Hierarchical Discourse Structure

Let us consider the following phrases and serially number them.

 Rahul went to the farm.

Terminology Used in Reference Resolution

Types of Referring Expressions

1. Indefinite Noun Phrases

Reference Resolution Tasks

Constraint on Co-reference Resolution:

2. Pronominal Anaphora Resolution

Statistical Language Modeling

 While simpler models may look at a context of a short sequence of

Neural Language Modeling

Evaluating Language Models

 A machine learning model, in general, is considered good if it predicts a test set of

What are N-grams?

Illustration for N-gram probabilities

The Use of N-grams

 In general, simply modeling using n-grams in n-gram models is an insufficient model

10,000∗10,000=100,000,000 for bigram,

10,000∗10,000∗10,000=1,000,000,000,000 for trigram.

Challenges of Probability Estimation

 Reliable Generalization: The main problem to tackle with estimating parameters or

Sensitivity to the Training Corpus

Smoothing is the task of adjusting the maximum likelihood estimate of probabilities to

 Central idea in smoothing algorithms: We will assign some probability mass to

Laplace Smoothing or Add-one Smoothing

Evaluating Language Models in NLP

How to Evaluate a Language Model?

What Does Evaluating a Model Mean?

Issue of Data Leakage or Bias in Language Models

 It is an end-to-end evaluation where we can understand if a particular improvement

An intrinsic evaluation metric is one that measures the quality of a model-independent of

 Perplexity is also an intrinsic measure (without the use of external datasets) to

 Perplexity of a probability model like language models in NLP: For a model of

Perplexity to Compare Different N-Gram Models

 Steps to compute perplexity for n-gram models:

multiple sentences) as:

Perplexity in the Real World

 Perplexity is used as a measure in training language models related to standardized

Pros and Cons

 Advantages of using Perplexity

Historical Perspective for Entropy

 Entropy in Information Theory: It was introduced by Claude Shannon in his

Handling Unknown Words

Multilingual and Cross lingual Language Modeling

What Is Cross-Lingual Language?

Why Cross-Lingual Language is Important

The development of Cross-Lingual Language is of utmost importance for the perception of

NLP With Other Languages: Non-English NLP

Multilingual NLP Tasks

· Recruitment processes and analyzing resumes in different languages

· Finance, review, and analysis of financial records in different languages, credit

· Security, analysis of criminal records and bureaucratic documents in different

· Education, equal opportunities, and analysis of transcripts or texts in different

· Fast and secure translation services

Cross-Lingual Language Models in Machine Learning

We mentioned that developing Cross-Lingual Language models is a very difficult task. We

Bert, or Bidirectional Encoder Representations from Transformers, is a very important

It can work in more than 100 languages. So what makes it possible?

XLM is a model that combines many different models. These are:

· Casual Language Modeling (CLM)

· Just like Bert, a Masked Language Modeling (MLM)

· Translation Language Modeling (TLM)

In morphologically rich languages such as Turkish, it is necessary to focus on subwords to

You might also like