Module 5
Module 5
Discourse Analysis is extracting the meaning out of the corpus or text. Discourse Analysis is
very important in Natural language Processing and helps train the NLP model better.
Concept of Coherence
Coherence in terms of Discourse in NLP means making sense of the utterances or making
meaningful connections and correlations. There is a lot of connection between the
coherence and the discourse structure (discussed in the next section). We use the property
of good text, coherence, etc., to evaluate the quality of the output generated by the natural
language processing generation system.
What are coherent discourse texts? Well, if we read a paragraph from a newspaper, we can
see that the entire paragraph is interrelated; hence we can say that the discourse is
coherence, but if we only combine the newspaper headlines consecutively, then it is not a
discourse, it is just a group of sentences that are also non-coherence.
Let us now learn about the two major properties of coherence, i.e., Coherence relation
between utterances and Coherence relation between entities.
When we say that the discourses are coherent, then it simply means that the discourse has
some sort of meaningful connection. The coherent relation tells us that there is some sort
of connection present between the utterances.
If there is some kind of relationship between the entities, then we can also say that the
discourse in NLP is coherent. So, the coherence between the entities is known as entity-
based coherence.
Discourse Structure
So far, we have discussed discourse and coherence, but we have not discussed the structure
of the discourse in NLP. Let us now look at the structure that discourse in NLP must have.
Now, the structure of the discourse depends on the type of segmentation applied to the
discourse.
What is discourse segmentation ? Well, when we determine the types of structures for a
large discourse, we term its segmentation. The segmentation is a difficult thing to
implement, but it is very necessary as discourse segmentation is used in fields like :
Information Retrieval,
Text summarization,
Information Extraction, etc.
Suppose we have a text with us, and the task is to segment the text into various units of
multi-paragraphs. In the multi-paragraphs, a single unit is going to represent a passage of
the text.
Now the algorithm will take the help of cohesion (that we have discussed above), and the
algorithm will classify the dependent texts and tie them together using some linguistic
devices. In simpler terms, unsupervised discourse segmentation means the classification and
grouping up of similar texts with the help of coherent discourse in NLP.
The unsupervised discourse segmentation can also be performed with the help of lexicon
cohesion. The lexicon cohesion indicates the relationship among similar units, for example,
synonyms.
In the previous segmentation, there was no certain labeled segment boundary to separate
the discourse segments. But in the supervised discourse segmentation, we only deal with
the training data set having a labeled boundary. To differentiate or structure the discourse
segments, we make use of cue words or discourse makers. These cue words or discourse
maker works to signal the discourse structure. As there can be varied domains of discourse
in NLP so, the cue words or discourse makers are domain specific.
Text Coherence
As we have previously discussed, the coherent discourse in NLP aims to find the coherence
relation among the discourse text. Now, to find the structure in discourse, we use lexical
repetition, but by using this lexical repetition, we cannot satisfy the conditions of coherent
discourse. So, to prove such a kind of discourse relation, Hebb has proposed some solutions.
We can say that the second statement, i.e., S1 can be the cause of the first statement,
i.e., S0. For example, Rahul is late. He will be punished.
In the above example, we can say that the first statement, S0, i.e., Rahul is late, has caused
the second statement, i.e., S1, i.e., He will be punished.
Explanation
Similar to the result, We can say that the first statement, i.e., S0 can be the cause of the
second statement, i.e., S1. For example, Rahul fought with his friend. He was drunk.
Parallel
By the term parallel, we mean that the assertion from the statement S0, i.e., p(a1, a2, …),
and the assertion from the statement S1, i.e. p(b1, b2, …), the ai and bi is similar for all the
values of I.
In simpler terms, it shows us that the sentences are parallel. For example, He wants food.
She wants money. Both of the statements are parallel as there is a sense of want in both
sentences.
Elaboration
Elaboration means that proposition P is inferring from both the assertions S0 and S1. For
example, Rahul is from Delhi. Rohan is from Mumbai.
Occasion
The occasion takes place when the change in the state is inferred from the first assertion S0,
the final state is inferred from the statement S1, and vice-versa. Let us take an example to
understand the relationship occasion better. For example, Rahul took the money. he gave it
to Rohan.
In the previous section, we discussed how text coherence takes place. Let us now try to
build a hierarchal discourse structure with the help of a group of statements. We generally
create the hierarchical structure among the coherence relations to get the entire discourse
in NLP.
S1:
Rahul went to the bank to deposit money.
S2:
He then went to Rohan's shop.
S3 :
He wanted a phone.
S4 :
He did not have a phone.
S5:
He also wanted to buy a laptop from Rohan's shop.
Now the entire discourse can be represented using the below hierarchal discourse structure.
Reference Resolution
The extraction of the meaning or interpretation of the sentences of discourse is one of the
most important tasks in natural language processing, and to do so, we first need to know
what or who is the entity that we are talking about. Reference resolution means
understanding the type of entity that is being talked about.
By the term reference, we mean the linguistic expression that is used to denote an
individual or an entity. For example, look at the below sentences.
Let us now look at the various terminologies used in the reference resolution.
Referring expression:
The NLP expression that performs the reference is termed a referring expression. For
example, the passage that we have talked about in the above section is an example
of the referring expression.
Referent:
Referent is the entity we have referred to. For example, in the above
passage, Rahul is the referent.
Co-refer:
As the name suggests, Co-refer is a term used for an entity if two or more
expressions are referring to the same entity. For example, Rahul and He is used for
the same entity, i.e., Rahul.
Antecedent:
The term that has been licensed to use another term is termed antecedent. For
example, in the above passage, Rahul is the antecedent of the reference He.
Anaphora & Anaphoric:
The referring expression is termed anaphoric. Anaphora & Anaphoric can be said to
be the term or reference used for an entity that has previously been introduced in
the same sentence.
Discourse model:
It is the model that has the overall representation of the entities that have been
referred to in the discourse text. It also contains the relationship of the involved
discourse in the NLP.
As we have previously discussed, the NLP expression that performs the reference is termed
a referring expression. We have mainly five types of referring expressions in Natural
Language Processing. Let us discuss them one by one.
Indefinite noun reference is a kind of reference that represents the entity that is new to the
discourse context's hearer. To understand the indefinite noun phrase, let us take an
example.
For example:
In the sentence : Rahul is doing some work., some is an indefinite noun phrase.
2. Definite Noun Phrases
A definite noun reference is a kind of reference that represents the entity that is not new to
the discourse context's hearer. The discourse context's hearer can easily identify the
definite noun reference. To understand the definite noun phrase, let us take an example.
For example:
In the sentence: Rahul loves reading the Times of India., the Times of India is an indefinite
noun phrase.
3. Pronouns
Pronouns is a form of definite reference (its working is the same as we have learned in
English grammar).
For example:
In the sentence, Rahul learned as much as he could. Here, he is the pronoun that is referring
to the noun Rahul.
4. Demonstratives
The demonstratives are also used to demonstrate the nouns but they behave differently
than the simple pronouns.
For example, that, this, these, and those, are some examples of demonstratives.
5. Names
Names can be the name of the person, location, organization, etc. So, it is the simplest form
of referring to the expressions.
For example, in the above examples, Rahul is the name referring expression.
To resolve the reference, we can use the two resolution tasks. Let us discuss them one by
one.
1. Co-reference Resolution
In the Co-reference Resolution, the main aim is to find the referring expression from the
provided text that refers to the same entity. In a discourse in NLP, Co-refer is a term used
for an entity if two or more expressions are referring to the same entity.
For example, Rahul and He is used for the same entity i.e., Rahul.
The Co-reference Resolution can be simply termed as finding the relevant co-refer
expressions among the provided discourse text. Let us take an example for more clarity.
For example, Rahul went to the farm. He cooked food. In this example, Rahul and He is the
referring expressions.
We have some sort of constraints present on the Co-reference Resolution. Let us learn
about the constraint.
In the English language, we have many pronouns. If we are using the pronouns he and she,
then we can easily resolve it. But if we are using the pronoun it, the resolution can be tricky,
and if we have a set of co-referring expressions, then it becomes more complex to resolve it.
In simpler terms, if we are using the it pronoun, then the exact determination of the
referred noun is complex.
By the terms Pronominal Anaphora Resolution, we are aiming to find the antecedent for
the current single pronoun.
For example, in the passage - Rahul went to the farm. He cooked food., Rahul is the
antecedent of the reference He.
Language Modeling: Introduction, N-Gram Models, Language Model Evaluation, Parameter
Estimation, Language Model Adaptation, Types of Language Models, Language-Specific
Modeling Problems, Multilingual and Cross lingual Language Modeling.
Introduction
In the modern world, the innumerable arsenal of languages helps us express our
feelings and thoughts, which is unique to the human species because it is a way to
express unique ideas and customs within different cultures and societies.
o Humans also have the capacity to use complex language far more than any
other species on Earth.
AI systems that understand these languages and generate text are known as
language models and are the latest and trending software technology in this decade.
Language modeling is a crucial element in modern NLP applications and makes the
machines understand qualitative information. Each language model type, in one way
or another, turns the qualitative information generated by humans into quantitative
information, which in turn allows people to communicate with machines as they do
with each other to a limited extent.
Let us understand further about N-grams and language models and how they are built using
N-grams.
Language Modeling
Language modeling (LM) is the use of various statistical and probabilistic techniques to
determine the probability of a given sequence of words occurring in a sentence. Language
models assign the probabilities to a sentence or a sequence of words or the probability of an
upcoming word given a previous set of words.
Language models are useful for a vast number of NLP applications such as next word
prediction, machine translation, spelling correction, authorship Identification, and
natural language generation.
The central idea in language models is to use probability distributions over word
sequences that describe how often the sequence occurs as a sentence in some
domain of interest.
Language models estimate the likelihood of texts belonging to a language. The
sequences are divided into multiple elements, and the language model models the
probability of an element given the previous elements.
o The elements can be bytes, characters, subwords or tokens. Then, the
sequence likelihood is the product of the elements’ probabilities.
Language models are primarily of two kinds: N-Gram language models and Grammar-
based language models such as probabilistic context-free grammar.
Language models can also be classified into Statistical Language Models and Neural
language models.
Utility of generating language text using language modeling: Independently of any
application, we could use a language model as a random sentence generator where
we sample sentences according to their language model probability.
o There are very few real-world use cases where you want to actually generate
language randomly.
o But the understanding of how to do this and what happens when you do so
will allow us to do more interesting things later.
A statistical language model is simply a probability distribution over all possible sentences.
Statistical language models learn the probability of word occurrence based on examples of
text.
Neural Language Models use different kinds of approaches like neural networks such
as feedforward neural networks, recurrent neural nets, attention-based networks,
and transformers-based neural nets late to model the language, and they have also
surpassed the statistical language models in their effectiveness.
Neural language models have many advantages over the statistical language models
as they can handle much longer histories and also can generalize better over contexts
of similar words and are more accurate at word prediction.
Neural net language models are also much more complex and slower and need
more energy to train and are less interpretable than statistical language models.
Hence for practical purposes where there are not a lot of computing power and
training data, and especially for smaller tasks, statistical language models like the n-
gram language model are the right tool.
On the other side, Large language models (LLMs) based on neural networks, in
particular, represented state of the art and gave rise to major advancements in NLP
AI .
o They hold the promise of transforming domains through learned knowledge
and slowly becoming ubiquitous in day-to-day life related to text and speech
use cases
o LLM sizes have also been increasing 10X every year for the last few years, and
these models grow in complexity and size along with their capabilities, like,
for example, few shot learners.
N-gram Models
N-gram models are a particular set of language models based on the statistical frequency of
groups of tokens.
An n-gram is an ordered group of n tokens. The bigrams of the sentence The cat eats
fish. are (The, cat), (cat, eats), (eats, fish) and (fish, .). The trigrams are (The, cat,
eats), (cat, eats, fish) and (eats, fish, .).
The smallest n-grams with n =1 are called unigrams. Unigrams are simply the tokens
appearing in the sentence.
The conditional probability that a certain token appears after previous tokens are
estimated by Maximum Likelihood Estimation on a set of training sequences.
N-grams - The central concept is that the next word dependent on the previous n words.
Intuitive Formulation
The intuitive idea behind n-grams and n-gram models is that instead of computing the
probability of a word given its entire history, we can approximate the history by just the last
few words like humans do while understanding speech and text.
P(wn∣w1…wn−1)≈P(wn) unigram
P(wn∣w1…wn−1)≈P(wn∣wn−1) bigram
P(wn∣w1…wn−1)≈P(wn∣wn−1wn−2) trigram
P(wn∣w1…wn−1)≈P(wn∣wn−1wn−2wn−3) 4-gram
P(wn∣w1…wn−1)≈P(wn∣wn−1wn−2wn−3wn−4)5-gram
Hence N-Gram models are also a simple class of language models (LM's) that assign
probabilities to sequences of words using shorter context.
We can predict the chance of emerging a word in a given position using these
assigned probabilities. For example, the last word of an n-gram given the previous
words.
N-gram models are the simplest and most common kind of language model.
The bigram model approximates the probability of a word given all the previous
words by using only the conditional probability of one preceding word.
So, the prediction for the next word is dependent on the previous word alone.
It is also one of the widely used models.
Probability Estimation
There are two main steps generally in building a machine learning model: Defining
the model and Estimating the model’s parameters, which is called the training or the
learning step. For language models, the definition depends on the model we choose.
There are also two quantities we need to estimate for developing the language
models for all words in the vocabulary of the language for which we are working
with:
o The Probability of observing a sequence of words from a language. For
example, Pr(Colorless green ideas sleep furiously)
This is the probability of a sentence or sequence Pr(w1,w2,…,wn)
o The Probability of observing a word having observed a sequence. For
example, Pr(furiously | Colorless green ideas)
This is the probability of the next word in a sequence Pr(wk+1∣w1,
…,wk)
o Pr(w1, w2, …, wn) is short for Pr(W1=w1,W1=w2,…,Wn=wn)
o The w notation denotes that the random variable W1 takes on value w1 and
so on. e.g., Pr(I,love,fish)=Pr(W1=I,W2=love,W3=fish)
Typical assumptions made in probability estimation methods: Probability models
almost always make independence assumptions.
o Even though two variables, X and Y, are not actually independent, the model
may treat them as independent, which can drastically reduce the number of
parameters to be estimated.
o Models without independence assumptions have way too many parameters
to estimate reliably from the data we may have and are intractable.
o Sometimes, the independence assumptions may not be correct, and these
models, when relying on those incorrect formulations, may often be incorrect
as well as they may assign probability mass to events that cannot occur.
Probability estimation without independence assumption: If there are no
independence assumptions about the sequence, then one way to estimate is the
fraction of times we see it. Pr(w1, w2, …, wn) = #(w1, w2, …, wn) / N where N is the
total number of sequences.
o Estimating using frequency fractionals is problematic because we need to
have seen the particular sentence many times to have to assign a good
probability for Pr(w1, w2, …, wn) = #(w1, w2, …, wn) / N.
o Also, estimating from sparse observations is unreliable, and we won't have a
solution for new sequences.
Maximum Likelihood estimation: The most basic parameter estimation technique is
the relative frequency estimation (frequencies are counts) which is also called the
method of Maximum Likelihood Estimation (MLE).
o The estimation simply works by counting the number of times the word
appears conditioned on the sentence and then normalizing the probabilities.
We also need some source text corpora.
o Chain rule of probability in estimation: To estimate the probabilities, we
usually rely on the Chain Rule of Probability, where we decompose the joint
probability into a product of conditional probabilities using the independence
assumption.
It is also to be kept in mind that estimating conditional probabilities
with long contexts is usually difficult, and for example, conditioning
on 4 or more words itself is very hard.
Markov assumption in probability estimation: The use of Markov assumption in
probability estimation solves a lot of problems we encounter with data sparsity and
conditional probability calculations.
o The assumption is that the probability of a word depends only on the
previous word(s). It is like saying the next event in a sequence depends only
on its immediate past context.
o Markov models are the class of probabilistic models that assume that we can
predict the probability of some future unit without looking too far into the
past.
Most language models currently in practice, especially statistical language models, are
extremely sensitive to changes in the style, topic, or genre of the text on which they are
trained.
Hence they are very sensitive to the training corpus from which they are trained on.
For example, one is much better off using a very big corpus of words of transcripts
from telephone conversations for modeling casual phone conversations than using a
corpus even of millions of words of transcripts from TV and radio news broadcasts.
The effect is quite strong, even for changes that seem trivial to a human. For
example, a language model trained on a certain publication of news wire text will see
its perplexity doubled when applied to a very similar news publication of wire text
from the same time period.
Smoothing
Laplace smoothing merely adds the number one to each count (hence the alternate name
adds one smoothing). We also need to adjust the denominator to take into account the
extra observations since there is a fixed number of words in the vocabulary, and for unseen
words, there is an increment of one.
Laplace smoothing does not perform well enough to be used in modern n-gram
models but is a useful tool and introduction to most other concepts.
A related way to view Laplace smoothing is as discounting or lowering some non-
zero discount counts in order to get the probability mass that will be assigned to the
zero counts.
Add-k smoothing: One alternative to add-one smoothing is to move a bit less of the
probability mass from the seen to the unseen events. Instead of adding 1 to each
count and we add a fractional count. This algorithm is therefore called add-k
smoothing.
o Add-k smoothing requires that we have a method for choosing k, and that can
be done by optimizing for k by trying different values on a holdout set.
o Add-k smoothing is useful for some tasks like text classification also, in
addition to probability estimation, but still doesn’t work well for language
modeling as it generates counts with poor variances and often inappropriate
discounts.
Language models are very useful in a broad range of applications like speech recognition,
machine translation part-of-speech tagging, parsing, Optical Character Recognition (OCR),
handwriting recognition, information retrieval, and many other daily tasks.
One of the main steps in the usage of language models is to evaluate the
performance beforehand and use them in further tasks.
This lets us build confidence in the handling of the language models in NLP and also
lets us know if there are any places where the model may behave
uncharacteristically.
In practice, we need to decide on the dataset to use, the method to evaluate, and also
select a metric to evaluate language models. Let us learn about each of the elements
further.
Evaluating a language model lets us know whether one language model is better
than another during experimentation and also to choose among already trained
models.
There are two ways to evaluate language models in NLP: Extrinsic
evaluation and Intrinsic evaluation.
o Intrinsic evaluation captures how well the model captures what it is supposed
to capture, like probabilities.
o Extrinsic evaluation (or task-based evaluation) captures how useful the model
is in a particular task.
Comparing among language models: We compare models by collecting a corpus of
text which is common for models which we are comparing for.
o We then divide the data into training and test sets and train the parameters
of both models on the training set.
o We then compare how well the two trained models fit the test set.
After we train models, Whichever model assigns a higher probability to the test set is
generally considered to accurately predicts the test set and hence a better model.
Among multiple probabilistic language models, the better model is the one that has a
tighter fit to the test data or that better predicts the details of the test data and
hence will assign a higher probability to the test data.
Extrinsic Evaluation
Extrinsic evaluation is the best way to evaluate the performance of a language model
by embedding it in an application and measuring how much the application improves.
Intrinsic Evaluation
We need to take advantage of intrinsic measures because running big language models in
NLP systems end-to-end is often very expensive, and it is easier to have a metric that can be
used to quickly evaluate potential improvements in a language model.
We also need a test set for an intrinsic evaluation of a language model in NLP
The probabilities of an N-gram model training set come from the corpus it is trained
on, the training set or training corpus.
We can then measure the quality of an N-gram model by its performance on some
unseen test set data called the test set or test corpus.
We will also sometimes call test sets and other datasets that are not in our training
sets held out corpora because we hold them out from the training data.
Good scores during intrinsic evaluation do not always mean better scores during extrinsic
evaluation, so we need both types of evaluation in practice.
Perplexity
Perplexity is a very common method to evaluate the language model on some held-out
data. It is a measure of how well a probability model predicts a sample.
The Intuition
The basic intuition is that the higher the perplexity measure is, the better the
language model is at modeling unseen sentences.
Perplexity can also be seen as a simple monotonic function of entropy. But
perplexity is often used instead of entropy due to the fact that it is arguably more
intuitive to our human minds than entropy.
Calculating Perplexity
PP(W)=P(w1w2…wN)−1/n =
o We know that if a language model can predict unseen words from the test set
if the P(a sentence from a test set) is highest, then such a language model is
more accurate.
Interpreting Perplexity
Perplexity intuitively provides a more human way of thinking about the random
variable’s uncertainty. The reasoning is that the perplexity of a uniform discrete
random variable with K outcomes is K.
o Example: The perplexity of a fair coin is two and the perplexity of a fair six-
sided die is six.
o This kind of framework provides a frame of reference for interpreting a
perplexity value.
Simple framework to interpret perplexity: If the perplexity of some random variable
X is 10, our uncertainty towards the outcome of X is equal to the uncertainty we
would feel towards a 10-sided die, helping us intuit the uncertainty more deeply.
Entropy
Entropy is a metric that has been used to quantify the randomness of a process in many
fields and compare worldwide languages, specifically in computational linguistics.
Definition for Entropy: The entropy (also called self-information) of a random variable is
the average level of the information, surprise, or uncertainty inherent to the single
variable's possible outcomes.
The more certain or the more deterministic an event is, the less information it will
contain. In a nutshell, the information is an increase in uncertainty or entropy.
Entropy of a discrete distribution p(x) over the event space X is given by H(p)=−∑x∈X
p(x)logp(x)
o H(X) >=0; H(X) = 0 only when the value of X is indeterminate and hence
providing no new information
o The smallest possible entropy for any distribution is zero.
o We also know that the entropy of a probability distribution is maximized
when it is uniform.
Entropy in Different Fields of NLP & AI
In terms of probability theory NLP, language perspective, and probability theory NLP,
entropy can also be defined as a statistical parameter that measures how much
information is produced for each letter of a text in the language.
o If the language is translated into binary digits (0 or 1) in the most efficient
way, the entropy H is the average number of binary digits required per letter
of the original language.
From a machine learning perspective, entropy is a measure of uncertainty, and the
objective of the machine learning model is to minimize uncertainty.
o Decision tree learning algorithms use relative entropy to determine the
decision rules that govern the data at each node.
o Classification algorithms in machine learning like logistic regression or
artificial neural networks often employ a standard loss function called cross
entropy loss that minimizes the average cross entropy between ground truth
and predicted distributions.
Cross Entropy
Due to the fact that we can not access an infinite amount of text in the language, and the
true distribution of the language is unknown, we define a more useful and usable
metric called Cross Entropy.
Intuition for Cross entropy: It is often used to measure the closeness of two
distributions where one distribution is from the sample text (Q) that the language
model aims to learn with as much proximity as possible and the other is
the empirical distribution of the language (P).
o Mathematical cross-entropy is defined as:
H(P,Q)=EP[−logQ] which can also be written as (P,Q)=H(P)+DKL(P∣∣Q)
H(P,Q) is the entropy and DKL(P∣∣Q) is the Kullback–Leibler (KL)
divergence of Q from P. It is also known as the relative entropy of P
with respect to Q.
From the formulation, we can see that the cross entropy of Q with respect to P is
the sum of two terms entropy and relative entropy:
o H(P), the entropy of P, is the average number of bits needed to encode any
possible outcome of P.
o The number of extra bits required to encode any possible outcome of P
optimized over Q.
The empirical entropy H(P) is unoptimizable, so when we train a language model with the
objective of minimizing the cross-entropy loss, the true objective is to minimize the KL
divergence of the distribution which was learned by our language model from the empirical
distribution of the language.
Tokenizers in Language Models: Tokenization is the first and important step in any
NLP pipeline, especially for language models which break unstructured data and
natural language text into chunks of information that can be considered as discrete
elements.
o The token occurrences in a document can be used directly as
a vector representing that document.
o The goal when crafting the vocabulary with tokenizers is to do it in such a way
that the tokenizer tokenizes as few words as possible into the unknown token.
Issue with unknown vocabulary/tokens: The general approach in most tokenizers is
to encode the rare words in your dataset using a special token UNK by convention so
that any new out-of-vocabulary word would be labeled as belonging to the rare word
category.
o We expect the model to learn how to deal with the other words from the
custom UNK token.
o It is also generally a bad sign if we see that the tokenizer is producing a lot of
these unknown tokens as the tokenizer was not able to retrieve a sensible
representation of a word and we are losing information along the way.
Methods to handle unknown tokens / OOV (out of vocabulary): Character level
embeddings and sub-word tokenization are some effective ways to unknown tokens.
o Under sub-word tokenization, WordPiece and BPE are de facto
methods employed by successful language models such as BERT and GPT, etc.
Character level embeddings: Character and subword embeddings are introduced as
an attempt to limit the size of embedding matrices such as in BERT but they have
the advantage of being able to handle new slang words, misspellings, and OOV
words.
o The required embedding matrix is much smaller than what is required for
word-level embeddings. Generally, the vectors represent each character in
any language
o Example: Instead of a single vector for "king" like in word embeddings, there
would be a separate vector for each of the letters "k", "i", "n", and "g".
o Character embeddings do not encode the same type of information that word
embeddings contain and can be thought of as encoding lexical information
and may be used to enhance or enrich word-level embeddings.
o Character level embeddings are also generally shallow in meaning but if we
have the character embedding, every single word's vector can be formed
even it is out-of-vocabulary words.
Subword tokenization: Subword tokenization allows the model to have a reasonable
vocabulary size while being able to learn meaningful context-independent
representations and also enables the model to process words it has never seen
before by decomposing them into known subwords.
o Example: The word refactoring can be split into re, factor, and ing.
Subwords re, factor, and ing occur more frequently than the word refactoring,
and their overall meaning is also kept intact.
Byte-Pair Encoding (BPE): BPE was initially developed as an algorithm to compress
texts and then used by OpenAI for tokenization when pretraining the GPT model.
o It is used by a lot of Transformer models like GPT, GPT-2, RoBERTa, BART, and
DeBERTa.
o BPE brings the perfect balance between character and word-level hybrid
representations which makes it capable of managing large corpora.
o This kind of behavior also enables the encoding of any rare words in the
vocabulary with appropriate subword tokens without introducing any
“unknown” tokens.
Although studies in the field of NLP, or Natural Language Processing, have been carried out
for years, all studies in the field have been in English. The vast majority of sentences that
machines could understand, or rather perceive and encode, were in English. Breaking this
English-dominated orientation and enabling machines to perceive and encode almost every
language that exists in a global manner is called Cross-Lingual Language studies.
Cross-Lingual Language NLP is a very difficult and complex process. The reason for this
complexity and difficulty lies in the fundamental differences between languages. All of the
more than 5000 languages spoken around the world have different rules and vectors. So
machines need to be trained to recognize these languages, to make sense out of shapes.
For this reason, although there are language recognition systems that work quite well in
different languages today, the process is still evolving and this takes time
As the world we live in globalizes, technologies that can only analyze English are far from
reality. For important tasks to be performed by computers, machines need to understand
many languages and be able to scan documents in these languages. This is especially
important in the finance and education sectors.
There are many different challenges in the development of Multi-Language NLP. For this
reason, many different approaches have been adopted over time for the development of
technology. We will discuss some of these approaches and models in detail.
One of the less successful models or approaches is to write algorithms that are trained on
each language separately. This proved to be very costly and time-consuming. Given the low
success rate, this approach has fallen behind.
There are some examples of this approach, for example, there are some models that are
trained in German only. The downside of this endeavor is that it is extremely difficult. Many
companies need models that can recognize several languages at once, and training models
for each language individually would cost millions of dollars and take months.
NLP is based on analyzing a lot of data. Machines can analyze huge amounts of text at the
same time and recognize languages and their vector fields. We will discuss some of the
multilingual NLP models in detail.
We can take a brief look at how Non-English pipelines are formed to better understand
these models. First, large amounts of data need to be collected. This data will then be
labeled. A large amount of cleaning may be required within the data. The more resources
available in the language, the easier it will be to prepare these pipelines. In this way, models
can be trained more accurately and faster.
It is very important to mention Multilingual NLP Tasks. There is still a digital perception that
English is the language that everyone knows, that it is innate, and that it is used worldwide.
This idea leads to social inequalities and these social inequalities are reflected in the future,
in the technological world.
First of all, when machines can analyze a language, they encode and decode not only the
linguistic structure but also the culture that this language is connected. Therefore, this
process, which appears to be purely technological, becomes more cultural and socialized
to a great extent. Problems such as racism, sexism, and discrimination in algorithms stem
from such one-sided approaches.
On the other hand, the world is far from being a place where nations live in isolation, it is
very global. The whole world interacts with each other. This increases the need for
multilingual NLP and this technology is actively used in many different fields. For example:
Many different models make this possible. Different approaches and diversity in the field
increase the chances of success. But some models stand out because they are more popular
or more successful than others. The most important factor in the popularity of these models
is their ease of use. Models that can be developed without the need for too much time and
financial resources are of course preferred by many people. Let's take a brief look at some
of these popular models.
mBbert
· Masked Language Modeling (MLM): Here, the model randomly masks 15% of the
words in the sentence or text. And the sentence has to guess these words. This distinguishes
it from other ways of working, such as RNN, because it does not learn words one after the
other.
· Next Sentence Prediction (NSP): Here the model combines two masked sentences as
input. These sentences may or may not be ordered in the text. The model then needs to
predict whether these sentences follow one after the other.
Through them, the model gains insight into how languages work. It can perceive languages
without human supervision.
XLM
XLM uses two different pre-training methods. These can be divided into supervised and
unsupervised. One is the source language and the other is the target language. XLM has
many different checkpoints where it checks whether the correct choice has been made.
Multifit
Muliti-fit is a model that works differently than the other two models. It is based on the
tokenization of subwords, not words, and uses QRNN models.
Let's briefly explain subword tokenization. Morphology is the study of the structure,
inflection, and inflection of words. Therefore, working only on "words" does not give
accurate results in languages rich in this respect.
The tokenization of these subwords allows the machines to detect words that are not very
common.