0% found this document useful (0 votes)
36 views13 pages

NLP Unit-5

Uploaded by

kathularajitha7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views13 pages

NLP Unit-5

Uploaded by

kathularajitha7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Unit-5

What Is Language Modeling?

A language model is a statistical or machine learning model that learns to predict the next
word (or character) in a sentence, given the previous words. It's a fundamental concept in
natural language processing (NLP).

Example:

If the input is:

"The cat sat on the"

A language model might predict:

"mat" as the next word.

Why Is Language Modeling Important?

Language modeling is the backbone of many NLP applications like:

 Text generation (e.g., ChatGPT!)


 Speech recognition
 Machine translation
 Spelling and grammar correction
 Autocomplete features in your phone or email

Types of Language Models

1. Statistical Models:
o n-gram models: Predict based on the last n words.
 Simple, fast, but limited by fixed-length memory.
2. Neural Network-Based Models:
o RNNs (Recurrent Neural Networks): Handle sequences and keep memory.
o LSTMs/GRUs: Improved RNNs for longer dependencies.
o Transformers (like GPT, BERT): Use attention mechanisms to model
context more effectively and in parallel.

The main goal is to estimate the probability of a sequence of words:

P(w1,w2,w3,...,wn)P(w_1, w_2, w_3, ..., w_n)P(w1,w2,w3,...,wn)


Often decomposed using the chain rule:

P(w1,w2,...,wn)=P(w1)⋅P(w2∣w1)⋅P(w3∣w1,w2)⋅...⋅P(wn∣w1,...,wn−1)

What Is an N-Gram Model in NLP?

An n-gram model is a type of language model used in Natural Language Processing (NLP) to predict
the next word in a sequence, based on the previous ones.

It uses the idea that:

"The probability of a word depends on a limited context — the previous n−1 words."

So instead of analyzing an entire sentence at once, it focuses on short chunks.

📏 What Does “N-Gram” Mean?

An n-gram is a sequence of n words:

n Name Example

1 Unigram "the"

2 Bigram "the cat"

3 Trigram "the cat sat"

4 4-gram "the cat sat on"

How Does an N-Gram Model Work?

The idea is to predict the next word based only on the previous (n-1) words.

So instead of:

P(w4∣w1,w2,w3)P(w_4 | w_1, w_2, w_3)P(w4∣w1,w2,w3)

…you approximate it as:

P(w4∣w2,w3)P(w_4 | w_2, w_3)P(w4∣w2,w3)

(for a trigram model, where n=3).

Probability Estimation

For a trigram model, the probability of a word given the previous two is estimated as:
P(wn∣wn−2,wn−1)=Count(wn−2,wn−1,wn)Count(wn−2,wn−1)P(w_n | w_{n-2}, w_{n-1}) =
\frac{\text{Count}(w_{n-2}, w_{n-1}, w_n)}{\text{Count}(w_{n-2}, w_{n-1})}P(wn∣wn−2
,wn−1)=Count(wn−2,wn−1)Count(wn−2,wn−1,wn)

Example: If "on the mat" appears 30 times and "on the" appears 100 times:

P("mat"∣"on the")=30100=0.3P(\text{"mat"} | \text{"on the"}) = \frac{30}{100} =


0.3P("mat"∣"on the")=10030=0.3

Challenges with N-Gram Models

 Data sparsity: Rare combinations won’t be seen often.


 Memory-based: Can’t capture long-term dependencies well.
 Zero probabilities: If a sequence isn’t in training data, probability becomes zero.

What Is Language Model Evaluation?

In NLP, evaluating a language model means testing how accurately it:

 Predicts the next word


 Generates fluent and meaningful text
 Understands and uses language in context

There are quantitative (automatic) metrics and qualitative (human) evaluations.

📏 Common Evaluation Metrics

1. Perplexity (most common for predictive models)

 Measures how “surprised” the model is by the actual data.


 Lower perplexity = better model
 Formula:

Perplexity(P)=2−1N∑i=1Nlog⁡2P(wi∣w1,...,wi−1)\text{Perplexity}(P) = 2^{-\frac{1}{N}
\sum_{i=1}^N \log_2 P(w_i|w_1,...,w_{i-1})}Perplexity(P)=2−N1∑i=1Nlog2P(wi∣w1,...,wi−1)

 Used heavily for evaluating n-gram models, RNNs, and LMs like GPT.

2. Accuracy / Precision / Recall

 For masked language models (like BERT):


o Accuracy = % of correctly predicted masked tokens.
3. BLEU / ROUGE / METEOR

 Used for tasks like machine translation and summarization.


 Compare generated text to human-written reference text.

Metric Used For How It Works

BLEU Translation Matches n-grams with reference

ROUGE Summarization Measures recall of overlapping words

METEOR Translation Considers synonyms, stemming, word order

4. Human Evaluation

 Ask humans to rate:


o Fluency
o Coherence
o Relevance
o Creativity or naturalness
 Often considered the gold standard but expensive and time-consuming.

5. Task-Specific Metrics

 For dialogue: F1 score, distinct-n (diversity), or engagement rating


 For code generation: Exact match or code execution accuracy
What Is Bayesian Parameter Estimation?

Bayesian parameter estimation is a statistical approach where parameters are treated as random
variables. Instead of estimating a single best value (like in frequentist methods), we compute a
probability distribution over possible values.

In NLP, this helps with:

 Handling data sparsity


 Improving generalization
 Incorporating prior knowledge

📏 Key Idea: Bayes’ Theorem

Bayes' Theorem is the foundation:

P(θ∣D)=P(D∣θ)⋅P(θ)P(D)P(\theta \mid D) = \frac{P(D \mid \theta) \cdot


P(\theta)}{P(D)}P(θ∣D)=P(D)P(D∣θ)⋅P(θ)

Where:

 θ\thetaθ: Parameters (e.g., word probabilities)


 DDD: Observed data (e.g., a corpus)
 P(θ)P(\theta)P(θ): Prior belief about parameters
 P(D∣θ)P(D \mid \theta)P(D∣θ): Likelihood of data given parameters
 P(θ∣D)P(\theta \mid D)P(θ∣D): Posterior — updated belief after seeing data

✍️ Example in NLP: Word Probability Estimation

Let’s say we want to estimate the probability of the word “cat” occurring.

Frequentist (Maximum Likelihood):

P("cat")=Count("cat")Total wordsP(\text{"cat"}) = \frac{\text{Count("cat")}}{\text{Total


words}}P("cat")=Total wordsCount("cat")

If “cat” doesn’t appear, this is 0 — not ideal.

✅ Bayesian (Smoothed Estimate):

Use a Dirichlet prior (common in NLP) to smooth the estimate:


P("cat")=Count("cat")+αN+α⋅VP(\text{"cat"}) = \frac{\text{Count("cat")} + \alpha}{N + \alpha \cdot
V}P("cat")=N+α⋅VCount("cat")+α

Where:

 α\alphaα: Smoothing factor (from prior)


 NNN: Total word count
 VVV: Vocabulary size

This is equivalent to additive smoothing (Laplace) — a Bayesian interpretation of smoothing!

📏 Applications in NLP

1. Text Classification

 Naive Bayes uses Bayesian estimation to calculate:

P(word∣class)∝Count(word, class)+αTotal words in class+α⋅VP(\text{word} \mid


\text{class}) \propto \frac{\text{Count(word, class)} + \alpha}{\text{Total words in class} +
\alpha \cdot V}P(word∣class)∝Total words in class+α⋅VCount(word, class)+α

2. Topic Modeling

 Latent Dirichlet Allocation (LDA) is a fully Bayesian model using priors over topics and
words.

3. Language Modeling

 Smoothing n-gram probabilities (like Kneser-Ney) can be interpreted as a Bayesian


approach.

✅Benefits in NLP

 Handles rare/unseen words better


 Prevents overfitting in low-data scenarios
 Allows incorporation of expert knowledge or previous models

📏 Bonus: Variational Inference & MCMC

In complex models like LDA or Bayesian neural networks:

 Use Variational Inference to approximate the posterior


 Or MCMC (Markov Chain Monte Carlo) to sample from it

Would you like a code snippet showing Bayesian estimation in a simple Naive Bayes text classifier?

📏 Where It’s Used in NLP

1. Text prediction (autocomplete on phones, search engines)


2. Speech recognition (choosing the most likely sentence from audio)
3. Spell check and correction (based on context)
4. Machine translation (evaluating how natural a translation sounds)

Problems with N-Grams

 Sparsity: Many n-grams won’t appear in the training data → 0 probability


 Memory hungry: Higher n = exponentially more combinations
 No understanding of meaning (they're based purely on frequency)
 Limited context: Can’t remember information more than n words back

Solutions & Evolution

 Smoothing techniques: Handle unseen words (like Laplace or Kneser-Ney)


 Backoff/Interpolation: Fall back to smaller n-grams if needed
 Neural language models: Replaced traditional n-gram models for better context
understanding (e.g., LSTMs, Transformers like GPT)

1. Class-Based Language Models

In class-based language models, we group words into predefined classes (e.g., nouns, verbs,
adjectives) and then model the probability of a word depending on its class rather than the word
itself. This can help generalize better by reducing the vocabulary size and making the model more
robust, especially when data is sparse.

For example:

 Instead of modeling the probability of each word directly (e.g.,


P("cat")P(\text{"cat"})P("cat")), the model might predict the class of a word first (e.g.,
P("Noun")P(\text{"Noun"})P("Noun")), and then predict the word given that class (e.g.,
P("cat"∣"Noun")P(\text{"cat"} | \text{"Noun"})P("cat"∣"Noun")).

This approach helps when:

 Smoothing the probabilities: Class-based models reduce the dimensionality of the problem.
 Handling sparse data: By grouping words, we reduce the risk of encountering zero
probabilities for rare words.

2. Variable-Length N-Gram Models

In variable-length n-gram models, we don't limit ourselves to a fixed window size (like 2-grams or 3-
grams), but instead, the model dynamically adjusts the length of the context it uses to predict the
next word.

 Traditional n-gram models have a fixed n (e.g., bigram, trigram) — predicting a word based
on the previous two or three words.
 Variable-length models can adapt based on the data. For example, a model might use a 3-
gram in one case, but fall back to a bigram or unigram if necessary, allowing more flexibility
in context.

This can help in:

 Capturing long-term dependencies more effectively.


 Dealing with sparsity by allowing the model to “back off” to simpler models if necessary.

3. Bayesian Topic Models (e.g., LDA)

Bayesian topic models like Latent Dirichlet Allocation (LDA) are used to model topics in a collection
of documents. These models assume that each document is a mixture of topics, and each topic is a
distribution over words.

 Bayesian here means that we estimate the posterior distribution of topics and words given
the observed data (documents). We use a prior distribution to encode our beliefs about the
topics before seeing the data, and then update those beliefs after observing the data.

In LDA, the generative process looks like this:

1. Choose a topic distribution for each document (from a Dirichlet prior).


2. For each word in the document, choose a topic and then a word from that topic (using
multinomial distributions).

The Bayesian inference process helps estimate the posterior distribution of topics and words. Since
LDA is a probabilistic model, it helps account for uncertainty in the data.
4. Combining These Concepts: A Bayesian Class-Based Variable-Length
Topic Model

Now, let's bring it all together in a model that combines:

 Class-based modeling (using word classes),


 Variable-length context (adjusting the length of the context dynamically),
 Bayesian inference (for topic modeling).

Imagine a Bayesian class-based variable-length topic model where:

 Each word belongs to a class (e.g., noun, verb, adjective).


 The model predicts a topic based on the word's class and a variable-length context of the
surrounding words.
 Bayesian methods are used to estimate the topics, class distributions, and the word-topic
associations.

This model would:

1. Assign topics based on a Bayesian approach (like LDA) to capture uncertainty and
distribution of topics.
2. Use classes to group words and generalize across similar types of words.
3. Adapt the length of the context based on the structure of the sentence, so the model can
flexibly adjust and capture dependencies at different levels.

Such a model would have advantages over simpler models:

 Improved generalization due to class-based assumptions.


 Better handling of long-range dependencies with the variable-length context.
 Improved uncertainty quantification using Bayesian methods.

Practical Example: Using LDA with Class-Based Word Groups

To give a more practical perspective, imagine applying LDA to a text corpus where words are first
grouped by their part of speech (e.g., nouns, verbs, adjectives). You could then use a Bayesian
approach to infer the topic distributions across these classes, giving each document a mixture of
topics that might be biased towards certain word classes.

In this setup:

 Class-based priors could bias topics towards certain categories (e.g., a topic related to
“sports” might have more verbs and nouns related to physical activities).
 Variable-length contexts could be used in the inference process to better capture sentence
structure or discourse-level patterns.
📏 Summary:

 Class-based language models use predefined categories of words (like nouns or verbs) to
generalize better.
 Variable-length models dynamically adjust the context they use for predictions.
 Bayesian methods allow us to infer topics, word distributions, and relationships while
managing uncertainty in the data.

This combination of techniques provides a robust, flexible approach to modeling language and
topics, improving generalization, handling sparse data, and capturing long-range dependencies.

Multilingual language modeling refers to training a language model that can handle multiple
languages simultaneously. Rather than creating separate models for each language, multilingual
models aim to learn shared representations that can be used across languages.

Key Points:

 The model is trained on text from multiple languages and learns to generalize across them.
 It can handle multiple tasks (e.g., text classification, translation, question answering) for
different languages.
 Multilingual models are typically trained on large, diverse corpora that include text from
various languages.

📊 Popular Multilingual Models:

1. mBERT (Multilingual BERT):


o A version of BERT trained on 104 languages.
o Uses WordPiece tokenization and learns shared word representations across all
these languages.
o Bidirectional: it can understand both left and right context in a sentence.

Example: You can use mBERT for tasks like sentiment analysis in English, German, or French,
and it will leverage the shared knowledge learned from the multiple languages.

2. XLM (Cross-lingual Language Model):


o XLM-R (XLM-RoBERTa) is a variant trained on 100+ languages.
o Built on RoBERTa's architecture and trained with a masked language model
objective.
o Uses unsupervised learning on large multilingual corpora for cross-lingual transfer
tasks.

Example: You can use XLM-R for translation, classification, and other multilingual tasks, even
when the training data for the target language is limited.

3. T5 (Text-to-Text Transfer Transformer):


o A model that frames all NLP tasks as a text-to-text problem (e.g., translation,
summarization).
o Multilingual T5 (mT5) is a variant of T5 that handles multiple languages by training
on a massive multilingual corpus.

📊 How Does Multilingual Language Modeling Work?

1. Shared Embedding Space:

 All languages share a common embedding space for words or subwords. This means that
similar words in different languages (like "house" in English and "maison" in French) will
have similar representations in the model's internal space.
 The model can map linguistic structures from different languages to this shared space,
learning patterns that can be transferred across languages.

2. Training on Multilingual Corpora:

 The model is trained on parallel corpora (texts in multiple languages) or a combination of


monolingual data from each language.
 The objective is often to predict the next word, sentence, or masked token, but in multiple
languages simultaneously.

3. Multilingual Pretraining:

 The model is pretrained using a language-agnostic objective that encourages it to learn


general representations of words, sentences, or documents across languages.
 Cross-lingual transfer: The model learns knowledge that can be transferred from one
language to another. For example, the model might learn grammar rules, word order, or
syntax structures that are common between languages.

📏 Cross-Lingual Language Modeling

Cross-lingual language modeling focuses on transferring knowledge from one language (the source
language) to another language (the target language). Cross-lingual models are particularly useful
when training data is available in the source language but limited or nonexistent in the target
language.

Key Concepts:

1. Zero-shot learning: Using a model trained on one language to perform a task in a different,
unseen language without additional training data in that language.
2. Cross-lingual transfer learning: Leveraging labeled data from one language to improve
performance in a different language.
3. Multilingual Pretrained Models: Models like mBERT, XLM-R, and mT5 perform well on
cross-lingual tasks, even without explicit training on the target language.
Cross-Lingual Applications in NLP

1. Machine Translation:
o Multilingual models enable models to translate between languages even if no
parallel corpus exists for certain language pairs.
o For instance, mBERT or XLM-R can translate between English and Swahili, even if no
direct translation dataset exists.
2. Text Classification:
o Cross-lingual text classification allows you to classify text in one language based on
the knowledge learned from another language.
o For example, you can train a model to classify news articles in English and then apply
it to classify articles in Spanish, even if you haven’t trained on any Spanish text.
3. Named Entity Recognition (NER):
o Cross-lingual NER uses multilingual models to recognize entities (such as names,
organizations, or locations) in multiple languages.
o mBERT and XLM-R can identify named entities in languages like German, French,
Hindi, and many others, without needing separate models for each language.
4. Question Answering:
o Cross-lingual QA models, such as mBERT, can answer questions in one language
based on context from another language.
o This is useful in environments where knowledge bases are in one language (like
English), but the user may ask questions in other languages.

Advantages of Multilingual and Cross-Lingual Models

1. Resource Efficiency:
o You don’t need separate models for each language.
o Training on multiple languages simultaneously is more resource-efficient than
training individual models for each language.
2. Better Generalization:
o Shared representations learned from multiple languages help the model generalize
better, especially in low-resource languages (those with fewer training data
available).
3. Zero-Shot and Few-Shot Learning:
o Cross-lingual models enable zero-shot and few-shot learning, allowing the model to
perform tasks in languages it has never seen during training.
4. Scalability:
o Once a multilingual model is trained, you can easily apply it to additional languages
without retraining from scratch, making it highly scalable.

Challenges in Multilingual and Cross-Lingual Modeling

1. Data Imbalance:
o Some languages (like English or Chinese) have large amounts of training data, while
others (e.g., Swahili or Xhosa) have less data available, which can lead to bias in the
model.
2. Complexity in Multilingual Syntax:
o Different languages have vastly different syntax and grammar. For example, English
follows a Subject-Verb-Object word order, while Japanese uses Subject-Object-
Verb. Multilingual models must learn how to navigate these differences.
3. Translation Bias:
o In cross-lingual tasks, translation errors or biases from machine translation systems
can be transferred to the model, leading to lower performance in the target
language.

📏 Popular Pretrained Multilingual Models

1. mBERT: A multilingual BERT model that can be fine-tuned for various tasks like text
classification, NER, and translation across multiple languages.
2. XLM-R: A robust version of XLM for cross-lingual tasks trained on 100+ languages.
3. mT5: A multilingual T5 that is designed for multilingual text-to-text tasks (like
summarization, translation, etc.).
4. ByT5: A variant of T5 that operates on byte-level inputs, enabling it to work with languages
that don't have a standard alphabet.

You might also like