NLP Final
NLP Final
1 2
CONT... CONT...
Part-of-Speech (POS) Tagging: POS tagging involves labeling each word in a Lemmatization and Stemming:- are techniques used to reduce words to their
sentence with its corresponding part of speech, such as noun, verb, adjective, base or root form.
etc. For example, in the sentence "The cat sits on the mat," the tags might be: i. Stemming: Reduces words to their root form by removing suffixes (e.g.,
i. The/DT (determiner) "running" becomes "run").
ii. cat/NN (noun) ii. Lemmatization: Reduces words to their base form considering the context
(e.g., "running" becomes "run", "better" becomes "good").
iii. sits/VB (verb)
Stop Words: are common words that are often filtered out in NLP tasks
iv. on/IN (preposition) because they carry less meaning (e.g., "the", "is", "in"). Removing stop words
v. the/DT (determiner) can improve the efficiency of text processing.
vi. mat/NN (noun)
Bag of Words (BoW): is a representation of text that describes the occurrence
Named Entity Recognition (NER): NER identifies and classifies named of words within a document.
entities in text into predefined categories such as person names, organizations, It involves creating a vocabulary of all words in the document and then
locations, dates, and more. For example, in the sentence "Barack Obama was representing each document by a vector of word counts.
born in Hawaii," the entities are: This model disregards grammar and word order but keeps multiplicity.
i. Barack Obama (Person)
ii. Hawaii (Location) 3 4
CONT... CONT...
TF-IDF (Term Frequency-Inverse Document Frequency):- It is a statistical Sentiment Analysis: is the process of determining the emotional tone
measure used to evaluate the importance of a word in a document relative to a behind words, sentences, or texts. It classifies the text into positive,
collection of documents (corpus). It combines: negative, or neutral sentiments.
i. Term Frequency (TF): How often a word appears in a document. Syntax and Parsing: Parsing involves analyzing the grammatical
ii. Inverse Document Frequency (IDF): How common or rare a word is across structure of a sentence to identify relationships between words. Syntax
all documents in the corpus. parsing can be:
i. Dependency Parsing: Identifies dependencies between words (e.g.,
Word Embeddings:-are dense vector representations of words that capture subject-verb relationships).
semantic meaning. Examples include Word2Vec, GloVe, and FastText. These ii. Constituency Parsing: Breaks sentences into sub-phrases or
embeddings map words to vectors in a high-dimensional space where constituents (e.g., noun phrases, verb phrases).
semantically similar words are closer together.
Machine Translation: is the automatic translation of text from one
N-grams are contiguous sequences of n items (words, characters, etc.) from language to another. Techniques range from rule-based approaches to
a given text. Common examples are: statistical and neural machine translation models.
i. Unigrams (1-gram): ["The", "cat", "sits"]
ii. Bigrams (2-gram): ["The cat", "cat sits"]
iii. Trigrams (3-gram): ["The cat sits"]
5 6
Ambiguity in language
CONT...
It refers to the phenomenon where a word, phrase, or sentence has
multiple interpretations.
Language Models: predict the probability of a sequence of words. They are
fundamental for tasks like text generation and are often built using neural Ambiguity can occur at various levels of language processing, such as
networks. Examples include LSTM-based models. lexical (word-level), syntactic (sentence structure), and semantic
(meaning) levels. Understanding and resolving ambiguity is a
Text Classification: is the process of assigning predefined categories to text. significant challenge in natural language processing (NLP).
Examples include spam detection, topic classification, and sentiment analysis.
1. Lexical Ambiguity
Speech Recognition: involves converting spoken language into text. It
combines NLP with signal processing and often uses models like Hidden • Lexical ambiguity arises when a word has multiple meanings.
Markov Models (HMM) and deep learning techniques. Example:
• "I went to the bank."
Explanation:
• The word "bank" can refer to a financial institution or the side of a
river.
Resolution:
• Context is used to determine the correct meaning. For example,
additional context like "to deposit money" clarifies that "bank" refers to
a financial institution.
7 8
CONT... CONT...
2. Syntactic Ambiguity
• 3. Semantic Ambiguity
• Syntactic ambiguity occurs when a sentence can be parsed in multiple ways
• Semantic ambiguity happens when a sentence can have multiple meanings,
due to its structure.
even if its syntactic structure is clear.
Example:
• Example:
• "I saw the man with the telescope."
• "He gave her cat food."
Explanation:
• Explanation:
• This sentence can be interpreted as either:
• This sentence can mean:
• "I used the telescope to see the man."
• "He gave food to her cat."
• "I saw a man who had a telescope."
• "He gave her some cat food."
Resolution:
• Resolution:
• Parsing algorithms and contextual understanding are used to determine the
• Semantic analysis and context are used to infer the intended meaning.
most likely structure.
9 10
CONT... QUESTIONS
• More than 100 students attended the seminar. 50 of them were from
• 4. Pragmatic Ambiguity
our college.
• Pragmatic ambiguity involves the interpretation of language in context,
considering the speaker's intentions and the situational context. • "The project will be completed in 10 days”.
• Example: • "The temperature will rise by 5 to 10 degrees”.
• "Can you pass the salt?"
• Explanation:
• This sentence can be interpreted as:
• A question about the listener's ability to pass the salt.
• A polite request for the listener to pass the salt.
• Resolution:
• Understanding the social and conversational context helps resolve
pragmatic ambiguity.
11 12
Segmentation CONT...
• Segmentation in Natural Language Processing (NLP) refers to the process of
dividing text into smaller meaningful units. These units can be sentences, • 4. Paragraph Segmentation
words, phrases, or other subunits. • Paragraph segmentation involves splitting a text into paragraphs. This is less
• Effective segmentation is crucial for many downstream NLP tasks such as common in typical NLP tasks but can be important for document-level
tokenization, part-of-speech tagging, named entity recognition, and parsing. analysis.
• 1. Sentence Segmentation • 5. Chunking (Shallow Parsing)
• Sentence segmentation, also known as sentence boundary detection, involves • Chunking involves segmenting and labeling multi-token sequences, such as
splitting a text into individual sentences. noun phrases (NP), verb phrases (VP), etc.
• 2. Word Segmentation
• Word segmentation, also known as tokenization, involves splitting a sentence
into individual words or tokens.
• 3. Subword Segmentation
• Subword segmentation involves splitting words into smaller units, such as
morphemes or subwords, which can be useful for handling out-of-vocabulary
words in machine translation or language modeling.
13 14
Stemming Tokenization
Stemming is a text normalization technique in Natural Language Processing Tokenization is a fundamental step in natural language processing (NLP)
(NLP) that reduces words to their base or root form. that involves splitting text into individual units called tokens. These tokens
can be words, phrases, or other meaningful elements. Tokenization facilitates
The root form is usually not a valid word by itself but is a common further processing and analysis of text data by breaking it down into
representation of words that allows for the conflation of different inflected manageable pieces.
forms of a word. Types of Tokenization
Stemming helps in reducing the dimensionality of text data and is particularly • Word Tokenization: Splitting text into individual words.
useful in search engines, text mining, and information retrieval systems.
• Sentence Tokenization: Splitting text into individual sentences.
Common Stemming Algorithms • Subword Tokenization: Splitting words into smaller units, such as
i. Porter Stemmer: One of the most widely used stemming algorithms, known morphemes or subwords, useful in dealing with unknown words or for
for its simplicity and efficiency. languages with rich morphology.
ii. Lancaster Stemmer: A more aggressive stemming algorithm compared to the Libraries for Tokenization
Porter Stemmer. • Several NLP libraries provide robust tokenization tools, including:
iii. Snowball Stemmer: Also known as the Porter2 stemmer, it is an • NLTK (Natural Language Toolkit)
improvement over the original Porter stemmer and is available for multiple • spaCy
languages. • Transformers by Hugging Face
• Gensim
15 16
Word embedding CONT...
• Word Embedding refers to a technique for representing words as dense • Contextual Information: Word embeddings are learned from large corpora
vectors of real numbers in a continuous vector space. of text and can reflect syntactic and semantic patterns. Popular embeddings
• Unlike traditional methods such as one-hot encoding, which represent words like Word2Vec, GloVe, and FastText are trained using various methods to
as sparse, high-dimensional vectors, word embeddings capture semantic capture these patterns.
relationships between words in a more compact and meaningful way.
• Key points are as following:- • Pre-trained Embeddings: Pre-trained word embeddings can be used to
i. Dimensionality Reduction: Word embeddings reduce the dimensionality of initialize models, allowing them to leverage learned semantic relationships
word representations compared to one-hot encoding, which typically results from large datasets without having to train embeddings from scratch.
in a sparse vector of the size of the vocabulary. Embeddings represent each
word as a dense vector of fixed size, often in the range of 50 to 300
dimensions. • Applications: Word embeddings are used in various NLP tasks such as text
classification, sentiment analysis, machine translation, and information
ii. Semantic Meaning: Word embeddings capture semantic meaning and retrieval. They are foundational for many modern NLP techniques and
relationships between words. Words with similar meanings or contexts are models.
represented by similar vectors. For example, "king" and "queen" may have
vectors that are closer to each other than "king" and "car."
17 18
19 20
CONT... Dependency Parsing
• Applications: Understanding word senses is critical for many NLP • It is a key aspect of syntactic analysis in natural language processing (NLP)
applications, including: and computational linguistics.
• Machine Translation: Ensuring the correct translation of words based on • It focuses on analyzing the grammatical structure of a sentence by
their intended meanings. identifying the relationships between words, particularly how each word
• Information Retrieval: Improving search results by understanding the depends on others.
context of search queries. Key Concepts in Dependency Parsing:
• Text Summarization: Generating accurate summaries that reflect the correct • Dependency Relations: In dependency parsing, the grammatical structure of
meanings of words. a sentence is represented by a set of dependency relations. Each relation
consists of a head and a dependent. The head is a word that governs or
influences another word (the dependent), establishing a syntactic connection
between them.
21 22
CONT... CONT...
• Dependency Parsing Models: Several algorithms and models are used for dependency parsing,
• Dependency Tree: The result of dependency parsing is often visualized as a including:
dependency tree or dependency graph. In this tree, each node represents a • Transition-based parsing: Constructs the dependency tree by making a sequence of parsing
decisions based on transitions between different states.
word, and directed edges represent dependency relations. The root of the tree is
• Graph-based parsing: Constructs the entire dependency graph and selects the best tree by
typically the main verb or another central element of the sentence. optimizing a scoring function.
Head and Dependent: • Neural network-based models: Leverage deep learning techniques to learn complex patterns
in dependency structures, improving accuracy and flexibility.
• Head: The governing word in a dependency relation. • Applications: Dependency parsing is crucial for various NLP tasks, including:
• Dependent: The word that is governed by the head. For example, in the phrase • Semantic Role Labeling: Understanding the roles played by different words in a sentence.
"The cat sleeps," "sleeps" is the head of "cat," which is the dependent. • Machine Translation: Improving the accuracy of translations by capturing grammatical
relationships.
• Types of Dependencies: Common dependency relations include: • Information Extraction: Identifying and extracting specific information based on grammatical
structure.
• Subject: The noun or noun phrase that performs the action (e.g., "cat" in "The cat
• Text Summarization: Generating coherent summaries by understanding sentence structure.
sleeps").
• Tools and Resources: Popular tools for dependency parsing include:
• Object: The noun or noun phrase that receives the action (e.g., "ball" in "She • SpaCy: An NLP library with built-in support for dependency parsing.
throws the ball"). • Stanford Parser: A widely used tool from the Stanford NLP group that provides dependency
parsing capabilities.
• Modifier: Words that provide additional information about another word (e.g.,
• NLTK: The Natural Language Toolkit, which includes functions for dependency parsing.
adjectives describing nouns).
23 24
Word Window Classification
• Word Window Classification is a technique used in Natural Language
Processing (NLP) to classify words based on the context provided by
surrounding words, known as a "window".
25 26
CONT... CONT...
• Model Training: A machine learning model (e.g., logistic regression, SVM, or
a neural network) is trained on these features to classify the target word.
Example to illustrate how word window
classification works:
• Sliding Window: The window slides over the text, classifying each word based • Task: Part-of-Speech Tagging
on its surrounding context.
Example Sentence:
Example
• Consider the sentence: "The quick brown fox jumps over the lazy dog."
• The cat sat on the mat.
• If the target word is "fox" and the window size is 3, the window will look like Goal:
this:
• Assign each word in the sentence its correct part of
• Previous word: "brown" speech (POS) tag.
• Target word: "fox"
• Next word: "jumps"
Word Window:
• Features for "fox" could include the embeddings of "brown", "fox", and • We'll use a word window of size 3 (1 word to the
"jumps". left, the target word, and 1 word to the right).
• For the sentence "The cat sat on the mat":
27 28
CONT... CONT...
• Target Word: "cat" Applications
•Word Window: [The, cat, sat] • Named Entity Recognition (NER): Classifying words into categories
like person, location, organization, etc.
•POS Tags: [DT (determiner), NN (Noun), VB
(Verb)] • Part-of-Speech (POS) Tagging: Assigning parts of speech to each word
in a sentence.
•Classification: NN (Noun)
• Chunking: Dividing a text into syntactically correlated parts like noun or
• Target Word: "sat" verb phrases.
•Word Window: [cat, sat, on] Benefits
•POS Tags: [NN, VB, IN] • Context-Aware: Takes into account the surrounding context, leading to
•Classification: VB (Verb) better classification performance.
• Target Word: "on" • Simplicity: Relatively simple to implement and understand
•Word Window: [sat, on, the]
•POS Tags: [VB, IN, DT]
•Classification: IN (Preposition)
29 30
33 34
The vectors for "king" and "queen" would be close to each other in this
vector space.
35 36
CONT... N-gram Language Models
Example: N-gram language models are a type of statistical model used in natural
• Suppose you have the sentence: "The cat sits on the mat." language processing (NLP) to predict the probability of a sequence of words in
• After embedding, the word "cat" might be turned into a vector like [0.2, a sentence. They are called "N-gram" models because they consider sequences
0.8, -0.1], and "mat" might be [0.5, 0.6, 0.1]. of "N" words at a time.
Applications:
Key Concepts:
• Text Classification: Embeddings help classify text by giving the model a
numerical understanding of words. • N-gram:
• Search Engines: They can find similar documents by comparing the • An N-gram is a contiguous sequence of "N" items (usually words) from a
given text.
embeddings of text in them.
• For example:
• Machine Translation: Embeddings allow models to understand and • Unigram (1-gram): A single word (e.g., "The")
translate words between languages by finding similar vectors in different
languages. • Bigram (2-gram): A sequence of two words (e.g., "The cat")
• Trigram (3-gram): A sequence of three words (e.g., "The cat sits")
• And so on...
37 38
CONT... CONT...
Language Model:
• A language model assigns probabilities to sequences of words.
• For an N-gram model, the probability of a word depends on the previous N-1
words.
• For example, in a trigram model, the probability of a word depends on the two
preceding words.
How It Works?
• The model is trained on a large corpus of text, counting how often different N-
grams occur.
• It uses these counts to estimate the probability of a word following a given
sequence of N-1 words.
39 40
Applications Perplexity
• Text Prediction: N-gram models can predict the next word or • Perplexity is a measurement used in natural language processing to evaluate the
sequence of words in a sentence (e.g., in auto complete). quality of a language model.
• It essentially tells us how well a probability model predicts a sample of text.
• Speech Recognition: They help in determining the most probable
words spoken in a sequence. • Lower perplexity indicates a better model because it suggests the model is better
at predicting the text.
• Spelling Correction: They can suggest corrections by considering the
What is Perplexity?
most likely word sequences.
• Understanding Perplexity:
Limitations:
• Perplexity is the exponentiation of the average negative log-likelihood of a
• Limited Context: Higher-order N-gram models (e.g., trigram, 4-gram) test set, which can be interpreted as the average branching factor of a
capture more context but require much more data and computational language model.
power. • In simpler terms, it tells us how "surprised" the model is by the text. If a
• Data Sparsity: Rare N-grams might not appear often enough in the model is well-trained and predicts the text well, it will have low perplexity
training data, leading to poor probability estimates for some word (low surprise). If the model is poorly trained, it will have high perplexity
(high surprise).
sequences.
• Over fitting: High-order N-gram models might fit the training data too
closely and not generalize well to new text.
41 42
CONT... CONT...
Perplexity and Language Models:
• N-gram Models: Perplexity is often used to evaluate N-gram models. A
trigram model, for example, will have lower perplexity than a bigram
model if it better captures the text's patterns.
• Neural Language Models: Modern neural language models (e.g., RNNs,
Transformers) often achieve much lower perplexity than traditional N-gram
models, indicating they are better at predicting sequences of words.
Interpreting Perplexity:
• A lower perplexity score indicates a better model. For example, if one
model has a perplexity of 50 and another has 100, the first model is
considered to be better at predicting the text.
• However, perplexity is relative; it should be compared within the same
dataset and task.
43 44
Example Hidden Markov Models
• Suppose a language model predicts the following sequence: "The cat sat • A Hidden Markov Model (HMM) is a statistical model used to represent
on the mat. systems that are governed by a Markov process with hidden states.
• "If the model predicts each word with high probability, the perplexity • HMMs are widely used in areas such as speech recognition, natural language
will be low, suggesting the model understands the text well. processing, and bioinformatics.
• If the model predicts each word with low probability, the perplexity will States:
be high, suggesting the model is less effective. • Hidden States: The actual states of the system are not directly observable.
Instead, they are inferred based on observable outputs.
• Observable States: These are the outputs or observations that can be directly
seen or measured.
• Markov Property:
• The Markov property assumes that the probability of transitioning to the next
state depends only on the current state, not on the sequence of previous states.
This is known as the first-order Markov property.
45 46
Initial State Probabilities: These are the probabilities of the system starting
in each possible hidden state.
47 48
EXAMPLE CONT...
• Imagine we have a person who can be either Happy or Sad on any given day
(these are the hidden states). We can't directly observe their mood, but we can
observe their behavior: whether they are Singing or Not Singing (these are the
observable states).
•
49 50
CONT... CONT...
51 52
CONT... CONT...
53 54
55 56
CONT... Recurrent Neural network
• A Recurrent Neural Network (RNN) is a class of artificial neural networks
designed for sequence data or time-series data.
• Unlike traditional feed forward neural networks, RNNs have connections that form
directed cycles, allowing them to maintain a form of memory by processing
information from previous time steps.
• This makes them highly suitable for tasks where the order of inputs matters, such
as language modeling, speech recognition, and time-series prediction.
Key Concepts of RNNs:
• Sequential Data: RNNs handle data where order matters, such as sentences or
time-series data.
• Hidden States: RNNs have a hidden state that carries information from one step to
the next, allowing the network to have a memory.
• Weights Sharing: The same weights are applied to inputs across time steps,
reducing the number of parameters compared to feed forward networks.
• Back propagation Through Time (BPTT): During training, errors are propagated
back through time, allowing RNNs to learn from the sequence's dependencies.
57 58
Structure of an RNN
RNN Architecture
• Input Layer: Receives the input sequence (e.g., words, time-series data).
• Hidden Layers: Receives input at each time step and maintains a hidden state
that captures past information.
• Output Layer: Generates output based on the current hidden state, which can be
used for tasks like classification or regression.
Types of RNNs:
• Vanilla RNN: The simplest form of RNN, with a single hidden layer connecting
back to itself. It can suffer from vanishing/exploding gradient problems.
• Long Short-Term Memory (LSTM): An advanced RNN architecture that
introduces gates (input, forget, and output) to control the flow of information and
mitigate issues like vanishing gradients.
• Gated Recurrent Unit (GRU): A simplified version of LSTM that combines the
forget and input gates into a single update gate.
59 60
RNN Workflow Vanishing Gradients and exploding
• Data Preparation: RNNs expect sequential data, so the input data is often
gradient
organized into sequences. • Vanishing gradients and exploding gradients are two common problems
• Forward Propagation: The input at each time step is processed, and the encountered when training deep neural networks, especially Recurrent
hidden state is updated to reflect past information. Neural Networks (RNNs) or very deep feed forward networks.
• Loss Calculation: The loss is calculated based on the model's predictions • Both issues arise during the back propagation process, where the gradients
and the actual targets. are propagated backward through the network to update the weights.
• Backpropagation Through Time (BPTT): The gradients are computed, and • Vanishing gradients occur when the gradients become extremely small as
weights are updated using backpropagation over multiple time steps. they are propagated back through layers of a neural network, especially in
deep networks.
• This results in extremely slow learning or no learning at all for the earlier
layers of the network.
• It is particularly problematic in RNNs because gradients need to be
propagated through many time steps, and information from earlier time
steps can get "lost."
61 62
63 64
UNIT-3
Consequences LSTM (Long Sort Term Memory)
• Unstable training: The loss function can fluctuate wildly, making it Long Short-Term Memory (LSTM) is a special type of Recurrent
difficult for the network to converge. Neural Network (RNN) designed to overcome the limitations of
• Divergence: The model might fail to learn and diverge, causing errors and traditional RNNs, particularly the vanishing gradient problem.
numerical instability (e.g., NaN values in the weights).
65 66
67 68
Architecture of an LSTM CONT...
Each LSTM unit consists of: • Candidate Memory ({C_t}): A temporary memory update based
• Cell State (C_t): The memory of the LSTM unit. on the input:
• Hidden State (h_t): The output of the LSTM unit at each time step,
which can be passed to the next unit.
• Forget Gate (f_t): Controls what proportion of the previous • Output Gate (o_t): Controls how much of the memory affects the
memory to retain. It is computed as: next hidden state:
• Input Gate (i_t): Determines how much of the new input should be • Final Memory Update: The cell state is updated using both the
added to the memory: forget gate and input gate:
69 70
71 72
Key Concepts of Gated Recurrent Unit (GRU)
Gated recurrent Unit (GRU)
The Gated Recurrent Unit (GRU) is a type of recurrent Recurrent Neural Networks (RNNs) Background:
neural network (RNN) architecture designed to solve issues
• RNNs are used to process sequential data (e.g., time series, text, speech).
related to learning long-term dependencies, such as vanishing They maintain a hidden state that is updated with each time step, allowing
gradients, that traditional RNNs face. them to capture the dependencies between elements in the sequence.
It is similar to the Long Short-Term Memory (LSTM) network • However, traditional RNNs struggle to capture long-range dependencies due
but has a simpler architecture. to the vanishing gradient problem during training.
73 74
CONT…
CONT…
• GRU Architecture: A GRU addresses the limitations of
traditional RNNs by introducing two gates: the reset gate
and the update gate. These gates help control the flow of
information and manage what information should be
carried forward or forgotten.
75 76
CONT….
CONT….
77 78
Advantages of GRU
79 80
Use Cases of GRU
Comparison of GRU vs. LSTM
LSTM: Uses three gates (input, forget, output), which provide more control
over memory, but at the cost of more computational complexity. • Time Series Forecasting: GRUs are used to capture dependencies in
time-series data for tasks like weather prediction or stock price
forecasting.
GRU: Uses two gates (update, reset), which makes it more efficient while • Natural Language Processing (NLP): GRUs are used in tasks like
still performing well on many tasks. machine translation, speech recognition, and text generation, where
sequential data is prevalent.
• Anomaly Detection: In detecting irregularities in sequences, such as
fraud detection in transactional data or identifying unusual patterns in
sensor data.
81 82
83 84
POS Tagging Algorithms
How POS Tagging Works
Rule-based POS Tagging:
Tokenization: The sentence is first split into •Uses a set of hand-crafted linguistic rules to tag
individual words (tokens). words.
Assign Tags: Each token is then tagged with its part
of speech based on its role in the sentence. •Example: "If a word ends in 'ing', it's likely a verb
For example: (e.g., running)."
Sentence: "The cat chased the mouse." Statistical POS Tagging:
POS Tags: •Uses machine learning models to assign POS tags
•"The" → DT (Determiner) based on probabilities derived from a large
•"cat" → NN (Noun) annotated corpus.
•"chased" → VBD (Verb, Past Tense) •Example: Hidden Markov Model (HMM), which
•"the" → DT (Determiner) predicts the most likely tag sequence given the word
•"mouse" → NN (Noun) sequence.
85 86
CONT...
Explanation of Tags
Neural Network-based POS Tagging:
• Uses deep learning methods like Recurrent Neural • DT: Determiner
Networks (RNNs) or Transformer-based models (e.g., • JJ: Adjective
BERT) to learn tagging from large amounts of labeled • NN: Noun
data. • VBZ: Verb, 3rd person singular present
Hybrid Methods: • IN: Preposition
• Combines rule-based and statistical or machine learning
approaches to improve accuracy.
87 88
Applications of POS Tagging Challenges in POS Tagging
Syntactic Parsing: Helps in building parse trees for sentences. • Ambiguity: Some words can belong to different parts of speech depending
Named Entity Recognition (NER): POS tagging is often a preprocessing on the context (e.g., "book" can be a noun or a verb).
step in recognizing named entities like persons, organizations, etc. • Unknown Words: Handling out-of-vocabulary (OOV) words not seen in
Machine Translation: Helps in understanding sentence structure for better the training data can be difficult.
translation.
Information Extraction: Assists in extracting useful information (e.g.,
events, dates) from text.
Sentiment Analysis: POS tags can help identify the sentiments by focusing
on adjectives and verbs.
89 90
91 92
CONT… CONT…
Pre-training and Fine-tuning:
Training:
• BERT is first pre-trained on a large corpus of text (e.g., Wikipedia and Books
• Masked Language Model (MLM): During training, Corpus) to learn general language representations. This pre-training allows it
some words in the input text are randomly masked, to capture a broad understanding of language.
• After pre-training, BERT can be fine-tuned on specific tasks (e.g., sentiment
and the model learns to predict these masked words analysis, question answering) using a smaller dataset. Fine-tuning typically
based on their context. This helps BERT understand involves adding a simple output layer to the pre-trained model and training it
the meaning of words based on surrounding words. on the task-specific data.
95 96
LIMITATION OF BERT MODEL CONT…
97 98
XLnet CONT…
99 100
XLnet
CONT…
XLNet is a cutting-edge language model created by Google AI Training Objective:
Brain and Carnegie Mellon University, designed to improve the • Unlike BERT, which uses masked language modeling (where
limitations of previous models like BERT. certain tokens are masked and then predicted), XLNet does not
mask tokens. Instead, it trains by permuting input sequences
Key Features of XLNet and predicting tokens, which allows it to leverage more context
information without introducing artificial masking.
Autoregressive Approach with Permutation:
• XLNet employs an autoregressive model with permutation,
meaning it predicts tokens based on every possible ordering of Transformer-XL Architecture:
the input, not just left-to-right or right-to-left. This enables it to • XLNet builds on the Transformer-XL architecture, which
learn bidirectional context while preserving the autoregressive enables it to handle longer sequences by remembering past
nature. tokens through a recurrence mechanism. This helps the model
capture dependencies across longer contexts, beneficial in tasks
requiring understanding of large text inputs.
101 102
CONT… UNIT-4
Superior Performance on NLP Tasks: Statistical Machine Translation (SMT)
• XLNet has outperformed previous models like BERT on various NLP
benchmarks, including text classification, question answering, and sentiment It is a method in Natural Language Processing (NLP) for
analysis. translating text from one language to another based on statistical
models.
Applications of XLNet SMT models build translations by learning patterns and
Question Answering: XLNet performs exceptionally well on QA tasks where probabilities from large amounts of bilingual text data (called
understanding bidirectional context is key. parallel corpora) without needing extensive human-written grammar
Text Classification: It is useful in tasks where longer dependencies and rules or vocabularies.
contextual understanding improve classification accuracy. Key Concepts in Statistical Machine Translation
Sentiment Analysis: XLNet can capture nuanced emotions in text, making it
effective for sentiment analysis in both short and long text passages. Parallel Corpora:
Language Translation: Its enhanced context understanding and ability to SMT requires parallel corpora, which consist of texts in two
handle long sequences make it promising for translation tasks. languages that correspond sentence by sentence. Examples
include news articles, government documents, and literature in
multiple languages.
103 104
CONT… CONT…
Translation Model:
Decoding:
• The translation model captures the probability that a given
• During translation, SMT systems use a decoder to generate the most probable
phrase in the source language translates to a phrase in the sequence of words in the target language based on the probabilities from both
target language. For instance, if the model is trained to the translation model and the language model. The decoder searches for the
translate from English to French, it will estimate the best combination that maximizes both translation accuracy and fluency.
probability of each possible French phrase for a given English
phrase based on the data it has seen.
Alignment and Phrase Tables:
• SMT involves alignment, which maps words and phrases in the source
Language Model: language to their counterparts in the target language. These alignments form
• The language model ensures that translations make sense in phrase tables that the model uses to translate segments rather than just single
the target language. It learns the natural flow of the target words, making translations more fluent and accurate.
language by analyzing large amounts of monolingual text,
assigning higher probabilities to sequences that are
grammatically correct and commonly used.
105 106
Advantages: Requires less manual rule-building, relying • Neural Machine Translation (NMT) is a technique for
instead on learning from data. automatically translating text from one language to another using
artificial neural networks.
• Works well with sufficient high-quality parallel data.
• Unlike traditional statistical methods, which rely on separately
modeled components (like language models and translation rules),
Limitations: Translation quality depends heavily on the NMT is an end-to-end learning approach where the entire
size and quality of the parallel corpus. translation process is handled by a single neural network.
• Cannot handle idiomatic expressions or complex syntax as • NMT systems are based on deep learning, allowing them to
effectively as more advanced models, such as Neural understand and generate more natural and accurate translations by
Machine Translation (NMT). learning directly from vast amounts of bilingual text data.
109 110
113 114
CONT...
UNIT-5
1D-CNN for NLP
Pooling Layer: After the convolution operation, a
1D-CNNs (1-Dimensional Convolutional Neural Networks) for Natural
Language Processing (NLP) have become a popular approach due to their
pooling layer (e.g., max-pooling) can be used to
simplicity, computational efficiency, and effectiveness in capturing local features reduce the dimensionality and retain important
in sequential data, such as text. features. This helps to generalize well and avoid over
While CNNs are often associated with image processing, they can be adapted fitting.
for text data using 1D convolutions over sequences, such as words or characters.
Non-linearity (Activation Function): A non-linear
Basic Concepts of 1D-CNNs in NLP activation function like ReLU (Rectified Linear
Convolutional Layer: 1D-CNNs use a convolution filter (kernel) that slides
over the input text sequence.
Unit) is often applied to the output of the
For NLP, this sequence typically consists of word embeddings (or character
convolution operation to introduce non-linearity.
embeddings). Fully Connected Layer: The outputs from the
The convolution operation helps capture local patterns, like n-grams, that are convolution and pooling layers are often fed into one
common in text sequences. or more fully connected layers (dense layers), which
are used to perform classification or regression tasks.
115 116
CONT... CONT...
Why Use 1D-CNNs for NLP? Why Use 1D-CNNs for NLP?
• Capturing Local Dependencies: 1D-CNNs are excellent at capturing • Capturing Local Dependencies: 1D-CNNs are excellent at capturing
local dependencies (e.g., word combinations, n-grams) within a text local dependencies (e.g., word combinations, n-grams) within a text
window. window.
• This can be particularly useful in tasks such as sentiment analysis, where • This can be particularly useful in tasks such as sentiment analysis, where
phrases may have strong local contextual cues. phrases may have strong local contextual cues.
• Parallel Computation: Unlike RNNs (Recurrent Neural Networks), 1D- • Parallel Computation: Unlike RNNs (Recurrent Neural Networks), 1D-
CNNs do not require sequential processing, making them faster and more CNNs do not require sequential processing, making them faster and more
efficient to train. efficient to train.
• Simple and Effective: Compared to other deep learning models for text, • Simple and Effective: Compared to other deep learning models for text,
such as LSTMs or Transformers, 1D-CNNs are often simpler and can such as LSTMs or Transformers, 1D-CNNs are often simpler and can
perform well with fewer computational resources. perform well with fewer computational resources.
117 118
CONT... APPLICATIONS
•Architecture of a Simple 1D-CNN for NLP
Input Layer: Tokenized input text is represented as sequences of
• Sentiment Analysis: Classifying text into positive,
word embeddings (e.g., GloVe, Word2Vec, or trainable embeddings). negative, or neutral sentiments.
Convolution Layer(s): A set of 1D convolution filters slides over the • Spam Detection: Distinguishing between spam and non-
input text, capturing local patterns and producing feature maps.
Pooling Layer: Max-pooling or average-pooling is applied to reduce spam messages.
the dimensionality of the feature maps and focus on the most relevant
features. • Text Categorization: Classifying text into predefined
Dropout Layer (optional): This layer is used to prevent over fitting categories (e.g., news categorization).
by randomly setting a fraction of the input units to zero during
training. • Named Entity Recognition (NER): Identifying named
Fully Connected (Dense) Layer: The extracted features are flattened entities in a piece of text.
and fed into one or more fully connected layers, which lead to the
output.
Output Layer: Depending on the task (e.g., binary classification,
multi-class classification), this layer typically uses a softmax or
sigmoid activation function.
119 120
Sub-word Models Why Use Sub-word Models?
• Sub-word models are a powerful approach used in Natural Sub-word models help overcome several common issues in NLP:
Language Processing (NLP) for representing and processing text at • Handling Unknown Words: In traditional word-based models, any word
not seen during training is treated as unknown.
a level below whole words, such as at the character or sub-word unit • Sub-word models, however, break unseen words into smaller known
level. units, which helps improve generalization and vocabulary coverage.
• This concept addresses challenges related to handling unknown
words (out-of-vocabulary words), language morphology, and • Morphological Richness: Many languages are morphologically complex,
meaning that words can have numerous forms due to inflections,
variations in text. derivations, or compounds.
• Instead of representing words as atomic units, sub-word models • Sub-word models can handle different word forms by representing words
as combinations of sub-word units.
break words into smaller, meaningful units or segments.
• These segments can be characters, character n-grams, or units • Vocabulary Size Reduction: By representing words as sub-word units,
derived through algorithms like Byte Pair Encoding (BPE) and the vocabulary size can be significantly smaller, leading to more efficient
memory usage and faster computations.
Word Piece.
121 122
123 124
Self-Attention for Generative Models What is Self-Attention?
• Self-Attention is a key mechanism used in generative • At its core, self-attention allows a model to focus on different parts of
an input sequence to make decisions.
models for Natural Language Processing (NLP) tasks,
particularly within transformer-based models like GPT • Unlike earlier models like RNNs or LSTMs, which process input
sequentially, self-attention considers the entire input sequence at once.
(Generative Pre-trained Transformer).
• This means the model can learn relationships between all words in a
• This mechanism allows models to weigh the importance sentence simultaneously, regardless of their distance from each other.
of different words in a sequence relative to each other • Example: Consider the sentence:
when generating or predicting the next word. • "The cat sat on the mat because it was soft."
• Self-attention is the backbone of transformer models, • When processing the word "it," the model needs to understand that "it"
enabling them to capture complex dependencies and refers to "the mat" and not "the cat." The self-attention mechanism helps
the model identify this relationship by assigning a higher weight
relationships within text. (importance) to the connection between "it" and "the mat."
125 126
127 128
Natural Language Generation CONT...
Natural Language Generation (NLG) is a subfield of Natural Key Components of NLG Systems
Language Processing (NLP) focused on the creation of human- • Content Determination: Decides what information to include
like text by machines. in the generated text based on input data.
It plays a crucial role in enabling computers to produce coherent • Document Structuring: Organizes the content into a coherent
and contextually relevant responses, transforming structured data structure, such as sentences, paragraphs, or specific formats,
or textual input into comprehensible and meaningful natural like reports or narratives.
language output. • Sentence Aggregation: Merges related pieces of information
into cohesive and well-formed sentences.
Basics of NLG in the context of NLP: • Lexicalization: Selects the words and phrases that best
NLG converts structured data, such as numbers, coded data, or convey the intended meaning.
other machine-readable content, into natural language text that is • Surface Realization: Converts the structured representation
easily understandable by humans. of text into grammatically correct and fluent sentences.
It is often used to generate reports, summarize data, create • Refinement: Involves adjustments for clarity, stylistic
chatbots, automate content creation, and provide conversational preferences, or contextual appropriateness, ensuring the final
output is polished and effective.
agents in various industries.
129 130