0% found this document useful (0 votes)
72 views33 pages

NLP Final

Uploaded by

Adithya Varma G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views33 pages

NLP Final

Uploaded by

Adithya Varma G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

NLP CONCEPT

 Natural Language Processing (NLP) is a field at the intersection of computer


(21CS121) NATURAL science, artificial intelligence, and linguistics, focusing on the interaction
between computers and human (natural) languages.
LANGUAGE PROCESSING
The goal of NLP is to enable computers to understand, interpret, and respond
to human language in a way that is both meaningful and useful. Some basic
concepts in NLP are shown bellow:

Tokenization: Tokenization is the process of splitting text into individual


words, phrases, symbols, or other meaningful elements called tokens. For
example, the sentence "Hello world!" might be tokenized into ["Hello",
"world", "!"].

1 2

CONT... CONT...
 Part-of-Speech (POS) Tagging: POS tagging involves labeling each word in a Lemmatization and Stemming:- are techniques used to reduce words to their
sentence with its corresponding part of speech, such as noun, verb, adjective, base or root form.
etc. For example, in the sentence "The cat sits on the mat," the tags might be: i. Stemming: Reduces words to their root form by removing suffixes (e.g.,
i. The/DT (determiner) "running" becomes "run").
ii. cat/NN (noun) ii. Lemmatization: Reduces words to their base form considering the context
(e.g., "running" becomes "run", "better" becomes "good").
iii. sits/VB (verb)
Stop Words: are common words that are often filtered out in NLP tasks
iv. on/IN (preposition) because they carry less meaning (e.g., "the", "is", "in"). Removing stop words
v. the/DT (determiner) can improve the efficiency of text processing.
vi. mat/NN (noun)
 Bag of Words (BoW): is a representation of text that describes the occurrence
 Named Entity Recognition (NER): NER identifies and classifies named of words within a document.
entities in text into predefined categories such as person names, organizations, It involves creating a vocabulary of all words in the document and then
locations, dates, and more. For example, in the sentence "Barack Obama was representing each document by a vector of word counts.
born in Hawaii," the entities are:  This model disregards grammar and word order but keeps multiplicity.
i. Barack Obama (Person)
ii. Hawaii (Location) 3 4
CONT... CONT...
TF-IDF (Term Frequency-Inverse Document Frequency):- It is a statistical Sentiment Analysis: is the process of determining the emotional tone
measure used to evaluate the importance of a word in a document relative to a behind words, sentences, or texts. It classifies the text into positive,
collection of documents (corpus). It combines: negative, or neutral sentiments.
i. Term Frequency (TF): How often a word appears in a document. Syntax and Parsing: Parsing involves analyzing the grammatical
ii. Inverse Document Frequency (IDF): How common or rare a word is across structure of a sentence to identify relationships between words. Syntax
all documents in the corpus. parsing can be:
i. Dependency Parsing: Identifies dependencies between words (e.g.,
Word Embeddings:-are dense vector representations of words that capture subject-verb relationships).
semantic meaning. Examples include Word2Vec, GloVe, and FastText. These ii. Constituency Parsing: Breaks sentences into sub-phrases or
embeddings map words to vectors in a high-dimensional space where constituents (e.g., noun phrases, verb phrases).
semantically similar words are closer together.
Machine Translation: is the automatic translation of text from one
 N-grams are contiguous sequences of n items (words, characters, etc.) from language to another. Techniques range from rule-based approaches to
a given text. Common examples are: statistical and neural machine translation models.
i. Unigrams (1-gram): ["The", "cat", "sits"]
ii. Bigrams (2-gram): ["The cat", "cat sits"]
iii. Trigrams (3-gram): ["The cat sits"]
5 6

Ambiguity in language
CONT...
 It refers to the phenomenon where a word, phrase, or sentence has
multiple interpretations.
Language Models: predict the probability of a sequence of words. They are
fundamental for tasks like text generation and are often built using neural Ambiguity can occur at various levels of language processing, such as
networks. Examples include LSTM-based models. lexical (word-level), syntactic (sentence structure), and semantic
(meaning) levels. Understanding and resolving ambiguity is a
Text Classification: is the process of assigning predefined categories to text. significant challenge in natural language processing (NLP).
Examples include spam detection, topic classification, and sentiment analysis.
 1. Lexical Ambiguity
Speech Recognition: involves converting spoken language into text. It
combines NLP with signal processing and often uses models like Hidden • Lexical ambiguity arises when a word has multiple meanings.
Markov Models (HMM) and deep learning techniques. Example:
• "I went to the bank."
Explanation:
• The word "bank" can refer to a financial institution or the side of a
river.
Resolution:
• Context is used to determine the correct meaning. For example,
additional context like "to deposit money" clarifies that "bank" refers to
a financial institution.
7 8
CONT... CONT...
 2. Syntactic Ambiguity
• 3. Semantic Ambiguity
• Syntactic ambiguity occurs when a sentence can be parsed in multiple ways
• Semantic ambiguity happens when a sentence can have multiple meanings,
due to its structure.
even if its syntactic structure is clear.
Example:
• Example:
• "I saw the man with the telescope."
• "He gave her cat food."
Explanation:
• Explanation:
• This sentence can be interpreted as either:
• This sentence can mean:
• "I used the telescope to see the man."
• "He gave food to her cat."
• "I saw a man who had a telescope."
• "He gave her some cat food."
Resolution:
• Resolution:
• Parsing algorithms and contextual understanding are used to determine the
• Semantic analysis and context are used to infer the intended meaning.
most likely structure.

9 10

CONT... QUESTIONS

• More than 100 students attended the seminar. 50 of them were from
• 4. Pragmatic Ambiguity
our college.
• Pragmatic ambiguity involves the interpretation of language in context,
considering the speaker's intentions and the situational context. • "The project will be completed in 10 days”.
• Example: • "The temperature will rise by 5 to 10 degrees”.
• "Can you pass the salt?"
• Explanation:
• This sentence can be interpreted as:
• A question about the listener's ability to pass the salt.
• A polite request for the listener to pass the salt.
• Resolution:
• Understanding the social and conversational context helps resolve
pragmatic ambiguity.

11 12
Segmentation CONT...
• Segmentation in Natural Language Processing (NLP) refers to the process of
dividing text into smaller meaningful units. These units can be sentences, • 4. Paragraph Segmentation
words, phrases, or other subunits. • Paragraph segmentation involves splitting a text into paragraphs. This is less
• Effective segmentation is crucial for many downstream NLP tasks such as common in typical NLP tasks but can be important for document-level
tokenization, part-of-speech tagging, named entity recognition, and parsing. analysis.
• 1. Sentence Segmentation • 5. Chunking (Shallow Parsing)
• Sentence segmentation, also known as sentence boundary detection, involves • Chunking involves segmenting and labeling multi-token sequences, such as
splitting a text into individual sentences. noun phrases (NP), verb phrases (VP), etc.
• 2. Word Segmentation
• Word segmentation, also known as tokenization, involves splitting a sentence
into individual words or tokens.
• 3. Subword Segmentation
• Subword segmentation involves splitting words into smaller units, such as
morphemes or subwords, which can be useful for handling out-of-vocabulary
words in machine translation or language modeling.

13 14

Stemming Tokenization
 Stemming is a text normalization technique in Natural Language Processing Tokenization is a fundamental step in natural language processing (NLP)
(NLP) that reduces words to their base or root form. that involves splitting text into individual units called tokens. These tokens
can be words, phrases, or other meaningful elements. Tokenization facilitates
The root form is usually not a valid word by itself but is a common further processing and analysis of text data by breaking it down into
representation of words that allows for the conflation of different inflected manageable pieces.
forms of a word. Types of Tokenization
Stemming helps in reducing the dimensionality of text data and is particularly • Word Tokenization: Splitting text into individual words.
useful in search engines, text mining, and information retrieval systems.
• Sentence Tokenization: Splitting text into individual sentences.
 Common Stemming Algorithms • Subword Tokenization: Splitting words into smaller units, such as
i. Porter Stemmer: One of the most widely used stemming algorithms, known morphemes or subwords, useful in dealing with unknown words or for
for its simplicity and efficiency. languages with rich morphology.
ii. Lancaster Stemmer: A more aggressive stemming algorithm compared to the Libraries for Tokenization
Porter Stemmer. • Several NLP libraries provide robust tokenization tools, including:
iii. Snowball Stemmer: Also known as the Porter2 stemmer, it is an • NLTK (Natural Language Toolkit)
improvement over the original Porter stemmer and is available for multiple • spaCy
languages. • Transformers by Hugging Face
• Gensim
15 16
Word embedding CONT...
• Word Embedding refers to a technique for representing words as dense • Contextual Information: Word embeddings are learned from large corpora
vectors of real numbers in a continuous vector space. of text and can reflect syntactic and semantic patterns. Popular embeddings
• Unlike traditional methods such as one-hot encoding, which represent words like Word2Vec, GloVe, and FastText are trained using various methods to
as sparse, high-dimensional vectors, word embeddings capture semantic capture these patterns.
relationships between words in a more compact and meaningful way.
• Key points are as following:- • Pre-trained Embeddings: Pre-trained word embeddings can be used to
i. Dimensionality Reduction: Word embeddings reduce the dimensionality of initialize models, allowing them to leverage learned semantic relationships
word representations compared to one-hot encoding, which typically results from large datasets without having to train embeddings from scratch.
in a sparse vector of the size of the vocabulary. Embeddings represent each
word as a dense vector of fixed size, often in the range of 50 to 300
dimensions. • Applications: Word embeddings are used in various NLP tasks such as text
classification, sentiment analysis, machine translation, and information
ii. Semantic Meaning: Word embeddings capture semantic meaning and retrieval. They are foundational for many modern NLP techniques and
relationships between words. Words with similar meanings or contexts are models.
represented by similar vectors. For example, "king" and "queen" may have
vectors that are closer to each other than "king" and "car."

17 18

Word Senses CONT...


• Refer to the different meanings or interpretations that a word can have
depending on its context. A single word can have multiple senses, each with its • Word Sense Disambiguation (WSD): This is a subtask of NLP
own specific meaning. focused on determining which sense of a word is used in a particular
context. WSD can be approached using various methods, including:
 Key Points About Word Senses:
• Polysemy: This is the phenomenon where a single word has multiple related
• Dictionary-based methods: Leveraging predefined lexical resources
meanings. For example, the word "bank" can refer to a financial institution or like WordNet, which provide detailed sense definitions and relations.
the side of a river. The different meanings are considered different senses of the • Supervised learning: Training models on labeled datasets where the
word. senses of words are annotated.
• Homonymy: This is when a word has multiple meanings that are unrelated or • Unsupervised and semi-supervised learning: Using clustering or co-
only loosely related. For instance, "bat" can refer to a flying mammal or a occurrence patterns to infer word senses without extensive labeled
piece of sports equipment. These are considered different senses of the word
and are usually distinguished by context.
data.
• Contextual Disambiguation: To understand the intended sense of a word in a • Lexical Resources: Resources such as WordNet provide structured
given context, disambiguation techniques are used. This process is crucial for information about word senses and their relationships, including
tasks such as machine translation, information retrieval, and text synonyms, antonyms, hypernyms, and hyponyms. These resources are
understanding. valuable for sense disambiguation and other NLP tasks.

19 20
CONT... Dependency Parsing
• Applications: Understanding word senses is critical for many NLP • It is a key aspect of syntactic analysis in natural language processing (NLP)
applications, including: and computational linguistics.
• Machine Translation: Ensuring the correct translation of words based on • It focuses on analyzing the grammatical structure of a sentence by
their intended meanings. identifying the relationships between words, particularly how each word
• Information Retrieval: Improving search results by understanding the depends on others.
context of search queries. Key Concepts in Dependency Parsing:
• Text Summarization: Generating accurate summaries that reflect the correct • Dependency Relations: In dependency parsing, the grammatical structure of
meanings of words. a sentence is represented by a set of dependency relations. Each relation
consists of a head and a dependent. The head is a word that governs or
influences another word (the dependent), establishing a syntactic connection
between them.

21 22

CONT... CONT...
• Dependency Parsing Models: Several algorithms and models are used for dependency parsing,
• Dependency Tree: The result of dependency parsing is often visualized as a including:
dependency tree or dependency graph. In this tree, each node represents a • Transition-based parsing: Constructs the dependency tree by making a sequence of parsing
decisions based on transitions between different states.
word, and directed edges represent dependency relations. The root of the tree is
• Graph-based parsing: Constructs the entire dependency graph and selects the best tree by
typically the main verb or another central element of the sentence. optimizing a scoring function.
Head and Dependent: • Neural network-based models: Leverage deep learning techniques to learn complex patterns
in dependency structures, improving accuracy and flexibility.
• Head: The governing word in a dependency relation. • Applications: Dependency parsing is crucial for various NLP tasks, including:
• Dependent: The word that is governed by the head. For example, in the phrase • Semantic Role Labeling: Understanding the roles played by different words in a sentence.
"The cat sleeps," "sleeps" is the head of "cat," which is the dependent. • Machine Translation: Improving the accuracy of translations by capturing grammatical
relationships.
• Types of Dependencies: Common dependency relations include: • Information Extraction: Identifying and extracting specific information based on grammatical
structure.
• Subject: The noun or noun phrase that performs the action (e.g., "cat" in "The cat
• Text Summarization: Generating coherent summaries by understanding sentence structure.
sleeps").
• Tools and Resources: Popular tools for dependency parsing include:
• Object: The noun or noun phrase that receives the action (e.g., "ball" in "She • SpaCy: An NLP library with built-in support for dependency parsing.
throws the ball"). • Stanford Parser: A widely used tool from the Stanford NLP group that provides dependency
parsing capabilities.
• Modifier: Words that provide additional information about another word (e.g.,
• NLTK: The Natural Language Toolkit, which includes functions for dependency parsing.
adjectives describing nouns).

23 24
Word Window Classification
• Word Window Classification is a technique used in Natural Language
Processing (NLP) to classify words based on the context provided by
surrounding words, known as a "window".

UNIT 2 • This approach is particularly useful in tasks like Named Entity


Recognition (NER), Part-of-Speech (POS) tagging, and other sequence
labeling tasks.

• How Word Window Classification Works?


• Word Window: A word window is a fixed-size context around a target
word. For instance, if the window size is 3, the window will include the
target word, one word to its left, and one word to its right.
• Feature Extraction: The features for the target word are extracted from
this window. These features can include the words themselves, their
embeddings, POS tags, or any other relevant linguistic features.

25 26

CONT... CONT...
• Model Training: A machine learning model (e.g., logistic regression, SVM, or
a neural network) is trained on these features to classify the target word.
Example to illustrate how word window
classification works:
• Sliding Window: The window slides over the text, classifying each word based • Task: Part-of-Speech Tagging
on its surrounding context.
Example Sentence:
Example
• Consider the sentence: "The quick brown fox jumps over the lazy dog."
• The cat sat on the mat.
• If the target word is "fox" and the window size is 3, the window will look like Goal:
this:
• Assign each word in the sentence its correct part of
• Previous word: "brown" speech (POS) tag.
• Target word: "fox"
• Next word: "jumps"
Word Window:
• Features for "fox" could include the embeddings of "brown", "fox", and • We'll use a word window of size 3 (1 word to the
"jumps". left, the target word, and 1 word to the right).
• For the sentence "The cat sat on the mat":
27 28
CONT... CONT...
• Target Word: "cat"  Applications
•Word Window: [The, cat, sat] • Named Entity Recognition (NER): Classifying words into categories
like person, location, organization, etc.
•POS Tags: [DT (determiner), NN (Noun), VB
(Verb)] • Part-of-Speech (POS) Tagging: Assigning parts of speech to each word
in a sentence.
•Classification: NN (Noun)
• Chunking: Dividing a text into syntactically correlated parts like noun or
• Target Word: "sat" verb phrases.
•Word Window: [cat, sat, on] Benefits
•POS Tags: [NN, VB, IN] • Context-Aware: Takes into account the surrounding context, leading to
•Classification: VB (Verb) better classification performance.
• Target Word: "on" • Simplicity: Relatively simple to implement and understand
•Word Window: [sat, on, the]
•POS Tags: [VB, IN, DT]
•Classification: IN (Preposition)
29 30

Neural Networks for text CONT...


Neural networks are like smart algorithms that learn patterns from data.
When it comes to text, here's a simple explanation: Learning:
Basic Concept: • The network learns by adjusting the importance (weights) of the
• Neurons (Nodes): Think of neurons as tiny decision-makers. Each connections between neurons. It does this over many cycles (called epochs)
neuron takes some input (like a word or a number) and decides if it using examples of text and the correct answers.
should pass that input along to the next layer of neurons based on how
"important" it thinks the input is.
• It tries to minimize errors using a method called back propagation, where it
Layers: checks how far off its prediction was and adjusts the weights to do better
• Input Layer: This is where the text data first enters the network. If next time.
we're working with text, each word or character can be turned into
numbers (using something called embeddings or one-hot encoding)
and fed into the input layer. Activation Functions: These are like filters that decide if a neuron should
• Hidden Layers: These are layers between the input and output. They activate (send a signal forward). They add non-linearity, which helps the
do the heavy lifting by learning complex patterns in the data. Each network learn complex patterns.
layer passes its output to the next one.
• Output Layer: This gives the final prediction, like identifying
whether a text is positive or negative in sentiment analysis.
31 32
Applying to Text embeddings
• Text as Input: Text is turned into numbers (embeddings) that the network
can understand.  Embeddings are a way to represent words or phrases as numerical vectors,
making them understandable for machine learning models, especially neural
• Pattern Recognition: The neural network learns patterns like which words networks.
usually appear together, sentence structures, or even the sentiment behind • Why Use Embeddings?
phrases. Computers Understand Numbers: Text data needs to be converted into
numbers because computers work with numbers, not words.
• Prediction: After learning from lots of examples, it can predict things like Capture Meaning: Simple methods like assigning a unique number to each
the sentiment of a sentence, classify topics, or generate new text. word don’t capture the meaning or relationships between words.
Embeddings solve this by encoding semantic relationships between words.

33 34

How Embeddings Work? CONT....


 Word as Vector: In an embedding, each word is represented as a vector  Training Embeddings: Embeddings can be learned using large text
(a list of numbers). For example, the word "cat" might be represented as datasets:
[0.2, 0.8, -0.1, ...] in a high-dimensional space. Word2Vec: One popular method where the model learns to predict a
word based on its surrounding context. As it does this, it adjusts the word
vectors so that similar words have similar vectors.
Similar Words, Similar Vectors: Words that are similar in meaning or
context have similar vectors. For example: GloVe: Another method that looks at word co-occurrence across the
whole text corpus, learning vectors that capture global statistical
information.
"King" might be [0.6, 0.1, 0.7, ...] BERT: More advanced models like BERT create embeddings that
understand context better, so the same word can have different vectors
"Queen" might be [0.5, 0.2, 0.7, ...] depending on how it’s used.

The vectors for "king" and "queen" would be close to each other in this
vector space.

35 36
CONT... N-gram Language Models
Example:  N-gram language models are a type of statistical model used in natural
• Suppose you have the sentence: "The cat sits on the mat." language processing (NLP) to predict the probability of a sequence of words in
• After embedding, the word "cat" might be turned into a vector like [0.2, a sentence. They are called "N-gram" models because they consider sequences
0.8, -0.1], and "mat" might be [0.5, 0.6, 0.1]. of "N" words at a time.
Applications:
 Key Concepts:
• Text Classification: Embeddings help classify text by giving the model a
numerical understanding of words. • N-gram:
• Search Engines: They can find similar documents by comparing the • An N-gram is a contiguous sequence of "N" items (usually words) from a
given text.
embeddings of text in them.
• For example:
• Machine Translation: Embeddings allow models to understand and • Unigram (1-gram): A single word (e.g., "The")
translate words between languages by finding similar vectors in different
languages. • Bigram (2-gram): A sequence of two words (e.g., "The cat")
• Trigram (3-gram): A sequence of three words (e.g., "The cat sits")
• And so on...

37 38

CONT... CONT...
 Language Model:
• A language model assigns probabilities to sequences of words.
• For an N-gram model, the probability of a word depends on the previous N-1
words.
• For example, in a trigram model, the probability of a word depends on the two
preceding words.

 How It Works?
• The model is trained on a large corpus of text, counting how often different N-
grams occur.
• It uses these counts to estimate the probability of a word following a given
sequence of N-1 words.

39 40
Applications Perplexity

• Text Prediction: N-gram models can predict the next word or • Perplexity is a measurement used in natural language processing to evaluate the
sequence of words in a sentence (e.g., in auto complete). quality of a language model.
• It essentially tells us how well a probability model predicts a sample of text.
• Speech Recognition: They help in determining the most probable
words spoken in a sequence. • Lower perplexity indicates a better model because it suggests the model is better
at predicting the text.
• Spelling Correction: They can suggest corrections by considering the
 What is Perplexity?
most likely word sequences.
• Understanding Perplexity:
Limitations:
• Perplexity is the exponentiation of the average negative log-likelihood of a
• Limited Context: Higher-order N-gram models (e.g., trigram, 4-gram) test set, which can be interpreted as the average branching factor of a
capture more context but require much more data and computational language model.
power. • In simpler terms, it tells us how "surprised" the model is by the text. If a
• Data Sparsity: Rare N-grams might not appear often enough in the model is well-trained and predicts the text well, it will have low perplexity
training data, leading to poor probability estimates for some word (low surprise). If the model is poorly trained, it will have high perplexity
(high surprise).
sequences.
• Over fitting: High-order N-gram models might fit the training data too
closely and not generalize well to new text.
41 42

CONT... CONT...
Perplexity and Language Models:
• N-gram Models: Perplexity is often used to evaluate N-gram models. A
trigram model, for example, will have lower perplexity than a bigram
model if it better captures the text's patterns.
• Neural Language Models: Modern neural language models (e.g., RNNs,
Transformers) often achieve much lower perplexity than traditional N-gram
models, indicating they are better at predicting sequences of words.
 Interpreting Perplexity:
• A lower perplexity score indicates a better model. For example, if one
model has a perplexity of 50 and another has 100, the first model is
considered to be better at predicting the text.
• However, perplexity is relative; it should be compared within the same
dataset and task.

43 44
Example Hidden Markov Models

• Suppose a language model predicts the following sequence: "The cat sat • A Hidden Markov Model (HMM) is a statistical model used to represent
on the mat. systems that are governed by a Markov process with hidden states.

• "If the model predicts each word with high probability, the perplexity • HMMs are widely used in areas such as speech recognition, natural language
will be low, suggesting the model understands the text well. processing, and bioinformatics.

• If the model predicts each word with low probability, the perplexity will  States:
be high, suggesting the model is less effective. • Hidden States: The actual states of the system are not directly observable.
Instead, they are inferred based on observable outputs.
• Observable States: These are the outputs or observations that can be directly
seen or measured.
• Markov Property:
• The Markov property assumes that the probability of transitioning to the next
state depends only on the current state, not on the sequence of previous states.
This is known as the first-order Markov property.
45 46

CONT..... Structure of an HMM

Transition Probabilities: It define the likelihood of moving from one


hidden state to another. They are usually represented in a matrix called the
transition matrix.

Emission Probabilities: It define the likelihood of observing a particular


output given a specific hidden state. These are represented in an emission
matrix.

Initial State Probabilities: These are the probabilities of the system starting
in each possible hidden state.

47 48
EXAMPLE CONT...

• Imagine we have a person who can be either Happy or Sad on any given day
(these are the hidden states). We can't directly observe their mood, but we can
observe their behavior: whether they are Singing or Not Singing (these are the
observable states).

49 50

CONT... CONT...

51 52
CONT... CONT...

53 54

Viterbi algorithm CONT...


• The Viterbi algorithm is used to find the most likely sequence of hidden states
in a Hidden Markov Model (HMM) given a sequence of observed events.
Here’s a step-by-step breakdown of the algorithm:

55 56
CONT... Recurrent Neural network
• A Recurrent Neural Network (RNN) is a class of artificial neural networks
designed for sequence data or time-series data.
• Unlike traditional feed forward neural networks, RNNs have connections that form
directed cycles, allowing them to maintain a form of memory by processing
information from previous time steps.
• This makes them highly suitable for tasks where the order of inputs matters, such
as language modeling, speech recognition, and time-series prediction.
 Key Concepts of RNNs:
• Sequential Data: RNNs handle data where order matters, such as sentences or
time-series data.
• Hidden States: RNNs have a hidden state that carries information from one step to
the next, allowing the network to have a memory.
• Weights Sharing: The same weights are applied to inputs across time steps,
reducing the number of parameters compared to feed forward networks.
• Back propagation Through Time (BPTT): During training, errors are propagated
back through time, allowing RNNs to learn from the sequence's dependencies.
57 58

Structure of an RNN
RNN Architecture
• Input Layer: Receives the input sequence (e.g., words, time-series data).
• Hidden Layers: Receives input at each time step and maintains a hidden state
that captures past information.
• Output Layer: Generates output based on the current hidden state, which can be
used for tasks like classification or regression.
Types of RNNs:
• Vanilla RNN: The simplest form of RNN, with a single hidden layer connecting
back to itself. It can suffer from vanishing/exploding gradient problems.
• Long Short-Term Memory (LSTM): An advanced RNN architecture that
introduces gates (input, forget, and output) to control the flow of information and
mitigate issues like vanishing gradients.
• Gated Recurrent Unit (GRU): A simplified version of LSTM that combines the
forget and input gates into a single update gate.

59 60
RNN Workflow Vanishing Gradients and exploding
• Data Preparation: RNNs expect sequential data, so the input data is often
gradient
organized into sequences. • Vanishing gradients and exploding gradients are two common problems
• Forward Propagation: The input at each time step is processed, and the encountered when training deep neural networks, especially Recurrent
hidden state is updated to reflect past information. Neural Networks (RNNs) or very deep feed forward networks.
• Loss Calculation: The loss is calculated based on the model's predictions • Both issues arise during the back propagation process, where the gradients
and the actual targets. are propagated backward through the network to update the weights.
• Backpropagation Through Time (BPTT): The gradients are computed, and • Vanishing gradients occur when the gradients become extremely small as
weights are updated using backpropagation over multiple time steps. they are propagated back through layers of a neural network, especially in
deep networks.
• This results in extremely slow learning or no learning at all for the earlier
layers of the network.
• It is particularly problematic in RNNs because gradients need to be
propagated through many time steps, and information from earlier time
steps can get "lost."

61 62

Why does this happen? Exploding Gradients


• During back propagation, gradients are multiplied by the weights of the
network at each layer. • Exploding gradients occur when the gradients grow exponentially
during back propagation, leading to extremely large updates to the
• If these weights are small (e.g., values between 0 and 1), the product of network’s weights.
many small numbers leads to even smaller numbers.
• This causes the model’s parameters to become unstable and the
• As a result, the gradients shrink exponentially as they are propagated training process to diverge, often leading to overflow and large
backward, leading to vanishing gradients. fluctuations in the loss function.
Consequences:  Why does this happen?
• Slow learning: The early layers of the network (or time steps in an RNN) • If the weights in the network are large (greater than 1), the gradients
learn very slowly or not at all because their gradients are nearly zero. can exponentially grow larger as they are propagated backward
• Loss of long-term dependencies: In RNNs, the network fails to learn through the layers.
relationships from far-away time steps because the gradient diminishes as it • This leads to increasingly larger updates, which can destabilize the
moves backward. model.

63 64
UNIT-3
Consequences LSTM (Long Sort Term Memory)

• Unstable training: The loss function can fluctuate wildly, making it  Long Short-Term Memory (LSTM) is a special type of Recurrent
difficult for the network to converge. Neural Network (RNN) designed to overcome the limitations of
• Divergence: The model might fail to learn and diverge, causing errors and traditional RNNs, particularly the vanishing gradient problem.
numerical instability (e.g., NaN values in the weights).

LSTMs are especially useful for tasks involving sequences where


long-term dependencies are important, making them highly effective
for Natural Language Processing (NLP) and other sequential data
tasks.

65 66

Key Features of LSTMs


LSTM FIGURE
• Memory Cells: LSTMs have a memory cell that can
maintain information over long sequences, allowing
them to capture long-term dependencies in data.
• Gating Mechanisms: LSTMs use three gates to
regulate the flow of information:
• Forget Gate: Decides which information to discard
from the memory cell.
• Input Gate: Controls which new information to store
in the memory cell.
• Output Gate: Regulates what part of the memory is
used to compute the output at each time step.

67 68
Architecture of an LSTM CONT...

 Each LSTM unit consists of: • Candidate Memory ({C_t}): A temporary memory update based
• Cell State (C_t): The memory of the LSTM unit. on the input:
• Hidden State (h_t): The output of the LSTM unit at each time step,
which can be passed to the next unit.
• Forget Gate (f_t): Controls what proportion of the previous • Output Gate (o_t): Controls how much of the memory affects the
memory to retain. It is computed as: next hidden state:

• Input Gate (i_t): Determines how much of the new input should be • Final Memory Update: The cell state is updated using both the
added to the memory: forget gate and input gate:

69 70

Why Use LSTMs? Applications of LSTMs in NLP


• Language Modeling: Predicting the next word in a sentence by learning from
• Overcome Vanishing Gradients: Traditional RNNs the previous sequence of words. LSTMs help capture long-range dependencies
struggle with learning long-term dependencies because that improve accuracy.
gradients diminish as they propagate back through time. • Machine Translation: LSTMs are used in encoder-decoder architectures,
LSTMs solve this by maintaining a more constant gradient where the encoder LSTM processes a sentence in one language and the
over time steps. decoder LSTM generates its translation in another language.
• Capturing Long-Term Dependencies: They are • Text Generation: LSTMs are used to generate new text (e.g., poetry, music
particularly useful for tasks where the context over long lyrics) by learning patterns in a large corpus.
sequences matters, such as in speech recognition, • Sentiment Analysis: LSTMs can model the sentiment of a sentence or
machine translation, and text generation. document by remembering important information from earlier parts of the
sequence.
• Flexibility: LSTMs can effectively learn when to retain or • Speech Recognition: In speech-to-text systems, LSTMs are used to map
forget information over time, making them highly sequences of audio features to sequences of text.
adaptable to various types of sequential data. • Named Entity Recognition (NER): LSTMs are used to identify and classify
entities like people, organizations, and locations within a sequence of text.

71 72
Key Concepts of Gated Recurrent Unit (GRU)
Gated recurrent Unit (GRU)
The Gated Recurrent Unit (GRU) is a type of recurrent Recurrent Neural Networks (RNNs) Background:
neural network (RNN) architecture designed to solve issues
• RNNs are used to process sequential data (e.g., time series, text, speech).
related to learning long-term dependencies, such as vanishing They maintain a hidden state that is updated with each time step, allowing
gradients, that traditional RNNs face. them to capture the dependencies between elements in the sequence.
It is similar to the Long Short-Term Memory (LSTM) network • However, traditional RNNs struggle to capture long-range dependencies due
but has a simpler architecture. to the vanishing gradient problem during training.

 The GRU has fewer gates and parameters compared to the


LSTM, which makes it computationally less expensive and
easier to train while still effectively capturing temporal
dependencies in sequential data.

73 74

CONT…
CONT…
• GRU Architecture: A GRU addresses the limitations of
traditional RNNs by introducing two gates: the reset gate
and the update gate. These gates help control the flow of
information and manage what information should be
carried forward or forgotten.

75 76
CONT….
CONT….

77 78

Advantages of GRU

79 80
Use Cases of GRU
Comparison of GRU vs. LSTM
 LSTM: Uses three gates (input, forget, output), which provide more control
over memory, but at the cost of more computational complexity. • Time Series Forecasting: GRUs are used to capture dependencies in
time-series data for tasks like weather prediction or stock price
forecasting.
GRU: Uses two gates (update, reset), which makes it more efficient while • Natural Language Processing (NLP): GRUs are used in tasks like
still performing well on many tasks. machine translation, speech recognition, and text generation, where
sequential data is prevalent.
• Anomaly Detection: In detecting irregularities in sequences, such as
fraud detection in transactional data or identifying unusual patterns in
sensor data.

81 82

Part of Speech Tagging Common Parts of Speech


 Part-of-Speech (POS) Tagging is a fundamental task in Natural • Noun (NN): Names of people, places, things, or ideas (e.g.,
Language Processing (NLP) where each word in a sentence is "dog", "happiness").
assigned a part of speech, such as a noun, verb, adjective, etc. • Verb (VB): Action words (e.g., "run", "jump").
• Adjective (JJ): Describes a noun (e.g., "happy", "blue").
• Adverb (RB): Describes a verb, adjective, or other adverb (e.g.,
This helps in understanding the grammatical structure of a sentence "quickly", "very").
and provides information about the syntactic role of words. • Pronoun (PRP): Substitutes for nouns (e.g., "he", "she").
• Preposition (IN): Shows relationships between nouns (e.g., "in",
"on").
• Conjunction (CC): Connects clauses, sentences, or words (e.g.,
"and", "but").
• Determiner (DT): Introduces nouns (e.g., "the", "a").

83 84
POS Tagging Algorithms
How POS Tagging Works
Rule-based POS Tagging:
Tokenization: The sentence is first split into •Uses a set of hand-crafted linguistic rules to tag
individual words (tokens). words.
Assign Tags: Each token is then tagged with its part
of speech based on its role in the sentence. •Example: "If a word ends in 'ing', it's likely a verb
For example: (e.g., running)."
Sentence: "The cat chased the mouse." Statistical POS Tagging:
POS Tags: •Uses machine learning models to assign POS tags
•"The" → DT (Determiner) based on probabilities derived from a large
•"cat" → NN (Noun) annotated corpus.
•"chased" → VBD (Verb, Past Tense) •Example: Hidden Markov Model (HMM), which
•"the" → DT (Determiner) predicts the most likely tag sequence given the word
•"mouse" → NN (Noun) sequence.
85 86

CONT...
Explanation of Tags
Neural Network-based POS Tagging:
• Uses deep learning methods like Recurrent Neural • DT: Determiner
Networks (RNNs) or Transformer-based models (e.g., • JJ: Adjective
BERT) to learn tagging from large amounts of labeled • NN: Noun
data. • VBZ: Verb, 3rd person singular present
Hybrid Methods: • IN: Preposition
• Combines rule-based and statistical or machine learning
approaches to improve accuracy.

87 88
Applications of POS Tagging Challenges in POS Tagging

Syntactic Parsing: Helps in building parse trees for sentences. • Ambiguity: Some words can belong to different parts of speech depending
Named Entity Recognition (NER): POS tagging is often a preprocessing on the context (e.g., "book" can be a noun or a verb).
step in recognizing named entities like persons, organizations, etc. • Unknown Words: Handling out-of-vocabulary (OOV) words not seen in
Machine Translation: Helps in understanding sentence structure for better the training data can be difficult.
translation.
Information Extraction: Assists in extracting useful information (e.g.,
events, dates) from text.
Sentiment Analysis: POS tags can help identify the sentiments by focusing
on adjectives and verbs.

89 90

BERT (Bidirectional Encoder Representations from CONT…


Transformers)
It is a groundbreaking model in Natural Language Processing (NLP)
Overview of BERT
developed by Google. • Architecture:
• Transformers: BERT is based on the Transformer architecture, which uses self-
It has significantly advanced the field by introducing new methods attention mechanisms to process input data in parallel rather than sequentially,
for understanding the context of words in a sentence. allowing it to understand context better.
 The self-attention mechanism is one of the core components of the • Bidirectional Context: Unlike traditional models that process text in a left-to-
right or right-to-left manner, BERT reads the entire sequence of words
Transformer model used in Natural Language Processing (NLP). simultaneously. This bidirectional approach allows BERT to grasp the context of
It enables the model to weigh the importance of different words in a a word based on all its surrounding words.
sentence when generating an output for a particular word.
This mechanism allows the Transformer to efficiently capture
dependencies between words, regardless of their distance in the
input sequence, which is a major advantage over traditional
sequence models like RNNs

91 92
CONT… CONT…
Pre-training and Fine-tuning:
Training:
• BERT is first pre-trained on a large corpus of text (e.g., Wikipedia and Books
• Masked Language Model (MLM): During training, Corpus) to learn general language representations. This pre-training allows it
some words in the input text are randomly masked, to capture a broad understanding of language.
• After pre-training, BERT can be fine-tuned on specific tasks (e.g., sentiment
and the model learns to predict these masked words analysis, question answering) using a smaller dataset. Fine-tuning typically
based on their context. This helps BERT understand involves adding a simple output layer to the pre-trained model and training it
the meaning of words based on surrounding words. on the task-specific data.

• Next Sentence Prediction (NSP): BERT is also


trained on pairs of sentences to understand the
relationship between them. For example, given two
sentences, it learns to predict whether the second
sentence follows the first in the original text.
93 94

Applications of BERT Impact of BERT


• Performance: BERT has set new state-of-the-art performance records on
multiple NLP benchmarks, significantly improving results for many
• BERT has been applied successfully across various NLP tasks.
tasks, including:
• Transfer Learning: It popularized the use of pre-trained language
• Text Classification: Assigning categories to text models, leading to the development of numerous other models based on
documents (e.g., sentiment analysis). BERT (e.g., RoBERTa, DistilBERT, ALBERT).
• Named Entity Recognition (NER): Identifying entities • Community Adoption: BERT's release has sparked a wave of research
such as names, dates, and locations in text. and development in NLP, influencing academic work and industry
• Question Answering: Finding answers to questions applications.
based on a given passage of text.
• Text Summarization: Generating concise summaries of
longer texts.
• Machine Translation: Translating text from one
language to another.

95 96
LIMITATION OF BERT MODEL CONT…

Static Masking During Training: High Computational Cost:


• BERT uses a Masked Language Model (MLM) objective, where it • BERT’s deep, bidirectional nature makes it computationally
randomly masks 15% of tokens during training and then predicts expensive, especially for large models like BERT-Large, which has
them. However, once a token is masked, BERT does not consider 24 layers and 340 million parameters. This can be an obstacle for
it in its original, unmasked context. This approach introduces a real-time applications or environments with limited computational
discrepancy between pre-training (with masked tokens) and resources.
downstream use (where no tokens are masked).
Bidirectionality as a Limitation for Autoregressive Tasks:
• Limited Handling of Long Sequences: • BERT’s bidirectional (non-autoregressive) training structure makes
• BERT has a fixed maximum sequence length, typically set to 512 it suboptimal for tasks that require autoregressive behavior, like
tokens. This can be restrictive when processing longer documents, text generation. It’s designed more for understanding and
forcing text to be split or truncated, which can lead to a loss of classification rather than generation.
contextual information.

97 98

XLnet CONT…

XLNet is a cutting-edge language model created by Google AI High Computational Cost:


Brain and Carnegie Mellon University, designed to improve the • BERT’s deep, bidirectional nature makes it computationally
limitations of previous models like BERT. expensive, especially for large models like BERT-Large, which has
24 layers and 340 million parameters. This can be an obstacle for
real-time applications or environments with limited computational
 Key Features of XLNet resources.
Autoregressive Approach with Permutation:
• XLNet employs an autoregressive model with permutation, Bidirectionality as a Limitation for Autoregressive Tasks:
meaning it predicts tokens based on every possible ordering of
the input, not just left-to-right or right-to-left. This enables it to • BERT’s bidirectional (non-autoregressive) training structure makes
learn bidirectional context while preserving the autoregressive it suboptimal for tasks that require autoregressive behavior, like
nature. text generation. It’s designed more for understanding and
classification rather than generation.

99 100
XLnet
CONT…
XLNet is a cutting-edge language model created by Google AI Training Objective:
Brain and Carnegie Mellon University, designed to improve the • Unlike BERT, which uses masked language modeling (where
limitations of previous models like BERT. certain tokens are masked and then predicted), XLNet does not
mask tokens. Instead, it trains by permuting input sequences
 Key Features of XLNet and predicting tokens, which allows it to leverage more context
information without introducing artificial masking.
Autoregressive Approach with Permutation:
• XLNet employs an autoregressive model with permutation,
meaning it predicts tokens based on every possible ordering of Transformer-XL Architecture:
the input, not just left-to-right or right-to-left. This enables it to • XLNet builds on the Transformer-XL architecture, which
learn bidirectional context while preserving the autoregressive enables it to handle longer sequences by remembering past
nature. tokens through a recurrence mechanism. This helps the model
capture dependencies across longer contexts, beneficial in tasks
requiring understanding of large text inputs.

101 102

CONT… UNIT-4
Superior Performance on NLP Tasks: Statistical Machine Translation (SMT)
• XLNet has outperformed previous models like BERT on various NLP
benchmarks, including text classification, question answering, and sentiment  It is a method in Natural Language Processing (NLP) for
analysis. translating text from one language to another based on statistical
models.
Applications of XLNet SMT models build translations by learning patterns and
Question Answering: XLNet performs exceptionally well on QA tasks where probabilities from large amounts of bilingual text data (called
understanding bidirectional context is key. parallel corpora) without needing extensive human-written grammar
Text Classification: It is useful in tasks where longer dependencies and rules or vocabularies.
contextual understanding improve classification accuracy.  Key Concepts in Statistical Machine Translation
Sentiment Analysis: XLNet can capture nuanced emotions in text, making it
effective for sentiment analysis in both short and long text passages. Parallel Corpora:
Language Translation: Its enhanced context understanding and ability to SMT requires parallel corpora, which consist of texts in two
handle long sequences make it promising for translation tasks. languages that correspond sentence by sentence. Examples
include news articles, government documents, and literature in
multiple languages.
103 104
CONT… CONT…
Translation Model:
Decoding:
• The translation model captures the probability that a given
• During translation, SMT systems use a decoder to generate the most probable
phrase in the source language translates to a phrase in the sequence of words in the target language based on the probabilities from both
target language. For instance, if the model is trained to the translation model and the language model. The decoder searches for the
translate from English to French, it will estimate the best combination that maximizes both translation accuracy and fluency.
probability of each possible French phrase for a given English
phrase based on the data it has seen.
Alignment and Phrase Tables:
• SMT involves alignment, which maps words and phrases in the source
Language Model: language to their counterparts in the target language. These alignments form
• The language model ensures that translations make sense in phrase tables that the model uses to translate segments rather than just single
the target language. It learns the natural flow of the target words, making translations more fluent and accurate.
language by analyzing large amounts of monolingual text,
assigning higher probabilities to sequences that are
grammatically correct and commonly used.
105 106

Types of Statistical Machine Translation Example of Statistical Machine Translation


Word-Based SMT:
• Translates words individually, often resulting in poor fluency
as word-for-word translation misses context. Imagine translating the English sentence “The cat is
Phrase-Based SMT: on the mat” to French. An SMT model would:
• The most common form, where translations are generated
for sequences or phrases of words instead of individual Break down the sentence into phrases like “The cat,”
words, improving fluency and context awareness. “is on,” and “the mat.”
Syntax-Based SMT: Use a parallel corpus to find common translations for
• Uses syntactic structures to improve the translation by
understanding grammatical relationships, enabling these phrases, e.g., “The cat” might translate to “Le
translations that better respect grammar rules. chat.”
Hierarchical Phrase-Based SMT:
Apply the language model to ensure the resulting
• Combines syntax-based and phrase-based approaches by
using hierarchical structures to capture phrase dependencies French sentence is natural and fluent, likely
and provide more accurate translations. producing “Le chat est sur le tapis.”
107 108
Advantages and Limitations of SMT Neural Machine Translation

Advantages: Requires less manual rule-building, relying • Neural Machine Translation (NMT) is a technique for
instead on learning from data. automatically translating text from one language to another using
artificial neural networks.
• Works well with sufficient high-quality parallel data.
• Unlike traditional statistical methods, which rely on separately
modeled components (like language models and translation rules),
Limitations: Translation quality depends heavily on the NMT is an end-to-end learning approach where the entire
size and quality of the parallel corpus. translation process is handled by a single neural network.
• Cannot handle idiomatic expressions or complex syntax as • NMT systems are based on deep learning, allowing them to
effectively as more advanced models, such as Neural understand and generate more natural and accurate translations by
Machine Translation (NMT). learning directly from vast amounts of bilingual text data.

109 110

Basic Concept Working Mechanism of NMT


• The core idea behind NMT is to map a sequence of words from the source • Encoding the Input Sentence: The encoder, typically
language (e.g., English) to a sequence of words in the target language (e.g., implemented using recurrent layers like Long Short-Term
French). Memory (LSTM) or Gated Recurrent Unit (GRU) networks,
• The model is designed to learn patterns in language data, understanding the processes the input sentence word by word.
structure, grammar, and context of sentences. • Each word is converted into an embedding (a numerical
• NMT achieves this through a specialized type of neural network architecture representation) and is fed into the encoder sequentially.
known as the Sequence-to-Sequence (Seq2Seq) model. It consists of two
parts:- • The encoder processes these embeddings and outputs a final
• Encoder: Reads the input sentence in the source language and converts it into a context vector that captures the overall meaning of the
fixed-size context vector, which represents the sentence's meaning. sentence.
• Decoder: Takes the context vector from the encoder and generates the translated • Generating the Context Vector: Once the encoder has
sentence in the target language. processed all words, it outputs a context vector (also known
as the hidden state) that contains the encoded information of
the entire input sentence. This vector is then passed to the
decoder.
111 112
CONT…
CONT…
• Decoding and Generating the Translation: The decoder • Attention Mechanism (Improvement): While early NMT
uses the context vector from the encoder and generates the models relied solely on the context vector, the attention
translated sentence, one word at a time. mechanism greatly improved translation quality.
• Attention allows the model to focus on specific parts of the
• Like the encoder, the decoder is often implemented using input sentence when generating each word of the output. For
LSTM or GRU layers. example, when translating a sentence, the model can "attend"
• It starts by generating the first word based on the context and more closely to certain words in the input that are relevant to the
next output word, leading to more accurate and fluent
then uses this first word to help predict the second word, and translations.
so on, until the full translated sentence is produced. • Training the Model: During training, the model learns by
comparing its translations to the correct translations in the
training data and adjusting its internal parameters to minimize
the error. It uses a large dataset of paired sentences in the source
and target languages and is trained over many iterations to
improve translation accuracy.

113 114

CONT...
UNIT-5
1D-CNN for NLP
 Pooling Layer: After the convolution operation, a
1D-CNNs (1-Dimensional Convolutional Neural Networks) for Natural
Language Processing (NLP) have become a popular approach due to their
pooling layer (e.g., max-pooling) can be used to
simplicity, computational efficiency, and effectiveness in capturing local features reduce the dimensionality and retain important
in sequential data, such as text. features. This helps to generalize well and avoid over
While CNNs are often associated with image processing, they can be adapted fitting.
for text data using 1D convolutions over sequences, such as words or characters.
Non-linearity (Activation Function): A non-linear
Basic Concepts of 1D-CNNs in NLP activation function like ReLU (Rectified Linear
Convolutional Layer: 1D-CNNs use a convolution filter (kernel) that slides
over the input text sequence.
Unit) is often applied to the output of the
 For NLP, this sequence typically consists of word embeddings (or character
convolution operation to introduce non-linearity.
embeddings). Fully Connected Layer: The outputs from the
The convolution operation helps capture local patterns, like n-grams, that are convolution and pooling layers are often fed into one
common in text sequences. or more fully connected layers (dense layers), which
are used to perform classification or regression tasks.
115 116
CONT... CONT...
Why Use 1D-CNNs for NLP? Why Use 1D-CNNs for NLP?
• Capturing Local Dependencies: 1D-CNNs are excellent at capturing • Capturing Local Dependencies: 1D-CNNs are excellent at capturing
local dependencies (e.g., word combinations, n-grams) within a text local dependencies (e.g., word combinations, n-grams) within a text
window. window.
• This can be particularly useful in tasks such as sentiment analysis, where • This can be particularly useful in tasks such as sentiment analysis, where
phrases may have strong local contextual cues. phrases may have strong local contextual cues.
• Parallel Computation: Unlike RNNs (Recurrent Neural Networks), 1D- • Parallel Computation: Unlike RNNs (Recurrent Neural Networks), 1D-
CNNs do not require sequential processing, making them faster and more CNNs do not require sequential processing, making them faster and more
efficient to train. efficient to train.
• Simple and Effective: Compared to other deep learning models for text, • Simple and Effective: Compared to other deep learning models for text,
such as LSTMs or Transformers, 1D-CNNs are often simpler and can such as LSTMs or Transformers, 1D-CNNs are often simpler and can
perform well with fewer computational resources. perform well with fewer computational resources.

117 118

CONT... APPLICATIONS
•Architecture of a Simple 1D-CNN for NLP
Input Layer: Tokenized input text is represented as sequences of
• Sentiment Analysis: Classifying text into positive,
word embeddings (e.g., GloVe, Word2Vec, or trainable embeddings). negative, or neutral sentiments.
Convolution Layer(s): A set of 1D convolution filters slides over the • Spam Detection: Distinguishing between spam and non-
input text, capturing local patterns and producing feature maps.
Pooling Layer: Max-pooling or average-pooling is applied to reduce spam messages.
the dimensionality of the feature maps and focus on the most relevant
features. • Text Categorization: Classifying text into predefined
Dropout Layer (optional): This layer is used to prevent over fitting categories (e.g., news categorization).
by randomly setting a fraction of the input units to zero during
training. • Named Entity Recognition (NER): Identifying named
Fully Connected (Dense) Layer: The extracted features are flattened entities in a piece of text.
and fed into one or more fully connected layers, which lead to the
output.
Output Layer: Depending on the task (e.g., binary classification,
multi-class classification), this layer typically uses a softmax or
sigmoid activation function.

119 120
Sub-word Models Why Use Sub-word Models?

• Sub-word models are a powerful approach used in Natural Sub-word models help overcome several common issues in NLP:
Language Processing (NLP) for representing and processing text at • Handling Unknown Words: In traditional word-based models, any word
not seen during training is treated as unknown.
a level below whole words, such as at the character or sub-word unit • Sub-word models, however, break unseen words into smaller known
level. units, which helps improve generalization and vocabulary coverage.
• This concept addresses challenges related to handling unknown
words (out-of-vocabulary words), language morphology, and • Morphological Richness: Many languages are morphologically complex,
meaning that words can have numerous forms due to inflections,
variations in text. derivations, or compounds.
• Instead of representing words as atomic units, sub-word models • Sub-word models can handle different word forms by representing words
as combinations of sub-word units.
break words into smaller, meaningful units or segments.
• These segments can be characters, character n-grams, or units • Vocabulary Size Reduction: By representing words as sub-word units,
derived through algorithms like Byte Pair Encoding (BPE) and the vocabulary size can be significantly smaller, leading to more efficient
memory usage and faster computations.
Word Piece.

121 122

Contextual Representations What are Contextual Representations?


Contextual representations in Natural Language Processing (NLP) are To break it down:
vector-based representations of words that capture the context in which words
appear in a sentence. • Static Word Embeddings (e.g., Word2Vec, GloVe): In these
approaches, each word is assigned a single vector regardless
 Unlike traditional approaches that assign each word a fixed embedding (such of its meaning in different contexts.
as Word2Vec or GloVe), contextual representations produce word embeddings • For example, the word "bank" would have the same
that vary depending on the surrounding words.
embedding whether it's used in the context of a "river bank"
or a "financial bank."
This approach allows models to understand and differentiate between different
meanings and usages of words based on their context. • Contextual Word Embeddings (e.g., ELMo, BERT, GPT):
Here, the embedding of a word is computed based on the
 Contextual representations are usually derived from models trained on large words surrounding it in a given sentence.
corpora of text using advanced architectures such as recurrent neural • Thus, the word "bank" would have different embeddings in
networks (RNNs), transformers, or a combination of these.
"river bank" and "financial bank," capturing its meaning
based on the context.

123 124
Self-Attention for Generative Models What is Self-Attention?

• Self-Attention is a key mechanism used in generative • At its core, self-attention allows a model to focus on different parts of
an input sequence to make decisions.
models for Natural Language Processing (NLP) tasks,
particularly within transformer-based models like GPT • Unlike earlier models like RNNs or LSTMs, which process input
sequentially, self-attention considers the entire input sequence at once.
(Generative Pre-trained Transformer).
• This means the model can learn relationships between all words in a
• This mechanism allows models to weigh the importance sentence simultaneously, regardless of their distance from each other.
of different words in a sequence relative to each other • Example: Consider the sentence:
when generating or predicting the next word. • "The cat sat on the mat because it was soft."
• Self-attention is the backbone of transformer models, • When processing the word "it," the model needs to understand that "it"
enabling them to capture complex dependencies and refers to "the mat" and not "the cat." The self-attention mechanism helps
the model identify this relationship by assigning a higher weight
relationships within text. (importance) to the connection between "it" and "the mat."

125 126

Why is Self-Attention Important for Generative Models?


Applications of Self-Attention in Generative Models
 Capturing Long-Range Dependencies: Self-attention
allows generative models to consider all words in the input
sequence, enabling the model to capture dependencies • Text Generation: Self-attention is fundamental to models like
between words that are far apart. This is especially useful for GPT-2 and GPT-3, which generate human-like text based on input
generating coherent and contextually accurate text. prompts.
Parallelization: Unlike RNNs, which process words • Machine Translation: Self-attention helps models like
sequentially, self-attention processes the entire sequence Transformer (which uses both self-attention and encoder-decoder
simultaneously. This leads to faster training and inference attention) to translate languages by focusing on relevant parts of
times, which is crucial for large-scale models like GPT. the source and target text.
Context-Awareness: By assigning varying attention weights • Summarization: By attending to the most important parts of a
to different words, the model can focus on the most relevant document, self-attention models can generate concise and accurate
words needed to generate or predict the next token. This summaries.
context-awareness leads to better understanding and
generation of language.

127 128
Natural Language Generation CONT...
Natural Language Generation (NLG) is a subfield of Natural Key Components of NLG Systems
Language Processing (NLP) focused on the creation of human- • Content Determination: Decides what information to include
like text by machines. in the generated text based on input data.
 It plays a crucial role in enabling computers to produce coherent • Document Structuring: Organizes the content into a coherent
and contextually relevant responses, transforming structured data structure, such as sentences, paragraphs, or specific formats,
or textual input into comprehensible and meaningful natural like reports or narratives.
language output. • Sentence Aggregation: Merges related pieces of information
into cohesive and well-formed sentences.
Basics of NLG in the context of NLP: • Lexicalization: Selects the words and phrases that best
 NLG converts structured data, such as numbers, coded data, or convey the intended meaning.
other machine-readable content, into natural language text that is • Surface Realization: Converts the structured representation
easily understandable by humans. of text into grammatically correct and fluent sentences.
It is often used to generate reports, summarize data, create • Refinement: Involves adjustments for clarity, stylistic
chatbots, automate content creation, and provide conversational preferences, or contextual appropriateness, ensuring the final
output is polished and effective.
agents in various industries.
129 130

You might also like