UNIT-3 • Candidate Memory ({C_t}): A temporary memory
update based on the input:
LSTM (Long Sort Term Memory)
Long Short-Term Memory (LSTM) is a special type of
Recurrent Neural Network (RNN) designed to • Output Gate (o_t): Controls how much of the
overcome the limitations of traditional RNNs, memory affects the next hidden state:
particularly the vanishing gradient problem.
LSTMs are especially useful for tasks involving
sequences where long-term dependencies are
important, making them highly effective for Natural • Final Memory Update: The cell state is updated
Language Processing (NLP) and other sequential data using both the forget gate and input gate:
tasks.
Key Features of LSTMs
• Memory Cells: LSTMs have a memory cell that can
maintain information over long sequences, allowing
them to capture long-term dependencies in data. Why Use LSTMs?
• Gating Mechanisms: LSTMs use three gates to • Overcome Vanishing Gradients: Traditional RNNs
regulate the flow of information. struggle with learning long-term dependencies
• Forget Gate: Decides which information to discard because gradients diminish as they propagate back
from the memory cell. through time. LSTMs solve this by maintaining a more
• Input Gate: Controls which new information to store constant gradient over time steps.
in the memory cell. • Capturing Long-Term Dependencies: They are
• Output Gate: Regulates what part of the memory is particularly useful for tasks where the context over
used to compute the output at each time step. long sequences matters, such as in speech
recognition, machine translation, and text generation.
• Flexibility: LSTMs can effectively learn when to
retain or forget information over time, making them
highly adaptable to various types of sequential data.
Applications of LSTMs in NLP
• Language Modeling: Predicting the next word in a
sentence by learning from the previous sequence of
words. LSTMs help capture long-range dependencies
that improve accuracy.
• Machine Translation: LSTMs are used in encoder-
decoder architectures, where the encoder LSTM
processes a sentence in one language and the decoder
Architecture of an LSTM LSTM generates its translation in another language.
Each LSTM unit consists of: • Text Generation: LSTMs are used to generate new
• Cell State (C_t): The memory of the LSTM unit. text (e.g., poetry, music lyrics) by learning patterns in a
• Hidden State (h_t): The output of the LSTM unit at large corpus.
each time step, which can be passed to the next unit. • Sentiment Analysis: LSTMs can model the sentiment
• Forget Gate (f_t): Controls what proportion of the of a sentence or document by remembering important
previous memory to retain. It is computed as: information from earlier parts of the sequence.
• Speech Recognition: In speech-to-text systems,
LSTMs are used to map sequences of audio features to
sequences of text.
• Input Gate (i_t): Determines how much of the new • Named Entity Recognition (NER): LSTMs are used to
input should be added to the memory: identify and classify entities like people, organizations,
and locations within a sequence of text.
Gated recurrent Unit (GRU)
. The Gated Recurrent Unit (GRU) is a type of recurrent
neural network (RNN) architecture designed to solve
issues related to learning long-term dependencies,
such as vanishing gradients, that traditional RNNs face.
It is similar to the Long Short-Term Memory (LSTM)
network but has a simpler architecture.
The GRU has fewer gates and parameters compared
to the LSTM, which makes it computationally less
expensive and easier to train while still effectively
capturing temporal dependencies in sequential data.
Key Concepts of Gated Recurrent Unit (GRU)
Recurrent Neural Networks (RNNs) Background:
• RNNs are used to process sequential data (e.g., time
series, text, speech). They maintain a hidden state that
is updated with each time step, allowing them to
capture the dependencies between elements in the
sequence.
• However, traditional RNNs struggle to capture long-
range dependencies due to the vanishing gradient
problem during training.
• GRU Architecture: A GRU addresses the limitations Comparison of GRU vs. LSTM
of traditional RNNs by introducing two gates: the reset LSTM: Uses three gates (input, forget, output),
gate and the update gate. These gates help control the which provide more control over memory, but at the
flow of information and manage what information cost of more computational complexity.
should be carried forward or forgotten.
GRU: Uses two gates (update, reset), which makes it
more efficient while still performing well on many
tasks.
Use Cases of GRU
• Time Series Forecasting: GRUs are used to capture
dependencies in time-series data for tasks like
weather prediction or stock price forecasting.
• Natural Language Processing (NLP): GRUs are used in
tasks like machine translation, speech recognition, and
text generation, where sequential data is prevalent.
• Anomaly Detection: In detecting irregularities in
sequences, such as fraud detection in transactional
data or identifying unusual patterns in sensor data.
Part of Speech Tagging
Part-of-Speech (POS) Tagging is a fundamental task
in Natural Language Processing (NLP) where each
word in a sentence is assigned a part of speech, such
as a noun, verb, adjective, etc.
This helps in understanding the grammatical
structure of a sentence and provides information
about the syntactic role of words.
Common Parts of Speech Explanation of Tags
• Noun (NN): Names of people, places, things, or ideas • DT: Determiner
(e.g., "dog", "happiness"). • JJ: Adjective
• Verb (VB): Action words (e.g., "run", "jump"). • NN: Noun
• Adjective (JJ): Describes a noun (e.g., "happy", • VBZ: Verb, 3rd person singular present
"blue"). • IN: Preposition
• Adverb (RB): Describes a verb, adjective, or other
Applications of POS Tagging
adverb (e.g., "quickly", "very").
Syntactic Parsing: Helps in building parse trees for
• Pronoun (PRP): Substitutes for nouns (e.g., "he",
sentences.
"she").
• Preposition (IN): Shows relationships between nouns Named Entity Recognition (NER): POS tagging is
(e.g., "in", "on"). often a preprocessing step in recognizing named
• Conjunction (CC): Connects clauses, sentences, or entities like persons, organizations, etc.
words (e.g., "and", "but"). Machine Translation: Helps in understanding
• Determiner (DT): Introduces nouns (e.g., "the", "a"). sentence structure for better translation.
Information Extraction: Assists in extracting useful
How POS Tagging Works information (e.g., events, dates) from text.
Tokenization: The sentence is first split into Sentiment Analysis: POS tags can help identify the
individual words (tokens). sentiments by focusing on adjectives and verbs.
Assign Tags: Each token is then tagged with its part
of speech based on its role in the sentence. Challenges in POS Tagging
• Ambiguity: Some words can belong to different
For example: parts of speech depending on the context (e.g.,
Sentence: "The cat chased the mouse." "book" can be a noun or a verb).
POS Tags: • Unknown Words: Handling out-of-vocabulary (OOV)
– "The" → DT (Determiner) words not seen in the training data can be difficult.
– "cat" → NN (Noun)
– "chased" → VBD (Verb, Past Tense) BERT (Bidirectional Encoder
– "the" → DT (Determiner) Representations from Transformers)
– "mouse" → NN (Noun)
It is a groundbreaking model in Natural Language
POS Tagging Algorithms Processing (NLP) developed by Google.
Rule-based POS Tagging: It has significantly advanced the field by introducing
• Uses a set of hand-crafted linguistic rules to tag new methods for understanding the context of words
words. in a sentence.
• Example: "If a word ends in 'ing', it's likely a verb Overview of BERT
(e.g., running)." • Architecture:
Statistical POS Tagging: – Transformers: BERT is based on the Transformer
• Uses machine learning models to assign POS tags architecture, which uses self-attention mechanisms to
based on probabilities derived from a large annotated process input data in parallel rather than sequentially,
corpus. allowing it to understand context better.
• Example: Hidden Markov Model (HMM), which – Bidirectional Context: Unlike traditional models that
predicts the most likely tag sequence given the word process text in a left-to-right or right-to-left manner,
sequence. BERT reads the entire sequence of words
simultaneously. This bidirectional approach allows
Neural Network-based POS Tagging: BERT to grasp the context of a word based on all its
• Uses deep learning methods like Recurrent Neural surrounding words.
Networks (RNNs) or Transformer-based models (e.g.,
BERT) to learn tagging from large amounts of labelled Training:
data. • Masked Language Model (MLM): During training,
Hybrid Methods: some words in the input text are randomly masked,
• Combines rule-based and statistical or machine and the model learns to predict these masked words
learning approaches to improve accuracy. based on their context. This helps BERT understand
the meaning of words based on surrounding words.
• Next Sentence Prediction (NSP): BERT is also trained
on pairs of sentences to understand the relationship
between them. For example, given two sentences, it
learns to predict whether the second sentence follows
the first in the original text.
Pre-training and Fine-tuning:
• BERT is first pre-trained on a large corpus of text
(e.g., Wikipedia and Books Corpus) to learn general
language representations. This pre-training allows it to
capture a broad understanding of language.
• After pre-training, BERT can be fine-tuned on
specific tasks (e.g., sentiment analysis, question
answering) using a smaller dataset. Fine-tuning
typically involves adding a simple output layer to the
pre-trained model and training it on the task-specific
data.
Applications of BERT
• BERT has been applied successfully across various
NLP tasks, including:
• Text Classification: Assigning categories to text
documents (e.g., sentiment analysis).
• Named Entity Recognition (NER): Identifying entities
such as names, dates, and locations in text.
• Question Answering: Finding answers to questions
based on a given passage of text.
• Text Summarization: Generating concise summaries
of longer texts.
• Machine Translation: Translating text from one
language to another.
Impact of BERT
• Performance: BERT has set new state-of-the-art
performance records on multiple NLP benchmarks,
significantly improving results for many tasks.
• Transfer Learning: It popularized the use of pre-
trained language models, leading to the development
of numerous other models based on BERT (e.g.,
RoBERTa, DistilBERT, ALBERT).
• Community Adoption: BERT's release has sparked a
wave of research and development in NLP, influencing
academic work and industry applications.