NLP Lab Manual
NLP Lab Manual
LAB MANUAL
Subject code :- ACTDCNLP00IP
SUBJECT NAME :- NATURAL LANGUAGE PROCESSING
LAB
SEMESTER: VII SEM IV YEAR
Session: - JULY- DEC 2024
Submitted to Submitted by
Prof. Rishi Yadav Name:-
Enrollment No. 21ADV3CSE0
INDEX
Word Analysis
Definition:
Word Analysis is a crucial task in Natural Language Processing (NLP) that involves
examining and understanding the structure and meaning of words within a text. It
aims to break down text into its fundamental components, categorize these
components, and extract meaningful information from them. This analysis is
foundational for various NLP applications, including information retrieval, sentiment
analysis, and machine translation.
Techniques:
1. Tokenization:
○ Definition: Tokenization is the process of splitting a text into
individual words or tokens. Tokens can be words, phrases, or symbols
that serve as the building blocks for further analysis.
○ Purpose: Tokenization helps in simplifying the text into manageable
units, making it easier to perform subsequent analyses.
○ Example: The sentence "Natural Language Processing is fascinating."
can be tokenized into ["Natural", "Language", "Processing", "is",
"fascinating", "."].
2. Part-of-Speech (POS) Tagging:
○ Definition: POS tagging involves identifying the grammatical category
of each word in a sentence. Categories include nouns, verbs, adjectives,
adverbs, etc.
○ Purpose: POS tagging helps in understanding the syntactic structure of
a sentence, which is useful for parsing, machine translation, and
text-to-speech systems.
○ Example: In the sentence "Natural Language Processing is
fascinating," POS tagging might label "Natural" as an adjective,
"Language" as a noun, and "is" as a verb.
3. Named Entity Recognition (NER):
○ Definition: NER is the process of identifying and classifying entities
mentioned in a text, such as names of people, organizations, locations,
dates, and other proper nouns.
○ Purpose: NER is used to extract structured information from
unstructured text, enabling applications like information retrieval,
question answering, and knowledge graph construction.
○ Example: In the sentence "Barack Obama was born in Hawaii," NER
identifies "Barack Obama" as a person and "Hawaii" as a location.
Applications:
CODE :-
Stripping
l= ['ed','ing']
s = input("Enter the word : ")
ss = s.strip('ed')
print(ss)
Output:
Enter the word : talented
talent
Word Generation
Definition:
Word Generation is the process of creating coherent and contextually
appropriate words or sequences of words in natural language. This process is
essential for various NLP applications, including text completion, machine
translation, dialogue systems, and content generation. Effective word generation
requires understanding both the syntactic structure and the semantic context of
the text.
Techniques:
1. Language Modeling:
○ Definition: Language modeling involves predicting the next word
in a sequence based on the preceding words. This technique helps
in generating text that follows a natural language structure and
context.
○ Purpose: Language models are used to complete sentences,
generate coherent text, and predict the next word in applications
such as autocomplete and text-based games.
○ Example: Given the input "The cat sat on the," a language model
might predict "mat" as the next word.
2. N-Grams:
○ Definition: N-Grams are statistical models that use the previous
n−1n-1n−1 words to predict the nnn-th word in a sequence. They
rely on the frequency of word sequences in a training corpus.
○ Purpose: N-Gram models are used to capture statistical
dependencies between words, which helps in predicting the next
word based on historical word patterns.
○ Example: In a bigram (2-gram) model, the prediction of the next
word is based on the preceding word. For the input "The cat," the
model might predict "sat" if "cat sat" is a frequent bigram in the
corpus.
3. Neural Language Models:
○ Definition: Neural language models utilize deep learning
techniques to generate text. They can learn complex patterns and
dependencies in language data through neural networks.
○ Purpose: These models generate more coherent and contextually
appropriate text compared to traditional methods. They are
particularly useful for tasks like text generation, translation, and
summarization.
○ Example: Recurrent Neural Networks (RNNs), Long Short-Term
Memory (LSTM) networks, and Transformer-based models are
commonly used neural language models.
4. Word Embeddings:
○ Definition: Word embeddings are dense vector representations of
words where similar words have similar vectors. They capture
semantic relationships between words by mapping them into a
continuous vector space.
○ Purpose: Word embeddings improve the performance of NLP
tasks by providing meaningful word representations that capture
context and similarity.
○ Example: The word "king" might be represented as a vector that is
close to the vector for "queen" in the embedding space.
5. Word2Vec:
○ Definition: Word2Vec is a technique for learning word
embeddings using shallow neural networks. It generates word
vectors based on the context of words in a corpus.
○ Purpose: Word2Vec helps in capturing semantic relationships and
similarities between words, which enhances various NLP
applications.
○ Example: Word2Vec can be used to find word similarities or
analogies, such as "man" is to "woman" as "king" is to "queen."
6. GloVe (Global Vectors for Word Representation):
○ Definition: GloVe is a model that generates word vectors by
aggregating global word-word co-occurrence statistics from a
corpus. It creates embeddings that capture the global statistical
information of words.
○ Purpose: GloVe aims to provide meaningful word vectors by
leveraging global co-occurrence patterns, improving the
performance of tasks like semantic similarity and word analogy.
○ Example: GloVe can be used to analyze relationships between
words and generate similar word vectors based on their usage in a
large corpus.
1. TensorFlow/Keras
2. Gensim
3. Transformers (by Hugging Face)
Applications:
CODE
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
Encode input text
input_text = "Natural Language Processing is"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
Generate text
output = model.generate(input_ids, max_length=50,
num_return_sequences=1,no_repeat_ngram_size=2)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Text:", generated_text)
OUTPUT
Generated Text: Natural Language Processing is a field of computer science
and artificial intelligence concerned with the interactions between computers and
human languages. It is used to develop applications that can understand, interpret, and
generate human language.
Experiment No. :- 02
Study of Morphology, N-Grams, and N-Grams Smoothing
Study of Morphology
Definition: Morphology is the branch of linguistics concerned with the structure and
formation of words. It examines how words are constructed from smaller units called
morphemes, which are the smallest meaningful units of language. Understanding
morphology is crucial for tasks such as text processing, language modeling, and
information retrieval.
Key Concepts:
1. Morphemes:
The smallest units of meaning in a language. Morphemes can be roots,
prefixes, suffixes, or infixes.
1. Morphological Analysis:
○ Objective: Break down words into their constituent morphemes and
understand their grammatical role.
2. Part-of-Speech Tagging (POS Tagging):
○ Objective: Identify the grammatical category of each word in a
sentence, which is essential for morphological analysis.
3. Morphological Parsing:
○ Objective: Analyze the structure of words and determine their root
forms and affixes.
4. Stemmer and Lemmatizer Tools:
○ Stemmers: Algorithms like Porter Stemmer and Snowball Stemmer
are used for stemming.
○ Lemmatizers: Tools like WordNet Lemmatizer from NLTK or spaCy's
lemmatizer are used for lemmatization.
CODE
Objective: Perform POS tagging to understand the grammatical role of each word.
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Natural Language Processing is fascinating."
doc = nlp(text)
pos_tags = [(token.text, token.pos_) for token in doc]
print("POS Tags:", pos_tags)
OUTPUT:
POS Tags: [('Natural', 'ADJ'), ('Language', 'NOUN'), ('Processing', 'NOUN'),
('is', 'AUX'), ('fascinating', 'ADJ'), ('.', 'PUNCT')]
OUTPUT
Word: уходил
Normal Form: уходить
Tag: VERB,impf,past,sg,m1
Experiment No. :- 03
Implementation of POS Tagging with Hidden Markov Model
To implement an HMM for POS tagging, you need to train the model on a
labeled corpus. For simplicity, we’ll use the nltk library, which provides tools
for this purpose. Install the necessary libraries
3. Code Example
Here’s a complete example of POS tagging using HMM in Python with the nltk
library:
4. Explanation
1. Loading Data:
○ The treebank corpus from NLTK is used, which contains a set of
sentences tagged with POS tags.
2. Preparing Data:
○ Convert the tagged sentences into the format required by the HMM
trainer: a list of tuples where each tuple contains a word and its
POS tag.
3. Training the Model:
○ Use HiddenMarkovModelTrainer to create and train the HMM
POS tagger.
4. Tagging:
○ Use the trained tagger to assign POS tags to words in new
sentences.
5. Output
Sample Output:
● Training Data: The quality and size of the training data can significantly
impact the performance of the HMM.
● Model Evaluation: For practical use, you should evaluate the model on a
separate test set to measure its accuracy.
● Extensions: You can extend this basic implementation with more
advanced techniques, such as smoothing, to handle unknown words or
tags more effectively.
CODE
Output:
By offering insights into the grammatical structure, this tagging aids machines
in comprehending not just individual words but also the connections between
them inside a phrase. For many NLP applications, like text summarization,
sentiment analysis, and machine translation, this kind of data is essential.
The following are the processes in a typical natural language processing (NLP)
example of part-of-speech (POS) tagging:
● Tokenization: Divide the input text into discrete tokens, which are
usually units of words or subwords. The first stage in NLP tasks is
tokenization.
● Loading Language Models: To utilize a library such as NLTK or
SpaCy, be sure to load the relevant language model. These models
offer a foundation for comprehending a language’s grammatical
structure since they have been trained on a vast amount of linguistic
data.
● Text Processing: If required, preprocess the text to handle special
characters, convert it to lowercase, or eliminate superfluous
information. Correct PoS labeling is aided by clear text.
● Linguistic Analysis: To determine the text’s grammatical structure,
use linguistic analysis. This entails understanding each word’s
purpose inside the sentence, including whether it is an adjective, verb,
noun, or other.
● Part-of-Speech Tagging: To determine the text’s grammatical
structure, use linguistic analysis. This entails understanding each
word’s purpose inside the sentence, including whether it is an
adjective, verb, noun, or other.
● Results Analysis: Verify the accuracy and consistency of the PoS
tagging findings with the source text. Determine and correct any
possible problems or mistakes
CODE
# Sample text
text = "NLTK is a powerful library for natural language processing."
OUTPUT:
Original Text:
.: .
Features to Consider:
Sources:
● Historical match data from sources like ESPN Cricinfo, Kaggle datasets,
or cricket statistics websites.
Example Data:
Data Cleaning:
Step-by-Step Implementation:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
3. Prepare Data
python
Copy code
X = data[features]
y = data['match_result_encoded']
5. Train a Model
# Initialize and train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", report)
7. Example Data
First, ensure you have the required libraries installed. You’ll need transformers
and torch.
Python Code:
from transformers import MarianMTModel, MarianTokenizer
def translate_text(text):
# Tokenize input text
tokens = tokenizer(text, return_tensors='pt', padding=True)
# Perform translation
translated_tokens = model.generate(**tokens)
return translated_text
# Example usage
english_text = "Hello, how are you?"
hindi_translation = translate_text(english_text)
You can test the model with different English sentences to see how well it
translates them into Hindi. Simply replace the english_text variable with other
sentences.
5. Advanced Considerations
● Fine-tuning: If you have specific domain data, you can fine-tune the
model on your own dataset for better accuracy.
● Data Preprocessing: Ensure your text data is clean and free from
unwanted characters to get better translation results.
● Error Handling: Implement error handling for cases where the model
might produce unexpected results or fail to translate.
6. Example Output
7. Conclusion
1. Basic Concepts
1. Synonym Expansion:
○ Replaces terms in the query with synonyms or related terms. For
instance, “car” might be expanded to include “automobile” or
“vehicle.”
2. Stemming and Lemmatization:
○ Stemming: Reduces words to their root form (e.g., “running”
becomes “run”).
○ Lemmatization: Reduces words to their base or dictionary form,
often considering the context (e.g., “better” becomes “good”).
3. Contextual Expansion:
○ Uses the context of the original query to add related terms. For
example, adding terms related to the query’s subject matter or
context.
4. Relevance Feedback:
○ Involves using feedback from the user about the relevance of
retrieved documents to adjust and expand the query. This can be
explicit (user provides feedback) or implicit (based on user
behavior).
5. Thesaurus-based Expansion:
○ Uses a thesaurus or ontology to find related terms or concepts to
include in the query.
6. Probabilistic Models:
○ Utilizes statistical models to determine which terms should be
added to the query based on their likelihood of improving retrieval
performance.
3. Historical Background
The work on query expansion following relevance feedback dates back to 1965,
when Rocchio formalized relevance feedback in the vector-space model. Early
work on using collection-based term co-occurrence statistics to select query
expansion terms was done by Spärck Jones and van Rijsbergen.
1. Automatic Expansion:
○ Algorithms and models that automatically determine expansion
terms without user intervention. Examples include Latent Semantic
Analysis (LSA) and Latent Dirichlet Allocation (LDA).
2. Manual Expansion:
○ Users or experts manually add terms to improve the query. This
can be more tailored but requires more effort.
3. Machine Learning-Based Expansion:
○ Utilizes machine learning algorithms to learn patterns from past
searches and expansions to make informed decisions about query
expansion.
6. Evaluation Metrics
1. Basic Concepts
● Emotion Detection: The process of identifying and categorizing
emotions expressed in textual data. This can include emotions like joy,
sadness, anger, surprise, fear, and disgust.
● Sentiment Analysis: A related field that typically focuses on identifying
the overall sentiment of a text as positive, negative, or neutral. Emotion
detection can be seen as a more granular approach to sentiment analysis.
2. Types of Emotions
Different models categorize emotions in various ways. Common frameworks
include:
1. Basic Emotions:
○ Paul Ekman’s model identifies basic emotions such as happiness,
sadness, anger, surprise, disgust, and fear.
2. Dimensional Models:
○ The Pleasure-Arousal-Dominance (PAD) model describes
emotions along three dimensions: pleasure (valence), arousal
(intensity), and dominance (control).
3. Extended Models:
○ Models like the Geneva Affective Picture Database (GAPD)
provide more nuanced categories, including secondary emotions
and mixed emotions.
5. Applications
1. Customer Sentiment Analysis:
○ Understanding customer feedback and improving customer service
by analyzing emotional content in reviews and surveys.
2. Social Media Monitoring:
○ Tracking public sentiment and emotions about brands, events, or
political issues.
3. Healthcare:
○ Analyzing patient feedback or clinical notes to identify emotional
states and mental health conditions.
4. Entertainment and Media:
○ Enhancing user experiences in games, movies, and content
recommendations by analyzing emotional responses.
5. Human-Computer Interaction:
○ Improving interactions between users and virtual assistants or
chatbots by understanding user emotions.