0% found this document useful (0 votes)
32 views28 pages

NLP Lab Manual

The document is a lab manual for a Natural Language Processing (NLP) course at SAGE University Indore, detailing various experiments and their objectives, including word analysis, morphology, and POS tagging. It outlines techniques, tools, and applications relevant to NLP, along with sample code for practical implementation. The manual serves as a guide for students in their seventh semester of the computer science program.

Uploaded by

ravanmorya86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views28 pages

NLP Lab Manual

The document is a lab manual for a Natural Language Processing (NLP) course at SAGE University Indore, detailing various experiments and their objectives, including word analysis, morphology, and POS tagging. It outlines techniques, tools, and applications relevant to NLP, along with sample code for practical implementation. The manual serves as a guide for students in their seventh semester of the computer science program.

Uploaded by

ravanmorya86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

SAGE University Indore

LAB MANUAL
Subject code :- ACTDCNLP00IP
SUBJECT NAME :- NATURAL LANGUAGE PROCESSING
LAB
SEMESTER: VII SEM IV YEAR
Session: - JULY- DEC 2024

Institute of Advance Computing

Submitted to Submitted by
Prof. Rishi Yadav Name:-
Enrollment No. 21ADV3CSE0
INDEX

Sr, No. Name Of Experiment Date of Remarks


Submission
01. Study of Word Analysis and Word
Generation in NLP
02. Study of Morphology, N-Grams, and
N-Grams Smoothing
03. Implementation of POS Tagging
with Hidden Markov Model
04. Building a POS Tagger in NLP
05. Implementing Cricket Game
Prediction
06. Implementing Machine Translation
from English to Hindi
07. Study of Query Expansion for
Information Retrieval
08. Study of Emotion Detection in Texts
Experiment No. :- 01
Study of Word Analysis and Word Generation in NLP

Word Analysis
Definition:
Word Analysis is a crucial task in Natural Language Processing (NLP) that involves
examining and understanding the structure and meaning of words within a text. It
aims to break down text into its fundamental components, categorize these
components, and extract meaningful information from them. This analysis is
foundational for various NLP applications, including information retrieval, sentiment
analysis, and machine translation.

Techniques:

1. Tokenization:
○ Definition: Tokenization is the process of splitting a text into
individual words or tokens. Tokens can be words, phrases, or symbols
that serve as the building blocks for further analysis.
○ Purpose: Tokenization helps in simplifying the text into manageable
units, making it easier to perform subsequent analyses.
○ Example: The sentence "Natural Language Processing is fascinating."
can be tokenized into ["Natural", "Language", "Processing", "is",
"fascinating", "."].
2. Part-of-Speech (POS) Tagging:
○ Definition: POS tagging involves identifying the grammatical category
of each word in a sentence. Categories include nouns, verbs, adjectives,
adverbs, etc.
○ Purpose: POS tagging helps in understanding the syntactic structure of
a sentence, which is useful for parsing, machine translation, and
text-to-speech systems.
○ Example: In the sentence "Natural Language Processing is
fascinating," POS tagging might label "Natural" as an adjective,
"Language" as a noun, and "is" as a verb.
3. Named Entity Recognition (NER):
○ Definition: NER is the process of identifying and classifying entities
mentioned in a text, such as names of people, organizations, locations,
dates, and other proper nouns.
○ Purpose: NER is used to extract structured information from
unstructured text, enabling applications like information retrieval,
question answering, and knowledge graph construction.
○ Example: In the sentence "Barack Obama was born in Hawaii," NER
identifies "Barack Obama" as a person and "Hawaii" as a location.

Tools and Libraries:

1. NLTK (Natural Language Toolkit)


2. spaCy
3. Stanford NLP

Applications:

● Information Retrieval: Enhancing search engines by improving the


understanding of query intent and content relevance.
● Sentiment Analysis: Analyzing opinions and emotions expressed in text.
● Machine Translation: Facilitating the translation of text from one language
to another by understanding the grammatical structure and entities.

CODE :-

Count Number of words


string = input(" Enter a line ")
print("The original string is :", string)
wc = len(string.split())
print("The number of words are :" ,wc)
Output :
Enter a line This is string
The original string is : This is string
The number of words are : 3

Stripping
l= ['ed','ing']
s = input("Enter the word : ")
ss = s.strip('ed')
print(ss)
​Output:
Enter the word : talented
talent
Word Generation
Definition:
Word Generation is the process of creating coherent and contextually
appropriate words or sequences of words in natural language. This process is
essential for various NLP applications, including text completion, machine
translation, dialogue systems, and content generation. Effective word generation
requires understanding both the syntactic structure and the semantic context of
the text.

Techniques:

1. Language Modeling:
○ Definition: Language modeling involves predicting the next word
in a sequence based on the preceding words. This technique helps
in generating text that follows a natural language structure and
context.
○ Purpose: Language models are used to complete sentences,
generate coherent text, and predict the next word in applications
such as autocomplete and text-based games.
○ Example: Given the input "The cat sat on the," a language model
might predict "mat" as the next word.
2. N-Grams:
○ Definition: N-Grams are statistical models that use the previous
n−1n-1n−1 words to predict the nnn-th word in a sequence. They
rely on the frequency of word sequences in a training corpus.
○ Purpose: N-Gram models are used to capture statistical
dependencies between words, which helps in predicting the next
word based on historical word patterns.
○ Example: In a bigram (2-gram) model, the prediction of the next
word is based on the preceding word. For the input "The cat," the
model might predict "sat" if "cat sat" is a frequent bigram in the
corpus.
3. Neural Language Models:
○ Definition: Neural language models utilize deep learning
techniques to generate text. They can learn complex patterns and
dependencies in language data through neural networks.
○ Purpose: These models generate more coherent and contextually
appropriate text compared to traditional methods. They are
particularly useful for tasks like text generation, translation, and
summarization.
○ Example: Recurrent Neural Networks (RNNs), Long Short-Term
Memory (LSTM) networks, and Transformer-based models are
commonly used neural language models.
4. Word Embeddings:
○ Definition: Word embeddings are dense vector representations of
words where similar words have similar vectors. They capture
semantic relationships between words by mapping them into a
continuous vector space.
○ Purpose: Word embeddings improve the performance of NLP
tasks by providing meaningful word representations that capture
context and similarity.
○ Example: The word "king" might be represented as a vector that is
close to the vector for "queen" in the embedding space.
5. Word2Vec:
○ Definition: Word2Vec is a technique for learning word
embeddings using shallow neural networks. It generates word
vectors based on the context of words in a corpus.
○ Purpose: Word2Vec helps in capturing semantic relationships and
similarities between words, which enhances various NLP
applications.
○ Example: Word2Vec can be used to find word similarities or
analogies, such as "man" is to "woman" as "king" is to "queen."
6. GloVe (Global Vectors for Word Representation):
○ Definition: GloVe is a model that generates word vectors by
aggregating global word-word co-occurrence statistics from a
corpus. It creates embeddings that capture the global statistical
information of words.
○ Purpose: GloVe aims to provide meaningful word vectors by
leveraging global co-occurrence patterns, improving the
performance of tasks like semantic similarity and word analogy.
○ Example: GloVe can be used to analyze relationships between
words and generate similar word vectors based on their usage in a
large corpus.

Tools and Libraries:

1. TensorFlow/Keras
2. Gensim
3. Transformers (by Hugging Face)

Applications:

● Text Generation: Creating coherent and contextually relevant text for


chatbots, content creation, and storytelling.
● Autocomplete: Enhancing user input experiences by predicting the next
word or phrase in text input fields.
● Machine Translation: Generating translations of text from one language
to another by understanding and generating contextually appropriate
words.
● Dialogue Systems: Generating responses in conversational agents and
virtual assistants.

CODE
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
Encode input text
input_text = "Natural Language Processing is"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
Generate text
output = model.generate(input_ids, max_length=50,
num_return_sequences=1,no_repeat_ngram_size=2)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Text:", generated_text)

OUTPUT
Generated Text: Natural Language Processing is a field of computer science
and artificial intelligence concerned with the interactions between computers and
human languages. It is used to develop applications that can understand, interpret, and
generate human language.
Experiment No. :- 02
Study of Morphology, N-Grams, and N-Grams Smoothing

Study of Morphology
Definition: Morphology is the branch of linguistics concerned with the structure and
formation of words. It examines how words are constructed from smaller units called
morphemes, which are the smallest meaningful units of language. Understanding
morphology is crucial for tasks such as text processing, language modeling, and
information retrieval.

Key Concepts:

1. Morphemes:
The smallest units of meaning in a language. Morphemes can be roots,
prefixes, suffixes, or infixes.

2. Derivation and Inflection:


○ Derivation: The process of creating new words by adding prefixes or
suffixes (e.g., "teach" -> "teacher").
○ Inflection: The process of modifying a word to express different
grammatical categories (e.g., "walk" -> "walked").
3. Stemming and Lemmatization:
○ Stemming: The process of reducing words to their base or root form,
often by removing suffixes (e.g., "running" -> "run").
○ Lemmatization: The process of reducing words to their base or
dictionary form (lemma), which considers the word's context and part of
speech (e.g., "better" -> "good").
4. Compound Words:
○ Definition: Words formed by combining two or more words (e.g.,
"notebook" from "note" + "book").

Techniques and Tools:

1. Morphological Analysis:
○ Objective: Break down words into their constituent morphemes and
understand their grammatical role.
2. Part-of-Speech Tagging (POS Tagging):
○ Objective: Identify the grammatical category of each word in a
sentence, which is essential for morphological analysis.
3. Morphological Parsing:
○ Objective: Analyze the structure of words and determine their root
forms and affixes.
4. Stemmer and Lemmatizer Tools:
○ Stemmers: Algorithms like Porter Stemmer and Snowball Stemmer
are used for stemming.
○ Lemmatizers: Tools like WordNet Lemmatizer from NLTK or spaCy's
lemmatizer are used for lemmatization.

CODE

POS Tagging with spaCy:

Objective: Perform POS tagging to understand the grammatical role of each word.
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Natural Language Processing is fascinating."
doc = nlp(text)
pos_tags = [(token.text, token.pos_) for token in doc]
print("POS Tags:", pos_tags)

OUTPUT:
POS Tags: [('Natural', 'ADJ'), ('Language', 'NOUN'), ('Processing', 'NOUN'),
('is', 'AUX'), ('fascinating', 'ADJ'), ('.', 'PUNCT')]

Morphological Parsing with pymorphy2:

Objective: Parse the morphology of Russian words using pymorphy2.

from pymorphy2 import MorphAnalyzer


morph = MorphAnalyzer()
word = "уходил" # Russian word for "was leaving"
parsed_word = morph.parse(word)[0]
print(f"Word: {word}")
print(f"Normal Form: {parsed_word.normal_form}")
print(f"Tag: {parsed_word.tag}")

OUTPUT
Word: уходил
Normal Form: уходить
Tag: VERB,impf,past,sg,m1
Experiment No. :- 03
Implementation of POS Tagging with Hidden Markov Model

Implementation of POS Tagging with Hidden Markov Model (HMM)

Part-of-Speech (POS) tagging with a Hidden Markov Model (HMM) is a classic


approach in Natural Language Processing (NLP) to assign grammatical
categories to words in a text. The HMM is a probabilistic model that assumes a
sequence of words can be generated by a sequence of hidden states, where each
state corresponds to a POS tag.

Here’s a step-by-step guide to implementing POS tagging with an HMM:

1. Understand HMM for POS Tagging

Hidden Markov Model (HMM):

● States: These correspond to POS tags (e.g., noun, verb).


● Observations: These are the words in the text.
● Transition Probabilities: The probability of moving from one POS tag
to another.
● Emission Probabilities: The probability of a word given a POS tag.
● Initial Probabilities: The probability of a POS tag at the start of a
sequence.

2. Setup and Dependencies

To implement an HMM for POS tagging, you need to train the model on a
labeled corpus. For simplicity, we’ll use the nltk library, which provides tools
for this purpose. Install the necessary libraries

3. Code Example

Here’s a complete example of POS tagging using HMM in Python with the nltk
library:
4. Explanation

1. Loading Data:
○ The treebank corpus from NLTK is used, which contains a set of
sentences tagged with POS tags.
2. Preparing Data:
○ Convert the tagged sentences into the format required by the HMM
trainer: a list of tuples where each tuple contains a word and its
POS tag.
3. Training the Model:
○ Use HiddenMarkovModelTrainer to create and train the HMM
POS tagger.
4. Tagging:
○ Use the trained tagger to assign POS tags to words in new
sentences.

5. Output

Running the code will output POS-tagged sentences. For example:

Sample Output:

Tagged Sentence: [('Natural', 'JJ'), ('Language', 'NN'), ('Processing', 'NN'), ('is',


'VBZ'), ('fascinating', 'VBG')]

Tagged Tokenized Sentence: [('Machine', 'NN'), ('learning', 'VBG'), ('is', 'VBZ'),


('transforming', 'VBG'), ('technology', 'NN')]
6. Considerations

● Training Data: The quality and size of the training data can significantly
impact the performance of the HMM.
● Model Evaluation: For practical use, you should evaluate the model on a
separate test set to measure its accuracy.
● Extensions: You can extend this basic implementation with more
advanced techniques, such as smoothing, to handle unknown words or
tags more effectively.

CODE

# Parts of Speech Tagging import


nltk from nltk.tokenize
import word_tokenize
# Download NLTK tokenizer and POS tagging models
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def pos_tagging(text):
# Tokenize the text into words
words = word_tokenize(text)
# Perform POS tagging
tagged_words = nltk.pos_tag(words) return tagged_words
# Example text text = "NLTK is a leading platform for building Python
programs to work with human language data."
# Perform POS tagging tagged_text = pos_tagging(text)
# Print POS tagged text
print(tagged_text)

Output:

[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'), ('platform',


'NN'), ('for', 'IN'), ('building', 'VBG'), ('Python', 'NNP')]
Experiment No. :- 04
Building a POS Tagger in NLP

POS(Parts-Of-Speech) Tagging in NLP


One of the core tasks in Natural Language Processing (NLP) is Parts of Speech
(PoS) tagging, which is giving each word in a text a grammatical category, such
as nouns, verbs, adjectives, and adverbs. Through improved comprehension of
phrase structure and semantics, this technique makes it possible for machines to
study and comprehend human language more accurately.
In many NLP applications, including machine translation, sentiment analysis,
and information retrieval, PoS tagging is essential. PoS tagging serves as a link
between language and machine understanding, enabling the creation of complex
language processing systems and serving as the foundation for advanced
linguistic analysis.
What is POS(Parts-Of-Speech) Tagging?
Parts of Speech tagging is a linguistic activity in Natural Language
Processing (NLP) wherein each word in a document is given a particular part of
speech (adverb, adjective, verb, etc.) or grammatical category. Through the
addition of a layer of syntactic and semantic information to the words, this
procedure makes it easier to comprehend the sentence’s structure and meaning.
In NLP applications, POS tagging is useful for machine translation, named
entity recognition, and information extraction, among other things. It also works
well for clearing out ambiguity in terms with numerous meanings and revealing
a sentence’s grammatical structure.

Default tagging is a basic step for the part-of-speech tagging. It is performed


using the DefaultTagger class. The DefaultTagger class takes ‘tag’ as a single
argument. NN is the tag for a singular noun. DefaultTagger is most useful when
it gets to work with the most common part-of-speech tag. That’s why a noun tag
is recommended.
Example of POS Tagging
Consider the sentence: “The quick brown fox jumps over the lazy dog.”
After performing POS Tagging:
● “The” is tagged as determiner (DT)
● “quick” is tagged as adjective (JJ)
● “brown” is tagged as adjective (JJ)
● “fox” is tagged as noun (NN)
● “jumps” is tagged as verb (VBZ)
● “over” is tagged as preposition (IN)
● “the” is tagged as determiner (DT)
● “lazy” is tagged as adjective (JJ)
● “dog” is tagged as noun (NN)

By offering insights into the grammatical structure, this tagging aids machines
in comprehending not just individual words but also the connections between
them inside a phrase. For many NLP applications, like text summarization,
sentiment analysis, and machine translation, this kind of data is essential.

Workflow of POS Tagging in NLP

The following are the processes in a typical natural language processing (NLP)
example of part-of-speech (POS) tagging:
● Tokenization: Divide the input text into discrete tokens, which are
usually units of words or subwords. The first stage in NLP tasks is
tokenization.
● Loading Language Models: To utilize a library such as NLTK or
SpaCy, be sure to load the relevant language model. These models
offer a foundation for comprehending a language’s grammatical
structure since they have been trained on a vast amount of linguistic
data.
● Text Processing: If required, preprocess the text to handle special
characters, convert it to lowercase, or eliminate superfluous
information. Correct PoS labeling is aided by clear text.
● Linguistic Analysis: To determine the text’s grammatical structure,
use linguistic analysis. This entails understanding each word’s
purpose inside the sentence, including whether it is an adjective, verb,
noun, or other.
● Part-of-Speech Tagging: To determine the text’s grammatical
structure, use linguistic analysis. This entails understanding each
word’s purpose inside the sentence, including whether it is an
adjective, verb, noun, or other.
● Results Analysis: Verify the accuracy and consistency of the PoS
tagging findings with the source text. Determine and correct any
possible problems or mistakes

CODE

# Importing the NLTK library


import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Sample text
text = "NLTK is a powerful library for natural language processing."

# Performing PoS tagging


pos_tags = pos_tag(words)

# Displaying the PoS tagged result in separate lines


print("Original Text:")
print(text)

print("\nPoS Tagging Result:")


for word, pos_tag in pos_tags:
print(f"{word}: {pos_tag}")

OUTPUT:

Original Text:

NLTK is a powerful library for natural language processing.


PoS Tagging Result:
NLTK: NNP
is: VBZ
a: DT
powerful: JJ
library: NN
for: IN
natural: JJ
language: NN
processing: NN
Experiment No. :- 05
Implementing Cricket Game Prediction

.: .

Implementing a cricket game prediction system involves creating a model that


can predict the outcome of cricket matches based on historical data and relevant
features. This can be achieved using various machine learning algorithms.
Below is a step-by-step guide to implementing a cricket game prediction system
using a machine learning approach.

1. Define the Problem

Objective: Predict the outcome of a cricket match (e.g., win/loss/draw) based


on historical match data and features such as team statistics, player
performance, and match conditions.

Features to Consider:

● Team statistics (e.g., batting average, bowling average)


● Player statistics (e.g., runs scored, wickets taken)
● Match conditions (e.g., venue, weather)
● Historical head-to-head results

2. Collect and Prepare Data

Sources:

● Historical match data from sources like ESPN Cricinfo, Kaggle datasets,
or cricket statistics websites.

Example Data:

● Match results: Date, teams, venue, winner, etc.


● Team performance: Runs scored, wickets taken, etc.
● Player performance: Runs scored, wickets taken, etc.

Data Cleaning:

● Handle missing values.


● Convert categorical features into numerical format (e.g., encoding team
names).
● Normalize or standardize numerical features if needed.

3. Example Implementation Using a Dataset


Let's use a hypothetical dataset and demonstrate how to build a simple
prediction model using Python and scikit-learn.

Step-by-Step Implementation:

1. Install Necessary Libraries

pip install pandas scikit-learn

2. Import Libraries and Load Data

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset


# Replace 'cricket_matches.csv' with the path to your dataset
data = pd.read_csv('cricket_matches.csv')

# Display the first few rows of the dataset


print(data.head())

3. Prepare Data

python

Copy code

# Example columns: 'team1', 'team2', 'team1_score', 'team2_score', 'match_result'


# Encode categorical variables
label_encoder = LabelEncoder()
data['team1_encoded'] = label_encoder.fit_transform(data['team1'])
data['team2_encoded'] = label_encoder.transform(data['team2'])

# Define features and target variable


features = ['team1_encoded', 'team2_encoded', 'team1_score', 'team2_score']
target = 'match_result' # Assuming 'match_result' contains 'team1', 'team2', or 'draw'
# Encode target variable
data['match_result_encoded'] = label_encoder.fit_transform(data['match_result'])

X = data[features]
y = data['match_result_encoded']

4. Split Data into Training and Test Sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

5. Train a Model
# Initialize and train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

6. Make Predictions and Evaluate the Model


# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)

print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", report)

7. Example Data

Example CSV Structure (cricket_matches.csv):


team1,team2,team1_score,team2_score,match_result

Team A,Team B,250,230,Team A

Team C,Team D,300,280,Team C


Experiment No. :- 06
Implementing Machine Translation from English to Hindi

Implementing machine translation from English to Hindi involves creating a


system that can automatically translate text from English to Hindi. This can be
accomplished using various approaches, including statistical machine
translation (SMT) and neural machine translation (NMT). Today, NMT,
especially with transformer-based models, is the most advanced and commonly
used approach.

Here’s a step-by-step guide to implementing an English-to-Hindi machine


translation system using pre-trained models from the Hugging Face
Transformers library, which provides state-of-the-art models and tools for NLP
tasks.

1. Install Necessary Libraries

First, ensure you have the required libraries installed. You’ll need transformers
and torch.

pip install transformers torch

2. Load a Pre-trained Model

Hugging Face’s transformers library provides pre-trained translation models


that can be used out-of-the-box. For this example, we will use the
Helsinki-NLP/opus-mt-en-hi model, which is specifically trained for
English-to-Hindi translation.

Python Code:
from transformers import MarianMTModel, MarianTokenizer

# Load pre-trained model and tokenizer


model_name = 'Helsinki-NLP/opus-mt-en-hi'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translate_text(text):
# Tokenize input text
tokens = tokenizer(text, return_tensors='pt', padding=True)

# Perform translation
translated_tokens = model.generate(**tokens)

# Decode the translated tokens


translated_text = tokenizer.decode(translated_tokens[0],
skip_special_tokens=True)

return translated_text

# Example usage
english_text = "Hello, how are you?"
hindi_translation = translate_text(english_text)

print(f"English Text: {english_text}")


print(f"Hindi Translation: {hindi_translation}")

3. Explanation of the Code

1. Loading the Model and Tokenizer:


○ MarianTokenizer and MarianMTModel from the transformers
library are used to load the pre-trained English-to-Hindi translation
model.
2. Translation Function:
○ Tokenization: The input English text is tokenized into a format
suitable for the model.
○ Translation: The model generates the Hindi translation based on
the tokenized input.
○ Decoding: The translated tokens are decoded back into
human-readable text.

4. Test the Model

You can test the model with different English sentences to see how well it
translates them into Hindi. Simply replace the english_text variable with other
sentences.
5. Advanced Considerations

● Fine-tuning: If you have specific domain data, you can fine-tune the
model on your own dataset for better accuracy.
● Data Preprocessing: Ensure your text data is clean and free from
unwanted characters to get better translation results.
● Error Handling: Implement error handling for cases where the model
might produce unexpected results or fail to translate.

6. Example Output

Here’s an example of what the output might look like:

English Text: Hello, how are you?

Hindi Translation: नमस्ते, आप कैसे हैं?

7. Conclusion

This guide demonstrates how to implement a machine translation system from


English to Hindi using a pre-trained model with the Hugging Face Transformers
library. The provided code is a starting point and can be extended with more
features, such as batch translation, integration into applications, or additional
language support. For more advanced use cases, consider exploring custom
fine-tuning and evaluation techniques to tailor the model to specific needs.
Experiment No. :- 07
Study of Query Expansion for Information Retrieval

Query expansion is a technique used in information retrieval systems to


improve the effectiveness of search queries. By expanding or refining a user’s
original query, the system aims to retrieve more relevant documents. Here’s an
overview of the study of query expansion:

1. Basic Concepts

● Query Expansion: The process of reformulating a user’s search query to


improve the chances of retrieving relevant documents. This involves
adding terms to the original query to cover more related topics or
synonyms.
● Information Retrieval (IR): The field concerned with obtaining
information from a large repository, such as a database or the internet,
based on user queries.

2. Types of Query Expansion

1. Synonym Expansion:
○ Replaces terms in the query with synonyms or related terms. For
instance, “car” might be expanded to include “automobile” or
“vehicle.”
2. Stemming and Lemmatization:
○ Stemming: Reduces words to their root form (e.g., “running”
becomes “run”).
○ Lemmatization: Reduces words to their base or dictionary form,
often considering the context (e.g., “better” becomes “good”).
3. Contextual Expansion:
○ Uses the context of the original query to add related terms. For
example, adding terms related to the query’s subject matter or
context.
4. Relevance Feedback:
○ Involves using feedback from the user about the relevance of
retrieved documents to adjust and expand the query. This can be
explicit (user provides feedback) or implicit (based on user
behavior).
5. Thesaurus-based Expansion:
○ Uses a thesaurus or ontology to find related terms or concepts to
include in the query.
6. Probabilistic Models:
○ Utilizes statistical models to determine which terms should be
added to the query based on their likelihood of improving retrieval
performance.
3. Historical Background

The work on query expansion following relevance feedback dates back to 1965,
when Rocchio formalized relevance feedback in the vector-space model. Early
work on using collection-based term co-occurrence statistics to select query
expansion terms was done by Spärck Jones and van Rijsbergen.

4. Techniques and Approaches

1. Automatic Expansion:
○ Algorithms and models that automatically determine expansion
terms without user intervention. Examples include Latent Semantic
Analysis (LSA) and Latent Dirichlet Allocation (LDA).
2. Manual Expansion:
○ Users or experts manually add terms to improve the query. This
can be more tailored but requires more effort.
3. Machine Learning-Based Expansion:
○ Utilizes machine learning algorithms to learn patterns from past
searches and expansions to make informed decisions about query
expansion.

5. Challenges in Query Expansion

● Ambiguity: Identifying the correct sense of a word or phrase can be


challenging, leading to irrelevant expansions.
● Over-expansion: Adding too many terms can dilute the relevance of the
query, making the search results less precise.
● Performance: Expansion algorithms can be computationally expensive
and may impact system performance, especially with large-scale datasets.

6. Evaluation Metrics

● Precision: The proportion of relevant documents retrieved out of all


retrieved documents.
● Recall: The proportion of relevant documents retrieved out of all relevant
documents available.
● F1 Score: The harmonic mean of precision and recall.
● Mean Average Precision (MAP): Measures the average precision of the
retrieved documents.

7. Recent Trends and Advances

● Neural Query Expansion: Leveraging deep learning models and


embeddings (like Word2Vec, BERT) to understand and expand queries
more contextually.
● Context-Aware Systems: Incorporating user context, such as search
history or location, to make more intelligent query expansions.
● Interactive Systems: Combining query expansion with interactive search
interfaces where users can refine their queries dynamically.

Query expansion continues to be a dynamic area of research in information


retrieval, with ongoing advancements aiming to enhance search quality and user
satisfaction.
Experiment No. :- 08
Study of Emotion Detection in Texts

Emotion detection in text is essentially a content-based classification challenge


that combines concepts from natural language processing and machine learning.
This paper addresses textual data-based emotion identification algorithms and
emotion detection.Emotion detection in texts, often referred to as sentiment
analysis or affective computing, is an important area of natural language
processing (NLP) that focuses on identifying and classifying emotions
conveyed in written language. This study has various applications, from
understanding customer feedback to improving human-computer interactions.
Here's an overview of the field:

1. Basic Concepts
● Emotion Detection: The process of identifying and categorizing
emotions expressed in textual data. This can include emotions like joy,
sadness, anger, surprise, fear, and disgust.
● Sentiment Analysis: A related field that typically focuses on identifying
the overall sentiment of a text as positive, negative, or neutral. Emotion
detection can be seen as a more granular approach to sentiment analysis.

2. Types of Emotions
Different models categorize emotions in various ways. Common frameworks
include:

1. Basic Emotions:
○ Paul Ekman’s model identifies basic emotions such as happiness,
sadness, anger, surprise, disgust, and fear.
2. Dimensional Models:
○ The Pleasure-Arousal-Dominance (PAD) model describes
emotions along three dimensions: pleasure (valence), arousal
(intensity), and dominance (control).
3. Extended Models:
○ Models like the Geneva Affective Picture Database (GAPD)
provide more nuanced categories, including secondary emotions
and mixed emotions.

3. Techniques and Approaches


1. Rule-Based Methods:
○ Use predefined lists of words (lexicons) associated with specific
emotions. These methods rely on simple matching of words or
phrases with emotion-labeled dictionaries.
2. Machine Learning-Based Methods:
○ Supervised Learning: Requires labeled training data to build
models that can classify emotions. Techniques include:
■ Naive Bayes: A probabilistic classifier based on Bayes'
theorem.
■ Support Vector Machines (SVM): Finds the optimal
hyperplane to separate different emotion classes.
■ Decision Trees: Classifies based on feature values and
decision rules.
○ Deep Learning: Uses neural networks to capture complex patterns
in data. Techniques include:
■ Recurrent Neural Networks (RNNs): Suitable for
sequential data like text.
■ Long Short-Term Memory Networks (LSTMs): A type of
RNN that handles long-term dependencies.
■ Transformers: Advanced architectures like BERT and GPT
that leverage contextual embeddings for more accurate
emotion detection.
3. Lexicon-Based Methods:
○ Utilize sentiment and emotion lexicons, such as WordNet-Affect
or SentiWordNet, to identify and analyze emotional content based
on word associations.
4. Hybrid Approaches:
○ Combine rule-based and machine learning methods to leverage the
strengths of both approaches.

4. Challenges in Emotion Detection


1. Ambiguity and Sarcasm:
○ Detecting sarcasm, irony, or ambiguous language can be difficult
and often requires advanced understanding of context.
2. Context and Cultural Differences:
○ Emotions can be expressed differently across cultures and contexts,
impacting the accuracy of detection models.
3. Subtle and Mixed Emotions:
○ Identifying subtle or mixed emotions in texts can be challenging.
People often express complex feelings that do not fit neatly into
predefined categories.
4. Domain-Specific Language:
○ Emotion detection models may need to be adapted for specific
domains or types of text, such as social media, reviews, or clinical
notes.

5. Applications
1. Customer Sentiment Analysis:
○ Understanding customer feedback and improving customer service
by analyzing emotional content in reviews and surveys.
2. Social Media Monitoring:
○ Tracking public sentiment and emotions about brands, events, or
political issues.
3. Healthcare:
○ Analyzing patient feedback or clinical notes to identify emotional
states and mental health conditions.
4. Entertainment and Media:
○ Enhancing user experiences in games, movies, and content
recommendations by analyzing emotional responses.
5. Human-Computer Interaction:
○ Improving interactions between users and virtual assistants or
chatbots by understanding user emotions.

6. Recent Trends and Advances


1. Multimodal Emotion Detection:
○ Integrating text with other modalities like audio and video to
enhance emotion detection accuracy.
2. Transfer Learning:
○ Leveraging pre-trained models on large datasets and fine-tuning
them for specific emotion detection tasks.
3. Emotion Recognition in Diverse Languages:
○ Expanding models to work effectively across multiple languages
and dialects.
4. Ethical Considerations:
○ Addressing concerns related to privacy, data security, and the
ethical use of emotion detection technologies.

Emotion detection in texts is a dynamic field that combines computational


techniques with psychological insights to better understand and respond to
human emotions in various contexts.

You might also like