1 Imagine two students, Priya and Arjun, in a university library where a group nearby is talking
loudly. Priya says to Arjun, "This noise is making it hard to study."
a. Discourse Meaning: Analyze Priya’s utterance from a discourse meaning perspective. What
is the primary information conveyed at this level?
b. Pragmatic Meaning: Analyze the utterance from a pragmatic meaning perspective. How
does the pragmatic interpretation build upon or differ from the discourse meaning in this
context?
c. Pragmatic Inferences: Identify at least two potential pragmatic inferences Arjun might make
based on Priya’s statement, considering the library context and their relationship as students.
2 Consider the following two sentences:
Sentence 1: "The clever rabbit runs from danger."
Sentence 2: "The fox chases the rabbit quickly."
a. Vocabulary Creation: Create a vocabulary of unique words from both sentences, listing
them in alphabetical order.
b. Bag-of-Words Representation: Represent each sentence using the Bag-of-Words (BOW)
model. Provide the frequency count for each word in the vocabulary for both sentences.
c. One-Hot Encoding: Create One-Hot Encoding (OHE) vectors for the words “rabbit” and
“fox” based on the vocabulary, using the alphabetical order to determine the index in the
OHE vector.
3 Analyze the limitations of the Hidden Markov Model (HMM) in Natural Language Processing
(NLP) when applied to tasks like part-of-speech tagging or speech recognition.
a. Key Limitations: Identify and explain three main limitations of HMMs in handling complex
language tasks.
b. Impact on Performance: Discuss how these limitations affect HMM performance in a
specific NLP task (e.g., part-of-speech tagging for social media text).
c. Alternatives: Suggest one alternative model or approach that addresses at least one of
these limitations, and briefly explain why it’s more effective.
4 Using the corpus: “dog runs in the park”
Tokenize the corpus and list all tokens. Create a vocabulary with word-to-ID mapping. Justify
your tokenization choices (e.g., handling punctuation, case sensitivity).
Choose a window size for a Skip-Gram model, justify your choice, and generate all (target,
context) training pairs from the numerically encoded corpus.
Discuss two advantages and two disadvantages of the Skip-Gram model for word embedding
generation in NLP.
5 Explain the core principle of the Lesk Algorithm for Word Sense Disambiguation (WSD).
Describe how it leverages dictionary definitions to disambiguate a polysemous word in a
specific context.
a. Core Principle : Outline the Lesk Algorithm’s approach, focusing on how it compares word
contexts to dictionary senses.
b. Application Example : Consider the sentence “She deposited money in the bank.” Apply the
Lesk Algorithm to disambiguate “bank” (financial institution vs. riverbank). Assume dictionary
definitions:
Sense 1 (financial): “A place where money is stored or managed.”
Sense 2 (riverbank): “The edge of a river or stream.” Show the overlap between the sentence
context and each sense, and determine the correct sense.
7 Given the text: “The fasttt fox leaps over the idle dog at the field!! #fox #wildfox #fast”
Process this text using tokenization, stopword removal, and lemmatization.
a. Tokenization : Tokenize the text, listing all tokens. Explain your tokenization choices (e.g.,
handling hashtags, extra letters, punctuation).
b. Stopword Removal : Remove stopwords from the tokens, using a standard stopword list
(e.g., “the”, “at”, “over”). List the remaining tokens and justify any exclusions.
c. Lemmatization : Apply lemmatization to the remaining tokens. Provide the final processed
output and explain how lemmatization affects each token (e.g., “leaps” to “leap
8 Consider a collection of three documents:
Document 1: “The new algorithm improves performance.”
Document 2: “Performance of the algorithm is key.”
Document 3: “This algorithm is efficient.”
a. Term Frequency (TF) : Compute the TF of “algorithm” in Document 1
b. Inverse Document Frequency (IDF) : Calculate the IDF of “algorithm” across the collection
c. TF-IDF Score: Multiply the TF and IDF values to obtain the TF-IDF score for “algorithm” in
Document 1.
9 S → NP VP
NP → DT NN
VP → VB NP
DT → the
NN → cat | dog
VB → chases
Sentence: "the cat chases the dog"
Task: Determine if the sentence is valid (i.e., can be derived from the grammar) using the CKY
algorithm.
10 Given a trigram model with vocabulary {the, cat, eats, fish} (size 4), and counts:
Trigrams: the cat eats: 5, cat eats fish: 3, others: 0
Bigrams: the cat: 7, cat eats: 4, eats fish: 3, others: 0
Total trigrams: 20
Using add-one smoothing:
- Compute smoothed trigram probabilities P(eats | the cat) and P(fish | cat eats).
- Calculate the sentence probability P(the cat eats fish)
11 Scenario: A startup is building a chatbot to handle customer inquiries for an e-commerce
platform. The chatbot needs to understand user queries like "Where is my order?" and
respond appropriately.
Question: Explain how Natural Language Processing (NLP) can enable the chatbot to achieve
this goal. Describe two key NLP tasks involved and how they contribute to understanding and
responding to customer queries.
12 Scenario: A news agency wants to analyze thousands of articles to identify trending topics.
The raw text contains uppercase letters, punctuation, and irrelevant words like "the" and "is."
Question: Design a text preprocessing pipeline for this task. List at least four preprocessing
steps, explain their purpose, and provide an example of how the sentence "The Quick Fox
Jumps!" is transformed after each step.
13 Scenario: A language learning app aims to help students understand sentence structure by
highlighting parts of speech in sentences like "The dog runs quickly."
Question: Explain how POS tagging can be used to achieve this. Provide the expected POS
tags for the given sentence and describe one challenge in accurately tagging ambiguous
words like "runs" (verb vs. noun).
14 Why is ambiguity a major challenge in NLP? Provide an example of a sentence with multiple
interpretations.
15 A blog analysis tool processes posts with mixed case and punctuation, like "Amazing Trip!!!".
Show the output after applying tokenization, lowercasing, and stop word removal.
16 What is the difference between stemming and lemmatization? Provide an example where
lemmatization is preferred.
17 A language learning tool tags words in "She runs fast" to teach grammar. Provide the POS tags
and explain how one ambiguous word could be mistagged.
18 What are two limitations of the Bag of Words model in capturing text meaning?
19 Explain how TF-IDF weights words differently from raw frequency counts. Why is the IDF
component important?
20 Compare one-hot encoding to dense embeddings like Word2Vec in terms of memory
efficiency and semantic representation.
21 Explain how Word2Vec creates word embeddings and what makes them capture semantic
relationships.
22 A job portal matches resumes to job postings, e.g., linking "engineer" to "technician". Explain
how Word2Vec helps and suggest whether CBOW or Skip-gram is better for this task.
23 Explain the difference between syntactic, semantic, and pragmatic analysis in NLP. Provide an
example of how each level of analysis contributes to understanding a sentence like "Can you
open the window?"
24 Describe the mathematical formulation of TF-IDF. Why does the inverse document frequency
(IDF) component amplify the importance of rare terms in a corpus?
25 Why does the Bag of Words model fail to capture contextual relationships between words?
Discuss how this limitation impacts tasks like sentiment analysis.
26 Explain why one-hot encoding leads to sparse representations in NLP. How does this sparsity
affect the scalability of models for large vocabularies?
27 Discuss the role of discourse analysis in NLP. How does it differ from syntactic and semantic
analysis in processing multi-sentence texts?
28 Describe how n-grams balance context and computational complexity in language modeling.
Why does increasing n (e.g., from bigrams to trigrams) improve accuracy but exacerbate data
sparsity?