2/7/22, 8:34 AM IR Midsem
IR Midsem
Description:
1. The exam contains 24 MCQs.
2. There may be more than one option correct for each question.
3. Some questions are worth 1 point, the rest -> 2 points.
4. There is no partial marking. Full marks will be awarded for a question if and only if all
correct and no wrong options are selected.
5. No negative marking.
Important Guidelines:
1. Open book
2. You may use a calculator (**do not use mobile phone calculator)
3. Kindly ensure your videos are on.
4. No extension will be given.
Your email will be recorded when you submit this form
If we use bigram indexes, which of the following words would be falsely 2 points
enumerated by co*me?
come
comment
income
coulome
https://docs.google.com/forms/d/e/1FAIpQLSfLz0PuhiDjyQ5fV4QDqfagw-H3-9EuV4Iqvmn7ZZM_Qx0UJg/viewform 1/11
2/7/22, 8:34 AM IR Midsem
Minimum how many copies of data are maintained in HDFS ? 1 point
What is the idf of the term which occurs in every document? 1 point
log10(N)
log10(1/N)
Rank the following documents in decreasing order according to their tf- 2 points
idf score wrt query = “All vehicles including car auto bike bus are stopped
due to accident”. Vocabulary = {car, auto, bike, bus} (*Use tf-idf = tf x idf)
doc1, doc2 , doc3
doc2, doc3, doc1
doc3, doc2, doc1
doc1, doc3, doc2
https://docs.google.com/forms/d/e/1FAIpQLSfLz0PuhiDjyQ5fV4QDqfagw-H3-9EuV4Iqvmn7ZZM_Qx0UJg/viewform 2/11
2/7/22, 8:34 AM IR Midsem
In logarithmic merge, where n=3. We have 47 tokens to be processed. Find 2 points
which all indexes including auxiliary indexes (Z0, I0, I1, I2, I3, I4 ) are in use
after all the tokens are used. See the table for representation. (consider
Z0 < n).
1, 1, 1, 0, 1, 0
0, 1, 1, 1, 1, 0
1, 1, 1, 1, 1, 0
0, 0, 0, 1, 1, 1
Which of the following are the functions of parser in distributed indexing? 1 point
Sorts and writes to a posting list.
Writes pairs into k partitions, where k ∈ N.
Reads document at a time and emits a pair.
Assigns a split into an idle machine.
Collects all pairs for one partition
https://docs.google.com/forms/d/e/1FAIpQLSfLz0PuhiDjyQ5fV4QDqfagw-H3-9EuV4Iqvmn7ZZM_Qx0UJg/viewform 3/11
2/7/22, 8:34 AM IR Midsem
How would the wild card query qu*ry be expressed for lookup in the 1 point
permutation index?
ry$*qu
ry$qu*
$qu*ry
qu*ry$
Compute edit distance between “cats” and “fast”, (with insertion, deletion 2 points
and substitution only).
Which of the following does not improve the performance of distributed 2 points
processing?
None of above
maintaing checksum of data
replication of data
partitioning of data
https://docs.google.com/forms/d/e/1FAIpQLSfLz0PuhiDjyQ5fV4QDqfagw-H3-9EuV4Iqvmn7ZZM_Qx0UJg/viewform 4/11
2/7/22, 8:34 AM IR Midsem
Which of the following can not run on HDFS? 1 point
MapReduce
Spark
Oracle Database
Hbase
Real time processing is also called as 2 points
Processing group of events less than minute
Per day processing
Per event processing
Per hour processing
In which launguage MapReduce is written ? 1 point
Python
C++
Java
Scala
https://docs.google.com/forms/d/e/1FAIpQLSfLz0PuhiDjyQ5fV4QDqfagw-H3-9EuV4Iqvmn7ZZM_Qx0UJg/viewform 5/11
2/7/22, 8:34 AM IR Midsem
Observed word is “acress”. Use the below table for finding the most 2 points
suitable correct word. (Dictionary contains only candidate words)
across
actress
access
acres
Which of following is not a data ingestion tool? 2 points
spark
kafka
flume
sqoop
What is purpose of Namenode ? 2 points
Store data
None of the above
Store metadata
Schedule jobs
https://docs.google.com/forms/d/e/1FAIpQLSfLz0PuhiDjyQ5fV4QDqfagw-H3-9EuV4Iqvmn7ZZM_Qx0UJg/viewform 6/11
2/7/22, 8:34 AM IR Midsem
For the query 'bord', state the word from the dictionary which has the 2 points
second minimum Jaccard Coefficient using character 2-gram index.
Dictionary = {aboard, border, dropped, lord}.
border
lord
dropped
aboard
Edit distance between any two strings s1 and s2 is upper bounded by? (|s| 1 point
denotes the length of the string)
|s1| - |s2|
min( |s1| , |s2| )
max( |s1| , |s2| )
|s1| + |s2|
Can the tf-idf weight of term in a document exceed 1? 1 point
True
False
https://docs.google.com/forms/d/e/1FAIpQLSfLz0PuhiDjyQ5fV4QDqfagw-H3-9EuV4Iqvmn7ZZM_Qx0UJg/viewform 7/11
2/7/22, 8:34 AM IR Midsem
Let’s say the length of the embedding vectors of songs is directly 2 points
proportional to their popularity. You want to calculate the similarity
between songs. Which of the following is/are true ?
If you switch from cosine similarity to dot product, popular songs become more
similar to only other popular songs.
If you switch from cosine similarity to dot product, popular songs become more
similar to all songs in general.
If you switch from dot product to cosine similarity, popular songs become less similar
than less popular songs.
If you switch from dot product to cosine similarity, popular songs become more
similar than less popular songs.
No change in song similarities when switching from cosine similarity to dot product
No change in song similarities when switching from dot product to cosine similarity
Paragraph for the next 3 questions
Q-abcd
D1 - a a c c
D2 - b d
Here a,b,c,d are individual tokens.
For the above set of query(Q) and documents(D1, D2), use the lnc.ltc weighting scheme to compute the
ranking score and answer the following:
(Roundup each calculation up to 2 decimal places. Use log10)
Q1
https://docs.google.com/forms/d/e/1FAIpQLSfLz0PuhiDjyQ5fV4QDqfagw-H3-9EuV4Iqvmn7ZZM_Qx0UJg/viewform 8/11
2/7/22, 8:34 AM IR Midsem
Which of the following is/are true? 2 points
D2 has better/larger score than D1
Whichever has better score, it is by a low margin (|difference| <= 0.02)
Whichever has better score, it is by a high margin (|difference| > 0.02)
D1 has better/larger score than D2
Q2
Now, if we take the euclidean distance between the normalized vectors 2 points
(instead of product), which of the following is/are true? (The ranking order
we talk about in this question is the one we get after Q1)
The ranking order remains the same and the margin is low (|difference| <= 0.02)
The ranking order remains the same and the margin is high (|difference| > 0.02)
The ranking order remains the same
The ranking order reverses
Q3
https://docs.google.com/forms/d/e/1FAIpQLSfLz0PuhiDjyQ5fV4QDqfagw-H3-9EuV4Iqvmn7ZZM_Qx0UJg/viewform 9/11
2/7/22, 8:34 AM IR Midsem
Now, if we take the product without normalizing the document vectors, 2 points
which of the following is/are true? (The ranking order we talk about in this
question is the one we get after Q1)
The ranking order remains the same and the margin is low (|difference| <= 0.02)
The ranking order remains the same and the margin is high (|difference| > 0.02)
The ranking order reverses
The ranking order remains the same
Paragraph for the next 2 questions
Q-abc
D1 - a a d
D2 - b c a
D3 - a a
Here a,b,c,d are individual tokens.
While ranking the documents using Binary Independence Model (BIM), in a particular iteration, we get
user feedback which tells us that -
(i) All documents are relevant
(ii) A term/token is relevant to a document if the document contains that specific term/token.
Now for this particular iteration, answer the following:
(Use log10 wherever log is required)
Q1
Which of the following is/are true? (Hint: Use the contingency table. For 2 points
smoothing, add 0.5 to every count in the table)
The log-odds ratio for term ‘a’ is 0.845
The log-odds ratio for term ‘c’ is -0.14
The log-odds ratio for term ‘a’ is 0.645
The log-odds ratio for term ‘b’ is -0.22
https://docs.google.com/forms/d/e/1FAIpQLSfLz0PuhiDjyQ5fV4QDqfagw-H3-9EuV4Iqvmn7ZZM_Qx0UJg/viewform 10/11
2/7/22, 8:34 AM IR Midsem
Q2
Which of the following is/are true? (RSV(D) denotes the Retrieval Status 2 points
Value for document D)
RSV(D1) = 1.69
RSV(D1) = 1.29
RSV(D1) = RSV(D3)
RSV(D2) = 0.60
A copy of your responses will be emailed to [email protected].
Submit Clear form
This form was created inside of IIIT Delhi. Report Abuse
Forms
https://docs.google.com/forms/d/e/1FAIpQLSfLz0PuhiDjyQ5fV4QDqfagw-H3-9EuV4Iqvmn7ZZM_Qx0UJg/viewform 11/11