TYBSC CS Information Retrieval Munotes
TYBSC CS Information Retrieval Munotes
INTRODUCTION TO INFORMATION
RETRIEVAL
Unit Structure
1.0 Objectives
1.1 Introduction and History of IR
1.2 Components of IR
1.3 Issues related to IR
1.4 Boolean retrieval
1.5 Dictionaries and tolerant retrieval
.in
1.5.1 Search structures for dictionary
1.5.2 Wildcard queries
es
1.6 Summary
1.7 List of References
ot
1.0 OBJECTIVES
un
1
Information retrieval Since the 1950s, text and text documents have been the field's main focus.
Documents come in a wide variety of forms, including web pages, emails,
books, academic papers, and news articles to name just a few. Every one
of these documents has some structure to it, including the title, author,
date, and abstract details related to the content of articles published in
scholarly [Link] referring to database records, the components
of this structure are referred to as attributes or fields. The key distinction
between a document and a normal database record, like one for a bank
account or a ticket reservation, is that a document's content is mostly
presented as text, which is a rather unstructured format.
Consider the details present in the account number and current amount,
two typical account record attributes, to demonstrate this distinction. Both
have very clearly defined formats (a six-digit integer for an account
number, for instance, and a real number with two decimal places for
balance), as well as [Link] a result, it is simple to develop
algorithms to find the records that respond to queries like "Find account
number 321456" or "Find accounts with balances larger than $50,000.00."
It is very easy to compare the values of these attributes.
.in
Think about a recent news report about a bank merger. The headline and
the story's source are some of the story's qualities, but the story's actual
content is what matters most. This important piece of data would normally
es
be recorded in a database system as a single huge attribute with no internal
structure. The majority of searches for this topic on search engines like
Google will be of the type "bank merger" or "bank takeover." In order to
ot
conduct this search, we must create algorithms that can evaluate if the tale
contains the information sought by comparing the text of the query with
the text of the [Link] a definition for a term, sentence, paragraph,
un
or piece of news
Comparing literature is challenging since describing a tale is harder than
defining an account number. The foundation of information retrieval is the
m
.in
There are other text-based tasks that are researched in information
retrieval besides search based on a user query (sometimes referred to as ad
hoc search because the range of possible queries is broad and not
es
prespecified). Filtering, categorization, and question-answering are
additional duties. Based on a person's interests, filtering or monitoring
involves finding stories that are relevant to them and sending them an alert
by email or another method. A predetermined set of labels or
ot
1.2 COMPONENTS OF IR
The main parts of an IR system are shown in Figure 1. A user's
information demand, which underpins and motivates the search process,
exists prior to performing a search. When this information need is offered
3
Information retrieval in writing form as a test collection for IR evaluation, we occasionally refer
to it as a topic. The user creates and submits a query to the IR system as a
result of her information need. This search usually only has one or two
terms, with two to three terms being normal for a Web search. Because a
query term might not actually be a word at all, we use "term" instead of
"word" in this sentence.A query term could be a date, a number, a musical
note, or a phrase, depending on the information required. Query phrases
may also be allowed to contain wildcard operators and other partial-match
operators. The term "inform," for instance, might refer to any word
beginning with that prefix (e.g., "informs," "informs," "informal,"
"informant," "informative," etc.).
.in
es
ot
un
m
.in
duplicate results, may be applied to the result list. One or two results from
a single host or domain, for instance, might be reported by a web search
engine, with the remaining pages being replaced by pages from other
sources. One of the most essential issues in the industry is the scoring of
es
documents in relation to a user's query.
experts have concentrated on a few core challenges that are still vitally
essential in the era of commercial web search engines dealing with billions
of web pages. Following are few of the issues related to IR:
m
.in
algorithms. Linguistic features are included in more sophisticated models,
but they often play a supporting role. H.P. Luhn, another information
retrieval pioneer, introduced the use of word frequency data to represent
text in the 1950s. It took until the 1990s for this perspective on text to
es
catch on in other branches of computer science, like natural language
processing.
ot
algorithms were required. Early in the 1960s, Cyril Cleverdon took the
lead in creating evaluation techniques, and today, precision and recall are
still widely utilized. The percentage of papers that are relevantly returned
is known as precision, and it is a relatively simple unit of measurement.
m
.in
domains like digital libraries and the legal sector, in addition to the
implicit Boolean filters used by Web search engines. Boolean retrieval
delivers sets of documents rather than ranked lists, in contrast to ranked
es
retrieval. A term t is regarded by the Boolean retrieval model as specifying
the collection of documents that contain it. Boolean queries, which are
regarded as operations on these sets, are built using the common Boolean
operators AND, OR, and NOT as follows:
ot
A OR B: union of A and B (A ∪ B)
NOT A: complement of A with respect to the document collection (A¯)
m
.in
es
Figure 2: Function to locate the first occurrence of a phrase after a given
position. The function calls the next and prev methods of the inverted
ot
Figure 3: Function to locate the next occurrence of a cover for the term
vector <t1, ..., tn> after a given position
Both of the aforementioned algorithms have the same fundamental mode
of operation. Lines 1-6 of the phrase search algorithm identify a range that
contains every term in the phrase in the order that it appears, such that no
smaller range included within it also contains every term in the phrase.
Lines 1-4 similarly identify all the terms as closely as possible in the cover
searching method. Then, an additional constraint is imposed to both
methods.
To simplify our definition of our Boolean search algorithm, we define two
functions that operate over Boolean queries, extending the nextDoc and
prevDoc methods of schema-dependent inverted indices.
8
Introduction to Information
Retrieval
.in
es
ot
un
Definitions for the NOT operator are more difficult, so we wait to discuss
m
Figure 4: Function to locate the next solution to the Boolean query Q after
a given position. The function nextSolution calls docRight and docLeft
to generate a candidate solution. These functions make recursive calls that
depend on the structure of the query
9
Information retrieval For the purpose of producing a potential solution, the function calls
docRight and docLeft. This potential answer can be found in the interval
[u,v] right after line 4. The potential solution is returned if it only consists
of one document. Otherwise, a recursive call is made by the function. All
answers to the Boolean inquiry Q may be produced by the following given
this function:
.in
time complexity is comparable to that of our proximity ranking algorithm
and word search [Link] temporal complexity changes to
O(n·κ·log(L/κ)) when expressed in terms of the quantity of candidate
solutions, which illustrates the adaptive character of the algorithm.
es
Although the call to the docLeft function in line 4 of the algorithm can be
eliminated, by clearly defining a potential answer, it aids in our analysis of
the algorithm's complexity.
ot
We didn't take into account the NOT operator while defining docRight
and docLeft. In fact, implementing the generalized versions of these
un
functions is not required in order to use the NOT operator. Instead, a query
can be transformed by applying De Morgan's laws, which will push any
NOT operators inside until they are very next to the query terms:
m
This transformation does not alter the quantity of AND and OR operators,
and as a result, does not alter the quantity of terms in the query (n). We are
left with a query that contains expressions of the form NOT t, where t is a
word, after properly applying De Morgan's principles. We need
comparable definitions of docRight and docLeft in order to process
queries with expressions of this type. These definitions could be expressed
in terms of nextDoc and prevDoc.
10
Introduction to Information
Retrieval
.in
1.5 DICTIONARIES AND TOLERANT RETRIEVAL
1.5.1 Search structures for dictionary
es
Our initial objective is to check whether each query term is present in the
lexicon and, if it is, find the reference to the associated postings given an
inverted index and a query. There are two main groups of solutions for this
ot
added or also some keys removed from the dictionary? (3) How frequently
will different keys be accessed, on average?
In some search engines, hashing has been employed for dictionary
searches. Each word in the dictionary is hashed into an integer over a
sufficiently enough area to make hash collisions improbable; if any, they
are resolved using auxiliary structures that can be labor-intensive to
maintain. 1 We hash each query phrase individually at query time,
following a pointer to the associated postings, and accounting for any
logic for handling hash [Link] variations of a search term (such as
the accented and unaccented forms of a word like "resume") might have
extremely different hash values, so it is difficult to discover them. Finally,
a hash function created to meet present demands may not be enough in a
few years in a context (like the Web) where the vocabulary is expanding.
11
Information retrieval Many of these problems are resolved by search trees, which, for example,
allow us to enumerate all lexical phrases beginning with automat. The
binary tree, which contains two children at each internal node, is the most
well-known search [Link] the tree's base, a phrase is sought after. Each
internal node (including the root) represents a binary test, the results of
which determine which of the two subtrees lies below that node in the
internal tree. An illustration of a binary search tree used for a dictionary
may be found in Figure 5. The balance of the tree is crucial for efficient
search (with a number of comparisons that is O(log M)). To mitigate
rebalancing, one approach is to allow the number of sub-trees under an
internal node to vary in a fixed interval.
.in
es
ot
Figure 5:A binary search tree. In this example the branch at the root
partitions vocabulary terms into two subtrees, those whose first letter is
un
12
But what about searches that use wildcards, in which the * sign is not Introduction to Information
required to appear at the end of the search string? We briefly generalize Retrieval
trailing wildcard queries before dealing with this general instance.
Consider leading wildcard queries or *mon-style queries first. Think of a
reverse B-tree on the dictionary, where each root-to-leaf path represents a
word in the dictionary spelled backwards. For example, the term "lemon"
would be represented by the path root-n-o-m-e-l in the B-tree. The next
step is to go down the reverse B-tree to find all terms R in the lexicon that
begin with a specific prefix.
In reality, we can handle a case that is even more generic by combining a
conventional B-tree with a reverse B-tree: wildcard queries with a single *
sign, like se*mon. To do this, we first enumerate the set W of dictionary
terms beginning with the prefix se using the ordinary B-tree, and then we
enumerate the set R of terms ending in the suffix mon using the reverse B-
tree. The collection of phrases that start with the prefix se and end with the
suffix mon is then obtained by taking the intersection W ∩ R of these two
sets. Finally, we obtain all documents that contain any of the terms in this
intersection using the conventional inverted [Link] can thus handle
.in
wildcard queries that contain a single * symbol using two B-trees, the
normal B-tree and a reverse B-tree.
es
1.6 SUMMARY
Calvin Mooers originated the phrase "information retrieval" in 1950. From
1961 onwards, when computers were developed for information handling,
ot
Although the IRS still operates in this manner today, numerous cutting-
edge design methods have been created and implemented over time. The
meaning of information retrieval has evolved over time, and scholars and
information specialists have used many terminologies to describe it.
Information access, text retrieval, information representation and retrieval,
information processing and retrieval, and information storage and retrieval
are only a few of them. We have also discussed about the components
used in the information retrieval system along with the issues associated
with it.
.in
7] What are the wildcard queries.
es
ot
un
m
14
2
LINK ANALYSIS AND SPECIALIZED
SEARCH
Unit Structure
2.0 Objectives
2.1 Introduction
2.2 Link Analysis
1.2.1 Why Link Analysis Is Done?
1.2.2 The Web as a graph
1.2.3 Graph-based representation of the World Wide Web (WWW)
.in
1.2.4 Types of links
1.2.5 Link spam
es
2.3 Hubs and Authorities
2.4 Page Rank and HITS algorithms
ot
2.5 Similarity
2.5.1 Similarity of Documents
un
15
Information retrieval 2.8 Collaborative filtering and content-based recommendation of
documents and products
2.8.1 What is CF?
2.8.2 Content-based recommendation of documents
2.9 Handling “invisible” Web
2.9.1 Open Access Journal Databases
2.9.2 Invisible Web Search Engines
2.9.3 Ways to Make Content More Visible
2.9.4 How to Access and Search for Invisible Content
2.9.5 Invisible Web Search Tools
2.10 Snippet generation
.in
2.11 Summarization, Question Answering
2.11.1 Summarizing Single Documents
es
2.11.2 Summarizing Multiple Documents
2.12 Cross-Lingual Retrieval
ot
2.0 OBJECTIVES
After completion of this module, you will learn:
16
2.1 INTRODUCTION Link Analysis and
Specialized Search
In this part of the book, we are going to delve into the intriguing area of
link analysis, which is a strong tool for studying the links between entities
in a network. In particular, we are going to look at how to use link
analysis. The purpose of link analysis is to identify patterns, trends, and
other significant insights by investigating the connections that exist
between the nodes that make up a network. The PageRank algorithm,
which determines the relative relevance of web sites based on the number
and quality of incoming and outgoing links to those pages, is one of the
link analysis methods that is utilised by the largest number of people.
The PageRank algorithm computes a rank for a website based on the
chance that the user would go to various links by using graphs to represent
the web. The huge size of the web necessitates the use of procedures using
matrices of very large sizes in order to complete the calculation. As a
result, the MapReduce programming paradigm is applied on top of the
distributed file system via the PageRank algorithm.
.in
In this section, we will go over the fundamentals of link analysis,
demonstrate how graphs may be used to do link analysis, and discuss how
PageRank is calculated. In addition to this, we will learn about the various
es
methods that may be used to calculate PageRank, as well as how
MapReduce can be used to carry out this calculation. This lesson will give
a good basis for understanding this significant area, whether you are a
student, researcher, or just inquisitive about the realm of link analysis.
ot
The use of hyperlinks for ranking online search results is the main topic of
this chapter. Web search engines consider a variety of elements when
calculating a web page's overall score for a specific query, one of which is
link analysis. Page rank and HITS are two different techniques for link
m
analysis.
Network theory uses the data analysis method of link analysis to examine
the web graph's connections. A collection of nodes and a set of edges,
links, or intersections between nodes make up a graph. The graphs are
used everywhere, including on social networking platforms like Facebook,
Twitter, and others.
17
Information retrieval 2.2.2 The Web as a graph
A web made out of static HTML pages linked by links in the form of a
directed graph, where each page is a node and each link are a directed
edge.
.in
2.2.3 Graph-based representation of the World Wide Web (WWW)
A directed graph may be used to depict the Web. As a result, web pages
es
will be the nodes in the graph. As a result, each website will represent a
node in this network, and direct linkages between them will act as
hyperlinks. The associations between the hyperlinks may be utilised to
ot
build a network.
2.2.4 Types of links
un
1. Inbound links:
Links leading to a website from outside sources are called inbound links.
m
One technique to raise a site's overall Page Rank is via inlinks. Inlinks do
not penalise websites.
2. Outbound links:
A page may connect to other pages on the same website or to other
websites via outbound links.
3. Dangling links:
Links that point to any page but have no outgoing connections are known
as dangling links.
2.2.5 Link spam
Unethical web page designers may attempt to generate pointless
connections in order to elevate one of their web pages in the search engine
results since it is commonly known that commercial search engines utilise
methods like Page Rank and anchor text extraction. It's known as link
spam.
18
2.3 HUBSAND AUTHORITIES Link Analysis and
Specialized Search
Every online page is given two scores, not one, as is the case with the
PageRank algorithm, which is the primary tenet of the hubs and authority
models. The hub score is one type score, while the authority score is the
other type. The total of the scaled hub values pointing to a certain web
page constitutes an authority value. The scaled authority values of the
pages that a hub points to make up its worth.
The well-known, highly recommended responses to the searches will be
referred to as the authorities for the question. These were the kind of sites
we were initially looking for. The hubs for the query will be referred to as
the high-value lists. Now, we attempt to assess the worth of each page p as
a prospective authority and hub by giving it two numerical scores: auth(p)
and hub(p). Each of these begins with a value of 1, demonstrating that we
are first unsure of which falls under which of these categories. Now,
voting is as easy as the following. We utilise the quality of the hubs to
improve our estimations of the authority's quality.
.in
Authority Update Rule: Update auth(p) to be the total of all hub scores for
pages that point to page p.
On the other hand, the list-finding method, in which we improve our
es
estimations of the quality of the hubs by the caliber of the authorities, is as
follows:
ot
Hub Update Rule: Update hub(p) to be the total of all the authority ratings
of the pages it points to for each page p.
Keep in mind that a single application of the Authority Updating Rule is
un
19
Information retrieval
.in
Figure 2.3.1 Re-weighting votes after normalizing for the query
“newspapers.”[1]
es
• All hub scores and authority scores are set to 1 at the beginning.
• We choose a set of steps, k.
ot
Apply the Hub Update Rule to the final set of scores after first applying the
Authority Update Rule to the current set of results.
m
• The hub and authority scores may ultimately entail extremely big
numbers. But, because we are only interested in their relative sizes, we
may normalise them by dividing each authority score by the total of all
authority scores and each hub score by the total of all hub scores.
20
Link Analysis and
Specialized Search
.in
es
Figure 2.3.2 Limiting hub and authority values for the query
“newspapers.”[1]
ot
The most important details in this text are that the normalized values of k
actually converge to limits as k goes to infinity, and that the limiting hub
and authority values are a property of the link structure, not of the initial
un
estimates we use to start the process of computing them. This means that
the limiting hub and authority values remain unchanged if we apply the
Authority Update Rule or the Hub Update Rule, reflecting the balance
m
between hubs and authorities that provided the initial intuition for them.
This means that the authority score is proportional to the hub scores of the
pages that point to you, and the hub score is proportional to the authority
scores of the pages you point to.
21
Information retrieval
.in
es
Figure 2.4.1 PageRank Scores [1]
Figure 2.4.1 shows a graph with a set of directed edges, where the size of a
ot
node is related to its PageRank score. The PageRank scores are set up so
that when added up, they equal 100.
un
though it only gets one incoming link. This is because the very important
node B points in that direction.
• The very small purple link nodes have a low PageRank score because
they are pointed to by a website with a low PageRank.
• The PageRank score of node D is higher than that of node A because D
points to A.
• The PageRank score of node E is kind of in the middle, and E points to
node B, and so on.
Figure 2.4.1 shows that the PageRank are pretty easy to understand, and
they match how we think a node in our graph is important.
In this section, we talk about the simple recursive formulation and the
flow model, which are two important ways to figure out PageRank.
22
The Simple Recursive Formulation method will be used to find PageRank Link Analysis and
Scores, in which: Specialized Search
.in
Figure 2 is a simple graph that shows how many votes a page called "j"
got. Using the following formula, you can figure out the score of node j as
the sum of the votes it got from ri and rk:
es
rj=ri/2+rk/3
This is because there are 3 out links on page I and 4 out links on page k.
Along with the three-outgoing links, the score of page j is also shared
ot
outside of page j. So, each of these links gets one-third of the importance
of node j.
un
m
.in
Where di is the node i's out-degree. You may remember that a node's
es
degree is the number of links that leave it. So, page j's importance score is
just the sum of all the pages i that point to it. To figure out how important
a page i is, you divide its importance by its out degree.
ot
This means that we get a different equation for each node in the network
based on the number of links. For example, the importance of node an in
un
ra=ra /2+rb/2
rb=ra /2+rc
rc= rb /2
24
Link Analysis and
Specialized Search
.in
Figure 2.4.3A very simple web graph [1]
The problem is that none of these three equations has a single answer. So,
we need to add another constraint if we want the solution to be unique.
es
Let's add a rule: the sum of all the PageRank scores should be 2. This is
shown by the equation below:
ra +rb +rc =1
ot
You can solve these equations and find the solution to the PageRank
scores, which is:
un
Input:
- A directed graph G = (V, E),
where V is the set of nodes (web pages) and E is the set of edges
(links between pages)
- A damping factor d (usually set to 0.85)
- A threshold value ε
(usually set to a small value, such as 0.0001)
Output:- A vector PR containing the PageRank scores
for each node in V
25
Information retrieval 2. Initialize all PageRank scores to 1/|V|
.in
3. Return the PageRank scores vector PR
26
• We refer to the quality score each page receives as a content provider Link Analysis and
also known as its authority score as well. Specialized Search
• The hub score, therefore, would be just the total of all the votes cast for
the authority that are referenced. The total number of expert votes that
the page receives will be the authority score.
• Next, we will calculate the steady state score using the repeated
improvement concept.
PageRank HITS
.in
Fast to compute Easy to compute, real
structure of the web graph. PageRank uses backlinks, which are links to
other pages on the same site. Backlinks from popular pages aren't as
important as backlinks from important pages.
un
Since it is hard for the owner of a Web page to get links from other
important pages to his or her page, it is hard to change Page Rank.
• Page Rank is a global measure that doesn't depend on the search query.
All of the pages' Page Rank values are calculated and stored offline, not
when a query is made.
2.5 SIMILARITY
1.5.1 Similarity of Documents
Textually similar documents in a large corpus like the Web or a collection
of news articles are easy to find with Jaccard similarity. This is an
important class of problems that Jaccard similarity solves well. We should
know that the level of similarity we are looking for is character-level
similarity, not "similar meaning," which requires us to look at the words in
the documents and how they are used. This problem is also interesting, but
there are other ways to solve it. But textual similarity is also useful in
27
Information retrieval important ways. A lot of them have to do with finding duplicates or close
copies. First, let's note that it's easy to tell if two documents are exact
copies of each other. All you have to do is compare them character by
character, and if there are any differences, they are not the same. In many
applications, however, the documents are not the same, but they share a lot
of text. Here are just a few: Plagiarism Plagiarized documents give us a
chance to see how well we can find textual similarities. Plagiarists can
only take parts of a document to use as their own. He might change a few
words or the order in which the original sentences are written. Still, the
new document may have at least 50% of the original. A sophisticated case
of plagiarism can't be found by comparing documents character by
character. Page Mirrors It is common for important or popular websites to
be hosted on more than one server so that the load can be spread out. Most
of the time, the pages on these mirror sites will be very similar, but not the
same.
For example, each one might have information about its own host and
links to other mirror sites but not to itself. Taking pages from one class
and putting them in another is a similar thing. These pages might have
.in
class notes, homework, and slides from lectures. From one year to the
next, similar pages might change the name of the course, the year, and
make small changes. It's important to be able to find these kinds of similar
es
pages because search engines give better results when they don't show two
pages that are almost exactly the same on the first page of results. Same
Source articles It is common for one reporter to write a news story that
gets sent to many newspapers, like through the Associated Press. The
ot
newspapers then put the story on their websites. Each newspaper makes
small changes to the article. They may take out paragraphs or even add
their own material. Most likely, they will put their logo, ads, and links to
un
other articles on their site around the article. But the original article will be
the most important part of each newspaper page. News aggregators like
GoogleNews try to find all versions of an article so they can show only
m
one. To do this, they have to figure out when two Web pages have similar
text but are not the same.
J(A, B) = |A ∩ B| / |A ∪ B|
Jaccard similarity ranges from 0 to 1, where 0 indicates no similarity and 1
indicates complete similarity.
Jaccard-Bag similarity, on the other hand, is a modified version of Jaccard
similarity that into account repeated items in the sets. Jaccard-Bag
28
similarity is defined as the size of intersection of the multisets (bags) of Link Analysis and
the sets A and B, divided by the size of the union of the multisets: Specialized Search
where ∩_bag and ∪_bag represent the intersection and union of multisets.
Jaccard-Bag similarity ranges from 0 to 1, where 0 indicates no similarity
and 1 indicates complete similarity. It is typically used in applications
where repeated items are common, such as in text analysis and data
mining.
Example 2.5.1: In Fig 2.5.2 we can see two sets S and T. There are three
elements in there intersection and total eight elements that appear on both
sets S an d T. The SIM(S, T) is 3/8.
.in
es
Figure 2.5.2 Two Sets with Jaccard Similarity 3/8
Use Case
ot
Movie Ratings Netflix keeps track of which movies each customer has
rented, as well as how they rated those movies. We can think of two
un
movies as similar if many of the same customers rented them or gave them
high ratings. We can also think of two customers as similar if they rented
or gave high ratings to many of the same movies. The same things we said
about Amazon apply here: similarities don't have to be big to be important,
m
and putting movies together by genre will make things easier. Also, the
question of ratings adds a new element. There are also:
1. Ignore customers who gave low ratings to the movies they rented. In
other words, act as if the customer never rented the movie.
2. To compare customers, think of "liked" and "hated" as two set elements
for each movie. If a customer gave a movie a high rating, add a "liked" to
the customer's set for that movie. If they gave a movie a low score, put
"hated" in their set for that movie. Then, we can look for sets with a high
Jaccard similarity. We can use the same method to compare movies.
3. If ratings range from one to five stars, add a movie to a customer's set n
times if they gave it n stars. Then, use Jaccard similarity for bags to figure
out how much customers are alike. The Jaccard similarity for bags B and
C is found by counting an element n times in the intersection if n is the
least number of times the element appears in both B and C. In the union,
we add up the number of times an element appears in both B and C.
29
Information retrieval 2.5.3 Similarity measure between two vectors
After making the necessary adjustments to the word weights, we now have
k-dimensional document and query vectors (where k is number of index
terms in vocabulary). So what we need to do is determine what they have
in common. The cosine similarity is a commonly used and broadly
accepted measure of similarity.
It is necessary to divide the inner product of the two vectors (the total of
the components that were multiplied pairwise) by the product of the
lengths of the vectors. Because of this, the lengths of the vectors are
normalised to be equal to one another, and the resemblance between them
can only be explained by the angle, or more specifically the cosine of the
angle, that exists between them.
.in
single word are given a similarity value of zero, while documents that
have a comparable vocabulary are given larger values (up to one in the
case of identical documents). Since a query may be thought of as a brief
document, it is of course feasible to generate a vector for the query. This
es
vector can then be used to compute the cosine similarity between the
vectors generated by the query and those generated by the documents that
fit the query. The order of the results is determined, in the end, by the
ot
similarity values between the query and the documents that were obtained.
There are several ways to measure similarity in Link Analysis and
Specialized Search, including:
un
30
The Jaccard similarity between two sets A and B is defined as: Link Analysis and
Specialized Search
J(A,B) = |A ∩ B| / |A ∪ B|
.in
represents the probability of a user following a link.
The HITS score of a page i is calculated in two steps. First, the authority
score A(i) of page i is calculated based on the incoming links to it:
es
A(i) = Σ H(j)
where H(j) is the hub score of page j, and j links to page i.
ot
Second, the hub score H(i) of page i is calculated based on the outgoing
links from it:
un
H(i) = Σ A(k)
where A(k) is the authority score of page k, and i links to page k.
m
.in
2.6.1 What is Hadoop?
Hadoop is a framework developed by Apache that is available as open
source. It enables users to store and handle large amounts of data in a
es
distributed manner across clusters of computers by using simple
programming patterns. It is built with the capability to expand from a
single server to thousands of devices, each of which provides local
ot
32
•Computer capability. Big data is processed fast using this concept of Link Analysis and
distributed computing. The user has greater processing power the more Specialized Search
computer nodes they utilise.
•Adaptability. Users do not need to preprocess data before saving it, in
contrast to conventional relational databases. They are free to keep as
much data as they like, and they may choose afterwards how to utilise it.
Data that is not organised, such as text, photos, and videos, is included.
•Tolerant of faults. Processing of data and applications is safeguarded
against hardware malfunction. To ensure that distributed computing does
not fail, tasks are immediately routed to other nodes if one node goes
down. Moreover, all data is automatically stored in numerous copies.
•Low price. The open-source framework is free and uses commodity
hardware to store large quantities of data.
• Flexibility. By simply adding new nodes, the user may quickly expand
the system. Minimal management is necessary.
.in
Big data
Big data is a collection of expansive datasets that cannot be analysed using
conventional computer methods. Big data has evolved into a field in and
es
of itself, including a variety of tools, methodologies, and frameworks.
33
Information retrieval 2. Hadoop MapReduce - Hadoop's processing engine is called
MapReduce.
3. YARN for Hadoop - A Hadoop resource management component is
called YARN.
.in
The Big Data framework in healthcare may assist in a thorough
examination of the data present on-site for availability, growing prices,
es
and even monitoring the progression of chronic illness.
Big Data is utilised in the media and entertainment industry to gather,
examine, and get meaningful customer information. To further hone
ot
34
Hadoop has been used in the transportation industry to manage traffic, Link Analysis and
develop intelligent transportation systems, design routes, and reduce Specialized Search
congestion.
Big Data may be utilised, particularly for the logistics division, to monitor
shipments, journey times, and further, conserve gasoline by implementing
best practises and giving directions to vehicles.
A more advanced electric grid with smart metres to monitor the reading
every 15 minutes will be introduced in Energy and Utilities. To improve
system performance, granular data will be used to evaluate data from
numerous devices and combine it with user input.
Big Data may monitor customer purchasing patterns in the retail and
wholesale industries and compare them to marketing strategies to increase
the value of the company. Similar applications include consumer loyalty
cards, RFID, point-of-sale scanners, neighbourhood activities, inventory
management, and even fraud prevention.
Big Data can monitor consumer insights in the insurance industry to
.in
streamline products and forecast behaviour through GPS devices, social
media connections, and investment prospects. Claims management may
benefit from optimised concepts to provide services more quickly.
es
2.6.5 Hadoops Architecture
In 2005, Doug Cutting, Mike Cafarella, and their team used the Google
ot
.in
es
Figure 2.6.1 Hadoops Architecture
ot
Since 2012, the name "Hadoop" has come to refer to a number of other
software packages that may be installed on top of or alongside Hadoop,
such as Apache Pig, Apache Hive, Apache HBase, Apache Spark, etc. in
un
HDFS
m
Under the conventional method, a single central database housed all of the
data. A single database was unable to fulfil the job before the emergence
of big data. The large amount of data needed to be stored, thus a
distributed method was chosen as the answer. Many distinct databases
received the data once it was broken up. A file system called HDFS was
created specifically for storing large datasets on common hardware and
storing data in diverse formats on many machines.
.in
1. The master daemon is ResourceManager (Master). It controls how
resources like CPU, memory, and network bandwidth are distributed.
es
2. NodeManager (Slave) - The slave daemon that informs the Resource
Manager of resource utilisation is called NodeManager.
MapReduce
ot
called Map, to create output values. The output of the mapping step is
collected and organised into blocks of related data during the shuffle and
sort phase. The output values from the phase of shuffling are then pooled.
Then it gives back a single output value.
In summary, the three parts of Hadoop are HDFS, MapReduce, and
YARN.
.in
pairings do not need to be sorted properly initially. Terms are shown
instead of termIDs for readability[1].
In order to guarantee that the work can be spread fairly and effectively, the
es
input data, in our instance a collection of web pages, are first divided into
n splits. For distributed indexing, splits of 16 or 64 MB are suitable. Splits
are continuously allocated by the master node rather than being pre-
ot
VALUE PAIRS portions. A key-value pair for indexing takes the form
(termID,docID). The mapping from terms to termIDs is also distributed in
distributed indexing, making it more complicated than in single-machine
indexing. Maintaining a (perhaps precomputed) mapping for frequent
terms that is replicated to all nodes and using terms directly (instead of
termIDs) for infrequent words is a straightforward approach.
38
Link Analysis and
Specialized Search
.in
consistent
term →termID mapping.
es
MapReduce's map phase involves splitting up the incoming data into key-
value pairs. To local intermediate files, also known as segment files, each
parser writes its output (shown as a-f g-p q-z in Figure 2.6.3).
ot
We want to keep all values for a particular key near together in the
reduction phase so that they may be read and processed rapidly. The
un
keyvalue pairs for each term partition are written by the parsers into a
separate segment file after the keys are divided into j term partitions. The
term partitions in the aforementioned figure are listed according to initial
letters: a-f, g-p, q-z, and j = 3. The person in charge of the indexing system
m
defines the word "partitions" For each term partition, the parsers then
produce associated segment files. This means that each term division
corresponds to r segments files, where r is the total number of parsers. For
instance, Figure 4.5 depicts the three parsers in Figure 4.6.3 as well as the
three a-f segment files of the a-f partition. 2 The inverters' duty during the
reduction phase is to compile all values (in this case, docIDs) for a certain
key (in this case, termID) into a single INVERTER list. Each term
partition is given to a distinct inverter by the master, who also, as with
parsers, reassigns term partitions in the event of delayed or failed
inverters. One inverter processes each term partition (corresponding to r
segment files, one on each parser).
In this case, we suppose that segment files are of a size that is manageable
by a single computer. After being sorted for each key, the list of values is
then written to the final sorted posts list. Figure 2.6.3 displays the data
flow for the values a to f. The inverted index has now been fully built.
Inverters and parsers are not two different types of devices. The master
recognises idle machines and gives them jobs. In both the map and reduce
39
Information retrieval phases, the same machine may function as both an inverter and a parser.
Moreover, there are often additional tasks that occur concurrently with
index building, thus a machine may do crawling or another unrelated work
in between being a parser and an inverter.
Each parser writes its segment files to its local disc to decrease writing
times before inverters reduce the data. The location of the appropriate
segment files is sent by the master to an inverter during the reduction
phase (e.g., of the r segment files of the a–f partition). Each segment file
only has to be read sequentially once since the parser wrote all information
pertinent to a certain inverter to a single segment file. This configuration
reduces the amount of network traffic required for indexing.
The overall structure of the MapReduce functions is shown in Figure
2.6.2. Input and output are often listing of key-value pairs so that
MapReduce tasks may be executed one after the other. The newly
produced term-partitioned index is converted into a document-partitioned
index by another MapReduce process. An effective and theoretically
straightforward framework for performing index building in a distributed
.in
setting is provided by MapReduce. With sufficiently big computer
clusters, it may scale to virtually arbitrary huge collections by offering an
automated approach for breaking index generation into smaller jobs.
es
Consider the issue of calculating the number of times each word appears
in a large collection of papers.
ot
un
m
.in
("fox", 1)
("jumps", 1)
es
("over", 1)
("the", 1)
ot
("lazy", 1)
("dog", 1)
un
Shuffle and sort phase: In this phase, the key-value pairs are shuffled and
sorted by key, so that all pairs with the same key are grouped together.
This allows the next phase to efficiently count the number of occurrences
m
of each word.
Reduce phase: In this phase, the key-value pairs are processed in parallel
by multiple reduce tasks. Each task takes a key and a list of values as input
and outputs a key-value pair representing the count of occurrences for that
key. In the case of Word Count, the value list for each key consists of a
sequence of 1s, indicating the number of times that word appears in the
input documents. The reduce task then sums up the values and outputs a
key-value pair with the word as the key and the count as the value. For
example, the output for the previous example input would be:
("the", 2)
("quick", 1)
("brown", 1)
("fox", 1)
("jumps", 1)
41
Information retrieval ("over", 1)
("lazy", 1)
("dog", 1)
Final output: The output from the reduce tasks is combined to produce
the final output of the Word Count program, which is a list of key-value
pairs representing the word and its count. This output can be further
processed or stored for later use. Overall, MapReduce provides an
efficient and scalable approach to solving the Word Count problem, and
can be applied to a wide range of other data processing tasks as well.
.in
culinary advice. The search results might be readily improved by
categorization, or perhaps the person has lately been browsing for recipes.
The search history in this situation, or even the browsing history, might be
considered. To customise search results, there are several methods.
es
Making a user profile is a fundamental and popular concept.
There are several options available for gathering user input data. The
user's prior search requests or the search results they clicked may be
gathered. Moreover, a source of data is the browsing history. For every
un
Data about the customer might also be a source: The access mode (desktop
or mobile device), the IP address (the location may be discovered from the
IP address), or the user's browser and operating system can all be
identified.
42
the Open Directory Project allows users to search up websites from their Link Analysis and
history. Specialized Search
.in
somehow. For Personalized Web Search the extracted knowledge is
incorporated in a Personalization module in order to facilitate the
personalization functions. Many different forms of data representation can
es
be applied here. The two most common representations seem to be the
rather simple “list of URLs” and the “bag of words”. Different tables or
tensors can be used as well (for example: all queries by a user and/or
clicked results for every query).
ot
The use of classification is also quite common. The categories for the
classification can be taken from the Open Directory Project (or from the
un
AND PRODUCTS
Collaborative filtering is a category of applications where similarity of sets
is crucial. In this approach, we propose to customers products that other
users who have shown similar likes have enjoyed. Millions of people use
[Link], which also offers millions of products. The database keeps
track of which clients purchased which things. If two consumers'
collections of bought things have a high Jaccard similarity, we may claim
that they are comparable. Likewise, two goods will be considered similar
if their sets of buyers have significant Jaccard similarity. While we could
anticipate mirror sites to have Jaccard similarity exceeding 90%, it is
doubtful that any two clients would have that high of a level of similarity
(unless theyhavepurchased only one item). Even a Jaccard similarity of
20% can be exceptional enough to locate clients with comparable
interests. The same is true for things; Jaccard similarities do not
necessarily need to be high to be meaningful. In addition to identifying
related clients or products, collaborative filtering necessitates a number of
43
Information retrieval technologies. For instance, two science fiction book fans who shop on
Amazon could each purchase a large number of titles, but they will only
have a small number of titles in common. But, by combining
similarityfinding with clustering, we may be able to identify the
similarities across science fiction literature and group them together. Next,
by determining whether they made purchases inside a large number of the
same groupings, we may develop a more potent idea of customer
similarity.
.in
me could opt to read the book and then suggest it to their friends.
This procedure may be automated as follows on an e-commerce website.
An implicit suggestion occurs when I purchase a book, but the website
es
may ask me to explicitly rate the book, say on a scale of 1 to 10.
The CF system will be able to determine that my buddy has similar
reading preferences to mine when he goes onto the website since we have
ot
already made comparable purchases. The computer will also detect that he
has not yet purchased the book I highly recommended, and will then
suggest this book to my buddy. The core of collaborative filtering is this.
un
Before showing the user the top suggestions, the algorithm will gather as
many recommendations as it can and rank them based on their general
popularity. CF may be used in online navigation to suggest connections to
m
.in
user-item matrix using machine learning techniques and utilise that model
to generate predictions. One such method develops a neural network that
learns to anticipate each user's rating of a new item. Another method
creates association criteria such as "30% of users like all these products,
es
and 90% of users who like items I and j also like item k."
The rules are generally of the form X ⇒ Y, where X is a set of items and
Y is another item, as in user-based algorithms. In this case, the rule is {i,
ot
j} ⇒ {k}. According to the rule, 30% of the users in the user-item matrix
support it, meaning that they all like all three products (this includes the
un
items in both X and Y). The 90% is the confidence in the rule, which is the
percentage of people who like all three things (including those in either X
or Y) compared to those who just like items I and j. (this includes only the
items in X).
m
Another method employs the naive Bayes classifier with the user-item
matrix as the input and states that an item is liked by the active user if
other user ratings are considered, with the active user's liking of the item
being proportional to the product of the probabilities that each other user
will also find the item to be liked.
2.8.2 Content-based recommendation of documents
Under the framework of user-based collaborative filtering, there are two
issues. Since a user will often only rate (or buy) a small number of things,
say 30, out of the millions that may be offered, the user–item matrix is
quite sparse. When an item has not yet been evaluated, there is another
difficulty that is similar to this one called the first-rater dilemma. A
content-based strategy is essential in this scenario since an e-commerce
website could still wish to promote products that do not have any ratings
associated with them. In order for content-based systems to be successful,
the system must have the capability of compiling an interest profile for
each user.
45
Information retrieval The user's preferred categories in regard to the application are considered
to be part of the user's interests. Take, for instance, whether the user
favorsworks of fiction over works of nonfiction, or mainstream music over
classical music. Once the system has a user profile, it is able to examine
the degree to which the item (or content) a user is seeing is similar to the
profile, and it is then able to establish a rating for the item based on the
degree to which it is similar (or content). This is quite similar to the search
procedure, except that the user's profile will serve as the query, and the
things that will be shown to the user will serve as the results of the query.
When an item has a higher aggregate rating, the user will see it ranked
higher overall when they see it.
.in
of: • Private websites, such VPNs (Virtual Private Networks) and login-
required websites
es
• Sites with limited access to content (which restrict access in a technical
way, such as by using Captcha, Robots Exclusion Standard, or no-cache
HTTP headers that prevent search engines from browsing or caching
them); • Sites with unlinked content (which lacks hyperlinks to other
ot
.in
Central," "Copernicus Publications," "Directory of Open Access Journals,"
etc. are examples of well-known and respectable databases.
sorted by category.
• The invisible web's "Complete Planet" searches more than 70,000
databases and specialised search engines. a search engine that works
equally effectively for academics and casual users.
• A real librarian manages "Digital Librarian," which is A Librarian's Pick
of the Best on the Web. Digital Librarian includes information from areas
as varied as Activism/Non-Profits and Railroads and Waterways, with an
eclectic mix of around 45 major categories.
• The Regents of the University of California's "Info Mine" is yet another
collection of online resources created by librarians.
• The "Wayback Machine," which enables users to find archived papers, is
only one of several categories on "Internet Archive," which also offers an
archive of Grateful Dead soundboard and crowd recordings. They provide
126K live music performances, 6 million texts, 2.5 million movies, and
2.9 million audio recordings.
47
Information retrieval • The Drexel University website "The Internet Public Library" (ipl1 and
ipl2) is a non-profit organisation maintained by students. Students
voluntarily take on the role of librarians and assist guests with their
inquiries. Data categories that are geared for kids and teens are among
them.
.in
• Make a sitemap public. It's essential to provide a fresh, serially linked
sitemap to your website. It's no longer regarded as best practise to
es
advertise it to your visitors; instead, you should post it and maintain its
current so that spiders can evaluate your site's content effectively.
• Write somewhere about it. Finding methods to provide connections to
ot
48
• Access a virtual private network via an employer. Link Analysis and
Specialized Search
• Request access; this could be as simple as a free registration.
• Pay for a subscription.
• Use a suitable resource. Use an invisible Web directory, portal or
specialized search engine such as Google Book Search, Librarian’s
Internet Index, or BrightPlanet’s Complete Planet.
.in
• [Link]'s finance and investing section
• GPO's Catalog of US Government Publications for general research
es
Government Data - Copyright Records (LOCIS); International -
International Data Base (IDB); Library of Congress - Library of Congress;
Law and Politics - THOMAS; PubMed; Transportation - FAA Flight
ot
Delay Information;
Snippet, a concise summary of the content that is intended to let the user
determine its applicability. The title of the document plus an automatically
derived brief summary often make up the snippet.[2]
m
Figure 2.10.1 The first 4 snippets from Google for German Expressionism
Brucke
49
Information retrieval For instance, the figure above provides some sample snippets from Google
that provide a summary of the first four articles that were retrieved in
response to the query "German Romanticism Brucke."
.in
Figure 2.10.2 Example of product reviews rich snippets
es
A snippet is a brief summary or excerpt of a document, such as a web page
or article, that is intended to give users an idea of the content and help
them determine its relevance or applicability to their needs. The snippet
ot
typically includes the title of the document and a brief summary of its
content, which may be automatically generated by search engines or other
software. Snippets are commonly used in search results, where they
un
provide users with a quick overview of the content and help them decide
which results to click on for more information.
.in
The goal of dynamic summaries is to offer the parts of the document that
will be most useful to the user in assessing it in light of their information
needs. They may display one or more "windows" on the page. These
es
windows are sometimes referred to as keyword-in-context () snippets
since they typically include one or more of the search keywords.
Current study focuses on many significant types of summaries, such as:
ot
language;
These summary objectives are often distinguished by their location on the
two-dimensional plane.
• Summarizing a single document as opposed to many documents
• Generic versus query-specific summarization
A general summary is one in which the main points of the material are
presented without taking into account the needs of any individual users or
information seekers (s).
The summary is created in response to a user query in query-focused
summarising, also known as focused summarization, topic-based
summarization, and user-focused summarization. A lengthy, non-factoid
response to a user inquiry might be considered query-focused
summarization.
51
Information retrieval
.in
es
Figure 2.12.1 Query focus summaries
ot
52
Link Analysis and
Specialized Search
.in
first choose the content, organise the information, and create the sentences
from a group of papers that we want to describe.
es
ot
.in
translated to Unicode format.
6. Morphological examination (different for different languages)
es
7. Problems with vocabulary (OOV)
• New terms are introduced to the language that the machine may not
understand.
ot
Link analysis and specialised search are two subjects that this section
discusses in great detail. In order to discover significant sites inside a
network, it presents methods like PageRank and HITS and discusses the
significance of link analysis. The section also discusses content-based
m
54
-Yates and Berthier Ribeiro - Neto, 2"' Edition, ACM Press Books 2012. Link Analysis and
Specialized Search
[3] Search Engines: Information Retrieval in Practice, Bruce Croft,
Donald Metzler and TrevorStrohman, 1ª Edition, Pearson, 2009.
[4] Information Retrieval Implementing and Evaluating Search Engines,
Stefan Büttcher, Charles L. A. Clarke and Gordon V. Cormack, The MIT
Press; Reprint edition (February 12, 2016)
2.15 BIBLIOGRAPHY
[1] Manning, C., Raghavan, P., &Schütze, H. (2008). Introduction to
Information Retrieval. Cambridge University Press.
[2] Baeza-Yates, R., & Ribeiro-Neto, B. (2011). Modern Information
Retrieval: The Concepts and Technology behind Search (2nd ed.). ACM
Press Books.
[3] Croft, B., Metzler, D., &Strohman, T. (2009). Search Engines:
Information Retrieval in Practice (1st ed.). Pearson.
.in
[4] Büttcher, S., Clarke, C. L. A., & Cormack, G. V. (2016). Information
Retrieval Implementing and Evaluating Search Engines. The MIT Press;
Reprint edition.
es
2.16 UNIT END EXERCISES
1. i) Compute the Jaccard similarities of each pair of the following
ot
three sets:
{1,2,3,4}, {2,3,5,7} and {2,4,6}
un
iii) Compute the Jaccard similarities of each pair of the following two sets:
{a,a,a,b} and {a,a,b,b,c}
2. What is the basic principle of flow model in PageRank?
3. How can you represent webpages and their links?
4. What are the issues associated with the web structure?
5. Define Collaborative filtering as a similar sets problem with examples.
6. What is similarity of documents. Give examples.
7. Define Jaccard Similarity of sets.
8. Briefly explain Hadoop MapReduce. Give example. Explain with
diagram.
55
3
WEB SEARCH ENGINE
Unit Structure
3.0 Objectives
3.1 Introduction and overview ofweb search
3.2 Web engine structure/ architecture
3.3 The user interfaces
3.4 Paid placement
3.5 Search engine spam
3.6 Search engine optimization
.in
3.6.1 Benefits of SEO
3.6.2 Working of SEO
es
3.6.3 SEO techniques
3.6.4 SEO tools
ot
3.8.3 GoogleArchitecture
3.9 Summary
3.10 List of References
3.11 Unit End Exercises
3.0 OBJECTIVES
To understand the fundamentals of web search
To get familiar with various concepts, tools and techniques for
managing the web search
56
3.1 INTRODUCTION AND OVERVIEW OF WEB Web Search Engine
SEARCH
A tool used to find information on the World Wide Web is known as a
search engine. The search engine enables users to request information that
meets particular criteria (usually those that contain a particular word or
phrase), and it then returns a list of references that satisfy those
requirements.
Large-scale full-text web page indexes are essentially what search engines
are. To work swiftly and effectively, they use indexes that are often
updated. The quality of search results is determined by the quality of the
indexes and how the engines use the data they have. The back of the book
indexes is comparable, but search engine indexes are much more
complicated. The searching abilities can be much enhanced by
understanding even a little bit about how they are made and employed.
Additional types of search engines include mobile search engines,
personal search engines that search on individual computers, and
.in
enterprise search engines that search intranets. Several search engines also
mine information from massive databases, open directories like
[Link], newsgroups, and other sources. Search engines use algorithms
es
rather than human editors to maintain Web directories. The majority of
websites that identify as search engines are actually only front ends for
search engines that belong to other businesses.
ot
57
Information retrieval 1] Web crawler
Search engine bots, web robots, and spiders are other names for web
crawlers. In search engine optimization (SEO) strategy, it is crucial. It
mostly consists of a piece of software that browses the web, downloads
information, and then gathers it all.
There are the following web crawler features that can affect the search
results
o Included Pages
o Excluded Pages
o Document Types
o Frequency of Crawling
2] Database
An example of a non-relational database is the search engine database.
.in
That is where all of the data on the web is kept. It has a lot of online
resources. Amazon Elastic Search Service and Splunk are two of the most
well-known search engine databases.
es
The following two characteristics of database variables may have an
impact on search results:
ot
3] Search interfaces
One of the most crucial elements of a search engine is the search interface.
It serves as the user's interface with the database. In essence, it aids users
m
Operators
Phrase Searching
Truncation
4] Ranking algorithms
Google uses the ranking algorithm to determine the order of web sites in
its search results.
The following ranking factors have an impact on the search results:
58
3.3 THE USER INTERFACES Web Search Engine
The query interface and the answer interface make up the user interface of
search engines. The fundamental query interface consists of a box into
which a string of text is typed. Different search engines return various
results depending on the order of the words. For instance, HotBot
conducts a search by the intersection of these terms, but AltaVista
conducts a search by the union of these terms (all terms must occur in the
documents returned as results). Some search engines have a sophisticated
query interface that supports Boolean operators as well as additional
capabilities including phrase, proximity, URL, title, date range, and data
type searches.
Search engines typically return sites in the order of relevancy to the
question regarding the answer interface. In other words, the sites that are
most pertinent to the search are listed first. Every entry in the list of results
typically has the title of the page, a URL, a succinct synopsis, a size, a
date, and a written language.
.in
3.4 PAID PLACEMENT
A Web search engine's relevant search results are displayed alongside
es
advertisements in a pay for placement, or P4P, Internet advertising model.
In this strategy, advertisers participate in an open auction to bid for the
opportunity to display an advertisement that contains particular search
terms (also known as keywords).
ot
59
Information retrieval
.in
On the one hand, paid placement seems to be a need for businesses, and
most significant Web search engines accept it. Paid placement, on the
other hand, may reduce the search engine's market share and its potential
es
for user-generated revenue.
60
We've listed a few of the most popular spam techniques below, along with Web Search Engine
an explanation of their classification, to assist you get a better picture of
what they entail. Hopefully, this has given you some insight into where the
next change might come from as well as some understanding of what
search engines are looking for.
Metatag stuffing
Webmasters used to be able to include as many keywords as they wanted
in their meta tags. A sufficient number of instances of the word
"radioactive" in your metatags essentially guaranteed a top position for
your site for that keyword because these metatags were the major
technique utilised by most engines to evaluate a web site. Search engine
administrators-imposed character restrictions for the meta description and
meta keywords elements in response to the stuffing problem. This implied
that they would only read the first "X" characters or that the tag you were
using would be completely ignored if it exceeded the character limit. This
meant that the site would no longer greatly benefit from repetitive
[Link], the search engines put a limit on the number of times a
.in
term may be used within a tag to counteract websites that just repeated the
term "travel" 50 times and ranked highly for that one term.
They may "spam" your site out of their index if you over that limit.
es
Using irrelevant keywords and terms in Meta Tags
Using completely unrelated keywords in metatags was a preferred method
ot
like "sex" or "MP3" in his or her tags. It goes without saying that this
prompted the search engines to once again adjust, and they started
punishing websites that employed keyphrases that did not appear or were
m
.in
Site within a given time frame, this is known as excessive submissions.
When search engines still favoured websites that were "active," re-
submission would have your site re-indexed. When you resubmitted, your
website was considered "active." Today, you risk having your website
es
blacklisted if you go over their limits. Because many website owners link
their sites to numerous free submission platforms, it is highly vital to
prevent this.
ot
62
3.6.1 Benefits of SEO Web Search Engine
.in
rules to decide which pages to display. To decide the rankings of their
SERPs, these algorithms, which have become incredibly complicated over
time, consider hundreds or even thousands of different ranking indicators.
es
Search engines, however, base their evaluation of a site's quality and
where it should rank on three key metrics:
63
Information retrieval 3.6.3 SEO techniques
The first step in raising a site's search ranks is understanding how search
engines operate. Using different SEO strategies to enhance the site for
search is necessary to actually raise a site's ranking:
.in
experience, and increase your chances of appearing higher in search
engine results. A good piece of content is more likely to be linked to
and shared on social media.
es
Link building - Getting high-quality backlinks is one of the
fundamental SEO strategies since links from other websites, or
"backlinks" in SEO lingo, are one of the main ranking factors in
ot
m
.in
resolving technical issues on the website and gives rankings and traffic
reports for top keywords and pages.
those of competitors.
Platforms for SEO - There are many different platforms for SEO that
combine many of the technologies required for SEO to optimise
websites. Siteimprove, Moz, BrightEdge, Searchmetrics, and Linkdex
are a few of the most well-known. These tools assist with keyword
research, tracking keyword rankings, finding on-page and off-page
SEO opportunities, and a variety of other SEO-related duties.
Pixel
A computer screen's display is made up of small dots called pixels. On a
computer screen, everything that appears is represented by pixels. It's near
to an absolute size because it varies little from computer to computer.
16px is a standard font size. Pixels have some disadvantages: People with
visual issues may become frustrated by the inability of Internet Explorer
users to resize text that has been set in pixels, and sometimes printing a
Web page that has been set in pixels can also lead to issues. Users of IE7
and IE8 as well as Opera can resize their full page using "page zoom."
Although it doesn't completely address the issue, it does help.)
Em
As ems have a relative size, their size varies according to the size of the
elements to which they are "related." Mostly speaking, it has to do with
.in
font size. The size 2em is therefore twice as large as the current font size.
Setting the base font size for your page is generally a smart idea. Although
users occasionally reset their browser's display size to suit their needs,
es
most browsers have a base size of 16px by default. Just set the font size to
16px so that it would look well on as many computers as possible (or
whatever suits you).The use of ems is regarded by many designers as
excellent practise. The disadvantage is that occasionally Internet Explorer
ot
6 makes fonts set in ems appear much larger (or smaller) than they
actually are.
un
Percentages
This unit of measurement is considerably more arbitrary. You might use a
font that is 80% the size of the primary typeface. In other words, if your
m
main font size is, let's say, 16px, text displayed as 80% would size in at
roughly 13px. Percentages are determined relative to the inherited size of
the parent text. If your primary font size is 16px, your inherited element is
at 80%, and you have another element inherited from the second one at,
say, 80% again, the cumulative font size will be around 10px.
Owen Briggs, a web designer, did some research and provided 264
screenshots as examples. He has also given us his suggestions, which are a
combination of percentages and ems. Although dated 2003, it still
functions.
We almost exclusively use pixels, ems, and percentages even though there
are alternative absolute ways to measure the size of items, such as points,
picas, centimetres, millimetres, and others.
66
3.8 SEARCH ENGINE ARCHITECTURES Web Search Engine
.in
es
ot
un
3.8.2 HarvestArchitecture
The crawler-indexer architecture comes in a number of different forms.
Harvest is the name of one of the variations. The most significant variation
that employs distributed architecture to collect and disseminate data is
called Harvest. The US National Academy of Sciences, NASA, the CIA,
and the US Government Printing Office all utilise it. Moreover, the
Harvest Cache from Network Appliances and the Netscape Catalog Server
are also commercial versions of Harvest.
67
Information retrieval
.in
Harvest offers two key components: gatherers and brokers, as seen in
Figure 4. Gatherers' duties include gathering and extracting indexing data
from one or more Web servers. The Harvest system establishes gathering
es
times. The name of the event, Harvest, denotes that the times are cyclical.
Brokers' responsibilities include providing the indexing system and the
query interface for the obtained data. To update their indices, brokers
ot
acquire data from gatherers or other brokers. Brokers can also filter
information and send it to others, saving time for other brokers.
Network traffic and server workload can be balanced based on gatherer
un
and broker settings. A lot of the vocabulary and scaling issues with
generic indices are avoided by the harvest system, which creates topic-
specific brokers and concentrates the index contents. The system also
m
3.8.3 GoogleArchitecture
The word googol, which signifies 10100, is whence the name Google is
derived. The hypertext framework is frequently used by the Google search
engine ([Link]). It asserts that its findings are superior to those
of current search engines. There are over 24 million pages cited.
In order to maximise efficiency, Google is primarily written in C/C++. It
can function on Linux or Solaris systems. Figure 5 depicts the
architecture.
68
Web Search Engine
.in
Lists of URLs are sent by the URL Server for the crawlers to retrieve.
According to the list, the crawlers download pages, which they then
deliver to the store [Link] pages are compressed by the Store Server
es
before being stored in the repository. Every time a new URL is extracted
from a Web page, a new docID, also known as an associated ID number,
is assigned. An indexing task is carried out by the index. It parses the
ot
The anchors file includes the content of each link as well as information
about where it points from and to. The URL Resolver then reads the
anchors file, changes the relative URLs to absolute URLs, and then
converts those to docID. The forward index is updated to include the
anchor text and docID. For the purpose of storing links and docIDs, it
creates a links database. All of the documents' PageRanks are calculated
using the database.
To create the inverted index, the Sorter takes the barrels and sorts them
according to wordID rather than docID. A list of wordIDs and offsets into
the inverted index is also produced by the Sorter.
This list and the lexicon created by the indexer are input into a software
called DumpLexicon, which creates a new lexicon that the searcher can
utilise. The searcher is managed by a Web server and employs the inverted
index, PageRanks, and the lexicon created by DumpLexicon to respond to
questions.
69
Information retrieval In 1998, Google had downloaded approximately 24 million Web pages. Its
storage capacity was 108.7 GB with a repository and 55.2 GB without
one. Google typically took between 1 and 10 seconds to respond to a
query.
3.9 SUMMARY
The relevance of the results a search engine returns determines how
valuable it is. Despite the fact that there may be millions of Web pages
using a specific word or phrase, some may be more authoritative, popular,
or relevant than others. The majority of search engines use techniques to
rank the results such that the "better" results are displayed first.
There are significant differences amongst search engines in how they
choose which pages are the best matches and what order the results should
be displayed. As Internet usage changes and new ways are developed, the
methods likewise change with time. The chapter also provides reasons
why we need to study search engines, and it provides relevant references
for readers to proceed further. More important, the readers should try out
.in
different search engines that are available today
Search, Ricardo Baeza -Yates and Berthier Ribeiro – Neto, 2nd Edition,
ACM Press Books 2011.
un
.in
4.7 Summary
4.8 List of References
es
4.9 Unit End Exercises
4.0 OBJECTIVES
ot
To get familiar with the challenges along with the model and its
evaluation in XML retrieval process
4.1 INTRODUCTION
m
72
A relational database is best able to handle some highly structured text XML Retrieval
search situations, such as when you wish to identify all employees who are
involved in billing and the employee table has an attribute for short textual
job descriptions. The SQL query in this instance is:
could be more than enough to meet your information needs with excellent
precision and recall.
Therefore, it is preferable to model many text-based structured data
sources as structured documents as opposed to relational data. Searching
via these organised documents is referred to as structured retrieval. In
structured retrieval, queries can be either structured or unstructured;
nevertheless, for the sake of this chapter, we'll assume that the collection
only contains structured documents. Examples of structured retrieval
include output from office suites like OpenOffice that save documents as
marked up text, patent databases, blogs, and text in which entities like
people and locations have been tagged (in a technique called named entity
.in
tagging).
Extensible Markup Language, or XML, which is now the most popular
such standard, will be the only one we examine for encoding structured
es
texts. The nuances that set XML apart from other markup languages like
HTML and SGML will not be discussed. The majority of our discussion in
this chapter, however, applies to markup languages in general.
ot
An XML file is a labelled, ordered tree. With an opening and closing tag,
the tree's nodes are each written as XML elements. There may be one or
more XML attributes for an element. The two tags <scene ...> and
</scene> contain the scene element in the XML document shown in
m
Figure 1. It has two child elements, title and poem, as well as an attribute
number with the value vii.
.in
Figure 2: The XML document in Figure 1 as a simplified DOM object
es
The XML Document Object Model, or DOM, is the industry standard for
XML document access and processing. Elements, attributes, and text
contained within elements are represented by the DOM as nodes in a tree.
ot
74
Narrowed Extended XPath I, sometimes known as NEXI, is a popular XML Retrieval
format for XML queries. Figure 3 provides an illustration. We display the
query on four lines fortypographical convenience, but it is intended to be
read as one unit withoutline breaks. In particular, //section is embedded
under //article.
.in
The search criteria in Figure 3 call for finding summer vacation-related
sections of articles from 2001 or 2002. Double slashes, like in XPath,
denote that any number of elements may obstruct a route. The element that
a clause alters is indicated by the dot enclosed in square [Link]
es
clause [.//yr = 2001 or .//yr = 2002] modifies //article. Thus, the dot refers
to //article in this case. Similarly, the dot in [about(., summer holidays)]
refers to the section that the clause modifies.
ot
.in
than full documents, as IR systems often do in unstructured retrieval. This
presents the first challenge in structured retrieval. Should the full play, the
act, or the scene in Figure 2 be returned if we search Shakespeare's plays
for Macbeth's castle? The user is most likely searching for the scene in this
es
instance. On the other hand, a general search for Macbeth ought to turn up
the play itself and not a component.
The organised document retrieval principle is one standard for
ot
This idea drives a retrieval technique that only descends to the level of the
smallest unit containing the information sought. Algorithmically using this
principle, nevertheless, can be challenging. Have a look at Figure 2 with
m
.in
or the play element for Shakespeare's works, can also serve as the
indexing unit. The best-hitting subelement for each book or play can then
be found using post-processing on the search results. For instance, the
search for Macbeth's castle might lead to the play Macbeth, which we can
es
then post-process to find the best-matching subelement in act I, scene vii.
Due to the fact that the relevance of a book as a whole is frequently not a
strong predictor of the importance of small subelements within it, this two-
ot
We can search all leaves, choose the most pertinent ones, and then expand
them to larger units in post-processing instead of retrieving huge units and
identifying subelements (top down) (bottom up). In order to determine
whether to return the title, the scene, the act, or the play for the query
m
77
Information retrieval It is usual to limit the set of elements that are eligible to be returned due to
the redundancy generated by nested elements. The following are some
restrictions:
• Remove all minor components
• Remove all element kinds that users do not examine (to do this, an
operating XML retrieval system that records this information is needed).
• Eliminate all element kinds that assessors typically deem irrelevant (if
relevance assessments are available)
• only maintain the element kinds that a system designer or librarian has
determined to be helpful search results.
The majority of these methods still provide result sets with nested
[Link] a result, in order to eliminate repetition, we could want to
remove some pieces in a postprocessing phase. As an alternative, we can
collapse a number of nested elements in the results list and highlight the
search phrases to highlight the pertinent portions. A medium-sized
.in
element (such as a section) is scanned slightly more slowly when query
terms are highlighted than a small subelement (e.g., a paragraph). Hence,
it is sufficient to display the section if both the section and the paragraph
es
appear in the results list. This strategy also has the benefit of presenting
the paragraph along with its context (i.e., the embedding section).Even
though the paragraph answers the question on its own, the context may be
useful in understanding the paragraph (e.g., the source of the information
ot
reported).
Redundancy is less of an issue if the user is aware of the collection's
un
structure and is able to identify the kind of element they want, as few
nested elements have the same type. However, as we noted in the
introduction, users frequently lack basic knowledge of how to construct
structured queries or are unaware of the name of an element in the
m
.in
es
ot
un
m
.in
vocabulary terms in unstructured retrieval.
es
ot
un
m
80
lexicalized subtree, not a structural term. For structural terms, we employ XML Retrieval
the previously described pseudo-XPath notation.
We ensure that retrievalresults respect this preference by computing a
weight for each match. A simple measure of the similarity of a path cq in a
query and a path cd in a document is the followingcontext resemblance
function CR:
.in
shortly. SIMNOMERGE is defined as follows:
es
where V is the vocabulary of non-structural terms; B is the set of all XML
contexts; and weight(q, t, c) and weight(d, t, c) are the weights of term t in
ot
dft
The algorithm for computing SIMNOMERGE for all documents in the
collection is shown in Figure 8.
m
81
Information retrieval 4.5 EVALUATION OF XML RETRIEVAL
The premier venue for research on XML retrieval is the INEX (INitiative
for the Evaluation of XML retrieval) program, a collaborative effort that
has produced reference collections, sets of queries, and relevance
judgments. A yearly INEX meeting is held to present and discuss research
results. TheINEX 2002 collection consisted of about 12,000 articles from
IEEE journals. We give collection statistics in Table 2 and show part of
the schema of the collection in Figure [Link] IEEE journal collection was
expanded in 2005. Since 2006 INEX uses the much larger English
Wikipedia as a test collection. The relevance of documents is judged by
human assessors using the methodology appropriately modified for
structured documents.
Table 2: INEX 2002 collection statistics
.in
es
ot
un
m
82
As CAS searches involve both structural and content criteria, relevance XML Retrieval
judgements are more complicated than in unstructured retrieval. INEX
2002 defined component coverage and subject relevance as orthogonal
measures of relevance. The component coverage dimension checks if the
element retrieved is “structurally” correct, i.e., neither too low nor too
high in the tree. We distinguish four cases:
Exact coverage (E): The information sought is the main topic of the
component and the component is a meaningful unit of information
Too small (S): The information sought is the main topic of the
component, but the component is not a meaningful (self-contained)
unit of information
Too large (L): The information sought is present in the component, but
is not the main topic
No coverage (N): The information sought is not a topic of the
component
The topical relevance dimension also has four levels: highly relevant (3),
fairly relevant (2), marginally relevant (1) and nonrelevant (0).
.in
Components are judged on both dimensions and the judgments are then
combined into a digit-letter code. 2S is a fairly relevant component that is
too small and 3E is a highly relevant component that has exact coverage.
es
In theory, there are 16 combinations of coverage and relevance, but many
cannot occur. For example, a nonrelevant component cannot have exact
coverage, so the combination 3N is not possible. The relevance-coverage
combinations are quantized as follows:
ot
un
m
Because we sum graded rather than binary relevance ratings, the usual
definitions of precision, recall, and F from can be roughly applied to our
modified meaning of relevant items retrieved.
83
Information retrieval 4.6 TEXT-CENTRIC VERSUS DATA-CENTRIC XML
RETRIEVAL
To match the text of the query with the text of the XML documents, the
form of structured retrieval we examine in this chapter uses the XML
structure as a framework. This is an example of a system designed with
text-centric XML in mind. Although both text and structure are crucial, we
give text a higher emphasis. By modifying unstructured retrieval
techniques to handle extra structural limitations, we achieve this. Our
strategy is predicated on the notion that XML document retrieval is
characterised by I extensive text fields (for example, parts of a document),
(ii) imperfect matching, and (iii) relevance-ranked returns. This use case
does not work well with relational databases.
XML that is focused on data, on the other hand, primarily encodes
numerical and non-text attribute value data. Most of the time, we wish to
apply exact match requirements while querying data-centric XML. As a
result, the structural features of XML documents and queries are
.in
highlighted. Among them is:
Identify workers whose monthly pay is the same as it was a year ago.
There is no need to rank this query. Since it is entirely structural, the user's
es
information needs can probably be satisfied by an exact match between
the salaries in the two time periods.
ot
For data that are essentially text documents marked up as XML to capture
document structure, text-centric techniques are appropriate. Due to the fact
that most text papers contain some kind of intriguing structure-paragraphs,
un
For data collections with complex structures that primarily comprise non-
text data, data-centric techniques are frequently used. With proteomic data
in bioinformatics or the depiction of a city map that, along with street
names and other textual descriptions, constitutes a navigational database, a
text-centric retrieval engine will struggle.
Joins and ordering constraints are two additional sorts of queries that are
challenging to handle in a text-centric structured retrieval architecture. A
join is necessary to find employees whose salaries have remained the
same. The following query imposes
an ordering constraint:
Get the chapter on algorithms that comes after the one on binomial heaps
in the book
This search depends on the arrangement of XML elements, specifically
the arrangement of chapter elements under the book node. For XML, there
are sophisticated query languages that support joins, ordering constraints,
and numerical attributes. The most well-known of them is XQuery, a
84
language that the W3C wants to standardise. It is intended to have a wide XML Retrieval
range of applications in every field where XML is utilised. It is difficult to
create an XQuery-based ranked retrieval system with the performance
traits that consumers have grown accustomed to in information retrieval
due to its complexity. One of the most active areas of XML retrieval
research right now is this one.
Relational databases, especially joins, are better suited to handle numerous
structural constraints (but ordering is also difficult in a database
framework – the tuples of a relation in the relational calculus are not
ordered). Because of this, relational databases are the foundation of the
majority of data-centric XML retrieval systems. Using a relational
database for XML retrieval is appropriate if text fields are brief, accurate
matching fits user needs, and retrieval results in the form of unordered sets
are acceptable.
4.7 SUMMARY
The content-based retrieval of XML-structured documents is known as
.in
XML retrieval, or XML information retrieval (eXtensible Markup
Language). As a result, it is utilised to determine the significance of XML
[Link] may need to discriminate between multiple contexts of a
es
term when computing term statistics for ranking, in particular inverse
document frequency (idf) statistics, which presents a barrier in XML
retrieval related to nesting.
ot
Almost fifty organisations from around the world are participating in the
international campaign known as the Initiative for the Evaluation of XML
Retrieval (INEX). It offers a way to assess retrieval programmes that give
un
85
Information retrieval 4] Information Retrieval Implementing and Evaluating Search Engines,
Stefan Buttcher, Charles L. A. Clarke and Gordon V. Cormack, The MIT
Press; Reprint edition (February 12, 2016).
.in
es
ot
un
m
86