Classification
Methods & Cluster
Hypothesis
Information Retrieval CC4151
Classification Methods
In the context of information retrieval, a classification is required for a purpose.
The purpose may be to group the documents in such a way that retrieval will be faster or
alternatively it may be to construct a thesaurus automatically.
There are two main areas of application of classification methods in IR:
(1) keyword clustering;
(2) document clustering.
Clustering and Cluster Hypothesis
Clustering is used in information retrieval systems to
enhance the efficiency and effectiveness of the retrieval
process. Clustering is achieved by partitioning the documents
in a collection into classes such that documents that are
associated with each other are assigned to the same cluster.
In information retrieval, the cluster hypothesis is an
assumption about the nature of the data handled in those
fields, which takes various forms. In information retrieval, it
states that documents that are clustered together "behave
similarly with respect to relevance to information needs".
Applications of Clustering
What is Benefit
Application
clustered?
search results more effective information
presentation to user
Search result clustering
(subsets of) alternative user interface: ``search
collection without typing''
Scatter-Gather
collection effective information presentation for
exploratory browsing
Collection clustering
collection increased precision and/or recall
Language modeling
collection higher efficiency: faster search
Cluster-based retrieval
Search Result Clustering
Search results we mean the documents that were returned in
response to a query.
The default presentation of search results in information retrieval is
a simple list.
Users scan the list from top to bottom until they have found the
information they are looking for. Instead, search result clustering
clusters the search results, so that similar documents appear
together.
It is often easier to scan a few coherent groups than many individual
documents.
This is particularly useful if a search term has different word senses.
Scatter-Gather
Scatter-Gather clusters the whole collection to get groups of documents that the user can
select or gather.
The selected groups are merged and the resulting set is again clustered. This process is
repeated until a cluster of interest is found.
Example: A collection of New York Times news stories is clustered (``scattered'') into eight
clusters (top row). The user manually gathers three of these into a smaller collection
International Stories and performs another scattering operation. This process repeats until a
small cluster with relevant documents is found (e.g., Trinidad)
Collection clustering
Clustered collections store documents ordered by the clustered index key value,.
clustered collections have the following benefits compared to non-clustered collections:
• Faster queries on clustered collections without needing a secondary index, such as queries
with range scans and equality comparisons on the clustered index key.
• Clustered collections have a lower storage size, which improves performance for queries
and bulk inserts.
• Clustered collections have additional performance improvements for inserts, updates,
deletes, and queries.
Language Modelling
A common suggestion to users for coming up with good queries is
to think of words that would likely appear in a relevant document,
and to use those words as the query. The language modelling
approach to IR directly models that idea: a document is a good
match to a query if the document model is likely to generate the
query, which will in turn happen if the document contains the query
words often. This approach thus provides a different realization of
some of the basic ideas for document ranking.
Example: Finite Automata
Cluster-based
Cluster-based information retrieval is one of the Information retrieval(IR) tools
that organize, extract features and categorize the web documents according
to their similarity.