IR Evaluation
There are different aspects through which we can evaluate IR systems:
1. Retrieval effectiveness (standard IR evaluation)
• Relevance of search results
2. System quality
a) Indexing speed (e.g., how many documents per hour?)
b) Search speed (search latency as a function of index size)
c) Coverage (document collection size and diversity)
d) Expressiveness of the query language
3. User utility
• User happiness based on relevance, speed, and user interface
• User return rate, user productivity (difficult to measure)
• A/B test: slight change on a deployed system visible to a fraction of users
• Difference evaluated using clickthrough log analysis
Evaluation Criteria
• Effectiveness
• How “good” are the documents are returned?
• Efficiency
• Retrieval time, indexing time, indexing size.
• Usability
• Learnability, flexibility
Reusable Test Collection
• Collection of documents
• Should be representative.
• Sample of information need.
• Should be randomized and representative.
• Usually formalized topic statement.
• Known relevance judgments.
• Assed by human.
• Binary judgments make evaluation easier.
Good Effectiveness Measures
• Should capture some aspects of what the user wants.
• The measure should be meaningful.
• Should be easily replicated by other researchers.
• Should be easily comparable.
• Expressed as a single number.
Effectiveness evaluation measure
• Set based measure
• Rank based measure
Set based measure
• IR system returns set of retrieved results without results.
• No certain number of results per query.
• Suitable for Boolean search.
Precision and recall
Precision and recall
Trade-off between R&P
• Precision
• The ability to retrieve top ranked documents that are mostly relevant.
• Recall
• The ability to retrieve all of relevant items.
Trade-off between R&P
F-measure