Djoerd Hiemstra – Research, Teaching and More

Chris Kamphuis defends PhD thesis on Exploring Relations and Graphs for Information Retrieval

by Chris Kamphuis

Finding relevant information in a large collection of documents can be challenging, especially when only text is considered when determining relevancy. This research leverages graph data to express information needs that consider more information than just text data. In some cases, instead of using inverted indexes for the data representation in our work, we use database management systems to store data.
First, we show that relational database systems are suited for retrieval experiments. A prototype system we built implements many proposed improvements to the BM25 ranking algorithm. In a large-scale reproduction study, we compare these improvements and find that the differences in effectiveness are smaller than we would expect, given the literature. We can easily change between versions of BM25 by rewriting the SQL query slightly, validating the usefulness of relational databases for reproducible IR research.
Then, we extend the data model to a graph data model. Using a graph data model; we can include more diverse data than just text. We show that we can more easily express complex information needs with a corresponding graph query language than when a relational language is used. This model is built on top of an embedded database system, allowing fast materialization of output data and using it for further steps.
One of the aspects we capture in the graph is information about entities. We use the Radboud Entity Linking (REL) system to connect entity information with documents. In order to efficiently annotate a large document collection with REL, we improved its efficiency. After these improvements, we used REL to create annotations for the MS MARCO document and passage collections. We can significantly improve recall for harder MS MARCO queries using these annotations. These entities are also used for an interactive demonstration where the geographical data of entities is used.

[more information]

Clause-Driven Automated Grading of SQL’s DDL and DML Statements

by Benard Wanjiru, Patrick van Bommel and Djoerd Hiemstra

Automated grading systems for SQL courses can significantly reduce instructor workload while ensuring consistency and objectivity in assessment. At our university, an automated SQL grading tool has become essential for evaluating assignments. Initially, we focused on grading Data Query Language (SELECT) statements, which constitute the core content of assignments in our first-year computer science course. SELECT statements produce a results table, which makes automatic grading relatively easy. However, other SQL statements, such as CREATE TABLE, INSERT, DELETE, UPDATE, do not produce a results table. This makes grading these statements more difficult. Recognizing the need to cover broader course material, we have extended our system to evaluate advanced Data Definition Language (DDL) and Data Manipulation Language (DML) statements. In this paper, we describe our approach to automated DDL/DML grading and illustrate our method of clause-driven tailored feedback generation. We explain how our system generates precise, targeted feedback based on specific SQL clauses or components. In addition, we present a practical example to highlight the benefits of our approach. Finally, we benchmark our grading tool against existing systems. Our extended tool can parse and provide feedback on most student SQL submissions. It can consistently provide targeted feedback, generating nearly one suggestion per error. It generates shorter feedback for simpler DML queries, while more complex syntax leads to longer feedback. It has the ability to pinpoint precise SQL errors. Lastly, it can generate precise and actionable suggestions, with each message directly tied to the specific component that caused the error.

To be presented at the SIGCSE Technical Symposium on Computer Science Education (SIGCSE 2026) on 18-21 February 2026 in St. Louis, United States of America.

Welcome to Databases 2025!

Welcome to Part B, Databases! We will resume Tuesday 4 November with a lecture at 15:30h. in HG00.304

The Databases part contains mandatory, individual quizzes, for which the following honour code applies:

You do not share the solutions;
The solutions to the quizzes should be your own work;
You do not post the quizzes, nor the solutions anywhere online;
You do not use instruction-tuned large language models like Github Copilot or ChatGPT;
You are allowed, and encouraged, to discuss the quizzes, and to ask clarifying questions to your fellow students; Please use the Brightspace Discussion Forum to reach out to me, the teaching assistants and your fellow students.

New this year are the online Socoles SQL exercises. Please, register yourself with the Socoles Autograder, see the previous announcement. Socoles will automatically give feedback on open questions that require SQL solutions. Socoles helps us grade the assignments for about 200 students in the course. Of course, you will get human feedback too, during the tutorials on Thursday mornings.

Wishing you a fruitful Part B!
Best wishes, Djoerd Hiemstra and Benard Wanjiru

Fatemeh Sarvi defends PhD thesis on Learning to Rank for e-Commerce Search

by Fatemeh Sarvi

Ranking is at the core of information retrieval, from search engines to recommendation systems. The objective of a ranking model is to order items based on their degree of relevance to the user’s information need, which is often expressed by a textual query. In product search, customers search through numerous options using brief, unstructured phrases, and the goal is to find not only relevant but also appealing products that match their preferences and lead to purchases. On the other side, there are the providers of the products who expect the ranking model to fairly expose their items to customers. These complications introduce unique characteristics that set product search apart from other types of search.
This thesis investigates the specific challenges of applying learning to rank models in product search and present methods to improve relevance, fairness, and effectiveness in this setting. We start by focusing on query-product matching based on textual data, as traditional information retrieval methods rely heavily on text to determine relevance. It has been shown that the vocabulary gap is larger in product search, mainly due to the limited and unstructured nature of queries and product descriptions. The vocabulary gap refers to the difference in the language used in queries and the terms found in product descriptions. In Chapter 2, we conduct a comprehensive evaluation of state-of-the-art supervised learning to match models, comparing their performance in product search. Our findings identify models that balance both accuracy and efficiency, offering practical insights for real-world applications.
Next, in Chapters 3 and 4 we address fairness in ranking on two-sided platforms, where the goal is to satisfy both groups of product search users at the same time. Accurate exposure estimation is crucial to achieve this balance. To this end, we introduce the phenomenon of outlierness in ranking as a factor that can influence the exposure-based fair ranking algorithms. Outlier items are products that deviate from others in a ranked list, due to distinct presentational features. We show empirically that these items attract more user attention and can impact exposure distribution in a list. To account for this effect, we propose OMIT, a method that reduces outlierness without compromising user utility or fairness towards providers. In the next chapter, we investigate whether outlier items influence user clicks. We introduce outlier bias as a new type of click bias, and propose OPBM. OPBM is an outlier-aware click model designed to account for both outlier and position bias. Our experiments show that in the worst case, OPBM performs similarly to the well-known Position-based model, making it a more reliable choice.
Finally, in Chapter 5 we explore how different presentational features influence user attention and perception of outliers in product search results. Through visual search and eye-tracking experiments, along with visual saliency modeling, we identify user scanning patterns and determine the role of bottom-up and top-down factors in guiding attention and shaping the perception of outliers.

[download pdf]

Team OpenWebSearch at LongEval

Using Historical Data for Scientific Search

by Daria Alexander, Maik Fröbe, Gijs Hendriksen, Matthias Hagen, Djoerd Hiemstra, Martin Potthast and Arjen de Vries

We describe the submissions of the OpenWebSearch team for the CLEF 2025 LongEval Sci-Retrieval track. Our approaches aim to explore how historical data from the past can be re-used to build effective rankings. The Sci-Retrieval track uses click-data and documents from the CORE search engine. We start all our submissions from rankings of the CORE search engine that we crawled for all queries of the track. This has two motivations: first, we hypothesize that a good practical search engine should only make minor improvements in the ranking at a time (i.e., we would like to only make small adjustments to the production ranking), and, second, we hypothesize that only documents that are in the top ranks of the CORE ranking can be relevant in the setup of LongEval where relevance is derived from clicks (i.e., we try to incorporate the position bias of the clicks into our rankings). Based on this crawled CORE ranking, we try to make improvements via qrel-boosting, RM3 keyqueries, clustering, monoT5 re-ranking and user intent prediction. Our evaluation shows that qrel-boosting, RM3 keyqueries, clustering and intent prediction improve the CORE ranking that we re-rank.

To be presented at the 16th Conference and Labs of the Evaluation Forum (CLEF 2025) on 9-12 September in Madrid, Spain.

[download pdf]

Join us at DIR 2025!

Be part of the 22nd Dutch-Belgian Information Retrieval Workshop at Radboud University, Nijmegen. We warmly invite you to register and to share your latest research with the community.

Submission deadline: Friday 10 October 2025, 23:59 CEST
Notification: Monday 13 October 2025
Registration deadline: Monday 20 October 2025, 23:59 CEST

Sponsored by SIGIR (ACM Special Interest Group on Information Retrieval) and SIKS (School of Information and Knowledge Systems)

More information at: https://informagus.nl/dir2025/

On the Neural Hype and Improving Efficiency of Sparse Retrieval

“The Neural Hype, Justified!” exclaimed Jimmy Lin in an opinion paper in the SIGIR Forum of December 2019. But is it really? Effectiveness-wise, maybe not: I will share some recent examples that show that neural rankers on new data do not even significantly improve a weak sparse baseline. If they do improve on old data, some neural rankers have been pre-trained on the test data – the ultimate sin of the machine learning professional – convincingly shown for the MovieLens data in the SIGIR 2025 poster of Dario Di Palma and colleagues: “Do LLMs Memorize Recommendation Datasets?” Efficiency-wise, neural rankers are no match to sparse rankers. The standard BERT (re-)ranker hailed by Lin’s SIGIR Forum paper may be as much as 10 million times as inefficient as a sparse ranker (Yes, you read that right). I will show some recent innovations for improving the efficiency of sparse rankers: The score-fitted index and the constant-length index (a SIGIR 2025 poster too!) which are implemented in Zoekeend, a new experimental search engine based on the relational database engine DuckDB and available from: https://gitlab.science.ru.nl/informagus/zoekeend/

Presented at the SIGIR Workshop on Reaching Efficiency in Neural Information Retrieval (ReNeuIR 2025)

[download slides]

Gebre Gebremeskel defends PhD thesis on recommender systems

Spotlight on Recommender Systems: Contributions to Selected Components in the Recommendation
Pipeline

by Gebrekirstos Gebremeskel

This thesis sheds light on the different components of the recommendation pipeline, under three themes, which are divided in 10 chapters. The first theme is Cumulative Citation Recommendation. Under this theme, we have conducted research on the task of Cumulative Citation Recommendation (CCR), which is the automation and maintenance of knowledge bases such as Wikipedia. Given a set of Knowledge Base entities, CCR is the task of filtering and ranking documents according to their citation worthiness to the entities. We specifically focused on the filtering stage of the recommendation process and the interplay between feature sets and machine learning algorithms. There are four chapters under the first theme: Chapters 3 to 6. Chapter 3 presents experiments with string-matching and machine learning approaches to the task of CCR. Chapter 4 investigates the interplay between the choice of feature sets and their impact on the performance of machine learning algorithms. Chapter 5 investigates the impact of the initial task of filtering in the CCR overall performance, and what makes some documents unfilterable. Chapter 6 reviews new advances in the area of the theme and the specific chapters. Under this theme, we show that simple string-matching approaches can have advantages over complex machine learning approaches for the task of CCR, that comparisons of machine learning algorithms should take into account the sets of features used, and that the filtering stage of a CCR task can impact recommender systems performance in different ways. The second theme is News Recommendation. In this theme, we investigate news recommendation with a particular focus on evaluation. We study the role of geography in news consumption to understand the geographical focus of news items and the geographical location of readers followed by the incorporation of geographic information into online deployments of algorithms. We also attempt to quantify random fluctuations in the performance difference of a live recommender system. After that, we focus on news evaluation, investigating it from several angles. We conducted A/A tests (running two instances of the same algorithm), offline evaluations, online evaluations, and comparisons of algorithm performances across years. There are three chapters under the theme of News Recommendation. Chapter 7 investigates the role of geographic information in news consumption, and examines in a real-world setting, the performance patterns of news recommender systems, one of which incorporates geographic information into its algorithm. Chapter 8 examines the challenges, validity, and consistency of news recommender systems evaluations from multiple perspectives, involving A/A tests, offline evaluations, online evaluations, and comparisons of algorithm performances across years. Chapter 9 reviews advances in News Recommendation with a focus on developments that have relevance to the approaches and findings presented in chapters 7 and 8. In the above theme, we show that user and item geography play a role in the consumption of news, that there are significant differences and discrepancies in offline and online evaluation of recommender systems algorithms, and that random effects on online performances can result in statistically significant performance differences. The third and final theme is Measuring Personalization and consists of Chapter 10. We view personalization as introducing or imposing differentiation between users in terms of the items recommended to them. In the differentiation, some items will be shared between users, and some will not. We then propose and apply a user-centric metric of personalization that, by using the recommendation lists and the resulting user reaction lists that result from users choosing to click or react on, measures the degree of users’ tendency to agree to the differentiation introduced or imposed between them by the recommender system, to converge (by, for example, clicking more on shared items), or to diverge from the differentiation (by, for example, clicking more on the items that are not in shared recommendation).

[Read more]

Selective Search as a First-Stage Retriever

by Gijs Hendriksen, Djoerd Hiemstra, and Arjen de Vries

Selective search assumes a document collection can be partitioned into topical index shards in such a way that individual search requests would be satisfied with a few shards only. Previous work has considered primarily the retrieval effectiveness of selective search architectures in an early precision setting. In this work, we instead consider selective search as the rst stage in a multi-stage pipeline, and therefore focus on obtaining high recall. We reproduce the most important algorithms from the selective search literature, and show that they can match the recall level of exhaustive search while reducing the required resources by 50%. We compare the different types of resource selection algorithms, and conclude that the more straightforward strategies that can select shards at a low cost actually outperform the more involved algorithms, in terms of reliably obtaining high recall with fewer shards.

To be presented at the 16th Conference and Lab of the Evaluation Forum (CLEF), in September 2025 in Madrid

[download pdf]

IRRJ Volume 1, Number 1

We are proud to introduce the first issue of the Information Retrieval Research Journal (IRRJ). IRRJ is the only peer-reviewed diamond open access journal that focuses exclusively on the information retrieval research community. The journal provides free and un-restricted on-line open access to papers in information retrieval, and runs fully on volunteer work by editors, reviewers, a production editor, a webmaster, and an advisory board. IRRJ does not require subscription fees nor article processing fees: At IRRJ the readers do not pay and the authors do not pay either. Instead, IRRJ plans to be completely self-funded, running on micro-donations, using resources and infrastructure provided by friend organizations and universities. We are grateful to the Radboud University and Royal Netherlands Academy of Arts and Sciences (KNAW) for providing the initial funding and infrastructure.

[read more]

Using Historical Data for Scientific Search

Spotlight on Recommender Systems: Contributions to Selected Components in the RecommendationPipeline

Spotlight on Recommender Systems: Contributions to Selected Components in the Recommendation
Pipeline