TimIR - Time Traveling through IR History

Motivation

In some settings, the reproducibility of ranked lists is desirable, such as when extracting a subset of an evolving document corpus for research or in domains such as patent retrieval, in medical systematic reviews or when exploring the rich history of Information Retrieval. Currently the only reliable way of achieving reproducibility in Information Retrieval Settings is Boolean Retrieval, which has therefore became the standard retrieval strategy for such tasks.

We showcase a hybrid retrieval strategy which combines a fast traditional sparse retrieval engine (Lucene) for live queries and a slower columnstore retrieval engine (MonetDB) that keeps all historical changes to the term statistics and is able to recreate the document statistics for a given point in time.

To showcase our proposed system in a small real-world example, we indexed all abstracts of the IR Anthology until 2021 in yearly increments, this includes 52780 documents, 63882 terms in the dictionary, and 2701478 total used terms.

Indexing Strategy

For indexing, we built upon the functionality by Lucene, and are indexing the IR Anthology by Lucene. Once it is processed, the term and document statistics are then mirrored and versioned in MonetDB. Therefore, we applied the RDA Dynamic Data Citation guidelines to our database, and extended each document in the corpora with a validity period. The validity period shows, when a document was added to the index and at which point it was deleted from the corpora. With this information it is then possible to calculate the relevant frequencies for a given point in time.

Functionality

Our System supports a wide range of different tasks, as analyzing the performance of search queries over time, by travelling through the historical states of the index, recreating the index for a given point in time, and last but not least the citation of queries with variable subsets of data. To showcase that queries are reproducible, we further store at the time of execution a hash of the result list to allow the verification of the recreation of the ranked list.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.mvn/wrapper		.mvn/wrapper
backend		backend
columnstore		columnstore
streamlit		streamlit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TimIR - Time Traveling through IR History

Motivation

Indexing Strategy

Functionality

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TimIR - Time Traveling through IR History

Motivation

Indexing Strategy

Functionality

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages