SPIRE-22

This repository contains the code and the data for our SPIRE'22 paper on unintended train--test leakage with neural retrieval models.

We studied the effects of unintended train--test leakage between MS MARCO/ORCAS and Robust04 and two Common Core tracks, identifying that 69% of the Robust04 queries have near-duplicates in MS MARCO / ORCAS (74% of the TREC 2017 Common Core track and 76% of the TREC 2018 Common Core track). We then trained five neural retrieval models on a fixed number of MS MARCO/ORCAS queries that are highly similar to the actual test queries and an increasing number of other queries to study the effects of such leaked instances.

Data

Sentence-BERT embeddings for all MS MARCO/ORCAS, Robust04, and Common Core queries and nearest neighbors (all together 10GB) are available at https://files.webis.de/corpora/corpora-webis/corpus-webis-leaking-queries
Our manual annotated data on query reformulation types and near-duplicate thresholds is located in src/main/resources/manual-annotations/.

Setup

Compile everything (including running unit tests): mvn clean install
First, create the query-datasets with the notebook src/main/jupyter/construction-of-query-datasets.ipynb.

Build Anserini indexes for the query datasets by running:

./src/main/bash/index-query-dataset.sh msmarco-document-train
./src/main/bash/index-query-dataset.sh msmarco-document-orcas

Construct candidates for leaking queries by running:

./src/main/bash/construct-candidates.sh msmarco-document-train
./src/main/bash/construct-candidates.sh msmarco-document-orcas

Training of Models

The scripts in ./src/main/jupyter/model-training were used to train all the models in the different scenarious.

Evaluation

Start the jupyter notebook with:

docker run --rm -ti -p 8888:8888 -v "${PWD}":/home/jovyan/work jupyter/datascience-notebook

The scripts in ./src/main/jupyter/reports-paper contain all the evaluations and experiments reported in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dockerfiles		dockerfiles
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SPIRE-22

Data

Setup

Training of Models

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

webis-de/SPIRE-22

Folders and files

Latest commit

History

Repository files navigation

SPIRE-22

Data

Setup

Training of Models

Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages