Skip to content

This repository contains the code and the data for our SPIRE'22 paper on unintended train--test leakage with neural retrieval models.

License

Notifications You must be signed in to change notification settings

webis-de/SPIRE-22

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SPIRE-22

This repository contains the code and the data for our SPIRE'22 paper on unintended train--test leakage with neural retrieval models.

We studied the effects of unintended train--test leakage between MS MARCO/ORCAS and Robust04 and two Common Core tracks, identifying that 69% of the Robust04 queries have near-duplicates in MS MARCO / ORCAS (74% of the TREC 2017 Common Core track and 76% of the TREC 2018 Common Core track). We then trained five neural retrieval models on a fixed number of MS MARCO/ORCAS queries that are highly similar to the actual test queries and an increasing number of other queries to study the effects of such leaked instances.

Data

Setup

  • Compile everything (including running unit tests): mvn clean install

  • First, create the query-datasets with the notebook src/main/jupyter/construction-of-query-datasets.ipynb.

  • Build Anserini indexes for the query datasets by running:

    ./src/main/bash/index-query-dataset.sh msmarco-document-train
    ./src/main/bash/index-query-dataset.sh msmarco-document-orcas
    
  • Construct candidates for leaking queries by running:

    ./src/main/bash/construct-candidates.sh msmarco-document-train
    ./src/main/bash/construct-candidates.sh msmarco-document-orcas
    

Training of Models

The scripts in ./src/main/jupyter/model-training were used to train all the models in the different scenarious.

Evaluation

Start the jupyter notebook with:

docker run --rm -ti -p 8888:8888 -v "${PWD}":/home/jovyan/work jupyter/datascience-notebook

The scripts in ./src/main/jupyter/reports-paper contain all the evaluations and experiments reported in the paper.

About

This repository contains the code and the data for our SPIRE'22 paper on unintended train--test leakage with neural retrieval models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published