This repository contains the code and models of the paper "AugTriever: Unsupervised Dense Retrieval by Scalable Data Augmentation"
Our code is based on the following repositories:
- SimCSE released with the SimCSE paper.
- Contriever released with the Contriever paper.
- MoCo released with the MoCo paper.
- DPR released with the DPR paper.
- SPIDER released with the SPIDER paper.
- MTEB released with the MTEB paper.
- BEIR released with the BEIR paper.
- SentEval released with the SentEval paper.
- E5 released with the E5 paper.
- rank_bm25
AugQ-Wiki and AugQ-CC can be downloaded from Huggingface Hub.
Naming corresponds to Table 1 in the paper.
| Aug method | Model | MM | BEIR (14 tasks) | Download Link |
|---|---|---|---|---|
| Hybrid-TQGen+ | MoCo | 24.6 | 41.1 | [download] |
| Hybrid-All | MoCo | 23.5 | 39.4 | [download] |
| Hybrid-TQGen | MoCo | 23.3 | 39.4 | [download] |
| Doc-Title | MoCo | 21.8 | 38.7 | [download] |
| QExt-PLM | MoCo | 20.6 | 38.2 | [download] |
| TQGen-Topic | MoCo | 21.2 | 38.9 | [download] |
| TQGen-Title | MoCo | 21.8 | 39.3 | [download] |
| TQGen-AbSum | MoCo | 23.2 | 39.6 | [download] |
| TQGen-ExSum | MoCo | 23.0 | 39.4 | [download] |
| TQGen-Topic | InBatch | 20.7 | 39.0 | [download] |
A few scripts for starting training are placed in the folder examples/traning. For example:
cd $PATH_TO_REPO
sh examples/training/cc.moco.topic50.bs2048.gpu8.shPlease refer to BEIR for data download.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python torch.distributed.launch --nproc_per_node=8 --master_addr=127.0.0.1 --master_port=2255 eval_beir.py --model_name_or_path output_dir/augtriever-release/cc.T03b_title50.moco-2e14.contriever256-special50.bert-base-uncased.avg.dot.q128d256.step100k.bs1024.lr5e5/ --dataset fiqa --metric dot --pooling average --per_gpu_batch_size 128 --beir_data_path data/beir/ --output_dir eval_dir/beirPlease refer to Spider for details about QA data download and processing.
export EXP_DIR="output_dir/cc-hybrid.RC20+T0gen80.seed477.moco-2e14.contriever256-special50.bert-base-uncased.avg.dot.q128d256.step200k.bs2048.lr5e5/"
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python torch.distributed.launch --nproc_per_node=8 --master_port=31133 --max_restarts=0 generate_passage_embeddings.py --model_name_or_path $EXP_DIR --output_dir $EXP_DIR/embeddings --passages data/nq/psgs_w100.tsv --per_gpu_batch_size 512
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python eval_qa.py --model_name_or_path facebook/contriever --passages data/nq/psgs_w100.tsv --passages_embeddings "$EXP_DIR/embeddings/*" --qa_file data/nq/qas/*-test.csv,data/nq/qas/entityqs/test/P*.test.json --output_dir $EXP_DIR/qa_output --save_or_load_indexConvert to Huggingface BERT model
python convert_checkpoint_to_hf_bert.py --ckpt_path output_dir/cc.T03b_topic.inbatch.contriever256-special.bert-base-uncased.avg.dot.q128d256.step100k.bs1024.lr5e5 --output_dir output_dir/cc.T03b_topic.inbatch.contriever256-special.bert-base-uncased.avg.dot.q128d256.step100k.bs1024.lr5e5/hf_ckpt_bert --model_type sharedConvert to Huggingface DPR model
python convert_checkpoint_to_hf_dpr.py --ckpt_path output_dir/cc-hybrid.RC20+T0gen80.seed477.moco-2e14.contriever256-special50.bert-base-uncased.avg.dot.q128d256.step200k.bs2048.lr5e5 --output_dir output_dir/cc-hybrid.RC20+T0gen80.seed477.moco-2e14.contriever256-special50.bert-base-uncased.avg.dot.q128d256.step200k.bs2048.lr5e5/hf_ckpt_dpr --model_type sharedReplace the exp path in gather_score_beir.py/gather_score_qa.py/gather_score_senteval.py and run it. For example
python gather_score_beir.pyAugTriever is licensed under the BSD 3-Clause License.
Evaluation codes that are forked from external repositories are placed in subfolders (e.g. src/beir, src/beireval, src/mteb, src/mtebeval, src/qa, src/senteval). Please refer to LICENSE in each subfolder for their Copyright information.
If you find the AugTriever code or models useful, please cite it by using the following BibTeX entry.
@article{meng2022augtriever,
title={AugTriever: Unsupervised Dense Retrieval by Scalable Data
Augmentation},
author={Meng, Rui and Liu, Ye and Yavuz, Semih and Agarwal, Divyansh and Tu, Lifu and Yu, Ning and Zhang, Jianguo and Bhat, Meghana and Zhou, Yingbo},
journal={arXiv preprint arXiv:2212.08841},
year={2022}
}