TITE

This repository contains the code for the paper TITE: Token-Independent Text Encoder for Information Retrieval.

Model Zoo

Pre-trained Models

We provide two pre-trained models. The first uses a intra-attention pooling, distributes the pooling layers late in the model, and uses a kernel size and stride of 2. The second model uses the same pooling, but upscales the hidden states. The base model is available here: webis/tite-2-late and the upscaled model is available here: webis/tite-2-late-upscale. The configuration files pl_config.yaml to reproduce both models are avilable in the respective model repositories and can be reproduced using the following command:

python main.py fit --config pl_config.yaml

Fine-tuned Models

We fine-tuned both models using Lightning IR and on MS MARCO and distillation scores from a large Set-Encoder model webis/set-encoder-large. The table below summarizes the nDCG@10 scores of the fine-tuned models on TREC DL 19 and 20 and the geometric mean on BEIR. The values are slightly different from the values reported in the paper due to a slightly different fine-tuning setup. The configuration files to reproduce both fine-tuned models are available in the respective model repositories and can be reproduced using the following command:

lightning-ir fit --config pl_config.yaml

Model	TREC DL 19	TREC DL 20	BEIR (geometric mean)
`webis/tite-2-late-msmarco`	0.69	0.71	0.40
`webis/tite-2-late-upscale-msmarco`	0.68	0.71	0.41

The results of the table can be reproduced using the following commands:

lightning-ir index \
  --config lightning-ir-configs/index/trainer.yaml \
  --config lightning-ir-configs/index/model.yaml \
  --config lightning-ir-configs/index/datamodule.yaml \
  --config lightning-ir-configs/index/index-callback.yaml \
  --model.model_name_or_path {MODEL_NAME}
  

lightning-ir search \
  --config lightning-ir-configs/search/trainer.yaml \
  --config lightning-ir-configs/search/model.yaml \
  --config lightning-ir-configs/search/datamodule.yaml \
  --config lightning-ir-configs/search/search-callback.yaml \
  --model.model_name_or_path {MODEL_NAME}

Pre-training

To pre-train a TITE model, run the following command:

python main.py \
  --config configs/trainer.yaml \
  --config configs/adamw.yaml \
  --config configs/data/datamodule-tite.yaml \
  --config configs/model/tite.yaml

See the configuration files in the configs directory for available configuration options.

Note that this command will stream the Hugging Face FineWeb dataset. We recommend first downloading the dataset and then using the local path as exemplified below to avoid streaming the dataset from Hugging Face during training:

data:
  class_path: tite.datasets.FineWebDataModule
  init_args:
    path: arrow
    data_files:
      train: ./HuggingFaceFW___fineweb-edu/default/0.0.0/*/fineweb-edu-train-*.arrow

Fine-tuning

We rely on the Lightning IR framework for fine-tuning, indexing, and retrieval. To fine-tune a TITE model, first insert the path or id of a pre-trained model in the lightning-ir-configs/fine-tune/model.yaml file. You can then use the following command:

lightning-ir fit \
  --config lightning-ir-configs/fine-tune/trainer.yaml \
  --config lightning-ir-configs/fine-tune/model.yaml \
  --config lightning-ir-configs/fine-tune/datamodule.yaml \
  --config lightning-ir-configs/fine-tune/adamw.yaml

To evaluate a fine-tuned model, first insert the fine-tuned model name in the lightning-ir-configs/index/model.yaml and lightning-ir-configs/search/model.yaml files. You can then use the following commands to index and search:

lightning-ir index \
  --config lightning-ir-configs/index/trainer.yaml \
  --config lightning-ir-configs/index/model.yaml \
  --config lightning-ir-configs/index/datamodule.yaml \
  --config lightning-ir-configs/index/index-callback.yaml

lightning-ir search \
  --config lightning-ir-configs/search/trainer.yaml \
  --config lightning-ir-configs/search/model.yaml \
  --config lightning-ir-configs/search/datamodule.yaml \
  --config lightning-ir-configs/search/search-callback.yaml

Reproduction

The run files to reproduce the tables in the paper are available on Zenodo. Download and unpack the run files and then run the notebooks/evaluate.ipynb notebook to reproduce the results.

The efficiency.py script can be used to reproduce the efficiency results. It outputs an efficiency.json file that can be copied into the notebooks directory and be evaluated by running the notebooks/efficiency.ipynb notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 465 Commits
.devcontainer		.devcontainer
configs		configs
lightning-ir-configs		lightning-ir-configs
notebooks		notebooks
tests		tests
tite		tite
tokenizers		tokenizers
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
efficiency.py		efficiency.py
main.py		main.py
profile_lightning_ir.py		profile_lightning_ir.py
profile_train.py		profile_train.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TITE

Model Zoo

Pre-trained Models

Fine-tuned Models

Pre-training

Fine-tuning

Reproduction

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

webis-de/tite

Folders and files

Latest commit

History

Repository files navigation

TITE

Model Zoo

Pre-trained Models

Fine-tuned Models

Pre-training

Fine-tuning

Reproduction

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages