This repository contains the code, experimental framework, and evaluation scripts to reproduce the results of the paper Evaluating Large Language Models for Cross-Lingual Retrieval.
retrieval/: First-stage retrieval (e.g., BM25)reranking/: LLM-based second-stage reranking (listwise & pairwise)
For our work, we evaluated CLEF2003 as a corpus for high-resource European languages, and CIRAL as a corpus for low-resource African languages.
To download CLEF2003, you first need to install clef-dataloaders. Follow the setup instructions in the repository, and run pip install -e . inside the directory where you extracted clef-dataloaders.
You can then use python download_data.py to download both CIRAL and CLEF2003.
The script preprocesses the data, so the format is compatible with all evaluation code inside this repository. By default, the script downloads both datasets, but you can choose to download only a specific dataset by passing the argumet --dataset <clef/ciral>.
You can also manually download the CIRAL queries/qrels from here. For queries, we used the -test-a.tsv files. For the qrels, we used the -test-a-pools.tsv files. The CIRAL corpus files can be downloaded from here.
Before running any scripts, you need to configure the .env file in the project root with the correct local paths.
All provided scripts automatically load .env from the repository root.
Refer to /retrieval/README.md for generating initial candidates using BM25 or bi-encoder models.
You can use either listwise or pairwise reranking. Refer to /reranking/README.md for details.
Run the provided script build_score_table.py to evaluate and build final retrieval or reranking results table.
python build_score_table.py \
--stage <retrieval|reranking> \
--dataset <clef|ciral> \
--approach <listwise|pairwise>
--stage: Stage of experiments, eitherretrievalorreranking--dataset: Dataset to evaluate, eithercleforciral--approach: Reranking method, eitherlistwiseorpairwise. Only required when--stage=reranking
After evaluation, a significance test is run automatically, reranking results significantly different from retrieval (paired t-test, p < 0.05) are marked with *.
If you find this paper useful, please cite:
@inproceedings{zuo-etal-2025-evaluating,
title = "Evaluating Large Language Models for Cross-Lingual Retrieval",
author = "Zuo, Longfei and Hong, Pingjun and Kraus, Oliver and Plank, Barbara and Litschko, Robert",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.612/",
pages = "11415--11429",
}