Skip to content

mainlp/llm-clir

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evaluating Large Language Models for Cross-Lingual Information Retrieval

Pipeline

This repository contains the code, experimental framework, and evaluation scripts to reproduce the results of the paper Evaluating Large Language Models for Cross-Lingual Retrieval.

Directory Structure

  • retrieval/: First-stage retrieval (e.g., BM25)
  • reranking/: LLM-based second-stage reranking (listwise & pairwise)

Data Loading

For our work, we evaluated CLEF2003 as a corpus for high-resource European languages, and CIRAL as a corpus for low-resource African languages.

To download CLEF2003, you first need to install clef-dataloaders. Follow the setup instructions in the repository, and run pip install -e . inside the directory where you extracted clef-dataloaders. You can then use python download_data.py to download both CIRAL and CLEF2003. The script preprocesses the data, so the format is compatible with all evaluation code inside this repository. By default, the script downloads both datasets, but you can choose to download only a specific dataset by passing the argumet --dataset <clef/ciral>.

You can also manually download the CIRAL queries/qrels from here. For queries, we used the -test-a.tsv files. For the qrels, we used the -test-a-pools.tsv files. The CIRAL corpus files can be downloaded from here.

Running Experiments

1. Environment Setup

Before running any scripts, you need to configure the .env file in the project root with the correct local paths. All provided scripts automatically load .env from the repository root.

2. First-stage Retrieval

Refer to /retrieval/README.md for generating initial candidates using BM25 or bi-encoder models.

3. Second-stage Reranking

You can use either listwise or pairwise reranking. Refer to /reranking/README.md for details.

Evaluation

Run the provided script build_score_table.py to evaluate and build final retrieval or reranking results table.

python build_score_table.py \
  --stage <retrieval|reranking> \
  --dataset <clef|ciral> \
  --approach <listwise|pairwise>
  • --stage: Stage of experiments, either retrieval or reranking
  • --dataset: Dataset to evaluate, either clef or ciral
  • --approach: Reranking method, either listwise or pairwise. Only required when --stage=reranking

After evaluation, a significance test is run automatically, reranking results significantly different from retrieval (paired t-test, p < 0.05) are marked with *.

Citation

If you find this paper useful, please cite:

@inproceedings{zuo-etal-2025-evaluating,
    title = "Evaluating Large Language Models for Cross-Lingual Retrieval",
    author = "Zuo, Longfei and Hong, Pingjun and Kraus, Oliver and Plank, Barbara and Litschko, Robert",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.612/",
    pages = "11415--11429",
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published