evaluation

ReasonIR: Evaluation

This directory contains evaluation scripts for Information Retrieval (IR) and Retrieval-Augmented Generation (RAG) tasks as described in our research paper.

BRIGHT

Setup (same as for synthetic data generation):

conda create -n reasonir python=3.10
conda activate reasonir
pip install -r evaluation/bright/requirements.txt

# to evaluate with BM25, you need to download java
wget https://download.oracle.com/java/22/latest/jdk-22_linux-x64_bin.deb
sudo dpkg -i

To evaluate ReasonIR on BRIGHT, run

bash evaluation/bright/script.sh

To evaluate ReasonIR on BRIGHT with a reranker (QwenRerank), run

bash evaluation/bright/reranker_script.sh

Note that this script first runs the retriever and then the reranker. This script produces two results files –reranker_results.json and reranker_retriever_results.json. The former is the results obtained using just the reranker scores and the later is results obtained when the reranker score are interpolated with the retriever scors. If you want to combine the reranker scores with BM25 scores, look at the comments in reranker_script.sh for instructions.

To reproduce the results for some other baselines (such as Cohere and Voyage embeddings), please install other required packages via pip install evaluation/bright/other_requirements.txt.

Downstream RAG evaluation

In order to reduce the cost of datastore construction, we first retrieve the top-1000 documents from the original MassiveDS-1.4T built with Contriever for each benchmark respectively. We then merge the retrieved documents as a new and smaller pool of datastore for experiments. To merge and deduplicate these documents, we use the script in datastore/construct_datastore_corpus.py.

To embed and index the filtered data with our retriever, run

git clone https://github.com/RulinShao/retrieval-scaling.git
bash evaluation/rag/datastore/build_datastore.sh

We then use the MassiveDS codebase to search for the queries following the instructions in the Datastore.

MMLU

To evaluate ReasonIR on MMLU, replace the data directories and run

export retrieval_file=$YOUR_RETRIEVAL_FILE  # refer to REAMDE for more details
export raw_query_file=mmlu.jsonl  # refer to the original MMLU questions used for retrieval
bash evaluation/rag/mmlu_cot/scripts/eval_llama_3_8b_mmlu_rag.sh

GPQA

First launch the LLM using vllm to obtain a local serving api:

python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --disable-cuda-graph --tp 1 --host 0.0.0.0

Then, run evaluation

cd evaluation/rag/gpqa
export RETRIEVED_FILE=YOUR_RETRIEVED_FILE
PYTHONPATH=. python src/main.py \
    --config-name naive_rag_default \
    model_path=Qwen/Qwen2.5-7B-Instructt \
    llm_endpoint=http://${VLLM_ENDPOINT}-${VLLM_PORT}:30000/v1 \
    top_k=5 \
    search_engine=offline_massiveds \
    use_query_rewriting=false \
    dataset_name=gpqa \
    split=diamond

Name		Name	Last commit message	Last commit date
parent directory ..
bright		bright
rag		rag
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

ReasonIR: Evaluation

BRIGHT

Downstream RAG evaluation

MMLU

GPQA

FilesExpand file tree

evaluation

Directory actions

More options

Directory actions

More options

Latest commit

History

evaluation

Folders and files

parent directory

README.md

ReasonIR: Evaluation

BRIGHT

Downstream RAG evaluation

MMLU

GPQA