This directory contains evaluation scripts for Information Retrieval (IR) and Retrieval-Augmented Generation (RAG) tasks as described in our research paper.
Setup (same as for synthetic data generation):
conda create -n reasonir python=3.10
conda activate reasonir
pip install -r evaluation/bright/requirements.txt
# to evaluate with BM25, you need to download java
wget https://download.oracle.com/java/22/latest/jdk-22_linux-x64_bin.deb
sudo dpkg -iTo evaluate ReasonIR on BRIGHT, run
bash evaluation/bright/script.shTo evaluate ReasonIR on BRIGHT with a reranker (QwenRerank), run
bash evaluation/bright/reranker_script.shNote that this script first runs the retriever and then the reranker. This script produces two results files –reranker_results.json and reranker_retriever_results.json. The former is the results obtained using just the reranker scores and the later is results obtained when the reranker score are interpolated with the retriever scors. If you want to combine the reranker scores with BM25 scores, look at the comments in reranker_script.sh for instructions.
To reproduce the results for some other baselines (such as Cohere and Voyage embeddings), please install other required packages via pip install evaluation/bright/other_requirements.txt.
In order to reduce the cost of datastore construction, we first retrieve the top-1000 documents from the original MassiveDS-1.4T built with Contriever for each benchmark respectively. We then merge the retrieved documents as a new and smaller pool of datastore for experiments. To merge and deduplicate these documents, we use the script in datastore/construct_datastore_corpus.py.
To embed and index the filtered data with our retriever, run
git clone https://github.com/RulinShao/retrieval-scaling.git
bash evaluation/rag/datastore/build_datastore.shWe then use the MassiveDS codebase to search for the queries following the instructions in the Datastore.
To evaluate ReasonIR on MMLU, replace the data directories and run
export retrieval_file=$YOUR_RETRIEVAL_FILE # refer to REAMDE for more details
export raw_query_file=mmlu.jsonl # refer to the original MMLU questions used for retrieval
bash evaluation/rag/mmlu_cot/scripts/eval_llama_3_8b_mmlu_rag.shFirst launch the LLM using vllm to obtain a local serving api:
python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --disable-cuda-graph --tp 1 --host 0.0.0.0Then, run evaluation
cd evaluation/rag/gpqa
export RETRIEVED_FILE=YOUR_RETRIEVED_FILE
PYTHONPATH=. python src/main.py \
--config-name naive_rag_default \
model_path=Qwen/Qwen2.5-7B-Instructt \
llm_endpoint=http://${VLLM_ENDPOINT}-${VLLM_PORT}:30000/v1 \
top_k=5 \
search_engine=offline_massiveds \
use_query_rewriting=false \
dataset_name=gpqa \
split=diamond