This repository contains the code bases of OpenScholar.
Blog | Demo | Paper | Model checkpoints and data | ScholarQABench | Expert Evaluation | Public demo code | Public demo data
- Overview of OpenScholar
- Repository Organizations
- Installation
- Run OpenScholar
- Train OpenScholar-8B
- Run Retriever
- Contact and Citation
Scientific progress hinges on our ability to find, synthesize, and build on relevant knowledge from the scientific literature. However, the exponential growth of this literature—with millions of papers now published each year—has made it increasingly difficult for scientists to find the information they need or even stay abreast of the latest findings in a single subfield.
To help scientists effectively navigate and synthesize scientific literature, we introduce OpenScholar, a retrieval-augmented language model (LM) designed to answer user queries by first searching for relevant papers in the literature and then generating responses grounded in those sources. Try open-scholar.allen.ai/ and check our paper for more detail.
This repository contains codes to run OpenScholar inference.
src/: Main source codes for OpenScholar.training/: Our training code to train Llama 3.1 8B using our processed data. We modified earlier version oftorchtunefor training.retriever/: Code base to run retrieval offline & host retrieval servers for online retrieval.
For automatic and human evaluations, please check the following repositories.
- To run evaluations on ScholarQABench, please check the ScholarQABench repository.
- For our human evaluation interfaces as well as the results, please check the OpenScholar_ExpertEval repository.
To run OpenScholar inference, please ensure that all necessary libraries are installed.
[test environment command]
conda create -n os_env python=3.10.0
conda activate os_env
pip install -r requirements.txt
python -m spacy download en_core_web_smAlso please set the following API keys:
export S2_API_KEY=YOUR_S2_API_KEYSee instructions to acquire API keys at Semantic Scholar API Page.
If you want to also want to use web search engine, then sign up for you.com web API and set the key.
export YOUR_API_KEY=YOUR_YOU_COM_API_KEYFor information related to OpenScholar training and retriever components, refer to the training/ and retrieval/ directories, respectively.
By default, OpenScholar takes retrieval results from off-line retrieval results after running the retrieval scripts in retrieval/, followed by additional retrieval from Semantic Scholar Paper API and web search API results. See the script src/use_search_apis.py to retrieve related passages offline using external APIs.
We released our retrieval results at google drive.
- Run a Standard RAG pipeline using top 10
python run.py \
--input_file YOUR_INPUT_FILE \
--model_name OpenScholar/Llama-3.1_OpenScholar-8B \
--use_contexts \
--output_file OUTPUT_FILE_PATH \
--top_n 10 --llama3 --zero_shot- Run a Retriever+ Reranker Pipeline
python run.py \
--input_file YOUR_INPUT_FILE \
--model_name OpenScholar/Llama-3.1_OpenScholar-8B \
--use_contexts \
--ranking_ce \
--reranker OpenScholar/OpenScholar_Reranker \
--output_file OUTPUT_FILE_PATH \
--top_n 10 --llama3 --zero_shot- Run Open Retriever Self-reflective Generation pipeline
python run.py \
--input_file YOUR_INPUT_FILE \
--model_name OpenScholar/Llama-3.1_OpenScholar-8B \
--use_contexts --output_file OUTPUT_FILE_NAME \
--top_n 10 --llama3 --use_contexts \
--ranking_ce --reranker OpenScholar/OpenScholar_Reranker \
--posthoc --feedack --ss_retriever \
--use_abstract --norm_cite --zero_shot --max_per_paper 3 \You can also combine the OpenScholar pipeline with propriety LLMs, by specifying model_name, api and api_key_fp.
python run.py \
--input_file YOUR_INPUT_FILE \
--model_name "gpt-4o" \
--api "openai" \
--api_key_fp PATH_TO_YOUR_OPEN_AI_KEY \
--use_contexts \
--output_file OUTPUT_FILE_PATH \
--top_n 10 --llama3 --zero_shotBelow, we provide the detailed of configurations.
top_n: The number of passages to be fed into the underlying LM. By default, we use10for multi-paper tasks.feedback: Set true if you want to use the self-feedback loop during generation.posthoc_at: Set true if you want to run posthoc citation attributionszero_shot: Set true if you want to run inference in a zero-shot manner.ranking_ce: Use a reranking model to reranktop_npassages; If not set true, we take thetop_npassages from thectxsin the provided input file.reranker: Specify the path to the reranker model file (local or HF hub). If you use our OpenScholar reranker, setOpenScholar/OpenScholar_Rerankermin_citation: You can set the minimum number of citations. If anyintis given, we exclude papers whose citations is belowmin_citation. By default, we set it toNoneand all papers are considered regardless of their citation counts.ss_retriever: Use semantic scholar API during the feedback generation loop to enhance the feedback results.use_abstract: Consider abstract to enhance the reranking results.max_per_paper: set the maximum number of passages from the same paper used during inference time.task_name: specify the task names when you run the single paper tasks. For SciFact, PubmedQA and QASA, the corresponding task names areclaim_full,boolean_question_fullandsingle_qa, respectively.
- We trained our embedding model using peS2o, which is used to form the OpenScholar Datastore.
- We used the Contriever code base to continue pre-training contriever (base), using the training script.
- We trained our reranker model using FlagEmbeddings.
- We formatted our reranker training data, following the original repo instructions, and then trained BGE reranker large with torch tune.
torchrun --nproc_per_node {number of gpus} \
-m FlagEmbedding.reranker.run \
--output_dir {path to save model} \
--model_name_or_path BAAI/bge-reranker-base \
--train_data PATH_TO_TRAIN_DATA \
--learning_rate 6e-5 \
--fp16 \
--num_train_epochs 5 \
--per_device_train_batch_size {batch size; set 1 for toy data} \
--gradient_accumulation_steps 4 \
--dataloader_drop_last True \
--train_group_size 16 \
--max_len 512 \
--weight_decay 0.01 \
--logging_steps 10
- We trained our OpenScholar-8B using our OpenScholar/OS_Train_Data data, which consists of 13k instruction-tuning data. We use our modified version of torchtune to train our 8B model using 8*A100.
See mode detailed instructions for setting up the training in train/
Both our peS2o v2 and v3 datastore (chunked text + index) are available:
See instructions under retriever to run the peS2o index locally. Note that due to the massive-scale of index (200+M embeddings based on 45 million papers), the peS2o retriever requires a lot of CPU memory. In our main experiments, we retrieved initial passages offline.
We are planning to release our efficient sparse-dense retriever API endpoint used for the OpenScholar Demo publicly via Semantic Scholar API to accelerate research for LLMs for scientific literature synthesis. Stay tune!!d!
If you have any questions, please contact [email protected]. Note that I am currently applying for academic jobs so I may be slow to respond.
If you have any questions related to demo, please file your request from google form.
@article{openscholar,
title={{OpenScholar}: Synthesizing Scientific Literature with Retrieval-Augmented Language Models},
author={Asai, Akari and He*, Jacqueline and Shao*, Rulin and Shi, Weijia and Singh, Amanpreet and Chang, Joseph Chee and Lo, Kyle and Soldaini, Luca and Feldman, Tian, Sergey and Mike, D’arcy and Wadden, David and Latzke, Matt and Minyang and Ji, Pan and Liu, Shengyan and Tong, Hao and Wu, Bohao and Xiong, Yanyu and Zettlemoyer, Luke and Weld, Dan and Neubig, Graham and Downey, Doug and Yih, Wen-tau and Koh, Pang Wei and Hajishirzi, Hannaneh},
journal={Arxiv},
year={2024},
}
