OpenScholar

This repository contains the code bases of OpenScholar.

Overview of OpenScholar

Scientific progress hinges on our ability to find, synthesize, and build on relevant knowledge from the scientific literature. However, the exponential growth of this literature—with millions of papers now published each year—has made it increasingly difficult for scientists to find the information they need or even stay abreast of the latest findings in a single subfield.

To help scientists effectively navigate and synthesize scientific literature, we introduce OpenScholar, a retrieval-augmented language model (LM) designed to answer user queries by first searching for relevant papers in the literature and then generating responses grounded in those sources. Try open-scholar.allen.ai/ and check our paper for more detail.

Repository Organizations

This repository contains codes to run OpenScholar inference.

src/: Main source codes for OpenScholar.
training/: Our training code to train Llama 3.1 8B using our processed data. We modified earlier version of torchtune for training.
retriever/: Code base to run retrieval offline & host retrieval servers for online retrieval.

For automatic and human evaluations, please check the following repositories.

To run evaluations on ScholarQABench, please check the ScholarQABench repository.
For our human evaluation interfaces as well as the results, please check the OpenScholar_ExpertEval repository.

Installation

To run OpenScholar inference, please ensure that all necessary libraries are installed.

[test environment command]

conda create -n os_env python=3.10.0
conda activate os_env
pip install -r requirements.txt
python -m spacy download en_core_web_sm

Also please set the following API keys:

export S2_API_KEY=YOUR_S2_API_KEY

See instructions to acquire API keys at Semantic Scholar API Page.

If you want to also want to use web search engine, then sign up for you.com web API and set the key.

export YOUR_API_KEY=YOUR_YOU_COM_API_KEY

For information related to OpenScholar training and retriever components, refer to the training/ and retrieval/ directories, respectively.

Run OpenScholar inference

By default, OpenScholar takes retrieval results from off-line retrieval results after running the retrieval scripts in retrieval/, followed by additional retrieval from Semantic Scholar Paper API and web search API results. See the script src/use_search_apis.py to retrieve related passages offline using external APIs.

We released our retrieval results at google drive.

Use Open LMs (e.g., `Llama-3.1_OpenScholar-8B`) locally

Run a Standard RAG pipeline using top 10

python run.py \
    --input_file YOUR_INPUT_FILE \
    --model_name OpenScholar/Llama-3.1_OpenScholar-8B \
    --use_contexts \
    --output_file OUTPUT_FILE_PATH \
    --top_n 10 --llama3 --zero_shot

Run a Retriever+ Reranker Pipeline

python run.py \
    --input_file YOUR_INPUT_FILE \
    --model_name OpenScholar/Llama-3.1_OpenScholar-8B \
    --use_contexts \
    --ranking_ce \
    --reranker OpenScholar/OpenScholar_Reranker \
    --output_file OUTPUT_FILE_PATH \
    --top_n 10 --llama3 --zero_shot

Run Open Retriever Self-reflective Generation pipeline

python run.py \
    --input_file YOUR_INPUT_FILE \
    --model_name  OpenScholar/Llama-3.1_OpenScholar-8B \
    --use_contexts --output_file OUTPUT_FILE_NAME \
    --top_n 10 --llama3 --use_contexts \
    --ranking_ce --reranker OpenScholar/OpenScholar_Reranker \ 
    --posthoc --feedack --ss_retriever \
    --use_abstract --norm_cite --zero_shot --max_per_paper 3 \

Use propriety LMs e.g., OpenAI GPT4o

You can also combine the OpenScholar pipeline with propriety LLMs, by specifying model_name, api and api_key_fp.

python run.py \
    --input_file YOUR_INPUT_FILE \
    --model_name "gpt-4o" \
    --api "openai" \
    --api_key_fp PATH_TO_YOUR_OPEN_AI_KEY \ 
    --use_contexts \
    --output_file OUTPUT_FILE_PATH \
    --top_n 10 --llama3 --zero_shot

Details of configurations

Below, we provide the detailed of configurations.

top_n: The number of passages to be fed into the underlying LM. By default, we use 10 for multi-paper tasks.
feedback: Set true if you want to use the self-feedback loop during generation.
posthoc_at: Set true if you want to run posthoc citation attributions
zero_shot: Set true if you want to run inference in a zero-shot manner.
ranking_ce: Use a reranking model to rerank top_n passages; If not set true, we take the top_n passages from the ctxs in the provided input file.
reranker: Specify the path to the reranker model file (local or HF hub). If you use our OpenScholar reranker, set OpenScholar/OpenScholar_Reranker
min_citation: You can set the minimum number of citations. If any int is given, we exclude papers whose citations is below min_citation. By default, we set it to None and all papers are considered regardless of their citation counts.
ss_retriever: Use semantic scholar API during the feedback generation loop to enhance the feedback results.
use_abstract: Consider abstract to enhance the reranking results.
max_per_paper: set the maximum number of passages from the same paper used during inference time.
task_name: specify the task names when you run the single paper tasks. For SciFact, PubmedQA and QASA, the corresponding task names are claim_full, boolean_question_full and single_qa, respectively.

Training

Embedding model training

We trained our embedding model using peS2o, which is used to form the OpenScholar Datastore.
We used the Contriever code base to continue pre-training contriever (base), using the training script.

Reranker model training

We trained our reranker model using FlagEmbeddings.
We formatted our reranker training data, following the original repo instructions, and then trained BGE reranker large with torch tune.

torchrun --nproc_per_node {number of gpus} \
-m FlagEmbedding.reranker.run \
--output_dir {path to save model} \
--model_name_or_path BAAI/bge-reranker-base \
--train_data PATH_TO_TRAIN_DATA \
--learning_rate 6e-5 \
--fp16 \
--num_train_epochs 5 \
--per_device_train_batch_size {batch size; set 1 for toy data} \
--gradient_accumulation_steps 4 \
--dataloader_drop_last True \
--train_group_size 16 \
--max_len 512 \
--weight_decay 0.01 \
--logging_steps 10

Generator LM training

We trained our OpenScholar-8B using our OpenScholar/OS_Train_Data data, which consists of 13k instruction-tuning data. We use our modified version of torchtune to train our 8B model using 8*A100.

See mode detailed instructions for setting up the training in train/

Run Retriever

Both our peS2o v2 and v3 datastore (chunked text + index) are available:

See instructions under retriever to run the peS2o index locally. Note that due to the massive-scale of index (200+M embeddings based on 45 million papers), the peS2o retriever requires a lot of CPU memory. In our main experiments, we retrieved initial passages offline.

We are planning to release our efficient sparse-dense retriever API endpoint used for the OpenScholar Demo publicly via Semantic Scholar API to accelerate research for LLMs for scientific literature synthesis. Stay tune!!d!

Contact and Citation

If you have any questions, please contact [email protected]. Note that I am currently applying for academic jobs so I may be slow to respond. If you have any questions related to demo, please file your request from google form.

@article{openscholar,
  title={{OpenScholar}: Synthesizing Scientific Literature with Retrieval-Augmented Language Models},
  author={Asai, Akari and He*, Jacqueline and Shao*, Rulin and Shi, Weijia and Singh, Amanpreet and Chang, Joseph Chee  and Lo,  Kyle and Soldaini, Luca and Feldman, Tian, Sergey and Mike, D’arcy and Wadden, David and Latzke, Matt and Minyang and Ji, Pan and Liu, Shengyan and Tong, Hao and Wu, Bohao and Xiong, Yanyu and Zettlemoyer, Luke and Weld, Dan and Neubig, Graham and Downey, Doug and Yih, Wen-tau and Koh, Pang Wei and Hajishirzi, Hannaneh},
  journal={Arxiv},
  year={2024},
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
imgs		imgs
retriever		retriever
src		src
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenScholar

Table of contents

Overview of OpenScholar

Repository Organizations

Installation

Run OpenScholar inference

Use Open LMs (e.g., `Llama-3.1_OpenScholar-8B`) locally

Use propriety LMs e.g., OpenAI GPT4o

Details of configurations

Training

Embedding model training

Reranker model training

Generator LM training

Run Retriever

Contact and Citation

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

AkariAsai/OpenScholar

Folders and files

Latest commit

History

Repository files navigation

OpenScholar

Table of contents

Overview of OpenScholar

Repository Organizations

Installation

Run OpenScholar inference

Use Open LMs (e.g., Llama-3.1_OpenScholar-8B) locally

Use propriety LMs e.g., OpenAI GPT4o

Details of configurations

Training

Embedding model training

Reranker model training

Generator LM training

Run Retriever

Contact and Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Use Open LMs (e.g., `Llama-3.1_OpenScholar-8B`) locally

Packages