AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

Lei Li, Xiangxu Zhang, Xiao Zhou, Zheng Liu,

Gaoling School of Artificial Intelligence, Renmin University of China

Beijing Academy of Artificial Intelligence

🔭 Overview

AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels (accepted in EMNLP2025)

In this work, we propose Self-Learning Hypothetical Document Embeddings for zero-shot medical information retrieval, eliminating the need for relevance-labeled data.

We alse develop a comprehensive Chinese Medical Information Retrieval Benchmark and evaluate the performance of various text embedding models on it.

⚙️ Installation

Note that the code in this repo runs under Linux system. We have not tested whether it works under other OS.

Clone this repository:

git clone https://github.com/ll0ruc/AutoMIR.git
cd automir

Create and activate the conda environment:

conda create -n automir python=3.10
conda activate automir
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install beir==2.0.0
pip install mteb==1.1.1
pip install deepspeed==0.15.1
pip install peft==0.12.0
pip install transformers==4.44.2
pip install sentence-transformers==3.1.1
pip install datasets==2.21.0
pip install vllm==0.5.4

💾 Datasets Preparation

CMIRB Description:

CMIRB (Chinese Medical Information Retrieval Benchmark) is a specialized multi-task dataset designed specifically for medical information retrieval.

It consists of data collected from various medical online websites, encompassing 5 tasks and 10 datasets, and has practical application scenarios.

Datasets

The data preprocessing process can be seen in data_collection_and_processing.

An overview datasets available in CMIRB is provided in the following table:

Name	Hub URL	Description	Query #Samples	Doc #Samples
MedExamRetrieval	CMIRB/MedExamRetrieval	Medical multi-choice exam	697	27,871
DuBaikeRetrieval	CMIRB/DuBaikeRetrieval	Medical search query from BaiDu Search	318	56,441
DXYDiseaseRetrieval	CMIRB/DXYDiseaseRetrieval	Disease question from medical website	1,255	54,021
MedicalRetrieval	CMIRB/MedicalRetrieval	Passage retrieval dataset collected from Alibaba search engine systems in medical domain	1,000	100,999
CmedqaRetrieval	CMIRB/CmedqaRetrieval	Online medical consultation text	3,999	100,001
DXYConsultRetrieval	CMIRB/DXYConsultRetrieval	Online medical consultation text	943	12,577
CovidRetrieval	CMIRB/CovidRetrieval	COVID-19 news articles	949	100,001
IIYiPostRetrieval	CMIRB/IIYiPostRetrieval	Medical post articles	789	27,570
CSLCiteRetrieval	CMIRB/CSLCiteRetrieval	Medical literature citation prediction	573	36,703
CSLRelatedRetrieval	CMIRB/CSLRelatedRetrieval	Medical similar literatue	439	36,758

Download the CMIRB dataset:

CMIRB:

Place all zip files under ./AutoMIR/dataset and extract them.

Data Structure:

For each dataset, the data is expected in the following structure:

${DATASET_ROOT} # Dataset root directory, e.g., ./dataset/MedExamRetrieval
├── query.jsonl        # Query file
├── corpus.jsonl        # Document file
└── qrels.txt         # Relevant label file

🤖 Training

Download the medical corpus from huatuo_encyclopedia_qa

1.0 Generate query from corpus

python gen_Data.gen_Query_data.py --corpus_path "./train_data/corpus.jsonl" --llm_name Qwen-32b

You will get the query.jsonl file in the train_data folder, which contains the generated queries for each document in the corpus.

1.1 Generate training data for LLM

python gen_Data.gen_LLM_data.py --query_path "./train_data/query.jsonl" --llm_name_gen_llm qwen

You will get the llm_train_data.jsonl file in the train_data/qwen folder, which contains the generated training data for LLM.

1.2 Fine-tuning LLM as Generator

bash run train_llm.sh

You will get the fine-tuned LLM model in the outputs/qwen folder, which can be used as a generator for generating retwriten queries.

2.1 Generate training data for Retriever

python gen_Data.gen_EMB_data.py --llm_name qwen

You will get the emb_train_data.jsonl file in the train_data/qwen folder, which contains the generated training data for retriever.

2.2 Fine-tuning Retriever

bash run train_emb.sh

You will get the fine-tuned retriever model in the outputs/qwen folder, which can be used for retrieving relevant documents based on the generated queries.

💽 Evaluate

We evaluate 10+ representative retrieval models of diverse sizes and architectures. Run the following command to get results:

cd ./src
python evaluate.py --retrieval_name bge-FT --llm_name qwen
* `--retriever_name`: the retrieval model to evaluate.
* `--llm_name`: the generator to evaluate.

🏆 Leaderboard

Information Retrieval

Model	Dim.	Avg.	MedExam	DuBaike	DXYDisease	Medical	Cmedqa	DXYConsult	Covid	IIYiPost	CSLCite	CSLRel
text2vec-large-zh	1024	30.56	41.39	21.13	41.52	30.93	15.53	21.92	60.48	29.47	20.21	23.01
mcontriever(masmarco)	768	35.20	51.5	22.25	44.34	38.5	22.71	20.04	56.01	28.11	34.59	33.95
bm25	-	35.35	31.95	17.89	40.12	29.33	6.83	17.78	78.9	66.95	33.74	29.97
text-embedding-ada-002	-	42.55	53.48	43.12	58.72	37.92	22.36	27.69	57.21	48.6	32.97	43.4
m3e-large	768	45.25	33.29	46.48	62.57	48.66	30.73	41.05	61.33	45.03	35.79	47.54
multilingual-e5-large	1024	52.08	53.96	53.27	72.1	51.47	28.67	41.35	75.54	63.86	42.65	37.94
piccolo-large-zh	1024	54.75	43.11	45.91	70.69	59.04	41.99	47.35	85.04	65.89	44.31	44.21
gte-large-zh	1024	55.40	41.22	42.66	70.59	62.88	43.15	46.3	88.41	63.02	46.4	49.32
bge-large-zh-v1.5	1024	55.40	58.61	44.26	71.71	59.6	42.57	47.73	73.33	67.13	43.27	45.79
peg	1024	57.46	52.78	51.68	77.38	60.96	44.42	49.3	82.56	70.38	44.74	40.38
HyDE (qwen+bge)	1024	56.62	64.39	52.73	73.98	57.27	38.52	47.11	74.32	73.07	46.16	38.68
SL-HyDE (qwen+bge)	1024	59.38	71.49	60.96	75.34	58.58	39.07	50.13	76.95	73.81	46.78	40.71

📜Reference

If this code or dataset contributes to your research, please kindly consider citing our paper and give this repo ⭐️ :)

@inproceedings{li-etal-2025-automir,
    title = "{A}uto{MIR}: Effective Zero-Shot Medical Information Retrieval without Relevance Labels",
    author = "Li, Lei  and Zhang, Xiangxu  and Zhou, Xiao  and Liu, Zheng",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.1305/",
    doi = "10.18653/v1/2025.findings-emnlp.1305",
    pages = "24028--24047",
    ISBN = "979-8-89176-335-7"
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dataset		dataset
examples		examples
images		images
results		results
src		src
train_data		train_data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

🔭 Overview

AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels (accepted in EMNLP2025)

⚙️ Installation

💾 Datasets Preparation

CMIRB Description:

Datasets

Download the CMIRB dataset:

Data Structure:

🤖 Training

1.0 Generate query from corpus

1.1 Generate training data for LLM

1.2 Fine-tuning LLM as Generator

2.1 Generate training data for Retriever

2.2 Fine-tuning Retriever

💽 Evaluate

🏆 Leaderboard

Information Retrieval

📜Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

🔭 Overview

AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels (accepted in EMNLP2025)

⚙️ Installation

💾 Datasets Preparation

CMIRB Description:

Datasets

Download the CMIRB dataset:

Data Structure:

🤖 Training

1.0 Generate query from corpus

1.1 Generate training data for LLM

1.2 Fine-tuning LLM as Generator

2.1 Generate training data for Retriever

2.2 Fine-tuning Retriever

💽 Evaluate

🏆 Leaderboard

Information Retrieval

📜Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages