This repository is the official codebase of our paper "MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models".
The proposed MiniLongBench is a low-cost benchmark for evaluating the Long Context Understanding (LCU) capabilities of LLMs, featuring a compact yet diverse test set of only 237 samples spanning 6 major task categories and 21 distinct tasks.
Through empirical analysis of over 60 LLMs, MiniLongBench reduces the average evaluation cost to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results.
2025-07 - We won the Outstanding Paper Award at ACL 2025🎉🎉🎉🎉🎉!
2025-05 - We released MiniLongBench dataset in [Baidu Drive] [Google Drive] [Hugging Face]. 👈🎉Please try it!
2025-05 - Our paper "MiniLongBench" has been accepted to ACL'25 main track! [Paper] 👈🎉Please read it!
Create a Python virtual environment and install all the packages listed in the requirements.txt.
conda create -n MiniLongBench python=3.11
conda activate MiniLongBench
pip install -r requirements.txtTo reproduce the construction of MiniLongBench, please install an apapted version of [py-irt].
pip install poetry
git clone https://github.com/linggm3/py-irt.git
cd py-irt
poetry install-
Download MiniLongBench [Baidu Drive] [Google Drive] [Hugging Face].
-
Obtain LLM responses on [OpenCompass]:
- Evaluate the LLM across all 237 test samples in MiniLongBench.
- Generate outputs in the format:
pred_data/example.
To generate and store the evaluation scores on 237 test samples:
python minilongbench_scorer.pyThere are two evaluation methods for MiniLongBench.
-
Predict the scores of LLMs on the full LongBench benchmark (
eval_new_llm_by_pred.ipynb): This notebook show how to obtain MiniLongBench socres by predicting the scores of LLMs on the full LongBench benchmark. -
Directly calculate the scores of LLMs on MiniLongBench (
eval_new_llm_directly.ipynb): This notebook show how to obtain MiniLongBench socres directly.
representation_learning.ipynb demonstrates how to load LongBench's evaluation data, perform data preprocessing, and learn representations for both the LLMs and test samples.
sample_clustering.ipynb demonstrates how to cluster the representations of test samples and extract cluster centers as representative test samples.
There are two evaluation methods for MiniLongBench.
- Predict the scores of LLMs on the full LongBench benchmark (
eval_by_pred.ipynb). - Directly calculate the scores of LLMs on MiniLongBench (
eval_directly.ipynb).