[ACL 25] MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

This repository is the official codebase of our paper "MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models".

The proposed MiniLongBench is a low-cost benchmark for evaluating the Long Context Understanding (LCU) capabilities of LLMs, featuring a compact yet diverse test set of only 237 samples spanning 6 major task categories and 21 distinct tasks.

Through empirical analysis of over 60 LLMs, MiniLongBench reduces the average evaluation cost to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results.

🎉 News

2025-07 - We won the Outstanding Paper Award at ACL 2025🎉🎉🎉🎉🎉!

2025-05 - We released MiniLongBench dataset in [Baidu Drive] [Google Drive] [Hugging Face]. 👈🎉Please try it!

2025-05 - Our paper "MiniLongBench" has been accepted to ACL'25 main track! [Paper] 👈🎉Please read it!

⚙️ Environment Setup

Create a Python virtual environment and install all the packages listed in the requirements.txt.

conda create -n MiniLongBench python=3.11
conda activate MiniLongBench
pip install -r requirements.txt

To reproduce the construction of MiniLongBench, please install an apapted version of [py-irt].

pip install poetry
git clone https://github.com/linggm3/py-irt.git
cd py-irt
poetry install

🧪 Testing on MiniLongBench

Obtain LLM's output on MiniLongBench

Download MiniLongBench [Baidu Drive] [Google Drive] [Hugging Face].
Obtain LLM responses on [OpenCompass]:
- Evaluate the LLM across all 237 test samples in MiniLongBench.
- Generate outputs in the format: pred_data/example.

Calculate scores across the all test samples.

To generate and store the evaluation scores on 237 test samples:

python minilongbench_scorer.py

Calculate scores on MiniLongBench

There are two evaluation methods for MiniLongBench.

Predict the scores of LLMs on the full LongBench benchmark (eval_new_llm_by_pred.ipynb): This notebook show how to obtain MiniLongBench socres by predicting the scores of LLMs on the full LongBench benchmark.
Directly calculate the scores of LLMs on MiniLongBench (eval_new_llm_directly.ipynb): This notebook show how to obtain MiniLongBench socres directly.

🛠️ Reproducing the MiniLongBench

Representation Learning

representation_learning.ipynb demonstrates how to load LongBench's evaluation data, perform data preprocessing, and learn representations for both the LLMs and test samples.

Sample Clustering

sample_clustering.ipynb demonstrates how to cluster the representations of test samples and extract cluster centers as representative test samples.

Evaluation

There are two evaluation methods for MiniLongBench.

Predict the scores of LLMs on the full LongBench benchmark (eval_by_pred.ipynb).
Directly calculate the scores of LLMs on MiniLongBench (eval_directly.ipynb).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[ACL 25] MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

🎉 News

⚙️ Environment Setup

🧪 Testing on MiniLongBench

Obtain LLM's output on MiniLongBench

Calculate scores across the all test samples.

Calculate scores on MiniLongBench

🛠️ Reproducing the MiniLongBench

Representation Learning

Sample Clustering

Evaluation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
eval_data		eval_data
figure		figure
pred_data/example		pred_data/example
README.md		README.md
eval_by_pred.ipynb		eval_by_pred.ipynb
eval_directly.ipynb		eval_directly.ipynb
eval_new_llm_by_pred.ipynb		eval_new_llm_by_pred.ipynb
eval_new_llm_directly.ipynb		eval_new_llm_directly.ipynb
irt.py		irt.py
minilongbench_metrics.py		minilongbench_metrics.py
minilongbench_scorer.py		minilongbench_scorer.py
representation_learning.ipynb		representation_learning.ipynb
requirements.txt		requirements.txt
sample_clustering.ipynb		sample_clustering.ipynb
utils.py		utils.py

MilkThink-Lab/MiniLongBench

Folders and files

Latest commit

History

Repository files navigation

[ACL 25] MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

🎉 News

⚙️ Environment Setup

🧪 Testing on MiniLongBench

Obtain LLM's output on MiniLongBench

Calculate scores across the all test samples.

Calculate scores on MiniLongBench

🛠️ Reproducing the MiniLongBench

Representation Learning

Sample Clustering

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages