Skip to content

[ACL 25] The Low-cost Long Context Understanding Benchmark for Large Language Models (Outstanding Paper Award)

Notifications You must be signed in to change notification settings

MilkThink-Lab/MiniLongBench

Repository files navigation

[ACL 25] MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

Project Status Python 3.11+ GitHub paper

This repository is the official codebase of our paper "MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models".

The proposed MiniLongBench is a low-cost benchmark for evaluating the Long Context Understanding (LCU) capabilities of LLMs, featuring a compact yet diverse test set of only 237 samples spanning 6 major task categories and 21 distinct tasks.

Through empirical analysis of over 60 LLMs, MiniLongBench reduces the average evaluation cost to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results.

🎉 News

2025-07 - We won the Outstanding Paper Award at ACL 2025🎉🎉🎉🎉🎉!

2025-05 - We released MiniLongBench dataset in [Baidu Drive] [Google Drive] [Hugging Face]. 👈🎉Please try it!

2025-05 - Our paper "MiniLongBench" has been accepted to ACL'25 main track! [Paper] 👈🎉Please read it!

⚙️ Environment Setup

Create a Python virtual environment and install all the packages listed in the requirements.txt.

conda create -n MiniLongBench python=3.11
conda activate MiniLongBench
pip install -r requirements.txt

To reproduce the construction of MiniLongBench, please install an apapted version of [py-irt].

pip install poetry
git clone https://github.com/linggm3/py-irt.git
cd py-irt
poetry install

🧪 Testing on MiniLongBench

Obtain LLM's output on MiniLongBench

  1. Download MiniLongBench [Baidu Drive] [Google Drive] [Hugging Face].

  2. Obtain LLM responses on [OpenCompass]:

    • Evaluate the LLM across all 237 test samples in MiniLongBench.
    • Generate outputs in the format: pred_data/example.

Calculate scores across the all test samples.

To generate and store the evaluation scores on 237 test samples:

python minilongbench_scorer.py

Calculate scores on MiniLongBench

There are two evaluation methods for MiniLongBench.

  1. Predict the scores of LLMs on the full LongBench benchmark (eval_new_llm_by_pred.ipynb): This notebook show how to obtain MiniLongBench socres by predicting the scores of LLMs on the full LongBench benchmark.

  2. Directly calculate the scores of LLMs on MiniLongBench (eval_new_llm_directly.ipynb): This notebook show how to obtain MiniLongBench socres directly.

🛠️ Reproducing the MiniLongBench

Representation Learning

representation_learning.ipynb demonstrates how to load LongBench's evaluation data, perform data preprocessing, and learn representations for both the LLMs and test samples.

Sample Clustering

sample_clustering.ipynb demonstrates how to cluster the representations of test samples and extract cluster centers as representative test samples.

Evaluation

There are two evaluation methods for MiniLongBench.

  1. Predict the scores of LLMs on the full LongBench benchmark (eval_by_pred.ipynb).
  2. Directly calculate the scores of LLMs on MiniLongBench (eval_directly.ipynb).

About

[ACL 25] The Low-cost Long Context Understanding Benchmark for Large Language Models (Outstanding Paper Award)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published