Install the latest release from PyPI:
pip install HumbleBenchThe following snippet demonstrates a minimal example to evaluate your model on HumbleBench.
from HumbleBench import download_dataset, evaluate
from HumbleBench.utils.entity import DataLoader
# Download the HumbleBench dataset
dataset = download_dataset()
# Prepare data loader (batch_size=16, no-noise images)
data = DataLoader(dataset=dataset,
batch_size=16,
use_noise_image=False, # For HumbleBench-GN, set this to True
nota_only=False) # For HumbleBench-E, set this to True
# Run inference
results = []
for batch in data:
# Replace the next line with your model's inference method
predictions = your_model.infer(batch)
# Expect predictions to be a list of dicts matching batch keys, plus 'prediction'
# Example:
results.extend(predictions)
# Compute evaluation metrics
metrics = evaluate(
input_data=results,
model_name_or_path='YourModel',
use_noise_image=False, # For HumbleBench-GN, set this to True
nota_only=False # For HumbleBench-E, set this to True
)
print(metrics)If you prefer to reproduce the published results, load one of our provided JSONL files (at results/common, results/noise_image, or results/nota_only):
from HumbleBench.utils.io import load_jsonl
from HumbleBench import evaluate
path = 'results/common/Model_Name/Model_Name.jsonl'
data = load_jsonl(path)
metrics = evaluate(
input_data=data,
model_name_or_path='Model_Name',
use_noise_image=False, # For HumbleBench-GN, set this to True
nota_only=False, # For HumbleBench-E, set this to True
)
print(metrics)env_name to your env's name.
HumbleBench provides a unified CLI for seamless integration with any implementation of our model interface.
git clone [email protected]:maifoundations/HumbleBench.git
cd HumbleBenchCreate a subclass of MultiModalModelInterface and define the infer method:
# my_model.py
from HumbleBench.models.base import register_model, MultiModalModelInterface
@register_model("YourModel")
class YourModel(MultiModalModelInterface):
def __init__(self, model_name_or_path, **kwargs):
super().__init__(model_name_or_path, **kwargs)
# Load your model and processor here
# Example:
# self.model = ...
# self.processor = ...
def infer(self, batch: List[Dict]) -> List[Dict]:
"""
Args:
batch: List of dicts with keys:
- label: one of 'A', 'B', 'C', 'D', 'E'
- question: str
- type: 'Object'/'Attribute'/'Relation'/...
- file_name: path to image file
- question_id: unique identifier
Returns:
List of dicts with an added 'prediction' key (str).
"""
# Your inference code here
return predictionsEdit configs/models.yaml to register your model and specify its weights:
models:
YourModel:
params:
model_name_or_path: "/path/to/your/checkpoint"#!/bin/bash
export CUDA_VISIBLE_DEVICES=0,1,2,3
python main.py \
--model "YourModel" \
--config configs/models.yaml \
--batch_size 16 \
--log_dir results/common \
[--use-noise] \
[--nota-only]--model: Name registered via@register_model--config: Path to yourmodels.yaml--batch_size: Inference batch size--log_dir: Directory to save logs and results--use-noise: Optional flag to assess HumbleBench-GN--nota-only: Optional flag to assess HumbleBench-E
🙇🏾🙇🏾🙇🏾
We have implemented many popular models in the models directory, along with corresponding shell scripts (including support for noise-image experiments) in the shell directory. If you’d like to add your own model to HumbleBench, feel free to open a Pull Request — we’ll review and merge it as soon as possible.
Please cite HumbleBench in your work:
@article{tong2025measuringepistemichumilitymultimodal,
title={Measuring Epistemic Humility in Multimodal Large Language Models},
author={Bingkui Tong and Jiaer Xia and Sifeng Shang and Kaiyang Zhou},
journal={arXiv preprint arXiv:2509.09658},
year={2025}
}For bug reports or feature requests, please open an issue or email us at [email protected].