Measuring Epistemic Humility in Multimodal Large Language Models

📦 Installation

Install the latest release from PyPI:

pip install HumbleBench

🚀 Quickstart (Python API)

The following snippet demonstrates a minimal example to evaluate your model on HumbleBench.

from HumbleBench import download_dataset, evaluate
from HumbleBench.utils.entity import DataLoader

# Download the HumbleBench dataset
dataset = download_dataset()

# Prepare data loader (batch_size=16, no-noise images)
data = DataLoader(dataset=dataset,
                    batch_size=16, 
                    use_noise_image=False,  # For HumbleBench-GN, set this to True
                    nota_only=False)        # For HumbleBench-E, set this to True

# Run inference
results = []
for batch in data:
    # Replace the next line with your model's inference method
    predictions = your_model.infer(batch)
    # Expect predictions to be a list of dicts matching batch keys, plus 'prediction'
    # Example: 
    results.extend(predictions)

# Compute evaluation metrics
metrics = evaluate(
    input_data=results,
    model_name_or_path='YourModel',
    use_noise_image=False,  # For HumbleBench-GN, set this to True
    nota_only=False         # For HumbleBench-E, set this to True
)
print(metrics)

If you prefer to reproduce the published results, load one of our provided JSONL files (at results/common, results/noise_image, or results/nota_only):

from HumbleBench.utils.io import load_jsonl
from HumbleBench import evaluate

path = 'results/common/Model_Name/Model_Name.jsonl'
data = load_jsonl(path)
metrics = evaluate(
    input_data=data,
    model_name_or_path='Model_Name',
    use_noise_image=False,  # For HumbleBench-GN, set this to True
    nota_only=False,        # For HumbleBench-E, set this to True
)
print(metrics)

🧩 Advanced Usage: Command-Line Interface

⚠️WARNING⚠️: If you wanna use our implemented models, please make sure you install all the requirements of respective model by yourself. And we use Conda to manage the python environment, so maybe you need to modify the env_name to your env's name.

HumbleBench provides a unified CLI for seamless integration with any implementation of our model interface.

1. Clone the Repository

git clone [email protected]:maifoundations/HumbleBench.git
cd HumbleBench

2. Implement the Model Interface

Create a subclass of MultiModalModelInterface and define the infer method:

# my_model.py
from HumbleBench.models.base import register_model, MultiModalModelInterface

@register_model("YourModel")
class YourModel(MultiModalModelInterface):
    def __init__(self, model_name_or_path, **kwargs):
        super().__init__(model_name_or_path, **kwargs)
        # Load your model and processor here
        # Example:
        # self.model = ...
        # self.processor = ...

    def infer(self, batch: List[Dict]) -> List[Dict]:
        """
        Args:
            batch: List of dicts with keys:
                - label: one of 'A', 'B', 'C', 'D', 'E'
                - question: str
                - type: 'Object'/'Attribute'/'Relation'/...
                - file_name: path to image file
                - question_id: unique identifier
        Returns:
            List of dicts with an added 'prediction' key (str).
        """
        # Your inference code here
        return predictions

3. Configure Your Model

Edit configs/models.yaml to register your model and specify its weights:

models:
  YourModel:
    params:
      model_name_or_path: "/path/to/your/checkpoint"

4. Run Evaluation from the Shell

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0,1,2,3

python main.py \
    --model "YourModel" \
    --config configs/models.yaml \
    --batch_size 16 \
    --log_dir results/common \
    [--use-noise] \
    [--nota-only]

--model: Name registered via @register_model
--config: Path to your models.yaml
--batch_size: Inference batch size
--log_dir: Directory to save logs and results
--use-noise: Optional flag to assess HumbleBench-GN
--nota-only: Optional flag to assess HumbleBench-E

5. Contribute to HumbleBench!

🙇🏾🙇🏾🙇🏾

We have implemented many popular models in the models directory, along with corresponding shell scripts (including support for noise-image experiments) in the shell directory. If you’d like to add your own model to HumbleBench, feel free to open a Pull Request — we’ll review and merge it as soon as possible.

📁 Citation

Please cite HumbleBench in your work:

@article{tong2025measuringepistemichumilitymultimodal,
  title={Measuring Epistemic Humility in Multimodal Large Language Models},
  author={Bingkui Tong and Jiaer Xia and Sifeng Shang and Kaiyang Zhou},
  journal={arXiv preprint arXiv:2509.09658},
  year={2025}
}

📮 Contact

For bug reports or feature requests, please open an issue or email us at [email protected].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Measuring Epistemic Humility in Multimodal Large Language Models

📦 Installation

🚀 Quickstart (Python API)

🧩 Advanced Usage: Command-Line Interface

1. Clone the Repository

2. Implement the Model Interface

3. Configure Your Model

4. Run Evaluation from the Shell

5. Contribute to HumbleBench!

📁 Citation

📮 Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
HumbleBench		HumbleBench
configs		configs
dataset		dataset
models		models
results		results
shell		shell
stat		stat
.gitignore		.gitignore
README.md		README.md
main.py		main.py
nota_only.log		nota_only.log
overall.log		overall.log
pyproject.toml		pyproject.toml

maifoundations/HumbleBench

Folders and files

Latest commit

History

Repository files navigation

Measuring Epistemic Humility in Multimodal Large Language Models

📦 Installation

🚀 Quickstart (Python API)

🧩 Advanced Usage: Command-Line Interface

1. Clone the Repository

2. Implement the Model Interface

3. Configure Your Model

4. Run Evaluation from the Shell

5. Contribute to HumbleBench!

📁 Citation

📮 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages