This repository provides scripts and instructions to evaluate WINO on LLaDA and MMaDA.
- Installation We recommend using uv for dependency and virtual environment management.
pipx install uv # or pip install uv
cd LLaDA
uv venv --python 3.11 dev
source dev/bin/activate
uv pip install -r requirements.txt- Prepare Model and Datasets
Before running inference or evaluation, please download the following models and datasets from Hugging Face into the specified local directories (e.g., ./LLaDA/models/ and ./LLaDA/data/).
You may use either huggingface-cli or the Python datasets library to complete the download.
| Model Name | Hugging Face Repo | Local Path |
|---|---|---|
| LLaDA-8B-Instruct | GSAI-ML/LLaDA-8B-Instruct |
./LLaDA/models/LLaDA-8B-Instruct/ |
| Dataset Name | Hugging Face Repo | Local Path |
|---|---|---|
| GSM8K | openai/gsm8k |
./LLaDA/data/gsm8k/ |
| MATH-500 | HuggingFaceH4/MATH-500 |
./LLaDA/data/math500/ |
| HumanEval | openai/openai_humaneval |
./LLaDA/data/humaneval/ |
| ai2_arc | allenai/ai2_arc |
./LLaDA/data/ai2_arc/ |
Datasets not listed above are already included in the ./LLaDA/data/ directory
- Quick Demo
Please make sure to set the correct model path in generate.py.
python generate.py- Evaluation
To evaluate WINO on a benchmark such as GSM8K. Please configure the model and data paths in the corresponding config file.
CUDA_VISIBLE_DEVICES=0 python eval.py --config ./configs/gsm8k.yamlAll available config files can be found in the ./LLaDA/configs/ directory.
We evaluate WINO using lmms-eval.
To run the evaluation, follow these steps:
- Install MMaDA dependencies
cd MMaDA
# pipx install uv
uv venv --python 3.11 dev
source dev/bin/activate
uv pip install -r requirements.txtA quick inference demo can be performed after this step.
python generate_demo.py- Install lmms-eval dependencies
cd lmms_eval
uv pip install -e .- Set some necessary environmental variables Some environmental variables are necessary for certain tasks to run.
export OPENAI_API_KEY="<YOUR_API_KEY>"
export HF_HOME="<Path to HF cache>"
export HF_TOKEN="<YOUR_API_KEY>"
export HF_HUB_ENABLE_HF_TRANSFER="1"Once all dependencies are installed and your API key is set, you can run the evaluation script directly:
cd ..
# Evaluating MMaDA on the reported six multimodel benchmarks
bash scripts/eval_baseline.sh
# Evaluating WINO on the reported six multimodel benchmarks
bash scripts/eval_wino.sh