We will be releasing the following contents:
- Evaluation for ROVER-TG (Please switch to branch
ROVER-TG. Note: This feature is currently under beta testing.) - Inference code
Install neccessary repo:
pip3 install -r requirements.txtConfigure your OpenAI credentials (choose one):
Option A: Environment Variables (Recommended)
export OPENAI_API_KEY="your-api-key"
export OPENAI_MODEL="gpt-4.1"Option B: Edit config.py
OPENAI_API_KEY = "your-api-key"
OPENAI_MODEL = "gpt-4.1"Legacy Azure Support (Deprecated)
export AZURE_OPENAI_API_KEY="your-api-key"
export AZURE_OPENAI_ENDPOINT="your-endpoint"
export AZURE_OPENAI_DEPLOYMENT="gpt-4.1"Set the generation directory path (choose one):
Option A: Environment Variable (Recommended)
export ROVER_GEN_DIR="/path/to/your/generated/results"
export MAX_RETRIES="3" # Optional: number of retries for failed evaluationsOption B: Edit config.py
ROVER_GEN_DIR = "/path/to/your/generated/results"
MAX_RETRIES = 3 # Number of retries for failed evaluationsYour generation directory should contain files in this format:
your_gen_dir/
├── gen_{task_id}.png # Generated image (required)
├── gen_{task_id}.txt # Reasoning text (optional)
├── gen_science_temporal_1.png
├── gen_science_temporal_1.txt
├── gen_science_causal_3.png
├── gen_science_causal_3.txt
└── ...
File Naming Convention:
- Images:
gen_{task_id}.png - Reasoning text:
gen_{task_id}.txt - Task IDs follow format:
{dimension}_{reasoning_type}_{number}
# Evaluate all available results
python evaluate_rover.py --output_dir results
# Filter by reasoning type
python evaluate_rover.py --output_dir results --reasoning_type temporal
# Filter by dimension
python evaluate_rover.py --output_dir results --dimension science
# Custom worker count
python evaluate_rover.py --output_dir results --workers 5# Generate summary report
python summarize.py --results_file results/rover_metrics.jsonl --output_dir results
# View detailed results
cat results/rover_summary.jsonThe system evaluates 5 core metrics:
| Metric | Code | Description | Requires Text |
|---|---|---|---|
| Reasoning Process | RP | Quality of written reasoning steps | ✅ Yes |
| Reasoning Visual | RV | Visual result matches target description | ❌ No |
| Reasoning Alignment | RA | Consistency between text and visual result | ✅ Yes |
| Visual Consistency | VC | Non-target elements remain unchanged | ❌ No |
| Image Quality | IQ | Technical quality of generated image | ❌ No |
The system gracefully handles missing reasoning text files:
- ✅ 3 visual metrics (RV, VC, IQ) work normally without text
⚠️ 2 text-dependent metrics (RP, RA) use fallback: "No think output available"- 📊 Scoring remains fair: Missing text results in appropriate score penalties for text-dependent metrics
Example without text files:
your_gen_dir/
├── gen_science_temporal_1.png # Only image, no .txt
├── gen_science_causal_3.png # Only image, no .txt
└── ...
→ Still evaluates RV, VC, IQ normally; RP and RA get low scores due to missing reasoning.
- Temporal: Changes over time (growth, aging, weather transitions)
- Spatial: Geometric transformations (rotation, perspective, positioning)
- Quantitative: Numerical changes (counting, scaling, proportions)
- Causal: Cause-effect relationships (interventions, reactions)
- imaginative: Creative additions/modifications (style transfer, object addition)
- Natural Science: Physics, chemistry, biology principles
- Culture: Cultural, historical, social contexts
- Common Sense: Everyday knowledge and practical understanding
- Logic: Mathematical and formal reasoning
Raw Results: results/rover_metrics.jsonl - One JSON object per evaluated task
Summary Report: results/rover_summary.json - Aggregated statistics by dimension, reasoning type, and overall
Console Output: Formatted table with scores (0-100 scale) for easy reading
python evaluate_rover.py [OPTIONS]
--output_dir DIR Output directory (default: rover_results)
--workers N Number of parallel workers (default: 10)
--dimension {science,culture,common_sense,logic}
Filter by dimension
--reasoning_type {temporal,spatial,quantitative,causal,imaginative}
Filter by reasoning type
--metrics METRIC [METRIC ...]
Specific metrics to evaluateThe evaluation uses the cheryyunl/ROVER dataset from Hugging Face, which is automatically downloaded. No manual data preparation needed.
The evaluation system uses a unified architecture:
evaluator.py: Unified evaluation function supporting all reasoning typesbase_metric.py: Common functionality (image encoding, GPT evaluation, scoring, retry logic)config.py: Configuration management (API keys, paths, retry settings)prompts.py: Evaluation prompts for all reasoning typesevaluate_rover.py: Main evaluation script with parallel processing
Key Features:
- Unified retry mechanism: All metrics use consistent retry logic for failed evaluations
- Configurable retries: Set
MAX_RETRIESin config or environment variables - Robust error handling: Graceful fallback for API failures and parsing errors
- Python 3.7+
- OpenAI API access
- Required packages:
datasets,openai,PIL,tqdm
Common Issues:
- "Generated image not found" → Check
ROVER_GEN_DIRpath and file naming - API errors → Verify OpenAI credentials in
config.py - Low RP/RA scores → Normal if reasoning text files (.txt) are missing
- Dataset loading fails → Check internet connection for Hugging Face access
If you use this benchmark in your research, please consider citing:
@article{liang2025rover,
title={ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation},
author={Liang, Yongyuan and Chow, Wei and Li, Feng and Ma, Ziqiao and Wang, Xiyao and Mao, Jiageng and Chen, Jiuhai and Gu, Jiatao and Wang, Yue and Huang, Furong},
journal={arXiv preprint arXiv:2511.01163},
year={2025}
}