ViSA for MindJourney

🎉 Accepted to World Modeling Workshop 🎉

✨ Verification through Spatial Assertion (ViSA) ✨

This is the code for our World Modeling Workshop paper. Verification through Spatial Assertion (ViSA) extends the MindJourney spatial reasoning pipeline with a Vision-Language Model (VLM) Verifier that implements a proposer-solver approach. The verifier adds a layer of consistency checking to ensure that generated world model outputs are reliable and accurate.

Pipeline Overview

The figure above illustrates two pipelines:

MindJourney (Top): The original pipeline that uses a world model to generate imagined camera views and scores them based on helpfulness for answering spatial reasoning questions.
ViSA (Bottom): Our extension that adds verification by generating micro-claims about scene changes and verifying them against the imagined frames.

ViSA Approach

The ViSA verifier enhances the MindJourney pipeline through a two-step verification process:

1. Micro-Claim Generation

After each camera action, the system:

Compares the "before" and "after" images from the world model
Generates frame-indexed micro-claims describing expected changes (e.g., "A red mug appears behind the box in frames 6-9")
Creates claims about spatial relationships, object properties, and dynamic scene changes

2. Claim Verification

For each micro-claim:

Uses a VLM to verify the claim against the visual evidence
Outputs a verdict: ENTAILED (claim is true), CONTRADICTED (claim is false), or INSUFFICIENT (cannot determine)
Provides confidence scores and reasoning for each verification

3. Score Weighting

The verification results are used to:

Compute Claim-Acceptance Rate (CAR) from verified micro-claims
Derive Evidence Quality (EQ) score from CAR, which measures the reliability of generated world model outputs
Weight action scores based on Evidence Quality
Filter out inconsistent or unreliable action results
Boost scores for actions with high Evidence Quality

This zero-training approach uses off-the-shelf VLMs (GPT-4V, LLaVA, InternVL3) to add quality control and improve the reliability of spatial reasoning.

Running the Pipelines

Prerequisites

Environment Setup: Follow the original MindJourney setup instructions
VLM Configuration:
- For GPT-family models (gpt-4o, gpt-4.1, etc.): Set your Azure OpenAI API key and endpoint
```
export AZURE_OPENAI_API_KEY="your_api_key"
```
  Update utils/api.py with your Azure endpoint.
- For InternVL3 models: Ensure adequate VRAM (see resource requirements below) and install required dependencies.

Python Path: Add the repository root to your Python path

export PYTHONPATH=$PYTHONPATH:./
export WORLD_MODEL_TYPE="svc"

Resource Requirements: The example SLURM scripts (e.g., pipeline_svc_cfg_SAT_scaling_spatial_beam_search_slurm.sh, pipeline_baseline_slurm.sh) provide guidance on resource requirements:
- GPUs: 2x 80GB GPUs (e.g., A100) recommended for InternVL3-14B and world model inference
- CPU: 4 cores per task
- Memory: 70GB RAM
- Time: 24 hours for full evaluation runs
Adjust these based on your specific model choices and dataset size.

1. Random Pipeline

Runs a random baseline without any intelligent action selection:

python pipelines/random_with_log_probs.py \
    --input_dir data/SAT \
    --output_dir outputs/random \
    --vlm_model_name "OpenGVLab/InternVL3-14B" \
    --num_questions 10 \
    --split "val"

2. Baseline Pipeline (No Test-Time Scaling)

Runs the baseline pipeline without world model exploration - directly answers questions from the initial image:

bash scripts/pipeline_baseline.sh

Or directly:

python pipelines/pipeline_baseline.py \
    --input_dir data/SAT \
    --output_dir outputs/baseline \
    --vlm_model_name "OpenGVLab/InternVL3-14B" \
    --num_questions 10 \
    --split "val" \
    --max_images 1

3. MindJourney Pipeline (Test-Time Scaling)

Runs the original MindJourney pipeline with spatial beam search using the world model:

bash scripts/pipeline_svc_SAT_scaling_spatial_beam_search.sh

Or directly:

python pipelines/pipeline_svc_scaling_spatial_beam_search_basic.py \
    --input_dir data/SAT \
    --output_dir outputs/mindjourney \
    --vlm_model_name "OpenGVLab/InternVL3-14B" \
    --num_questions 10 \
    --split "val" \
    --max_steps_per_question 3 \
    --num_beams 3 \
    --num_top_candidates 5

4. ViSA Pipeline (MindJourney + Verification)

Runs the enhanced pipeline with ViSA verifier enabled:

bash scripts/pipeline_svc_SAT_scaling_spatial_beam_search_with_verifier.sh

Or directly:

python pipelines/pipeline_svc_scaling_spatial_beam_search_with_verifier.py \
    --enable_verifier \
    --verification_threshold 0.7 \
    --input_dir data/SAT \
    --output_dir outputs/svc_with_verifier \
    --vlm_model_name "OpenGVLab/InternVL3-14B" \
    --num_questions 10 \
    --split "val" \
    --max_steps_per_question 3 \
    --num_beams 3 \
    --num_top_candidates 5

Key ViSA Parameters:

--enable_verifier: Enable/disable ViSA verifier (default: True)
--verification_threshold: Evidence Quality (EQ) threshold derived from CAR for score weighting (default: 0.7)
--baseline: Run in baseline mode without verifier

Output Format

The ViSA pipeline includes verification metrics in the results, including Evidence Quality (EQ) derived from Claim Acceptance Rate (CAR):

{
  "accuracy": {...},
  "progress": {...},
  "verification_metrics": {
    "question_id": {
      "step_0": {
        "action_family": {
          "subaction": {
            "claim_acceptance_rate": 0.85,
            "evidence_quality_score": 0.85,
            "consistency_score": 0.85,
            "total_claims": 3,
            "accepted_claims": 2,
            "rejected_claims": 1,
            "claims": [...],
            "verification_results": [...]
          }
        }
      }
    }
  }
}

Citation

This code extends the original MindJourney framework. If you use this repository, please cite:

@misc{yang2025mindjourneytesttimescalingworld,
      title={MindJourney: Test-Time Scaling with World Models for Spatial Reasoning}, 
      author={Yuncong Yang and Jiageng Liu and Zheyuan Zhang and Siyuan Zhou and Reuben Tan and Jianwei Yang and Yilun Du and Chuang Gan},
      year={2025},
      eprint={2507.12508},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.12508}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
aesthetic-predictor		aesthetic-predictor
assets		assets
pipelines		pipelines
scripts		scripts
stable_virtual_camera		stable_virtual_camera
utils		utils
.DS_Store		.DS_Store
README.md		README.md
evaluate_perceptual_quality_slurm.sh		evaluate_perceptual_quality_slurm.sh
pipeline_baseline_slurm.sh		pipeline_baseline_slurm.sh
pipeline_svc_cfg_SAT_scaling_spatial_beam_search_slurm.sh		pipeline_svc_cfg_SAT_scaling_spatial_beam_search_slurm.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViSA for MindJourney

🎉 Accepted to World Modeling Workshop 🎉

Pipeline Overview

ViSA Approach

1. Micro-Claim Generation

2. Claim Verification

3. Score Weighting

Running the Pipelines

Prerequisites

1. Random Pipeline

2. Baseline Pipeline (No Test-Time Scaling)

3. MindJourney Pipeline (Test-Time Scaling)

4. ViSA Pipeline (MindJourney + Verification)

Output Format

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ViSA for MindJourney

🎉 Accepted to World Modeling Workshop 🎉

Pipeline Overview

ViSA Approach

1. Micro-Claim Generation

2. Claim Verification

3. Score Weighting

Running the Pipelines

Prerequisites

1. Random Pipeline

2. Baseline Pipeline (No Test-Time Scaling)

3. MindJourney Pipeline (Test-Time Scaling)

4. ViSA Pipeline (MindJourney + Verification)

Output Format

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages