Skip to content

Princeton-AI2-Lab/CubeBench

Repository files navigation

CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning under Partial Observations

Overview

LLM agents excel in digital tasks but struggle in physically grounded settings due to weak spatial mental models. We identify three core challenges: spatial reasoning, long-horizon state tracking, and active exploration under partial observation.

To evaluate these abilities, we introduce CubeBench, a benchmark based on the Rubik’s Cube with a three-tier diagnostic framework, ranging from full symbolic states to partial visual observations.

Experiments on leading LLMs reveal severe long-horizon failures, including a 0.00% pass rate on all long-horizon tasks. CubeBench further supports solver-based diagnostics to isolate cognitive bottlenecks, offering insights toward building more spatially grounded intelligent agents.

Setup

First, install the required dependencies:

pip install -r requirements.txt

Generating Pruning Tables

This project relies on Kociemba’s Solver. Since pruning tables are too large to store in Git, you must generate them locally before running experiments.

  1. Ensure the directory cube/solvers/optimal/tables exists.
  2. Run the following command:
python -c "import cube.solvers.optimal.solver"

The tables will then be generated. This process may take several hours, but can be significantly accelerated using PyPy.

For more details, see the original repository: RubiksCube-OptimalSolver.

Generating Test Cases

We provide the exact test cases used in our experiments:

experiments/generated_states/test_subset_5_back_checked.json

You can use this file directly, or generate your own test cases using the provided pipeline:

Step 1. Generate initial states with random scrambles

python experiments/generate_initial_states.py

Step 2. Bucket the states by depth (Requires the optimal solver pruning tables generated in the previous step.)

python experiments/sieve.py --max-per-level 20

This step can be stopped once enough cases are collected. ⚠️ Depth-20 states are extremely rare, so they must be imported from an external source.

Step 3. Build the hard-20 bucket We provide a helper script that uses experiments/hard20.txt (sourced from cube20.org).

python experiments/build_hard20_bucket.py

Step 4. Filter and select test cases

Remove center-symmetric cases and keep the first 5 cases from each depth bucket:

python filter_states_by_scrambles_back_num.py

Add Your API Config

Please set your API base and API key at policy/algorithm/smolagent/api_config.py.

API_CONFIGS = {
    'openrouter': {
        'base': 'https://openrouter.ai/api/v1',
        'key': 'add-your-key-here'
    },
    # more providers can be added here
}

Running the Main Experiment

The scripts for this experiment are designed for Linux systems, and we highly recommend using Linux to run the experiments.

You can use experiments/configs/toy_config.json for a quick verification of your environment setup. Afterward, you can customize your own configuration to run experiments.

Run a configuration for the first time:

python batch_test_generated_states.py --config {path_to_config}

If the run completes normally (including expected terminations such as Time Limit Exceeded, Token Limit Exceeded, Image Limit Exceeded), you can stop here.

If the run crashes unexpectedly and you want to resume, add --resume:

python batch_test_generated_states.py --config {path_to_config} --resume {path_to_resume}

If you want to run a single testcase instead of in batch, you can use:

python -m agent.agent --your-arguments ...

An example is:

python -m agent.agent --observation-type state_string --scramble-moves 8  --model openai/gpt-5 --max-steps 20 --agent-type basic --provider openrouter --timeout 1800

All results will be saved in the test_outputs directory.

Configuration File Format

Configuration files are written in JSON.

Example:

{
  "experiment_parameters": {
    "states_file": "experiments/generated_states/test_subset_5_back_checked.json",
    "timeout": 1800,
    "provider": "openrouter",
    "max_input_tokens": 1000000,
    "max_output_tokens": 300000,
    "max_images": 200,
    "workers": 80,
    "parallel_strategy": "all",
    "checkpoint_interval": 6,
    "step_observation_callback": false,
    "reward_type": "heuristic"
  },
  "observation_types": [
    "state_string",
    "full_view",
    "face_view",
    "vertex_view"
  ],
  "scrambles": {
    "1": 20,
    "2": 20,
    "3": 20,
    "4": 20,
    "8": 20,
    "12": 20,
    "16": 20,
    "20": 20
  },
  "agent_types": [
    "basic",
    "standard_solver",
    "ideal_solver"
  ],
  "models": [
    "openai/gpt-5"
  ]
}

experiment_parameters

Defines the core execution settings:

  • states_file: Path to the initial states file
  • timeout: Timeout (in seconds) per test
  • provider: API provider (e.g., "openrouter")
  • max_input_tokens: Max input tokens
  • max_output_tokens: Max output tokens
  • max_images: Max images per run
  • workers: Number of parallel workers
  • parallel_strategy: "all" or "model"
  • checkpoint_interval: Save interval for checkpoints
  • step_observation_callback: Enable/disable step callbacks
  • reward_type: Reward calculation method. Choices: "no_reward", "heuristic", "face", "sticker"

scrambles

Specifies the number of states to generate for each scramble depth:

{
  "1": 20,   // 1 scramble move → up to 20 solving steps
  "2": 20,   // 2 scramble moves → up to 20 solving steps
  ...
}

agent_types

List of agents to test:

  • "basic" – Basic Agent
  • "standard_solver" – Standard-Solver Agent
  • "ideal_solver" – Ideal-Solver Agent

observation_types

List of observation types to test:

  • "state_string": Full Symbolic
  • "full_view": Full Visual
  • "face_view": Face View
  • "vertex_view": Vertex View

models

Models to evaluate, using the names recognized by your provider.

Troubleshooting

Sometimes, providers return malformed responses that cause the smolagent framework to raise AgentGenerationError or JsonParsingError. We provide a patch to handle these.

  1. In patch_smolagents.py, replace /path/to/your/site-packages with your actual site-packages path.
  2. Modify _response.py in the openai package. In class BaseAPIResponse, method _parse, add:
# Existing code ......
    return response.text  # type: ignore

# NEW CODE BEGIN
try:
    data = response.json()
except Exception as exc:
    log.debug("Could not parse JSON: %s - %s", type(exc), exc)
    # Return a minimal valid JSON structure
    data = {}
# NEW CODE END

# Existing code ......
return self._client._process_response_data(
    data=data,
    cast_to=cast_to,  # type: ignore
    response=response,
)
  1. In agent.py, import the patch:
try:
    from patch_smolagents import patched_generate
    print("✅ Loaded smolagents patch for empty response handling")
except ImportError as e:
    print(f"⚠️ Could not load smolagents patch: {e}")
    print("Place 'patch_smolagents.py' in the same directory as this script.")

Demos

2D version

A 2D interactive Rubik's Cube visualization for debugging and understanding sticker numbering and face relationships.

Run:

python examples/numbered_cube.py

Controls:

  • F/B/L/R/U/D: Rotate face clockwise
  • Shift+F/B/L/R/U/D: Rotate face counter-clockwise
  • Space: Scramble the cube
  • R: Reset to solved state
  • N: Toggle sticker numbers
  • Q or Esc: Quit

3D version

A 3D interactive Rubik's Cube visualization using Pygame and PyOpenGL, supporting real-time rotation, zoom, and sticker/face label debugging.

Run:

python examples/interactive_3d_cube.py

Controls:

  • F/B/L/R/U/D: Rotate face clockwise
  • Shift+F/B/L/R/U/D: Rotate face counter-clockwise
  • Space: Scramble the cube
  • Backspace: Reset to solved state
  • N: Toggle sticker numbers
  • T: Toggle face labels
  • S: Solve the cube
  • C: Stop solving process
  • Mouse Drag: Rotate view
  • Mouse Wheel: Zoom in/out
  • Q or Esc: Quit

Both versions are designed for debugging and verifying the cube's sticker mapping and move logic. The 3D version is especially useful for visually confirming the spatial relationships between faces and stickers.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages