CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning under Partial Observations

Overview

LLM agents excel in digital tasks but struggle in physically grounded settings due to weak spatial mental models. We identify three core challenges: spatial reasoning, long-horizon state tracking, and active exploration under partial observation.

To evaluate these abilities, we introduce CubeBench, a benchmark based on the Rubik’s Cube with a three-tier diagnostic framework, ranging from full symbolic states to partial visual observations.

Experiments on leading LLMs reveal severe long-horizon failures, including a 0.00% pass rate on all long-horizon tasks. CubeBench further supports solver-based diagnostics to isolate cognitive bottlenecks, offering insights toward building more spatially grounded intelligent agents.

Setup

First, install the required dependencies:

pip install -r requirements.txt

Generating Pruning Tables

This project relies on Kociemba’s Solver. Since pruning tables are too large to store in Git, you must generate them locally before running experiments.

Ensure the directory cube/solvers/optimal/tables exists.
Run the following command:

python -c "import cube.solvers.optimal.solver"

The tables will then be generated. This process may take several hours, but can be significantly accelerated using PyPy.

For more details, see the original repository: RubiksCube-OptimalSolver.

Generating Test Cases

We provide the exact test cases used in our experiments:

experiments/generated_states/test_subset_5_back_checked.json

You can use this file directly, or generate your own test cases using the provided pipeline:

Step 1. Generate initial states with random scrambles

python experiments/generate_initial_states.py

Step 2. Bucket the states by depth (Requires the optimal solver pruning tables generated in the previous step.)

python experiments/sieve.py --max-per-level 20

This step can be stopped once enough cases are collected. ⚠️ Depth-20 states are extremely rare, so they must be imported from an external source.

Step 3. Build the hard-20 bucket We provide a helper script that uses experiments/hard20.txt (sourced from cube20.org).

python experiments/build_hard20_bucket.py

Step 4. Filter and select test cases

Remove center-symmetric cases and keep the first 5 cases from each depth bucket:

python filter_states_by_scrambles_back_num.py

Add Your API Config

Please set your API base and API key at policy/algorithm/smolagent/api_config.py.

API_CONFIGS = {
    'openrouter': {
        'base': 'https://openrouter.ai/api/v1',
        'key': 'add-your-key-here'
    },
    # more providers can be added here
}

Running the Main Experiment

The scripts for this experiment are designed for Linux systems, and we highly recommend using Linux to run the experiments.

You can use experiments/configs/toy_config.json for a quick verification of your environment setup. Afterward, you can customize your own configuration to run experiments.

Run a configuration for the first time:

python batch_test_generated_states.py --config {path_to_config}

If the run completes normally (including expected terminations such as Time Limit Exceeded, Token Limit Exceeded, Image Limit Exceeded), you can stop here.

If the run crashes unexpectedly and you want to resume, add --resume:

python batch_test_generated_states.py --config {path_to_config} --resume {path_to_resume}

If you want to run a single testcase instead of in batch, you can use:

python -m agent.agent --your-arguments ...

An example is:

python -m agent.agent --observation-type state_string --scramble-moves 8  --model openai/gpt-5 --max-steps 20 --agent-type basic --provider openrouter --timeout 1800

All results will be saved in the test_outputs directory.

Configuration File Format

Configuration files are written in JSON.

Example:

{
  "experiment_parameters": {
    "states_file": "experiments/generated_states/test_subset_5_back_checked.json",
    "timeout": 1800,
    "provider": "openrouter",
    "max_input_tokens": 1000000,
    "max_output_tokens": 300000,
    "max_images": 200,
    "workers": 80,
    "parallel_strategy": "all",
    "checkpoint_interval": 6,
    "step_observation_callback": false,
    "reward_type": "heuristic"
  },
  "observation_types": [
    "state_string",
    "full_view",
    "face_view",
    "vertex_view"
  ],
  "scrambles": {
    "1": 20,
    "2": 20,
    "3": 20,
    "4": 20,
    "8": 20,
    "12": 20,
    "16": 20,
    "20": 20
  },
  "agent_types": [
    "basic",
    "standard_solver",
    "ideal_solver"
  ],
  "models": [
    "openai/gpt-5"
  ]
}

`experiment_parameters`

Defines the core execution settings:

states_file: Path to the initial states file
timeout: Timeout (in seconds) per test
provider: API provider (e.g., "openrouter")
max_input_tokens: Max input tokens
max_output_tokens: Max output tokens
max_images: Max images per run
workers: Number of parallel workers
parallel_strategy: "all" or "model"
checkpoint_interval: Save interval for checkpoints
step_observation_callback: Enable/disable step callbacks
reward_type: Reward calculation method. Choices: "no_reward", "heuristic", "face", "sticker"

`scrambles`

Specifies the number of states to generate for each scramble depth:

{
  "1": 20,   // 1 scramble move → up to 20 solving steps
  "2": 20,   // 2 scramble moves → up to 20 solving steps
  ...
}

`agent_types`

List of agents to test:

"basic" – Basic Agent
"standard_solver" – Standard-Solver Agent
"ideal_solver" – Ideal-Solver Agent

`observation_types`

List of observation types to test:

"state_string": Full Symbolic
"full_view": Full Visual
"face_view": Face View
"vertex_view": Vertex View

`models`

Models to evaluate, using the names recognized by your provider.

Troubleshooting

Sometimes, providers return malformed responses that cause the smolagent framework to raise AgentGenerationError or JsonParsingError. We provide a patch to handle these.

In patch_smolagents.py, replace /path/to/your/site-packages with your actual site-packages path.
Modify _response.py in the openai package. In class BaseAPIResponse, method _parse, add:

# Existing code ......
    return response.text  # type: ignore

# NEW CODE BEGIN
try:
    data = response.json()
except Exception as exc:
    log.debug("Could not parse JSON: %s - %s", type(exc), exc)
    # Return a minimal valid JSON structure
    data = {}
# NEW CODE END

# Existing code ......
return self._client._process_response_data(
    data=data,
    cast_to=cast_to,  # type: ignore
    response=response,
)

In agent.py, import the patch:

try:
    from patch_smolagents import patched_generate
    print("✅ Loaded smolagents patch for empty response handling")
except ImportError as e:
    print(f"⚠️ Could not load smolagents patch: {e}")
    print("Place 'patch_smolagents.py' in the same directory as this script.")

Demos

2D version

A 2D interactive Rubik's Cube visualization for debugging and understanding sticker numbering and face relationships.

Run:

python examples/numbered_cube.py

Controls:

F/B/L/R/U/D: Rotate face clockwise
Shift+F/B/L/R/U/D: Rotate face counter-clockwise
Space: Scramble the cube
R: Reset to solved state
N: Toggle sticker numbers
Q or Esc: Quit

3D version

A 3D interactive Rubik's Cube visualization using Pygame and PyOpenGL, supporting real-time rotation, zoom, and sticker/face label debugging.

Run:

python examples/interactive_3d_cube.py

Controls:

F/B/L/R/U/D: Rotate face clockwise
Shift+F/B/L/R/U/D: Rotate face counter-clockwise
Space: Scramble the cube
Backspace: Reset to solved state
N: Toggle sticker numbers
T: Toggle face labels
S: Solve the cube
C: Stop solving process
Mouse Drag: Rotate view
Mouse Wheel: Zoom in/out
Q or Esc: Quit

Both versions are designed for debugging and verifying the cube's sticker mapping and move logic. The 3D version is especially useful for visually confirming the spatial relationships between faces and stickers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning under Partial Observations

Overview

Setup

Generating Pruning Tables

Generating Test Cases

Add Your API Config

Running the Main Experiment

Configuration File Format

`experiment_parameters`

`scrambles`

`agent_types`

`observation_types`

`models`

Troubleshooting

Demos

2D version

3D version

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agent		agent
assets		assets
cube		cube
examples		examples
experiments		experiments
LICENSE		LICENSE
README.md		README.md
batch_test_generated_states.py		batch_test_generated_states.py
patch_smolagents.py		patch_smolagents.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

License

Princeton-AI2-Lab/CubeBench

Folders and files

Latest commit

History

Repository files navigation

CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning under Partial Observations

Overview

Setup

Generating Pruning Tables

Generating Test Cases

Add Your API Config

Running the Main Experiment

Configuration File Format

experiment_parameters

scrambles

agent_types

observation_types

models

Troubleshooting

Demos

2D version

3D version

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`experiment_parameters`

`scrambles`

`agent_types`

`observation_types`

`models`

Packages