LLM agents excel in digital tasks but struggle in physically grounded settings due to weak spatial mental models. We identify three core challenges: spatial reasoning, long-horizon state tracking, and active exploration under partial observation.
To evaluate these abilities, we introduce CubeBench, a benchmark based on the Rubik’s Cube with a three-tier diagnostic framework, ranging from full symbolic states to partial visual observations.
Experiments on leading LLMs reveal severe long-horizon failures, including a 0.00% pass rate on all long-horizon tasks. CubeBench further supports solver-based diagnostics to isolate cognitive bottlenecks, offering insights toward building more spatially grounded intelligent agents.
First, install the required dependencies:
pip install -r requirements.txtThis project relies on Kociemba’s Solver. Since pruning tables are too large to store in Git, you must generate them locally before running experiments.
- Ensure the directory
cube/solvers/optimal/tablesexists. - Run the following command:
python -c "import cube.solvers.optimal.solver"The tables will then be generated. This process may take several hours, but can be significantly accelerated using PyPy.
For more details, see the original repository: RubiksCube-OptimalSolver.
We provide the exact test cases used in our experiments:
experiments/generated_states/test_subset_5_back_checked.jsonYou can use this file directly, or generate your own test cases using the provided pipeline:
Step 1. Generate initial states with random scrambles
python experiments/generate_initial_states.pyStep 2. Bucket the states by depth (Requires the optimal solver pruning tables generated in the previous step.)
python experiments/sieve.py --max-per-level 20This step can be stopped once enough cases are collected.
Step 3. Build the hard-20 bucket
We provide a helper script that uses experiments/hard20.txt (sourced from cube20.org).
python experiments/build_hard20_bucket.pyStep 4. Filter and select test cases
Remove center-symmetric cases and keep the first 5 cases from each depth bucket:
python filter_states_by_scrambles_back_num.pyPlease set your API base and API key at policy/algorithm/smolagent/api_config.py.
API_CONFIGS = {
'openrouter': {
'base': 'https://openrouter.ai/api/v1',
'key': 'add-your-key-here'
},
# more providers can be added here
}The scripts for this experiment are designed for Linux systems, and we highly recommend using Linux to run the experiments.
You can use experiments/configs/toy_config.json for a quick verification of your environment setup. Afterward, you can customize your own configuration to run experiments.
Run a configuration for the first time:
python batch_test_generated_states.py --config {path_to_config}If the run completes normally (including expected terminations such as Time Limit Exceeded, Token Limit Exceeded, Image Limit Exceeded), you can stop here.
If the run crashes unexpectedly and you want to resume, add --resume:
python batch_test_generated_states.py --config {path_to_config} --resume {path_to_resume}If you want to run a single testcase instead of in batch, you can use:
python -m agent.agent --your-arguments ...An example is:
python -m agent.agent --observation-type state_string --scramble-moves 8 --model openai/gpt-5 --max-steps 20 --agent-type basic --provider openrouter --timeout 1800All results will be saved in the test_outputs directory.
Configuration files are written in JSON.
Example:
{
"experiment_parameters": {
"states_file": "experiments/generated_states/test_subset_5_back_checked.json",
"timeout": 1800,
"provider": "openrouter",
"max_input_tokens": 1000000,
"max_output_tokens": 300000,
"max_images": 200,
"workers": 80,
"parallel_strategy": "all",
"checkpoint_interval": 6,
"step_observation_callback": false,
"reward_type": "heuristic"
},
"observation_types": [
"state_string",
"full_view",
"face_view",
"vertex_view"
],
"scrambles": {
"1": 20,
"2": 20,
"3": 20,
"4": 20,
"8": 20,
"12": 20,
"16": 20,
"20": 20
},
"agent_types": [
"basic",
"standard_solver",
"ideal_solver"
],
"models": [
"openai/gpt-5"
]
}Defines the core execution settings:
states_file: Path to the initial states filetimeout: Timeout (in seconds) per testprovider: API provider (e.g.,"openrouter")max_input_tokens: Max input tokensmax_output_tokens: Max output tokensmax_images: Max images per runworkers: Number of parallel workersparallel_strategy:"all"or"model"checkpoint_interval: Save interval for checkpointsstep_observation_callback: Enable/disable step callbacksreward_type: Reward calculation method. Choices: "no_reward", "heuristic", "face", "sticker"
Specifies the number of states to generate for each scramble depth:
{
"1": 20, // 1 scramble move → up to 20 solving steps
"2": 20, // 2 scramble moves → up to 20 solving steps
...
}
List of agents to test:
"basic"– Basic Agent"standard_solver"– Standard-Solver Agent"ideal_solver"– Ideal-Solver Agent
List of observation types to test:
"state_string": Full Symbolic"full_view": Full Visual"face_view": Face View"vertex_view": Vertex View
Models to evaluate, using the names recognized by your provider.
Sometimes, providers return malformed responses that cause the smolagent framework to raise AgentGenerationError or JsonParsingError.
We provide a patch to handle these.
- In
patch_smolagents.py, replace/path/to/your/site-packageswith your actual site-packages path. - Modify
_response.pyin theopenaipackage. In classBaseAPIResponse, method_parse, add:
# Existing code ......
return response.text # type: ignore
# NEW CODE BEGIN
try:
data = response.json()
except Exception as exc:
log.debug("Could not parse JSON: %s - %s", type(exc), exc)
# Return a minimal valid JSON structure
data = {}
# NEW CODE END
# Existing code ......
return self._client._process_response_data(
data=data,
cast_to=cast_to, # type: ignore
response=response,
)- In
agent.py, import the patch:
try:
from patch_smolagents import patched_generate
print("✅ Loaded smolagents patch for empty response handling")
except ImportError as e:
print(f"⚠️ Could not load smolagents patch: {e}")
print("Place 'patch_smolagents.py' in the same directory as this script.")A 2D interactive Rubik's Cube visualization for debugging and understanding sticker numbering and face relationships.
Run:
python examples/numbered_cube.pyControls:
F/B/L/R/U/D: Rotate face clockwiseShift+F/B/L/R/U/D: Rotate face counter-clockwiseSpace: Scramble the cubeR: Reset to solved stateN: Toggle sticker numbersQorEsc: Quit
A 3D interactive Rubik's Cube visualization using Pygame and PyOpenGL, supporting real-time rotation, zoom, and sticker/face label debugging.
Run:
python examples/interactive_3d_cube.pyControls:
F/B/L/R/U/D: Rotate face clockwiseShift+F/B/L/R/U/D: Rotate face counter-clockwiseSpace: Scramble the cubeBackspace: Reset to solved stateN: Toggle sticker numbersT: Toggle face labelsS: Solve the cubeC: Stop solving processMouse Drag: Rotate viewMouse Wheel: Zoom in/outQorEsc: Quit
Both versions are designed for debugging and verifying the cube's sticker mapping and move logic. The 3D version is especially useful for visually confirming the spatial relationships between faces and stickers.
