InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model for Spatial Intelligence

We present InSpatio-WorldFM, a real-time generative frame model for spatial intelligence. It delivers efficient, multi-view consistent spatial reasoning and enables real-time interactive exploration. Our goal is to bring real-time spatial reasoning to consumer-grade GPUs, moving spatial intelligence from data centers to edge devices.

We open-source the code for this project and provide an online demo for a live preview.

Code arXiv Demo

Multi-View Consistency

3D World Models vs. 2D World Models: Consistency for Spatial Intelligence

Unlike video generation models that learn purely in the 2D pixel space, the foundation of World Models is consistency. This means that generated content does not arbitrarily drift, change, or disappear over time. Such spatial hallucinations remain prevalent even in the most advanced video generation models today.

We believe that purely 2D learning is insufficient to achieve spatial intelligence. And the most fundamental property of spatial intelligence is 3D multi-view consistency.

In InSpatio-WorldFM, multi-view consistency serves as the core constraint for content generation, both in precomputed spatial representations and during real-time inference. This ensures that the model learns and maintains a coherent understanding of 3D spatial structure, rather than modeling the world as a sequence of unconstrained 2D pixel changes. As a result, InSpatio-WorldFM achieves strong spatial coherence and persistent consistency.

Performance

Efficiency is the Key to Spatial Intelligence

Recent large language models (LLMs) and video diffusion models（VDMs） trillions of tokens spanning vast amounts of human knowledge and visual experience. Yet even at this scale, they remain fundamentally limited in spatial understanding—the ability to perceive and reason about the structure and arrangement of the physical world.

In contrast, spatial intelligence is remarkably efficient in biological systems. Even animals with very small brains naturally possess spatial intelligence, suggesting that spatial reasoning does not inherently require massive computational resources.

"InSpatio-WorldFM is built on the premise that spatial intelligence should be efficient and widely deployable. Our goal is to enable real-time spatial reasoning on consumer GPUs, bringing spatial intelligence from data centers to edge device. In practice, WorldFM can achieve interactive frame rates on a single RTX 4090."

We achieve this through a frame-based architecture, combined with model distillation and inference optimization. We believe real-time interaction is central to spatial intelligence. Reducing computational requirements and model size is therefore a critical step toward making World Models practical, scalable, and deployable across a wide range of real-world systems.

High-Efficiency Low-Latency Real-Time Interaction

Memory Architecture

Explicit Anchors vs. Implicit Memory: Spatial Memory in Spatial Intelligence

A defining property of spatial intelligence is persistent memory. If a robot forgets the layout of a warehouse or the location of objects the moment it turns its head, true autonomy is impossible.

In InSpatio-WorldFM, we implement spatial memory through a hybrid design that combines explicit anchors with implicit neural memory. We use feedforward reconstruction as explicit spatial anchors, while reference frames serve as implicit memory within the neural model.

Explicit 3D Anchoring

We argue that explicit 3D anchoring provides a critical inductive structure for spatial intelligence. Rather than relying solely on implicit neural representations, explicit anchors enable stable, efficient, and scalable spatial reasoning across viewpoints and over time.

Ultra-Long Inference Without Content Degradation

Applications

Infinite Possibilities Enabled by Spatial Intelligence

📷 From 2D Images to 3D Worlds: Real-Time Interactive Exploration

With just a single photo, you can immerse yourself and explore places you have never visited.

✏️ Consistent Spatial Editing

With just a textual description, you can generate customized environments with your desired style.

🎮 Game-Style Interaction

With no installation required, you can instantly enter and experience worlds in any game style.

Game World 1

Game World 2

🤖 Embodied Intelligence

Without expensive data collection, embodied intelligence gains access to infinitely generated, customizable spatial environments.

Room Scene

Factory Environment 1

Factory Environment 2

Future

Looking Ahead

InSpatio-WorldFM introduces an efficient real-time spatial intelligence model that runs on modest computational resources, and we are releasing its technical details to support open research and development.

At the same time, we are extending the model to support real-time interaction with dynamic worlds. These capabilities will be made available in the near future.

"We believe efficient spatial intelligence will become a foundational capability across generative models, embodied agents, and real-world systems. If you are excited about this future, we invite you to build it with us."