Netflix’s VOID: Fixing the Physics Problem in Video Editing

There’s a familiar trick in modern video editing with AI – taking an object out, slapping some new background in, – and calling it good. It does the trick for simple things, but it all falls apart the moment the object actually starts to move or interact with anything.

Take a domino chain for example, remove a few tiles in the middle and most AI models will still convincingly show the rest of the dominoes falling over. Visually its a-ok but from a physics standpoint its just plain wrong.

Netflix’s new model VOID (Video Object and Interaction Deletion) is tackling that exact flaw. The end result is straightforward but boy is it tough to actually pull off : remove an object from a video and have everything else in the scene behave as if the object never even existed.


The Problem with “Good Enough” Video Editing

Most video editing models today are really good at appearance.

They can:

  • remove an object
  • reconstruct the background
  • clean up shadows or reflections

But they struggle with interactions:

  • objects that collide
  • objects that support other objects
  • anything involving motion or cause-and-effect

Current models often generate scenes that look correct frame-by-frame but are physically implausible when you consider how the scene should evolve over time .


VOID’s Approach: Counterfactual Video

VOID reframes the task.

Instead of asking:

“What pixels should go here?”

It asks:

“What would have happened if this object didn’t exist?”

The model then generates a new video based on that scenario.

Formally, the system takes:

  • a video
  • a mask identifying the object to remove

and produces a counterfactual video where both the object and its downstream effects are gone .

That includes things like:

  • removing a person holding an object → the object falls
  • removing a blocker → a collision never happens
  • removing a force source → motion stops or changes

The key detail is that VOID doesn’t just erase, it recomputes the scene dynamics.


How It Works

Under the hood, VOID is a mix of familiar components assembled in a very specific way.

Diffusion backbone

  • Built on CogVideoX, a video diffusion transformer
  • Initialized from prior work on layered video editing (Generative Omnimatte)

Training on counterfactual data

The model is trained on paired videos:

  • original scene with object
  • re-simulated scene without it

These are generated using:

  • physics simulation (Kubric)
  • human-object interaction data (HUMOTO)

Quadmask conditioning

Instead of a simple mask, VOID uses a four-region mask:

  • object to remove
  • areas affected by removal
  • overlap regions
  • untouched regions

This gives the model explicit guidance about:

  • what must change
  • what must stay stable

It’s a small design choice that ends up doing a lot of work.


VLM-guided reasoning

VOID brings in a vision-language model to:

  • identify which parts of the scene are affected
  • expand the mask beyond the obvious object

That’s how it figures out things like:

  • which objects depend on the removed one
  • where motion changes will happen

Two-pass generation

The system runs in two stages:

Pass 1

  • predicts the new motion and scene evolution

Pass 2 (optional)

  • fixes artifacts like deformation
  • uses motion-aligned noise to stabilize results

The second pass only kicks in when the model expects significant motion changes.


What It Gets Right

The improvements show up where previous models fall apart.

  • domino chains stop when the middle is removed
  • objects fall when support disappears
  • collisions don’t happen if the obstacle is gone
  • reflections and shadows disappear correctly

The model also generalizes beyond training cases:

  • a balloon floats up when the person holding it is removed
  • a blender doesn’t turn on if the person activating it is gone

That’s not perfect physics simulation, but it’s a meaningful step toward causal consistency.


Practical Considerations: License, Deployment & Requirements

VOID is out there for the taking under the Apache 2.0 license, which makes it pretty easy to use for commercial purposes, tinker with the code, and pass it on to others barely held back by any red tape. In reality, this makes getting it into production environments a breeze, with no major license headaches to worry about.

You can run the model on your local machine and the repository has everything you need – including a full inference pipeline and the model weights – so this isn’t just an API-only release & you can deploy it on your own hardware.

But let’s be real, the hardware requirements aren’t exactly trivial. Inference needs a GPU with around 40GB of VRAM, so we’re talking A100 or H100 territory here. You won’t be able to run this comfortably on your standard, consumer-grade PC. Training is even more demanding, but most teams aren’t going to be doing that anyway, so that’s a fair point.

And let’s not forget the pipeline itself is a fair bit more complicated than your standard video editing tool. It needs extra components like segmentation models and a vision-language model to generate those fancy interaction-aware masks. That means a lot more moving parts, and potentially external dependencies too – unless you’re happy to swap those bits out yourself.

If you’re working in an environment that already has:

  • dedicated GPU hardware
  • an existing video processing pipeline
  • some engineering know-how to wrap your head around multi-step systems

then VOID will likely fit right in for you.

But if you’re a smaller team or looking to use this for a lighter weight project, then the setup overhead and hardware requirements might end up being a bit of a showstopper.


Does It Work?

Mostly yes.

In a human preference study:

  • VOID was chosen 64.8% of the time
  • the next best model (Runway) got 18.4%

The biggest gains show up in:

  • interaction correctness
  • physical plausibility

Which is exactly what the model is designed to improve.


Where It Still Falls Short

Limitations:

  • struggles with unusual camera angles
  • limited video length – only a few seconds
  • resolution could be better
  • depends heavily on synthetic training data

There’s also a broader limitation that isn’t unique to VOID:

  • it approximates physics rather than simulating it

That works well enough for many cases, but edge scenarios will still break.


Why This Really Matters

VOID marks a big turning point for video models – they’re moving on from just trying to look the part.

We used to focus on:

  • making video frames look realistic

Now we’re hitting a wall because:

  • getting these models to behave consistently over time is a real pain

Its like we’re solving two different problems here.

When a model has a good grasp on things like:

  • how to handle support
  • move objects around smoothly
  • spots the cause-and-effect in a scene

Well that opens up a whole new world of possibilities:

  • video editing is suddenly a lot more reliable
  • generation of simulation-like content gets a huge boost
  • you can count on your tools to behave as expected in a production environment

And to be clear, VOID isn’t some all-purpose AI system that does it all.

It’s all about helping out with:

  • video editing
  • VFX workflows – it’s especially good at really tricky effects
  • cleaning up messy content
  • doing some fundamental research into video generation

It’s not about:

  • building a chat interface that sounds human
  • automating enterprise workflows – that’s way beyond its scope
  • general reasoning tasks – nope, it’s not that kind of AI

The Bottom Line

Here’s the takeaway: VOID is focused on a pretty narrow problem, but it’s a real problem that exists.

Most models can easily make a scene look clean and tidy, but not many of them can actually make it behave as it should after you edit it.

And that becomes super obvious when you start throwing objects around and getting them to interact with each other.

VOID isn’t solving the physics problem in video generation just yet, but it is nudging us in the right direction.

And thats where the next big gains in video model performance are probably going to come from.

Source: https://arxiv.org/pdf/2604.02296


Building with generative video or simulation-heavy systems?

If you’re wrestling with video models, scene editing, or anything that needs to behave like real life – you’ve probably already butted heads with the limitations of what’s currently available.

Getting something to look good is the easy part. It’s when you try to get it to act like it’s supposed to that things start to get really tough.

At RisingStack, we help teams move beyond the demo phase & get their prototypes out the door – into systems that can actually handle the real deal. That includes:

  • Taming generative models to get them to play nice in production\
  • Wrangling consistency, state, and interaction logic so it all makes sense across every frame\
  • Building pipelines that work with the latest & greatest AI tech\
  • Bridging that huge gap between research-grade models & actually building something people can use

If you’re exploring this space or just trying to figure out what actually works, we’re here to help you get unstuck.

👉 Drop us a line at RisingStack & let’s build something that doesn’t just look good – but behaves like it too.

Share this post

Twitter
Facebook
LinkedIn
Reddit

Related posts

RAG Demystified: From Math to Self-Hosted Code

In today’s AI hype you cannot miss the term “RAG,” which stands for Retrieval Augmented Generation. In plain English, it stands for customizing large language model reasoning with your own context and knowledge. I searched a lot of resources and

Read More »

Node.js
Experts

Learn more at risingstack.com

Node.js Experts