There’s a familiar trick in modern video editing with AI – taking an object out, slapping some new background in, – and calling it good. It does the trick for simple things, but it all falls apart the moment the object actually starts to move or interact with anything.
Take a domino chain for example, remove a few tiles in the middle and most AI models will still convincingly show the rest of the dominoes falling over. Visually its a-ok but from a physics standpoint its just plain wrong.
Netflix’s new model VOID (Video Object and Interaction Deletion) is tackling that exact flaw. The end result is straightforward but boy is it tough to actually pull off : remove an object from a video and have everything else in the scene behave as if the object never even existed.
The Problem with “Good Enough” Video Editing

Most video editing models today are really good at appearance.
They can:
- remove an object
- reconstruct the background
- clean up shadows or reflections
But they struggle with interactions:
- objects that collide
- objects that support other objects
- anything involving motion or cause-and-effect
Current models often generate scenes that look correct frame-by-frame but are physically implausible when you consider how the scene should evolve over time .
VOID’s Approach: Counterfactual Video
VOID reframes the task.
Instead of asking:
“What pixels should go here?”
It asks:
“What would have happened if this object didn’t exist?”
The model then generates a new video based on that scenario.
Formally, the system takes:
- a video
- a mask identifying the object to remove
and produces a counterfactual video where both the object and its downstream effects are gone .
That includes things like:
- removing a person holding an object → the object falls
- removing a blocker → a collision never happens
- removing a force source → motion stops or changes
The key detail is that VOID doesn’t just erase, it recomputes the scene dynamics.
How It Works

Under the hood, VOID is a mix of familiar components assembled in a very specific way.
Diffusion backbone
- Built on CogVideoX, a video diffusion transformer
- Initialized from prior work on layered video editing (Generative Omnimatte)
Training on counterfactual data
The model is trained on paired videos:
- original scene with object
- re-simulated scene without it
These are generated using:
- physics simulation (Kubric)
- human-object interaction data (HUMOTO)
Quadmask conditioning
Instead of a simple mask, VOID uses a four-region mask:
- object to remove
- areas affected by removal
- overlap regions
- untouched regions
This gives the model explicit guidance about:
- what must change
- what must stay stable
It’s a small design choice that ends up doing a lot of work.
VLM-guided reasoning
VOID brings in a vision-language model to:
- identify which parts of the scene are affected
- expand the mask beyond the obvious object
That’s how it figures out things like:
- which objects depend on the removed one
- where motion changes will happen
Two-pass generation
The system runs in two stages:
Pass 1
- predicts the new motion and scene evolution
Pass 2 (optional)
- fixes artifacts like deformation
- uses motion-aligned noise to stabilize results
The second pass only kicks in when the model expects significant motion changes.
What It Gets Right
The improvements show up where previous models fall apart.
- domino chains stop when the middle is removed
- objects fall when support disappears
- collisions don’t happen if the obstacle is gone
- reflections and shadows disappear correctly
The model also generalizes beyond training cases:
- a balloon floats up when the person holding it is removed
- a blender doesn’t turn on if the person activating it is gone
That’s not perfect physics simulation, but it’s a meaningful step toward causal consistency.
Practical Considerations: License, Deployment & Requirements
VOID is out there for the taking under the Apache 2.0 license, which makes it pretty easy to use for commercial purposes, tinker with the code, and pass it on to others barely held back by any red tape. In reality, this makes getting it into production environments a breeze, with no major license headaches to worry about.
You can run the model on your local machine and the repository has everything you need – including a full inference pipeline and the model weights – so this isn’t just an API-only release & you can deploy it on your own hardware.
But let’s be real, the hardware requirements aren’t exactly trivial. Inference needs a GPU with around 40GB of VRAM, so we’re talking A100 or H100 territory here. You won’t be able to run this comfortably on your standard, consumer-grade PC. Training is even more demanding, but most teams aren’t going to be doing that anyway, so that’s a fair point.
And let’s not forget the pipeline itself is a fair bit more complicated than your standard video editing tool. It needs extra components like segmentation models and a vision-language model to generate those fancy interaction-aware masks. That means a lot more moving parts, and potentially external dependencies too – unless you’re happy to swap those bits out yourself.
If you’re working in an environment that already has:
- dedicated GPU hardware
- an existing video processing pipeline
- some engineering know-how to wrap your head around multi-step systems
then VOID will likely fit right in for you.
But if you’re a smaller team or looking to use this for a lighter weight project, then the setup overhead and hardware requirements might end up being a bit of a showstopper.
Does It Work?
Mostly yes.
In a human preference study:
- VOID was chosen 64.8% of the time
- the next best model (Runway) got 18.4%
The biggest gains show up in:
- interaction correctness
- physical plausibility
Which is exactly what the model is designed to improve.
Where It Still Falls Short
Limitations:
- struggles with unusual camera angles
- limited video length – only a few seconds
- resolution could be better
- depends heavily on synthetic training data
There’s also a broader limitation that isn’t unique to VOID:
- it approximates physics rather than simulating it
That works well enough for many cases, but edge scenarios will still break.
Why This Really Matters
VOID marks a big turning point for video models – they’re moving on from just trying to look the part.
We used to focus on:
- making video frames look realistic
Now we’re hitting a wall because:
- getting these models to behave consistently over time is a real pain
Its like we’re solving two different problems here.
When a model has a good grasp on things like:
- how to handle support
- move objects around smoothly
- spots the cause-and-effect in a scene
Well that opens up a whole new world of possibilities:
- video editing is suddenly a lot more reliable
- generation of simulation-like content gets a huge boost
- you can count on your tools to behave as expected in a production environment
And to be clear, VOID isn’t some all-purpose AI system that does it all.
It’s all about helping out with:
- video editing
- VFX workflows – it’s especially good at really tricky effects
- cleaning up messy content
- doing some fundamental research into video generation
It’s not about:
- building a chat interface that sounds human
- automating enterprise workflows – that’s way beyond its scope
- general reasoning tasks – nope, it’s not that kind of AI
The Bottom Line
Here’s the takeaway: VOID is focused on a pretty narrow problem, but it’s a real problem that exists.
Most models can easily make a scene look clean and tidy, but not many of them can actually make it behave as it should after you edit it.
And that becomes super obvious when you start throwing objects around and getting them to interact with each other.
VOID isn’t solving the physics problem in video generation just yet, but it is nudging us in the right direction.
And thats where the next big gains in video model performance are probably going to come from.
Source: https://arxiv.org/pdf/2604.02296
Building with generative video or simulation-heavy systems?
If you’re wrestling with video models, scene editing, or anything that needs to behave like real life – you’ve probably already butted heads with the limitations of what’s currently available.
Getting something to look good is the easy part. It’s when you try to get it to act like it’s supposed to that things start to get really tough.
At RisingStack, we help teams move beyond the demo phase & get their prototypes out the door – into systems that can actually handle the real deal. That includes:
- Taming generative models to get them to play nice in production\
- Wrangling consistency, state, and interaction logic so it all makes sense across every frame\
- Building pipelines that work with the latest & greatest AI tech\
- Bridging that huge gap between research-grade models & actually building something people can use
If you’re exploring this space or just trying to figure out what actually works, we’re here to help you get unstuck.


