Netflix’s VOID: Fixing the Physics Problem in Video Editing

Last updated: April 13, 2026

💻 RISINGSTACK SERVICES

AI Development Services

Node.js Consulting & Full‑Stack JavaScript Development

MLOps and AI Infrastructure Services

SRE & DevOps Consulting Services

IT Strategy Consultancy

💻 Articles by Topics

DevOps

Elixir

JavaScript

Kubernetes

Node.js

React

Sign up to our newsletter!

There’s a familiar trick in modern video editing with AI – taking an object out, slapping some new background in, – and calling it good. It does the trick for simple things, but it all falls apart the moment the object actually starts to move or interact with anything.

Take a domino chain for example, remove a few tiles in the middle and most AI models will still convincingly show the rest of the dominoes falling over. Visually its a-ok but from a physics standpoint its just plain wrong.

Netflix’s new model VOID (Video Object and Interaction Deletion) is tackling that exact flaw. The end result is straightforward but boy is it tough to actually pull off : remove an object from a video and have everything else in the scene behave as if the object never even existed.

The Problem with “Good Enough” Video Editing

Most video editing models today are really good at appearance.

They can:

remove an object
reconstruct the background
clean up shadows or reflections

But they struggle with interactions:

objects that collide
objects that support other objects
anything involving motion or cause-and-effect

Current models often generate scenes that look correct frame-by-frame but are physically implausible when you consider how the scene should evolve over time .

VOID’s Approach: Counterfactual Video

VOID reframes the task.

Instead of asking:

“What pixels should go here?”

It asks:

“What would have happened if this object didn’t exist?”

The model then generates a new video based on that scenario.

Formally, the system takes:

a video
a mask identifying the object to remove

and produces a counterfactual video where both the object and its downstream effects are gone .

That includes things like:

removing a person holding an object → the object falls
removing a blocker → a collision never happens
removing a force source → motion stops or changes

The key detail is that VOID doesn’t just erase, it recomputes the scene dynamics.

How It Works

Under the hood, VOID is a mix of familiar components assembled in a very specific way.

Diffusion backbone

Built on CogVideoX, a video diffusion transformer
Initialized from prior work on layered video editing (Generative Omnimatte)

Training on counterfactual data

The model is trained on paired videos:

original scene with object
re-simulated scene without it

These are generated using:

physics simulation (Kubric)
human-object interaction data (HUMOTO)

Quadmask conditioning

Instead of a simple mask, VOID uses a four-region mask:

object to remove
areas affected by removal
overlap regions
untouched regions

This gives the model explicit guidance about:

what must change
what must stay stable

It’s a small design choice that ends up doing a lot of work.

VLM-guided reasoning

VOID brings in a vision-language model to:

identify which parts of the scene are affected
expand the mask beyond the obvious object

That’s how it figures out things like:

which objects depend on the removed one
where motion changes will happen

Two-pass generation

The system runs in two stages:

Pass 1

predicts the new motion and scene evolution

Pass 2 (optional)

fixes artifacts like deformation
uses motion-aligned noise to stabilize results

The second pass only kicks in when the model expects significant motion changes.

What It Gets Right

The improvements show up where previous models fall apart.

domino chains stop when the middle is removed
objects fall when support disappears
collisions don’t happen if the obstacle is gone
reflections and shadows disappear correctly

The model also generalizes beyond training cases:

a balloon floats up when the person holding it is removed
a blender doesn’t turn on if the person activating it is gone

That’s not perfect physics simulation, but it’s a meaningful step toward causal consistency.

Practical Considerations: License, Deployment & Requirements

VOID is out there for the taking under the Apache 2.0 license, which makes it pretty easy to use for commercial purposes, tinker with the code, and pass it on to others barely held back by any red tape. In reality, this makes getting it into production environments a breeze, with no major license headaches to worry about.

You can run the model on your local machine and the repository has everything you need – including a full inference pipeline and the model weights – so this isn’t just an API-only release & you can deploy it on your own hardware.

But let’s be real, the hardware requirements aren’t exactly trivial. Inference needs a GPU with around 40GB of VRAM, so we’re talking A100 or H100 territory here. You won’t be able to run this comfortably on your standard, consumer-grade PC. Training is even more demanding, but most teams aren’t going to be doing that anyway, so that’s a fair point.

And let’s not forget the pipeline itself is a fair bit more complicated than your standard video editing tool. It needs extra components like segmentation models and a vision-language model to generate those fancy interaction-aware masks. That means a lot more moving parts, and potentially external dependencies too – unless you’re happy to swap those bits out yourself.

If you’re working in an environment that already has:

dedicated GPU hardware
an existing video processing pipeline
some engineering know-how to wrap your head around multi-step systems

then VOID will likely fit right in for you.

But if you’re a smaller team or looking to use this for a lighter weight project, then the setup overhead and hardware requirements might end up being a bit of a showstopper.

Does It Work?

Mostly yes.

In a human preference study:

VOID was chosen 64.8% of the time
the next best model (Runway) got 18.4%

The biggest gains show up in:

interaction correctness
physical plausibility

Which is exactly what the model is designed to improve.

Where It Still Falls Short

Limitations:

struggles with unusual camera angles
limited video length – only a few seconds
resolution could be better
depends heavily on synthetic training data

There’s also a broader limitation that isn’t unique to VOID:

it approximates physics rather than simulating it

That works well enough for many cases, but edge scenarios will still break.

Why This Really Matters

VOID marks a big turning point for video models – they’re moving on from just trying to look the part.

We used to focus on:

making video frames look realistic

Now we’re hitting a wall because:

getting these models to behave consistently over time is a real pain

Its like we’re solving two different problems here.

When a model has a good grasp on things like:

how to handle support
move objects around smoothly
spots the cause-and-effect in a scene

Well that opens up a whole new world of possibilities:

video editing is suddenly a lot more reliable
generation of simulation-like content gets a huge boost
you can count on your tools to behave as expected in a production environment

And to be clear, VOID isn’t some all-purpose AI system that does it all.

It’s all about helping out with:

video editing
VFX workflows – it’s especially good at really tricky effects
cleaning up messy content
doing some fundamental research into video generation

It’s not about:

building a chat interface that sounds human
automating enterprise workflows – that’s way beyond its scope
general reasoning tasks – nope, it’s not that kind of AI

The Bottom Line

Here’s the takeaway: VOID is focused on a pretty narrow problem, but it’s a real problem that exists.

Most models can easily make a scene look clean and tidy, but not many of them can actually make it behave as it should after you edit it.

And that becomes super obvious when you start throwing objects around and getting them to interact with each other.

VOID isn’t solving the physics problem in video generation just yet, but it is nudging us in the right direction.

And thats where the next big gains in video model performance are probably going to come from.

Source: https://arxiv.org/pdf/2604.02296

Building with generative video or simulation-heavy systems?

If you’re wrestling with video models, scene editing, or anything that needs to behave like real life – you’ve probably already butted heads with the limitations of what’s currently available.

Getting something to look good is the easy part. It’s when you try to get it to act like it’s supposed to that things start to get really tough.

At RisingStack, we help teams move beyond the demo phase & get their prototypes out the door – into systems that can actually handle the real deal. That includes:

Taming generative models to get them to play nice in production\
Wrangling consistency, state, and interaction logic so it all makes sense across every frame\
Building pipelines that work with the latest & greatest AI tech\
Bridging that huge gap between research-grade models & actually building something people can use

If you’re exploring this space or just trying to figure out what actually works, we’re here to help you get unstuck.

👉 Drop us a line at RisingStack & let’s build something that doesn’t just look good – but behaves like it too.

Share this post

GPT-5.5 and GPT Image 2: What Actually Changed

RisingStack Engineering

OpenAI has rolled out two updates that on the surface seem like two separate things, but actually share a common thread. GPT-5.5 makes a big jump in terms of reasoning, coding and tool use, while GPT Image 2 focuses on

Keeping Meeting Apps Alive in the Background on iOS and watchOS

Roland

Building an app that records meetings, generates transcriptions, and produces AI analysation sounds straightforward – until you try to make it reliable while the user goes in and out of your app, takes calls, plays music, or checks notifications. This

RAG Demystified: From Math to Self-Hosted Code

RisingStack Engineering

In today’s AI hype you cannot miss the term “RAG,” which stands for Retrieval Augmented Generation. In plain English, it stands for customizing large language model reasoning with your own context and knowledge. I searched a lot of resources and

Node.js
Experts

Learn more at risingstack.com

Netflix’s VOID: Fixing the Physics Problem in Video Editing

💻 RISINGSTACK SERVICES

💻 Articles by Topics

Sign up to our newsletter!

In this article:

The Problem with “Good Enough” Video Editing

VOID’s Approach: Counterfactual Video

How It Works

Diffusion backbone

Training on counterfactual data

Quadmask conditioning

VLM-guided reasoning

Two-pass generation

What It Gets Right

Practical Considerations: License, Deployment & Requirements

Does It Work?

Where It Still Falls Short

Why This Really Matters

The Bottom Line

Building with generative video or simulation-heavy systems?

Share this post

Related posts

GPT-5.5 and GPT Image 2: What Actually Changed

Keeping Meeting Apps Alive in the Background on iOS and watchOS

RAG Demystified: From Math to Self-Hosted Code

Node.js
Experts

Node.js Experts

DEVELOPMENT & CONSULTING

TRAININGS

RESOURCES & COMMUNITY

OTHER

Netflix’s VOID: Fixing the Physics Problem in Video Editing

💻 RISINGSTACK SERVICES

💻 Articles by Topics

Sign up to our newsletter!

In this article:

The Problem with “Good Enough” Video Editing

VOID’s Approach: Counterfactual Video

How It Works

Diffusion backbone

Training on counterfactual data

Quadmask conditioning

VLM-guided reasoning

Two-pass generation

What It Gets Right

Practical Considerations: License, Deployment & Requirements

Does It Work?

Where It Still Falls Short

Why This Really Matters

The Bottom Line

Building with generative video or simulation-heavy systems?

Share this post

Related posts

GPT-5.5 and GPT Image 2: What Actually Changed

Keeping Meeting Apps Alive in the Background on iOS and watchOS

RAG Demystified: From Math to Self-Hosted Code

Node.jsExperts

Node.js Experts

DEVELOPMENT & CONSULTING

TRAININGS

RESOURCES & COMMUNITY

OTHER

Node.js
Experts