Prompt Relay: Inference-Time Temporal-Semantic Control via Cross-Attention Routing


S-Lab, Nanyang Technological University

Temporal Cross-Attention Routing

A non-parametric method to support granular control over the temporal placement of each text prompt in video generation.

Temporal cross-attention teaser

We enforce temporal locality by injecting a distance-based penalty term, C, into the cross-attention mechanism:

Attn(Q, K, V) = softmax(QKT√d - C(Q, K)) V

We penalize how much a query i attends to the key j relative to how far the query is from the key’s assigned midpoint. Given a prompt P assigned to segment S of the video with midpoint ms, the penalty matrix for segment S is:

C(i, j) = 1[j ∈ Ks] · softplus(|f(i) - ms| - w)22 σ2

Here, f(i) is the latent frame index of the ith query token, w is a small window size proportional to the length of the segment, and σ is derived so attention weights decay to a negligible factor ε at the segment boundaries. This keeps temporal prompts from interfering with neighboring prompts.

Video Gallery

Scene Transitions

Scene transition 1
Flying Eagle → Cyberpunk City → 20th Century Living Room
Scene transition 2
Caveman → Spartan → Midieval Knight
Scene transition 3
Hong Kong → Grand Canyon

Event Composition

Event Composition 1
Pour Cereal → Pour Milk
Event Composition 2
Pan Right → Pan Left
Event Composition 3
Man Removes Glasses → Reverse Shot → Monkey Stops Eating