Prompt Relay

We enforce temporal locality by injecting a distance-based penalty term, C, into the cross-attention mechanism:

Attn(Q, K, V) = softmax(QK^T√d - C(Q, K)) V

We penalize how much a query i attends to the key j relative to how far the query is from the key’s assigned midpoint. Given a prompt P assigned to segment S of the video with midpoint m_s, the penalty matrix for segment S is:

C(i, j) = 1[j ∈ K_s] · softplus(|f(i) - m_s| - w)²2 σ²

Here, f(i) is the latent frame index of the ith query token, w is a small window size proportional to the length of the segment, and σ is derived so attention weights decay to a negligible factor ε at the segment boundaries. This keeps temporal prompts from interfering with neighboring prompts.

Prompt Relay: Inference-Time Temporal-Semantic Control via Cross-Attention Routing

Temporal Cross-Attention Routing

Video Gallery

Scene Transitions

Event Composition

Full Prompt