Skip to content

Faster Pi0-FAST inference time; DCT early stopping trick implementation for FAST (Frequency-Space Action Sequence Tokenization)

Notifications You must be signed in to change notification settings

uynitsuj/FASTer

Repository files navigation

My implementation of a DCT (Discrete Cosine Transform) early stopping technique in FAST (Frequency-Space Action Sequence Tokenization) to decrease observation-to-action latency. The idea is to decode only the first few DCT coefficients from the PaLi-Gemma VLM rather than waiting for a full action chunk inference to complete. This can cut down inference to action time by a little under half since around the first 3-4 frequency coefficients of the DCT are typically enough for a coarse action sequence reconstruction.

Note: Action reconstruction accuracy is reduced but predictions can be improved/smoothed by implementing temporal action-chunk ensembling, introduced here.

FAST: Efficient Action Tokenization for Vision-Language-Action Models

This is a modified repo for the FAST action tokenizer.

The necessary changes to the pi0-fast model implementation can be found in a forked repo commit

The action tokenizer maps any sequence of robot actions into a sequence of dense, discrete action tokens for training autoregressive VLA models.

Intuition Behind DCT Early Stopping

The Discrete Cosine Transform (DCT) is a frequency-domain transformation technique that expresses a time-series sequence of data points as a weighted sum of cosine functions at different frequencies. In the context of robot action trajectories, we can leverage DCT's energy compaction property to represent complex action sequences with just a few coefficients.

Why DCT for Robot Actions?

Robot action trajectories often contain smooth, continuous movements across multiple dimensions. When transformed into the frequency domain using DCT most of the signal's energy gets concentrated in the lower frequency components and the DCT naturally captures the temporal correlation of action sequences at high control frequencies.

The DCT Early Stopping Technique

The key insight for early stopping is that we don't need to wait for all DCT coefficients to be generated by the model before beginning robot execution. Since the lower frequency components contain most of the action information, we can:

  1. Generate only the first few DCT coefficients
  2. Reconstruct a coarse but usable approximation of the action prediction
  3. Begin executing the action, and optionally, let the model continue generating tokens to resolve higher-frequency details

This approach cuts down inference-to-action time by roughly half, enabling more responsive robot control using pi0-fast.

Visual Demonstration

Full DCT Action Sequence Reconstruction 0-4th DCT Harmonics Reconstruction

This shows a side by side comparison of reconstructions from full DCT coefficients (left) compared to DCT early stopping (right).

fullDCT

Figure 1: Full DCT Reconstruction This shows action trajectories across multiple dimensions reconstructed using all DCT coefficients (decoded from real tokens generated by a finetuned PI0-FAST model as an example). These are complete, detailed action sequences with all fine-grained movements preserved.

zeroDCT

Figure 2: 0th Harmonic Only (DC Signal) This shows reconstructions using only the 0th DCT coefficient (DC component). This essentially captures the mean value of each action dimension - which is why we see flat lines. While this provides the general position for each dimension, it lacks any temporal dynamics.

zerofourDCT

Figure 3: Early Stopping (0th through 4th Harmonics) This demonstrates reconstruction using the 0th through 4th harmonics. Notice how it approximates the overall shape and critical movements from Figure 1, despite using less than half of the coefficients. The major action components - like the significant drops in Dimensions 1, 3, 5, and 9 - are clearly captured.

Caveat: Fewer DCT harmonics translates to a slightly larger action-sequence reconstruction error. Action accuracy is ever-so-slightly reduced.

About

Faster Pi0-FAST inference time; DCT early stopping trick implementation for FAST (Frequency-Space Action Sequence Tokenization)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages