Skip to content

[Bug] VLM TFLOPS calculation is inaccurate: uses config seq_length instead of actual FA sequence length, and omits ViT encoder FLOPs #3498

@SophusDavid

Description

@SophusDavid

Problem

The current TFLOPS calculation in flop_utils.py has two significant issues when training Vision-Language Models (e.g., Qwen3.5-VL):

Issue 1: LLM FLOPS uses configured seq_length instead of actual FA sequence length

During VLM SFT training, preprocess_packed_seqs / pack_or_pad_batch_sequences dynamically pads sequences to the actual content length (rounded up to divisible_by), not to cfg.model.seq_length. For example, with cfg.model.seq_length=4096, the actual FA sequence length observed is typically ~1280 (e.g., 950 vision tokens after spatial merge + ~330 text tokens).

However, num_floating_point_operations() unconditionally uses cfg.model.seq_length for all computations, including:

  • The outer multiplier: batch_size * cfg.model.seq_length
  • Attention core FLOPs (which scale quadratically with sequence length)

This causes the reported TFLOPS to be significantly overestimated (potentially 3-10x higher than reality depending on the actual-to-configured length ratio).

Issue 2: ViT encoder FLOPs are completely missing

The FLOPS calculation only accounts for the LLM (language model) portion. For VLM models, the Vision Transformer (ViT) encoder contributes a non-trivial fraction of total compute:

  • ViT transformer layers with bidirectional full attention (not causal → 4hs² per attention core vs 2hs² for causal)
  • GELU MLP layers
  • Patch Merger (spatial merge 2×2 + 2-layer MLP projection from ViT hidden dim to LLM hidden dim)

For Qwen3.5-VL-9B with a typical 864×1152 input image (3888 patches), ViT accounts for ~15-18% of total training FLOPS — this is entirely omitted.

Minimal repro

1. Train any Qwen VL model (e.g., Qwen3.5-VL-9B) with SFT
2. Set cfg.model.seq_length = 4096
3.Observe that actual FA sequence length is ~1280 (visible via FA debug logs)
4.The reported TFLOPS will be ~3x higher than expected because the calculation assumes seq_length=4096

Expected behavior

  • TFLOPS calculation should use the actual sequence length that FA processes, not the configured maximum
  • ViT encoder FLOPS should be included in the total TFLOPS metric for VLM models

Affected area

area:perf

Regression?

No

Environment

Run on H20 Qwen3.5-9B.
H20 FP16 FLOPS ~=150 . But in the log GPU utilization: 242.1MODEL_TFLOP/s/GPU

Logs

Step Time : 3.81s GPU utilization: 242.1MODEL_TFLOP/s/GPU
 [2026-04-21 02:40:03] iteration        3/      60 | consumed samples:           96 | elapsed time per iteration (ms): 3813.3 | learning rate: 7.500000E-08 | global batch size:    32 | lm loss: 1.315767E+01 | mtp_1 loss: 1.332609E+01 | loss scale: 1.0 | grad norm: 221.443 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 4.31s GPU utilization: 214.4MODEL_TFLOP/s/GPU

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions