Problem
The current TFLOPS calculation in flop_utils.py has two significant issues when training Vision-Language Models (e.g., Qwen3.5-VL):
Issue 1: LLM FLOPS uses configured seq_length instead of actual FA sequence length
During VLM SFT training, preprocess_packed_seqs / pack_or_pad_batch_sequences dynamically pads sequences to the actual content length (rounded up to divisible_by), not to cfg.model.seq_length. For example, with cfg.model.seq_length=4096, the actual FA sequence length observed is typically ~1280 (e.g., 950 vision tokens after spatial merge + ~330 text tokens).
However, num_floating_point_operations() unconditionally uses cfg.model.seq_length for all computations, including:
- The outer multiplier: batch_size * cfg.model.seq_length
- Attention core FLOPs (which scale quadratically with sequence length)
This causes the reported TFLOPS to be significantly overestimated (potentially 3-10x higher than reality depending on the actual-to-configured length ratio).
Issue 2: ViT encoder FLOPs are completely missing
The FLOPS calculation only accounts for the LLM (language model) portion. For VLM models, the Vision Transformer (ViT) encoder contributes a non-trivial fraction of total compute:
- ViT transformer layers with bidirectional full attention (not causal → 4hs² per attention core vs 2hs² for causal)
- GELU MLP layers
- Patch Merger (spatial merge 2×2 + 2-layer MLP projection from ViT hidden dim to LLM hidden dim)
For Qwen3.5-VL-9B with a typical 864×1152 input image (3888 patches), ViT accounts for ~15-18% of total training FLOPS — this is entirely omitted.
Minimal repro
1. Train any Qwen VL model (e.g., Qwen3.5-VL-9B) with SFT
2. Set cfg.model.seq_length = 4096
3.Observe that actual FA sequence length is ~1280 (visible via FA debug logs)
4.The reported TFLOPS will be ~3x higher than expected because the calculation assumes seq_length=4096
Expected behavior
- TFLOPS calculation should use the actual sequence length that FA processes, not the configured maximum
- ViT encoder FLOPS should be included in the total TFLOPS metric for VLM models
Affected area
area:perf
Regression?
No
Environment
Run on H20 Qwen3.5-9B.
H20 FP16 FLOPS ~=150 . But in the log GPU utilization: 242.1MODEL_TFLOP/s/GPU
Logs
Step Time : 3.81s GPU utilization: 242.1MODEL_TFLOP/s/GPU
[2026-04-21 02:40:03] iteration 3/ 60 | consumed samples: 96 | elapsed time per iteration (ms): 3813.3 | learning rate: 7.500000E-08 | global batch size: 32 | lm loss: 1.315767E+01 | mtp_1 loss: 1.332609E+01 | loss scale: 1.0 | grad norm: 221.443 | number of skipped iterations: 0 | number of nan iterations: 0 |
Step Time : 4.31s GPU utilization: 214.4MODEL_TFLOP/s/GPU
Problem
The current TFLOPS calculation in flop_utils.py has two significant issues when training Vision-Language Models (e.g., Qwen3.5-VL):
Issue 1: LLM FLOPS uses configured seq_length instead of actual FA sequence length
During VLM SFT training, preprocess_packed_seqs / pack_or_pad_batch_sequences dynamically pads sequences to the actual content length (rounded up to divisible_by), not to cfg.model.seq_length. For example, with cfg.model.seq_length=4096, the actual FA sequence length observed is typically ~1280 (e.g., 950 vision tokens after spatial merge + ~330 text tokens).
However, num_floating_point_operations() unconditionally uses cfg.model.seq_length for all computations, including:
This causes the reported TFLOPS to be significantly overestimated (potentially 3-10x higher than reality depending on the actual-to-configured length ratio).
Issue 2: ViT encoder FLOPs are completely missing
The FLOPS calculation only accounts for the LLM (language model) portion. For VLM models, the Vision Transformer (ViT) encoder contributes a non-trivial fraction of total compute:
For Qwen3.5-VL-9B with a typical 864×1152 input image (3888 patches), ViT accounts for ~15-18% of total training FLOPS — this is entirely omitted.
Minimal repro
Expected behavior
Affected area
area:perf
Regression?
No
Environment
Run on H20 Qwen3.5-9B.
H20 FP16 FLOPS ~=150 . But in the log GPU utilization: 242.1MODEL_TFLOP/s/GPU
Logs