[Bug] VLM TFLOPS calculation is inaccurate: uses config seq_length instead of actual FA sequence length, and omits ViT encoder FLOPs

### Problem

The current TFLOPS calculation in flop_utils.py has two significant issues when training Vision-Language Models (e.g., Qwen3.5-VL):

### Issue 1: LLM FLOPS uses configured seq_length instead of actual FA sequence length
During VLM SFT training, preprocess_packed_seqs / pack_or_pad_batch_sequences dynamically pads sequences to the actual content length (rounded up to divisible_by), not to cfg.model.seq_length. For example, with cfg.model.seq_length=4096, the actual FA sequence length observed is typically ~1280 (e.g., 950 vision tokens after spatial merge + ~330 text tokens).

However, num_floating_point_operations() unconditionally uses cfg.model.seq_length for all computations, including:

- The outer multiplier: batch_size * cfg.model.seq_length
- Attention core FLOPs (which scale quadratically with sequence length)

This causes the reported TFLOPS to be significantly overestimated (potentially 3-10x higher than reality depending on the actual-to-configured length ratio).

### Issue 2: ViT encoder FLOPs are completely missing

The FLOPS calculation only accounts for the LLM (language model) portion. For VLM models, the Vision Transformer (ViT) encoder contributes a non-trivial fraction of total compute:

- ViT transformer layers with bidirectional full attention (not causal → 4hs² per attention core vs 2hs² for causal)
- GELU MLP layers
- Patch Merger (spatial merge 2×2 + 2-layer MLP projection from ViT hidden dim to LLM hidden dim)

For Qwen3.5-VL-9B with a typical 864×1152 input image (3888 patches), ViT accounts for ~15-18% of total training FLOPS — this is entirely omitted.



### Minimal repro

```shell
1. Train any Qwen VL model (e.g., Qwen3.5-VL-9B) with SFT
2. Set cfg.model.seq_length = 4096
3.Observe that actual FA sequence length is ~1280 (visible via FA debug logs)
4.The reported TFLOPS will be ~3x higher than expected because the calculation assumes seq_length=4096
```

### Expected behavior

- TFLOPS calculation should use the actual sequence length that FA processes, not the configured maximum
- ViT encoder FLOPS should be included in the total TFLOPS metric for VLM models

### Affected area

area:perf

### Regression?

No

### Environment

Run on H20  Qwen3.5-9B.
H20 FP16 FLOPS ~=150 . But in the log  **_GPU utilization: 242.1MODEL_TFLOP/s/GPU_**

 

### Logs

```shell
Step Time : 3.81s GPU utilization: 242.1MODEL_TFLOP/s/GPU
 [2026-04-21 02:40:03] iteration        3/      60 | consumed samples:           96 | elapsed time per iteration (ms): 3813.3 | learning rate: 7.500000E-08 | global batch size:    32 | lm loss: 1.315767E+01 | mtp_1 loss: 1.332609E+01 | loss scale: 1.0 | grad norm: 221.443 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 4.31s GPU utilization: 214.4MODEL_TFLOP/s/GPU
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] VLM TFLOPS calculation is inaccurate: uses config seq_length instead of actual FA sequence length, and omits ViT encoder FLOPs #3498

Problem

Issue 1: LLM FLOPS uses configured seq_length instead of actual FA sequence length

Issue 2: ViT encoder FLOPs are completely missing

Minimal repro

Expected behavior

Affected area

Regression?

Environment

Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] VLM TFLOPS calculation is inaccurate: uses config seq_length instead of actual FA sequence length, and omits ViT encoder FLOPs #3498

Description

Problem

Issue 1: LLM FLOPS uses configured seq_length instead of actual FA sequence length

Issue 2: ViT encoder FLOPs are completely missing

Minimal repro

Expected behavior

Affected area

Regression?

Environment

Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions