User problem
As per deepseekv4 paper, track, implement and benchmark the feature for optimized determinism:
- Attention Backward: independent accumulation buffers followed by a global deterministic summation
- MoE backward: token order pre-processing within the rank, buffer isolation across multiple ranks
- mHC: output each split part separately and perform a deterministic reduction in a subsequent kernel
Desired outcome
Deepseek-v4 is deterministic in training with all the optimized deterministic kernels.
Alternatives considered
No response
Affected area
area:model
Urgency / use case
Blocking current work
Extra context
No response
User problem
As per deepseekv4 paper, track, implement and benchmark the feature for optimized determinism:
Desired outcome
Deepseek-v4 is deterministic in training with all the optimized deterministic kernels.
Alternatives considered
No response
Affected area
area:model
Urgency / use case
Blocking current work
Extra context
No response