Skip to content

[determinism] [feature] DSV4 Determinism Kernel Level Optimization #3538

@ZhiyuLi-Nvidia

Description

@ZhiyuLi-Nvidia

User problem

As per deepseekv4 paper, track, implement and benchmark the feature for optimized determinism:

  • Attention Backward: independent accumulation buffers followed by a global deterministic summation
  • MoE backward: token order pre-processing within the rank, buffer isolation across multiple ranks
  • mHC: output each split part separately and perform a deterministic reduction in a subsequent kernel

Desired outcome

Deepseek-v4 is deterministic in training with all the optimized deterministic kernels.

Alternatives considered

No response

Affected area

area:model

Urgency / use case

Blocking current work

Extra context

No response

Metadata

Metadata

Labels

DeterminismTo track the bugs/issues in deterministic training in Megatron-Bridge.featureNew capabilities, enhancements, or enablement worktrackingTracking issue for an ongoing project with smaller steps

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions