[determinism] [feature] DSV4 Determinism Kernel Level Optimization

### User problem

As per [deepseekv4 paper](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf), track, implement and benchmark the feature for optimized determinism:
* Attention Backward: independent accumulation buffers followed by a global deterministic summation
* MoE backward: token order pre-processing within the rank, buffer isolation across multiple ranks
* mHC: output each split part separately and perform a deterministic reduction in a subsequent kernel


### Desired outcome

Deepseek-v4 is deterministic in training with all the optimized deterministic kernels.

### Alternatives considered

_No response_

### Affected area

area:model

### Urgency / use case

Blocking current work

### Extra context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[determinism] [feature] DSV4 Determinism Kernel Level Optimization #3538

User problem

Desired outcome

Alternatives considered

Affected area

Urgency / use case

Extra context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[determinism] [feature] DSV4 Determinism Kernel Level Optimization #3538

Description

User problem

Desired outcome

Alternatives considered

Affected area

Urgency / use case

Extra context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions