[Epic] NeMo-Automodel integration

Issue tracking overall status of the integration.

During the bringup of NeMo RL we duplicated some logic that was in NeMo Automodel in order to OSS quickly. It's now time to converge and deduplicate so we rely on a single source of truth for things related to DTensor and Automodel. 

Here is a rough breakdown of the stages:

* Stage 1: support NeMo Automodel APIs in a separate policy (named something like `DTensorPolicyWorkerV2`)
* Stage 2: upstream parallelize plans and changes specific to NeMo RL (e.g., checkpointing, CP, tied-embedding, seq-packing)
* Stage 3: test NeMo AutoModel vs. DTensorPolicy to ensure parity
* ---- mark DTensorPolicyWorker for deprecation -----
* Stage 4: sunset DTensorPolicyWorker and promote DTensorPolicyWorker in its place
* Stage 5: Use backported 2.7.0 DCP checkpointing in NeMo RL
* Stage 6: integrate cut cross entropy/liger kernels from Automodel

CC: @akoumpa

related: #224 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] NeMo-Automodel integration #578

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Epic] NeMo-Automodel integration #578

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions