Skip to content

[Epic] NeMo-Automodel integration #578

@terrykong

Description

@terrykong

Issue tracking overall status of the integration.

During the bringup of NeMo RL we duplicated some logic that was in NeMo Automodel in order to OSS quickly. It's now time to converge and deduplicate so we rely on a single source of truth for things related to DTensor and Automodel.

Here is a rough breakdown of the stages:

  • Stage 1: support NeMo Automodel APIs in a separate policy (named something like DTensorPolicyWorkerV2)
  • Stage 2: upstream parallelize plans and changes specific to NeMo RL (e.g., checkpointing, CP, tied-embedding, seq-packing)
  • Stage 3: test NeMo AutoModel vs. DTensorPolicy to ensure parity
  • ---- mark DTensorPolicyWorker for deprecation -----
  • Stage 4: sunset DTensorPolicyWorker and promote DTensorPolicyWorker in its place
  • Stage 5: Use backported 2.7.0 DCP checkpointing in NeMo RL
  • Stage 6: integrate cut cross entropy/liger kernels from Automodel

CC: @akoumpa

related: #224

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions