Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
aea57f7
MergeGRPO to main
nv-mmanohara Nov 3, 2025
59dd3a0
MergeGRPO to main
nv-mmanohara Nov 3, 2025
5c70988
SFT update
nv-mmanohara Nov 4, 2025
200a25b
Resolving comments
nv-mmanohara Nov 9, 2025
894d9e9
lint
yuki-97 Nov 11, 2025
a82eeef
refactor yaml
yuki-97 Nov 11, 2025
620fa54
update custom parallel plan doc
yuki-97 Nov 11, 2025
29c3655
revert logger.py
yuki-97 Nov 11, 2025
32d1726
unify run_grpo with multiple env
RayenTian Nov 18, 2025
f45f55d
remove useless code
yuki-97 Nov 18, 2025
2fadb92
Enhance dataset handling by introducing RawDataset class and updating…
RayenTian Nov 20, 2025
e2ce46b
Add Code Jaccard Environment documentation and update dataset process…
RayenTian Nov 20, 2025
bca6506
Refactor dataset classes to inherit from RawDataset
RayenTian Nov 21, 2025
3ab7845
Update environment configuration in YAML files and adjust dataset set…
RayenTian Nov 21, 2025
6aafd29
Remove unused base model parallel plan from custom parallel configura…
RayenTian Nov 26, 2025
d8cb985
fix doc
RayenTian Nov 26, 2025
48ebad7
Update nightly compute test to allow for up to 1300 GPU hours
RayenTian Dec 2, 2025
c342278
address comments
RayenTian Dec 7, 2025
64fc14e
waive nemotron 49B because of issue #1571
RayenTian Dec 8, 2025
a4f2f03
Update loss_multiplier in helpsteer3 data processor to zero for trunc…
RayenTian Dec 9, 2025
5dc34bb
remove examples in environments documentation
RayenTian Dec 9, 2025
4878deb
fix comment
RayenTian Dec 9, 2025
119e86a
Remove metrics.json file from functional distillation tests to stream…
RayenTian Dec 9, 2025
f1d37cd
remove path.py
RayenTian Dec 9, 2025
09b8090
update env name check
RayenTian Dec 9, 2025
db22883
docs: minor revisions (#1626)
lbliii Dec 12, 2025
cf6e02a
fix doc
RayenTian Dec 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/design-docs/fsdp2-parallel-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@ The Hugging Face tensor parallel plan is the default. It's available for most mo

## Custom Parallel Plan Example

A custom parallel plan should be defined in a separate file, such as the example provided in `examples/custom_parallel.py`.
A custom parallel plan should be defined in a separate file, such as the example provided in `examples/custom_parallel/custom_parallel.py`.

To implement the custom parallel plan, either update the value of `custom_parallel_plan` in the `yaml` file directly, or pass the override via the command line. For example:

```bash
uv run examples/run_grpo_math.py \
policy.dtensor_cfg.custom_parallel_plan=examples.custom_parallel.custom_parallel_plan
policy.dtensor_cfg.custom_parallel_plan=examples.custom_parallel.custom_parallel.custom_parallel_plan
```
144 changes: 134 additions & 10 deletions docs/guides/environments.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Environments for GRPO Training

GRPO supports several examples of environments for different tasks. Each environment provides a standardized interface for reward computation and evaluation.
GRPO includes multiple environments, each offering a standard interface for reward computation and evaluation.

## Math Environment

Expand Down Expand Up @@ -40,9 +40,51 @@ code_env = CodeEnvironment.remote(env_config)

### Configuration
- `num_workers`: Number of parallel workers for code execution
- `terminate_on_evaluation`: Whether to terminate after code execution (True for single-turn, False for multi-turn)
- `terminate_on_evaluation`: Whether to terminate after code execution (True for single-turn, False for multi-turn).

We’re tracking an end-to-end example of this environment in [#858](https://github.com/NVIDIA-NeMo/RL/issues/858). Add a 👍 to show your interest.
We are tracking an end-to-end example of this environment in [#858](https://github.com/NVIDIA-NeMo/RL/issues/858). Add a 👍 to show your interest.

## Code Jaccard Environment

The Code Jaccard Environment evaluates code (or text) responses by measuring Jaccard-based similarity against ground-truth answers. This is a lightweight, text-similarity reward useful when an execution sandbox is unnecessary or unavailable.

### How It Works
- Extracts the assistant’s response text from each conversation.
- Computes a Jaccard similarity score between the response and ground truth:
- Tokenizes both texts by whitespace, computes intersection/union, then applies a length ratio penalty.
- Scores are in [0, 1]. Observations label responses as “aligned/misaligned” using a 0.5 threshold.
- Returns:
- observations: Environment feedback strings.
- rewards: Tensor of similarity scores.
- terminateds: All ones (single-step episodes).
- answers: The response text when requested (optional).

### Usage
```python
from nemo_rl.environments.code_jaccard_environment import CodeJaccardEnvironment

env_config = {
"num_workers": 2,
# Optional default stop strings (unused in scoring but available for consistency)
"stop_strings": None,
}

code_jaccard_env = CodeJaccardEnvironment.remote(env_config)
```

### Configuration
- `num_workers` (int): Number of parallel verification workers.
- `stop_strings` (list[str] | None): Optional default stop strings (propagated downstream; not required for scoring).

### Sample GRPO Config
```yaml
env:
code_jaccard:
num_workers: 2
stop_strings: null
data:
env_name: code_jaccard
```

## Reward Model Environment

Expand Down Expand Up @@ -72,9 +114,9 @@ reward_env = RewardModelEnvironment.remote(env_config)

In GRPO training, resources are allocated across three main components:

- **Policy Actor**: The trained model
- **Policy Actor**: The trained model.
- **Generation Actor**: Used for generating responses during rollouts (can be colocated with policy or on separate nodes/GPUs).
- **Reward Model Environment Actor**: Evaluates generated responses and computes rewards
- **Reward Model Environment Actor**: Evaluates generated responses and computes rewards.

The resource allocation logic works as follows:

Expand All @@ -86,10 +128,10 @@ The resource allocation logic works as follows:
2. Policy and generation non-colocated: 8 GPUs total = 2 for policy + 2 for generation + 4 for reward model

#### Multi-Node Setup (`num_nodes > 1`)
- Policy training, generation, and reward model environment can be distributed across different nodes
- Reward model gets dedicated resources as specified in `env.reward_model.resources`
- Generation gets dedicated resources as specified in `policy.generation.colocated.resources`
- Remaining nodes are allocated to policy training
- Policy training, generation, and reward model environment can be distributed across different nodes.
- Reward model gets dedicated resources as specified in `env.reward_model.resources`.
- Generation gets dedicated resources as specified in `policy.generation.colocated.resources`.
- Remaining nodes are allocated to policy training.

In the future, the resource control part will be refactored to enable fine-grained resource configuration for each actor. For detailed resource management and optimization strategies, see [#1100](https://github.com/NVIDIA-NeMo/RL/issues/1100).

Expand All @@ -99,4 +141,86 @@ See [examples/run_grpo_rm.py](../../examples/run_grpo_rm.py) for a complete exam

### Configuration Examples

See [examples/configs/grpo_rm_1B.yaml](../../examples/configs/grpo_rm_1B.yaml) for a complete configuration example.
See [examples/configs/grpo_rm_1B.yaml](../../examples/configs/grpo_rm_1B.yaml) for a complete configuration example.


## Registering Custom Environments

NeMo RL provides a flexible environment registration mechanism that allows you to add custom environments without modifying the source code.

### Using the `register_env` Interface

You can use the `register_env` function to dynamically register new environments without modifying NeMo RL's internal code.

**Function Signature**

```python
from nemo_rl.environments.utils import register_env

register_env(env_name: str, actor_class_fqn: str) -> None
```

**Parameters:**

- `env_name`: Unique identifier name for the environment (string)
- `actor_class_fqn`: Fully Qualified Name of the environment Actor class, in the format `'module.path.ClassName'`

### Example: Registering a Custom Environment

Suppose you've created a custom reinforcement learning environment for code generation tasks:

**1. Create Your Custom Environment Actor Class**

```python
# File: my_custom_envs/code_gen_env.py
import ray
from nemo_rl.environments.interfaces import EnvironmentInterface

@ray.remote
class CodeGenEnvironmentActor(EnvironmentInterface):
"""Custom code generation environment."""

def __init__(self, config):
self.config = config
# Initialize your environment

async def reset(self):
# Reset environment logic
return initial_state

async def step(self, action):
# Execute action, return reward, etc.
return observation, reward, done, info

# Implement other required interface methods...
```

**2. Register the Environment in Your Training Script**

```python
# File: train.py
from nemo_rl.environments.utils import register_env

# Register your custom environment
register_env(
env_name="code_gen",
actor_class_fqn="my_custom_envs.code_gen_env.CodeGenEnvironmentActor"
)

# Now you can use "code_gen" in your config
# Training code...
```

**3. Use the Registered Environment in Your Config**

```yaml
# config.yaml
env:
code_gen:
num_workers: 2
max_code_length: 512
test_cases_per_problem: 5

data:
env_name: code_gen # Use your registered environment name
```
Loading
Loading