Skip to content

NCCL error when using non-colocated generation on single node #722

@yuki-97

Description

@yuki-97

It works well on multi-node, but fails on single-node.

Error:

raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

Repro:

  1. pytest

    1. Remove pyskip.
      # Skip tensor_parallel_size == 2 until we have resources in CI
      if tensor_parallel_size == 2:
          pytest.skip(
              "Test requires at least three GPUs to run with tensor_parallel_size == 2 on separate clusters."
          )
    2. Change the world_size to world_size=tensor_parallel_size+1 (will be fixed in feat: support non-colocated in mcore #613)
      futures_train = lm_policy.init_collective(ip, port, world_size=tensor_parallel_size+1)
      futures_inference = vllm_generation.init_collective(ip, port, world_size=tensor_parallel_size+1)
    3. Run following pytest.
      uv run pytest -vv tests/unit/models/generation/test_vllm_generation.py::test_vllm_refit_non_collocated_update_weights[2-False]
      
  2. dtensor worker

    RUN_COMMAND="NRL_FORCE_REBUILD_VENVS=true uv run python examples/run_grpo_math.py \
        grpo.max_num_steps=100 \
        grpo.val_period=1000 \
        policy.generation.vllm_cfg.async_engine=false \
        policy.generation.vllm_cfg.tensor_parallel_size=2 \
        policy.generation.colocated.enabled=false \
        policy.generation.colocated.resources.gpus_per_node=4 \
        policy.dtensor_cfg.tensor_parallel_size=2 \
        policy.dynamic_batching.enabled=True \
        checkpointing.enabled=false \
        logger.wandb_enabled=true \
        logger.tensorboard_enabled=false \
        logger.monitor_gpus=true \
        logger.wandb.project=${PROJECT_NAME} \
        logger.wandb.name=${EXP_NAME} \
        cluster.num_nodes=1 \
        cluster.gpus_per_node=8"
  3. mcore worker

    RUN_COMMAND="NRL_FORCE_REBUILD_VENVS=true uv run python examples/run_grpo_math.py \
        --config examples/configs/grpo_math_1B_megatron.yaml
        grpo.max_num_steps=100 \
        grpo.val_period=1000 \
        policy.generation.vllm_cfg.async_engine=false \
        policy.generation.vllm_cfg.tensor_parallel_size=2 \
        policy.generation.colocated.enabled=false \
        policy.generation.colocated.resources.gpus_per_node=4 \
        policy.megatron_cfg.enabled=true \
        policy.megatron_cfg.activation_checkpointing=false \
        policy.megatron_cfg.tensor_model_parallel_size=2 \
        policy.megatron_cfg.pipeline_model_parallel_size=1 \
        policy.dynamic_batching.enabled=True \
        checkpointing.enabled=false \
        logger.wandb_enabled=true \
        logger.tensorboard_enabled=false \
        logger.monitor_gpus=true \
        logger.wandb.project=${PROJECT_NAME} \
        logger.wandb.name=${EXP_NAME} \
        cluster.num_nodes=1 \
        cluster.gpus_per_node=8"

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingqa_rcca_donewhen RCCA finished for the issue, the qa will mark with this label .

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions