Skip to content

Conversation

@kashif
Copy link
Collaborator

@kashif kashif commented Sep 10, 2025

What does this pr do?

fixes #3979

This pull request improves the way distributed training environment variables are set for trainers in the trl package. Instead of hardcoding or defaulting to potentially conflicting values for MASTER_ADDR and MASTER_PORT, a new utility function is introduced to safely and automatically select a free port, reducing the risk of collisions during concurrent runs.

Distributed training environment setup improvements:

  • Added a new utility function ensure_master_addr_port in trl/trainer/utils.py that sets MASTER_ADDR and chooses a free MASTER_PORT if not already set, or if set to "0" or "auto", to avoid port collisions in distributed training.
  • Updated imports in trl/trainer/grpo_trainer.py, trl/trainer/online_dpo_trainer.py, and trl/trainer/rloo_trainer.py to include ensure_master_addr_port. [1] [2] [3]
  • Replaced hardcoded or default assignments of MASTER_ADDR and MASTER_PORT in the constructors of GRPOTrainer, OnlineDPOTrainer, and RLOOTrainer with calls to ensure_master_addr_port, ensuring safer distributed setup. [1] [2] [3]
  • Added necessary imports for os and socket in trl/trainer/utils.py to support the new utility function.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@qgallouedec qgallouedec changed the title [vllm] ensure MASTER_ADDR/MASTER_PORT are set safely ⚓ [vllm] ensure MASTER_ADDR/MASTER_PORT are set safely Sep 23, 2025
@qgallouedec qgallouedec merged commit 008c7ad into huggingface:main Sep 23, 2025
9 of 10 checks passed
singing-cat pushed a commit to singing-cat/trl that referenced this pull request Sep 23, 2025
qgallouedec added a commit that referenced this pull request Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GRPOTrainer vLLM colocate hardcodes MASTER_PORT=12345 so no parallel runs possible

3 participants