-
Notifications
You must be signed in to change notification settings - Fork 3k
[trainer, worker] feat: more flexible and easy-to-use reward model #3679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The PR seems to be included in sglang 0.5.3. |
…olcengine#3679) The current reward model implementation faces the following challenges: 1. Model Support: It is primarily designed for discriminative models and lacks robust support for generative reward models. 2. Complexity: It relies on heavy-weight backends like FSDP or Megatron, which are often unnecessary for typical reward model inference tasks. 3. Flexibility: The batch-level synchronization mechanism hinders the implementation of more flexible, sample-level reward functions for the developers. ### What this PR does To address these issues, this PR introduces a more flexible and easy-to-use reward model design. Specifically, it implements two main classes: `RewardModelManager` and `RewardManagerWorker`, with some runnable scripts in `recipe/fapo`. <img width="1732" height="1188" alt="image" src="https://github.com/user-attachments/assets/50fa8358-483c-44af-ba7b-3b696306c3db" /> - `RewardModelManager` first launches multiple reward servers and then adopts a router-based approach to manage these servers (using [SGLang Router](https://docs.sglang.ai/advanced_features/router.html)), distributing requests to reward servers. - `RewardManagerWorker` retrieves the remote actor handle, providing users with greater flexibility in designing custom reward functions. For example, users can easily implement a customized reward function like the following: ```python async def compute_score( data_source: str, solution_str: str, ground_truth: str, extra_info: dict, reward_router_address: str, reward_model_tokenizer: PreTrainedTokenizer, ): # Compute rule-based reward score rule_based_score = ... # Compute GRM reward score grm_prompts = ... grm_prompt_ids = ... # Users can directly call the reward model grm_outputs = post(f"{http://{reward_router_address}/generate}", ...) # post request to reward router ... # Final reward score final_score = ... return final_score ``` This implementation provides a `reward_model` interface in the `compute_score` method, maximizing flexibility and convenience for algorithmic design. Note that this is an asynchronous function, so efficiency is not a concern—each sample is processed asynchronously. ### Integration with AgentLoop This PR introduces asynchronous reward computation for individual samples (`async def run_single(self, data: DataProto) -> dict`) and leverages an event loop to handle reward computation in parallel, significantly improving processing efficiency. Moreover, this implementation can be integrated with `agentloop` for improved efficiency (has been implemented): <img width="2362" height="1280" alt="image" src="https://github.com/user-attachments/assets/4297428d-194b-4c6f-aff1-69daf02ca743" /> In this mode, the reward model operates independently from the rollout process (standalone mode), enabling a natural async data flow where each sample undergoes reward rollout immediately after actor rollout. With this implementation, code redundancy is reduced in the existing reward model while maximizing flexibility for user-customized reward functions. ### Runable Scripts A runnable example is provided in `recipe/fapo/`. The newly introduced parameters for this implementation are placed in `fapo/config` and will be integrated into the main codebase upon completion of the refactoring.
…olcengine#3679) The current reward model implementation faces the following challenges: 1. Model Support: It is primarily designed for discriminative models and lacks robust support for generative reward models. 2. Complexity: It relies on heavy-weight backends like FSDP or Megatron, which are often unnecessary for typical reward model inference tasks. 3. Flexibility: The batch-level synchronization mechanism hinders the implementation of more flexible, sample-level reward functions for the developers. ### What this PR does To address these issues, this PR introduces a more flexible and easy-to-use reward model design. Specifically, it implements two main classes: `RewardModelManager` and `RewardManagerWorker`, with some runnable scripts in `recipe/fapo`. <img width="1732" height="1188" alt="image" src="https://github.com/user-attachments/assets/50fa8358-483c-44af-ba7b-3b696306c3db" /> - `RewardModelManager` first launches multiple reward servers and then adopts a router-based approach to manage these servers (using [SGLang Router](https://docs.sglang.ai/advanced_features/router.html)), distributing requests to reward servers. - `RewardManagerWorker` retrieves the remote actor handle, providing users with greater flexibility in designing custom reward functions. For example, users can easily implement a customized reward function like the following: ```python async def compute_score( data_source: str, solution_str: str, ground_truth: str, extra_info: dict, reward_router_address: str, reward_model_tokenizer: PreTrainedTokenizer, ): # Compute rule-based reward score rule_based_score = ... # Compute GRM reward score grm_prompts = ... grm_prompt_ids = ... # Users can directly call the reward model grm_outputs = post(f"{http://{reward_router_address}/generate}", ...) # post request to reward router ... # Final reward score final_score = ... return final_score ``` This implementation provides a `reward_model` interface in the `compute_score` method, maximizing flexibility and convenience for algorithmic design. Note that this is an asynchronous function, so efficiency is not a concern—each sample is processed asynchronously. ### Integration with AgentLoop This PR introduces asynchronous reward computation for individual samples (`async def run_single(self, data: DataProto) -> dict`) and leverages an event loop to handle reward computation in parallel, significantly improving processing efficiency. Moreover, this implementation can be integrated with `agentloop` for improved efficiency (has been implemented): <img width="2362" height="1280" alt="image" src="https://github.com/user-attachments/assets/4297428d-194b-4c6f-aff1-69daf02ca743" /> In this mode, the reward model operates independently from the rollout process (standalone mode), enabling a natural async data flow where each sample undergoes reward rollout immediately after actor rollout. With this implementation, code redundancy is reduced in the existing reward model while maximizing flexibility for user-customized reward functions. ### Runable Scripts A runnable example is provided in `recipe/fapo/`. The newly introduced parameters for this implementation are placed in `fapo/config` and will be integrated into the main codebase upon completion of the refactoring.
…olcengine#3679) The current reward model implementation faces the following challenges: 1. Model Support: It is primarily designed for discriminative models and lacks robust support for generative reward models. 2. Complexity: It relies on heavy-weight backends like FSDP or Megatron, which are often unnecessary for typical reward model inference tasks. 3. Flexibility: The batch-level synchronization mechanism hinders the implementation of more flexible, sample-level reward functions for the developers. ### What this PR does To address these issues, this PR introduces a more flexible and easy-to-use reward model design. Specifically, it implements two main classes: `RewardModelManager` and `RewardManagerWorker`, with some runnable scripts in `recipe/fapo`. <img width="1732" height="1188" alt="image" src="https://github.com/user-attachments/assets/50fa8358-483c-44af-ba7b-3b696306c3db" /> - `RewardModelManager` first launches multiple reward servers and then adopts a router-based approach to manage these servers (using [SGLang Router](https://docs.sglang.ai/advanced_features/router.html)), distributing requests to reward servers. - `RewardManagerWorker` retrieves the remote actor handle, providing users with greater flexibility in designing custom reward functions. For example, users can easily implement a customized reward function like the following: ```python async def compute_score( data_source: str, solution_str: str, ground_truth: str, extra_info: dict, reward_router_address: str, reward_model_tokenizer: PreTrainedTokenizer, ): # Compute rule-based reward score rule_based_score = ... # Compute GRM reward score grm_prompts = ... grm_prompt_ids = ... # Users can directly call the reward model grm_outputs = post(f"{http://{reward_router_address}/generate}", ...) # post request to reward router ... # Final reward score final_score = ... return final_score ``` This implementation provides a `reward_model` interface in the `compute_score` method, maximizing flexibility and convenience for algorithmic design. Note that this is an asynchronous function, so efficiency is not a concern—each sample is processed asynchronously. ### Integration with AgentLoop This PR introduces asynchronous reward computation for individual samples (`async def run_single(self, data: DataProto) -> dict`) and leverages an event loop to handle reward computation in parallel, significantly improving processing efficiency. Moreover, this implementation can be integrated with `agentloop` for improved efficiency (has been implemented): <img width="2362" height="1280" alt="image" src="https://github.com/user-attachments/assets/4297428d-194b-4c6f-aff1-69daf02ca743" /> In this mode, the reward model operates independently from the rollout process (standalone mode), enabling a natural async data flow where each sample undergoes reward rollout immediately after actor rollout. With this implementation, code redundancy is reduced in the existing reward model while maximizing flexibility for user-customized reward functions. ### Runable Scripts A runnable example is provided in `recipe/fapo/`. The newly introduced parameters for this implementation are placed in `fapo/config` and will be integrated into the main codebase upon completion of the refactoring.
…olcengine#3679) The current reward model implementation faces the following challenges: 1. Model Support: It is primarily designed for discriminative models and lacks robust support for generative reward models. 2. Complexity: It relies on heavy-weight backends like FSDP or Megatron, which are often unnecessary for typical reward model inference tasks. 3. Flexibility: The batch-level synchronization mechanism hinders the implementation of more flexible, sample-level reward functions for the developers. ### What this PR does To address these issues, this PR introduces a more flexible and easy-to-use reward model design. Specifically, it implements two main classes: `RewardModelManager` and `RewardManagerWorker`, with some runnable scripts in `recipe/fapo`. <img width="1732" height="1188" alt="image" src="https://github.com/user-attachments/assets/50fa8358-483c-44af-ba7b-3b696306c3db" /> - `RewardModelManager` first launches multiple reward servers and then adopts a router-based approach to manage these servers (using [SGLang Router](https://docs.sglang.ai/advanced_features/router.html)), distributing requests to reward servers. - `RewardManagerWorker` retrieves the remote actor handle, providing users with greater flexibility in designing custom reward functions. For example, users can easily implement a customized reward function like the following: ```python async def compute_score( data_source: str, solution_str: str, ground_truth: str, extra_info: dict, reward_router_address: str, reward_model_tokenizer: PreTrainedTokenizer, ): # Compute rule-based reward score rule_based_score = ... # Compute GRM reward score grm_prompts = ... grm_prompt_ids = ... # Users can directly call the reward model grm_outputs = post(f"{http://{reward_router_address}/generate}", ...) # post request to reward router ... # Final reward score final_score = ... return final_score ``` This implementation provides a `reward_model` interface in the `compute_score` method, maximizing flexibility and convenience for algorithmic design. Note that this is an asynchronous function, so efficiency is not a concern—each sample is processed asynchronously. ### Integration with AgentLoop This PR introduces asynchronous reward computation for individual samples (`async def run_single(self, data: DataProto) -> dict`) and leverages an event loop to handle reward computation in parallel, significantly improving processing efficiency. Moreover, this implementation can be integrated with `agentloop` for improved efficiency (has been implemented): <img width="2362" height="1280" alt="image" src="https://github.com/user-attachments/assets/4297428d-194b-4c6f-aff1-69daf02ca743" /> In this mode, the reward model operates independently from the rollout process (standalone mode), enabling a natural async data flow where each sample undergoes reward rollout immediately after actor rollout. With this implementation, code redundancy is reduced in the existing reward model while maximizing flexibility for user-customized reward functions. ### Runable Scripts A runnable example is provided in `recipe/fapo/`. The newly introduced parameters for this implementation are placed in `fapo/config` and will be integrated into the main codebase upon completion of the refactoring.
…olcengine#3679) The current reward model implementation faces the following challenges: 1. Model Support: It is primarily designed for discriminative models and lacks robust support for generative reward models. 2. Complexity: It relies on heavy-weight backends like FSDP or Megatron, which are often unnecessary for typical reward model inference tasks. 3. Flexibility: The batch-level synchronization mechanism hinders the implementation of more flexible, sample-level reward functions for the developers. ### What this PR does To address these issues, this PR introduces a more flexible and easy-to-use reward model design. Specifically, it implements two main classes: `RewardModelManager` and `RewardManagerWorker`, with some runnable scripts in `recipe/fapo`. <img width="1732" height="1188" alt="image" src="https://github.com/user-attachments/assets/50fa8358-483c-44af-ba7b-3b696306c3db" /> - `RewardModelManager` first launches multiple reward servers and then adopts a router-based approach to manage these servers (using [SGLang Router](https://docs.sglang.ai/advanced_features/router.html)), distributing requests to reward servers. - `RewardManagerWorker` retrieves the remote actor handle, providing users with greater flexibility in designing custom reward functions. For example, users can easily implement a customized reward function like the following: ```python async def compute_score( data_source: str, solution_str: str, ground_truth: str, extra_info: dict, reward_router_address: str, reward_model_tokenizer: PreTrainedTokenizer, ): # Compute rule-based reward score rule_based_score = ... # Compute GRM reward score grm_prompts = ... grm_prompt_ids = ... # Users can directly call the reward model grm_outputs = post(f"{http://{reward_router_address}/generate}", ...) # post request to reward router ... # Final reward score final_score = ... return final_score ``` This implementation provides a `reward_model` interface in the `compute_score` method, maximizing flexibility and convenience for algorithmic design. Note that this is an asynchronous function, so efficiency is not a concern—each sample is processed asynchronously. ### Integration with AgentLoop This PR introduces asynchronous reward computation for individual samples (`async def run_single(self, data: DataProto) -> dict`) and leverages an event loop to handle reward computation in parallel, significantly improving processing efficiency. Moreover, this implementation can be integrated with `agentloop` for improved efficiency (has been implemented): <img width="2362" height="1280" alt="image" src="https://github.com/user-attachments/assets/4297428d-194b-4c6f-aff1-69daf02ca743" /> In this mode, the reward model operates independently from the rollout process (standalone mode), enabling a natural async data flow where each sample undergoes reward rollout immediately after actor rollout. With this implementation, code redundancy is reduced in the existing reward model while maximizing flexibility for user-customized reward functions. ### Runable Scripts A runnable example is provided in `recipe/fapo/`. The newly introduced parameters for this implementation are placed in `fapo/config` and will be integrated into the main codebase upon completion of the refactoring.
…olcengine#3679) The current reward model implementation faces the following challenges: 1. Model Support: It is primarily designed for discriminative models and lacks robust support for generative reward models. 2. Complexity: It relies on heavy-weight backends like FSDP or Megatron, which are often unnecessary for typical reward model inference tasks. 3. Flexibility: The batch-level synchronization mechanism hinders the implementation of more flexible, sample-level reward functions for the developers. ### What this PR does To address these issues, this PR introduces a more flexible and easy-to-use reward model design. Specifically, it implements two main classes: `RewardModelManager` and `RewardManagerWorker`, with some runnable scripts in `recipe/fapo`. <img width="1732" height="1188" alt="image" src="https://github.com/user-attachments/assets/50fa8358-483c-44af-ba7b-3b696306c3db" /> - `RewardModelManager` first launches multiple reward servers and then adopts a router-based approach to manage these servers (using [SGLang Router](https://docs.sglang.ai/advanced_features/router.html)), distributing requests to reward servers. - `RewardManagerWorker` retrieves the remote actor handle, providing users with greater flexibility in designing custom reward functions. For example, users can easily implement a customized reward function like the following: ```python async def compute_score( data_source: str, solution_str: str, ground_truth: str, extra_info: dict, reward_router_address: str, reward_model_tokenizer: PreTrainedTokenizer, ): # Compute rule-based reward score rule_based_score = ... # Compute GRM reward score grm_prompts = ... grm_prompt_ids = ... # Users can directly call the reward model grm_outputs = post(f"{http://{reward_router_address}/generate}", ...) # post request to reward router ... # Final reward score final_score = ... return final_score ``` This implementation provides a `reward_model` interface in the `compute_score` method, maximizing flexibility and convenience for algorithmic design. Note that this is an asynchronous function, so efficiency is not a concern—each sample is processed asynchronously. ### Integration with AgentLoop This PR introduces asynchronous reward computation for individual samples (`async def run_single(self, data: DataProto) -> dict`) and leverages an event loop to handle reward computation in parallel, significantly improving processing efficiency. Moreover, this implementation can be integrated with `agentloop` for improved efficiency (has been implemented): <img width="2362" height="1280" alt="image" src="https://github.com/user-attachments/assets/4297428d-194b-4c6f-aff1-69daf02ca743" /> In this mode, the reward model operates independently from the rollout process (standalone mode), enabling a natural async data flow where each sample undergoes reward rollout immediately after actor rollout. With this implementation, code redundancy is reduced in the existing reward model while maximizing flexibility for user-customized reward functions. ### Runable Scripts A runnable example is provided in `recipe/fapo/`. The newly introduced parameters for this implementation are placed in `fapo/config` and will be integrated into the main codebase upon completion of the refactoring.
|
请问一下,现在我用最新的main分支代码训练,默认使用这个路径下的reward manager么,我一开始在脚本设置reward_model.reward_manager=batch会报错,然后我在verl/verl/experimental/reward/reward_loop注册了batch,就没有报错了 |
|
Yes, batch reward manager is not supported in reward loop (as reward loop provides another way to handle batch requests). If you want to use the legacy, you can add |
如果我是想当policymodel rollout之后通过调用llm api来输出reward 分数,这种使用哪种比较好呢? |
|
see #4318 (comment) |
…olcengine#3679) The current reward model implementation faces the following challenges: 1. Model Support: It is primarily designed for discriminative models and lacks robust support for generative reward models. 2. Complexity: It relies on heavy-weight backends like FSDP or Megatron, which are often unnecessary for typical reward model inference tasks. 3. Flexibility: The batch-level synchronization mechanism hinders the implementation of more flexible, sample-level reward functions for the developers. ### What this PR does To address these issues, this PR introduces a more flexible and easy-to-use reward model design. Specifically, it implements two main classes: `RewardModelManager` and `RewardManagerWorker`, with some runnable scripts in `recipe/fapo`. <img width="1732" height="1188" alt="image" src="https://github.com/user-attachments/assets/50fa8358-483c-44af-ba7b-3b696306c3db" /> - `RewardModelManager` first launches multiple reward servers and then adopts a router-based approach to manage these servers (using [SGLang Router](https://docs.sglang.ai/advanced_features/router.html)), distributing requests to reward servers. - `RewardManagerWorker` retrieves the remote actor handle, providing users with greater flexibility in designing custom reward functions. For example, users can easily implement a customized reward function like the following: ```python async def compute_score( data_source: str, solution_str: str, ground_truth: str, extra_info: dict, reward_router_address: str, reward_model_tokenizer: PreTrainedTokenizer, ): # Compute rule-based reward score rule_based_score = ... # Compute GRM reward score grm_prompts = ... grm_prompt_ids = ... # Users can directly call the reward model grm_outputs = post(f"{http://{reward_router_address}/generate}", ...) # post request to reward router ... # Final reward score final_score = ... return final_score ``` This implementation provides a `reward_model` interface in the `compute_score` method, maximizing flexibility and convenience for algorithmic design. Note that this is an asynchronous function, so efficiency is not a concern—each sample is processed asynchronously. ### Integration with AgentLoop This PR introduces asynchronous reward computation for individual samples (`async def run_single(self, data: DataProto) -> dict`) and leverages an event loop to handle reward computation in parallel, significantly improving processing efficiency. Moreover, this implementation can be integrated with `agentloop` for improved efficiency (has been implemented): <img width="2362" height="1280" alt="image" src="https://github.com/user-attachments/assets/4297428d-194b-4c6f-aff1-69daf02ca743" /> In this mode, the reward model operates independently from the rollout process (standalone mode), enabling a natural async data flow where each sample undergoes reward rollout immediately after actor rollout. With this implementation, code redundancy is reduced in the existing reward model while maximizing flexibility for user-customized reward functions. ### Runable Scripts A runnable example is provided in `recipe/fapo/`. The newly introduced parameters for this implementation are placed in `fapo/config` and will be integrated into the main codebase upon completion of the refactoring.
The current reward model implementation faces the following challenges:
What this PR does
To address these issues, this PR introduces a more flexible and easy-to-use reward model design. Specifically, it implements two main classes:
RewardModelManagerandRewardManagerWorker, with some runnable scripts inrecipe/fapo.RewardModelManagerfirst launches multiple reward servers and then adopts a router-based approach to manage these servers (using SGLang Router), distributing requests to reward servers.RewardManagerWorkerretrieves the remote actor handle, providing users with greater flexibility in designing custom reward functions. For example, users can easily implement a customized reward function like the following:This implementation provides a
reward_modelinterface in thecompute_scoremethod, maximizing flexibility and convenience for algorithmic design.Note that this is an asynchronous function, so efficiency is not a concern—each sample is processed asynchronously.
Integration with AgentLoop
This PR introduces asynchronous reward computation for individual samples (
async def run_single(self, data: DataProto) -> dict) and leverages an event loop to handle reward computation in parallel, significantly improving processing efficiency.Moreover, this implementation can be integrated with
agentloopfor improved efficiency (has been implemented):In this mode, the reward model operates independently from the rollout process (standalone mode), enabling a natural async data flow where each sample undergoes reward rollout immediately after actor rollout.
With this implementation, code redundancy is reduced in the existing reward model while maximizing flexibility for user-customized reward functions.
Runable Scripts
A runnable example is provided in
recipe/fapo/. The newly introduced parameters for this implementation are placed infapo/configand will be integrated into the main codebase upon completion of the refactoring.