-
Notifications
You must be signed in to change notification settings - Fork 159
Description
Add Process Reward Mechanism
Motivation
Current frameworks for training LLM-based tool-calling agents rely on final-outcome rewards (e.g., task success/failure or the accuracy of queries). While effective for simple tasks, this approach may face critical limitations in complex, multi-turn tool-calling scenarios:
Sparse Feedback: Agents receive no guidance during multi-turn reasoning (e.g., API calls, data retrieval, tool chaining), leading to inefficient exploration.
So we propose Introducing process rewards (step-level or milestone-based rewards) to explore more possibilities:
- User-Customizable Reward Modes:Users can choose for themselves whether to use the process rewards and how to define their process reward.
- Rate of convergence:By providing immediate feedback on tool selection, parameter validity, and reasoning coherence,it may accelerate convergence of the training process.
- Tool-Calling Proficiency:We want to know if process rewards can help llm learn how to use tools expertly.
Key point
1.how to get process reward: rule_based ,reward model...
2.Process Reward Injection Points:Stepwise Correctness,Toolchain Efficiency...
3.The timing of process reward:Each time of completing the tool calling...