Skip to content

Process Reward Mechanism #6

@myAugust

Description

@myAugust

Add Process Reward Mechanism

Motivation

Current frameworks for training LLM-based tool-calling agents rely on final-outcome rewards (e.g., task success/failure or the accuracy of queries). While effective for simple tasks, this approach may face critical limitations in complex, multi-turn tool-calling scenarios:
Sparse Feedback: Agents receive no guidance during multi-turn reasoning (e.g., API calls, data retrieval, tool chaining), leading to inefficient exploration.

So we propose Introducing process rewards (step-level or milestone-based rewards) to explore more possibilities:

  1. User-Customizable Reward Modes:Users can choose for themselves whether to use the process rewards and how to define their process reward.
  2. Rate of convergence:By providing immediate feedback on tool selection, parameter validity, and reasoning coherence,it may accelerate convergence of the training process.
  3. Tool-Calling Proficiency:We want to know if process rewards can help llm learn how to use tools expertly.

Key point

1.how to get process reward: rule_based ,reward model...

2.Process Reward Injection Points:Stepwise Correctness,Toolchain Efficiency...

3.The timing of process reward:Each time of completing the tool calling...

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions