GitHub - yyDing1/FAPO: Code Implementation of Paper "FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning"

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

We implement some key modifications base on the verl framework, including:

Asynchronize rollout and reward: we launch the reward model as an external service, and realize sample-wise async, with detailed implementation in verl/experimental/agent_reward_loop
We extract the most related code about FAPO algorithm in fapo/ for reference, including fapo/fapo_genrm and fapo/fapo_reasoning. Corresponding training scripts of FAPO-GenRM and FAPO-Reasoning (and Baselines) are placed in scripts/.

Step 1: Train FAPO-GenRM

Due to the file size limit, we only upload first 100 rows in example_data/fapo-critic.jsonl (convert to jsonl for better readability).

bash scripts/run_fapo_genrm_4b.sh

Step 2: Train FAPO-Reasoning

Step 2.1: Launch GenRM as an External Service

# first launch multiple genrm servers
bash scripts/launch_server.sh

# launch a router to manage data_parallel genrm servers
# so that the request should be sent to the router
# then the router will distribute the request to the corresponding genrm server
bash scripts/launch_router.sh

Step 2.2: Train

# Note that you should specify the router address
# in the `fapo/fapo_reasoning/reward_fn.py`

# Train Baseline Models
bash scripts/run_baseline_reasoning_7b.sh
bash scripts/run_baseline_reasoning_32b.sh

# Train FAPO Models
bash scripts/run_fapo_reasoning_7b.sh
bash scripts/run_fapo_reasoning_32b.sh

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
example_data		example_data
fapo		fapo
scripts		scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Step 1: Train FAPO-GenRM

Step 2: Train FAPO-Reasoning

Step 2.1: Launch GenRM as an External Service

Step 2.2: Train

About

Uh oh!

Releases

Packages

Languages

License

yyDing1/FAPO

Folders and files

Latest commit

History

Repository files navigation

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Step 1: Train FAPO-GenRM

Step 2: Train FAPO-Reasoning

Step 2.1: Launch GenRM as an External Service

Step 2.2: Train

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages