📖 Paper | 🤗 Checkpoints
Authors Yoonjeon Kim1*, Doohyuk Jang1*, Eunho Yang1,2†
1KAIST, 2AITRICS
*Equal Contribution, †Corresponding author
Recent studies on reasoning models explore the meta-awareness of language models, the ability to determine `how to think' by itself. We argue that large reasoning models lack this meta-awareness property by proving severe misalignment between true roll-outs and predicted meta information. We hypothesize that the alignment of meta prediction and true roll-outs directly leads to significant performance gain. To verify this hypothesis, we design a training pipeline that boosts Meta-Awareness via Self-Alignment (MASA). Unlike prior approaches, our method requires no external datasets, auxiliary models, or human-crafted reasoning pipelines but leverages self-generated reasoning signals to train meta-awareness. Meta-awareness enables efficient training through prediction-based gating and cutoff, with behavior cloning on expert meta-trajectories ensuring reliable meta-predictions. The results are inspiring: our strategy yields significant improvements in both accuracy and training efficiency on in-domain tasks and shows strong generalization to out-of-domain benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve a 19.3% gain in accuracy on AIME25, and a 6.2% average gain over six mathematics benchmarks. Aided by the enhanced meta-cognitive ability, our approach benefits generalization in out-of-domain benchmarks, gaining an additional 3.87% in GPQA-Diamond, and an overall 2.08% accuracy gain over 13 benchmarks encompassing logical, scientific, and coding domains.
- Integration into VeRL
- MASA with dapo algorithm
-
Training pipeline of MASA -
Evaluation on mathematical benchmarks -
Evaluation on logical / scientific / coding benchmarks
We highly recommend to use uv environment for fast environment building. Check how to install from uv installation.
uv venv --python 3.12
source .venv/bin/activate
uv pip install torch==2.7.0
uv pip install --no-build-isolation -e .
uv pip install latex2sympy2_extended fire tensordict pysnooper hydra-core wandb
uv pip install flash-attn==2.8.3 --no-build-isolation
You can run inference, evaluation, and score calculation all at once using the following command:
bash eval.sh <MODEL_NAME> <DOMAIN [math|science|logic|coding]> <NUM_GPUS>
- For coding benchmarks, only MBPP is currently supported, since coding evaluation environments vary greatly across benchmarks.
- For other benchmarks, please use their official evaluation implementations.
-
Inference
python3 eval/inference.py \ --model_path MODEL_NAME_OR_PATH \ --dataset DATSET_NAME \ --temperature TEMPERATURE \ --tp TENSOR_PARALLEL_SIZE \ --max_model_len MAX_MODEL_LEN \ --max_new_token MAX_NEW_TOKEN \ --dtype MODEL_DTYPE \ --output_dir OUTPUT_DIR \ --n NUMBER_OF_SAMPLE_PER_PROMPTSupported Datasets:
["math", "amc23", "aime2025", "math500", "minerva", "olympiad_math", "aime2024", "rbench", "gpqa_diamond", "arc_challenge", "scibench", "AR-LSAT", "FOLIO", "LogicalDeduction", "ProntoQA", "ProofWriter", "MBPP"]You can also integrate your custom dataset by following the provided data preprocessing code.
- Evaluation
This script evaluates the correctness of each response stored in the given
python3 eval/eval.py \ --response_path INFERENCE_OUTPUT_PATH \ --eval_path OUTPUT_EVAL_RESULT_PATHresponse_pathdirectory.
- Score Calculation
This script computes the pass@k metrics (e.g., pass@1, pass@5, pass@10) based on your evaluation results.
python3 eval/calc_passk.py \ --eval_path EVAL_RESULT_PATH \ --k COMMA_SEPARATED_K_VALUES
wandb login # login to your wandb account
bash train_masagrpo.sh # To run our MASA with GRPO algotithm
bash train_grpo.sh # To run baseline GRPO algorithm without MASA
The training configurations (NGPUS, OUTPUT_DIR, BASE_MODEL, BASE_DIR) could be modified in the scripts. The training of 8B model is executed with 8 x H100, which takes about 1-2 days to reach a single epoch (314 steps with batch size 128).
If out of memory error occurs, adjust the following values.
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=36000actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=96000actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=96000
For more performance acceleration, refer to this verl document page Performance Tuning Guide.
python3 data_preprocess.py --dataset_name HF_DATASET_NAME
This repository builds upon and references the following open-source projects:
We sincerely thank the authors of these repositories for their valuable contributions to the community.
@article{kim2025meta,
title={Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning},
author={Kim, Yoonjeon and Jang, Doohyuk and Yang, Eunho},
journal={arXiv preprint arXiv:2510.03259},
year={2025}
}