[AAAI26] Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

Weixiang Zhao¹^* , Xingyu Sui¹^* , Jiahe Guo¹^* , Yulin Hu¹^* , Yang Deng² ,
Yanyan Zhao¹ , Bing Qin¹ , Wanxiang Che¹ , Tat-Seng Chua³ , Ting Liu¹ ,

¹Harbin Institute of Technology ²Singapore Management University ³National University of Singapore

Warning: This paper contains model outputs that may be considered offensive.

[Paper] [Project Page]

News

[2025/03/20] We released our code source.

Install and Run

Installation

You need to insatll easyvllm for the deocoding of LRMs.

git clone https://github.com/SCIR-SC-Qiaoban-Team/easyvllm
cd easyvllm
pip install -e .

easyvllm depends on VLLM and requires additional installation steps to support LRMs. Please refer to the installation guide at https://github.com/SCIR-SC-Qiaoban-Team/easyvllm.

git clone https://github.com/SCIR-SC-Qiaoban-Team/FreeEvalLM
cd FreeEvalLM
pip install -e .

Additionally, you need to add your OpenAI token for the evaluation of certain benchmarks.

Run our code

cd FreeEvalLM
python freeEvalLM/src/decode.py

Our pipeline consists of two steps: generate and evaluate. If you need to use them separately, please call the corresponding evaluator in the tasks module individually.

Arguments

Parameter	Type	Default	Description
`model_path`	-	-	Path to the model
`save_path`	-	-	Path to save results
`decode_type`	`Literal['query', 'query_reasoning_ctrl', 'query_force_reasoning_content']`	-	Type of decoding
`file_path`	`str`	`None`	Path to the input file
`task`	`str`	`None`	Task name, when `file_path` is set to `None`, the data and evaluator will be loaded according to the predefined task type, which support `livebench`, `ifeval`, `mmlu_pro`, `strong_reject`, `wild_jailbreak` and `XSTest_S`
`sample`	`int`	`-1`	Number of sampled data. When set to -1, all data will be loaded
`query_keys`	`str`	`None`	Keys for query extraction
`response_keys`	`str`	`None`	Keys for response storage
`reasoning_keys`	`str`	`None`	Keys for reasoning extraction and storage
`tensor_parallel_size`	`int`	`1`	Number of tensor parallel units in `vllm`
`model_num`	`int`	`None`	Number of models to be loaded in `vllm`
`port`	`int`	`50000`	Port number for `vllm serve`.
`max_model_len`	`int`	`None`	Maximum model length in `vllm`
`show_log`	`bool`	`True`	Whether to display logs of `vllm`
`timeout`	`int`	`30`	Timeout duration in seconds
`threads`	`int`	`20`	Number of threads
`enable_reasoning`	`bool`	`False`	Enable reasoning when using LRMs
`reasoning_parser`	`str`	`'deepseek_r1'`	Reasoning parser type, support `deepseek_r1`, `openthinker` and `simplescaling`
`system_prompt_file`	`str`	`None`	Path to the system prompt file
`chat_template_file`	`str`	`None`	Path to the chat template file
`max_new_tokens`	`int`	`8192`	Maximum number of new tokens
`device_ids`	`str`	`None`	Device IDs to use
`reasoning_max_retry`	`int`	`10`	Maximum number of retries when the model's output does not conform to the expected format
`add_reasoning_prompt`	`bool`	`False`	Manually add the reasoning token.
`enable_length_ctrl`	`bool`	`False`	Enable length control
`reasoning_max_len`	`int`	`None`	Maximum length for reasoning
`reasoning_min_len`	`int`	`0`	Minimum length for reasoning
`reasoning_scale`	`float`	`None`	Scaling factor for reasoning
`cut_by_sentence`	`bool`	`False`	Cut content by sentence for length control
`force_reasoning_content_keys`	`str`	`None`	Keys for forced reasoning content, usually consistent with the previously saved `reasoning_keys`
`overwrite`	`bool`	`False`	Overwrite the responses and reasoning keys of the input files

Tasks

We now support testing for MMLU-Pro, IFEval, Live-Bench, StrongReject, WildJailbreak, and XSTest. If you need to add a custom task, please refer to src/task.py to add the dataset and evaluator.

Example

cd FreeEvalLM
bash scripts/distill-8b_ifeval.sh

If you want to control the thinking length and try multi-GPUs inference, please refer to scripts/distill-8b_ifeval_length-control.sh

Acknowledgments

We would like to express sincere gratitude to the following open-source projects and their development teams.

Citation

If you find our work useful, please consider citing our paper:

@article{zhao2025trade,
  title={Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities},
  author={Zhao, Weixiang and Sui, Xingyu and Guo, Jiahe and Hu, Yulin and Deng, Yang and Zhao, Yanyan and Qin, Bing and Che, Wanxiang and Chua, Tat-Seng and Liu, Ting},
  journal={arXiv preprint arXiv:2503.17979},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
datasets		datasets
freeEvalLM.egg-info		freeEvalLM.egg-info
freeEvalLM		freeEvalLM
scripts		scripts
system_prompt		system_prompt
templates		templates
.gitignore		.gitignore
readme.md		readme.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[AAAI26] Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

News

Install and Run

Installation

Run our code

Arguments

Tasks

Example

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[AAAI26] Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

News

Install and Run

Installation

Run our code

Arguments

Tasks

Example

Acknowledgments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages