Skip to content

SCIR-SC-Qiaoban-Team/FreeEvalLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[AAAI26] Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

Weixiang Zhao1* ,  Xingyu Sui1* ,  Jiahe Guo1* ,  Yulin Hu1* ,  Yang Deng2 , 
Yanyan Zhao1 ,  Bing Qin1 ,  Wanxiang Che1 ,  Tat-Seng Chua3 ,  Ting Liu1 , 

1Harbin Institute of Technology    2Singapore Management University    3National University of Singapore   

Warning: This paper contains model outputs that may be considered offensive.

[Paper]    [Project Page]   

News

  • [2025/03/20] We released our code source.

Install and Run

Installation

You need to insatll easyvllm for the deocoding of LRMs.

git clone https://github.com/SCIR-SC-Qiaoban-Team/easyvllm
cd easyvllm
pip install -e .

easyvllm depends on VLLM and requires additional installation steps to support LRMs. Please refer to the installation guide at https://github.com/SCIR-SC-Qiaoban-Team/easyvllm.

git clone https://github.com/SCIR-SC-Qiaoban-Team/FreeEvalLM
cd FreeEvalLM
pip install -e .

Additionally, you need to add your OpenAI token for the evaluation of certain benchmarks.

Run our code

cd FreeEvalLM
python freeEvalLM/src/decode.py

Our pipeline consists of two steps: generate and evaluate. If you need to use them separately, please call the corresponding evaluator in the tasks module individually.

Arguments

Parameter Type Default Description
model_path - - Path to the model
save_path - - Path to save results
decode_type Literal['query', 'query_reasoning_ctrl', 'query_force_reasoning_content'] - Type of decoding
file_path str None Path to the input file
task str None Task name, when file_path is set to None, the data and evaluator will be loaded according to the predefined task type, which support livebench, ifeval, mmlu_pro, strong_reject, wild_jailbreak and XSTest_S
sample int -1 Number of sampled data. When set to -1, all data will be loaded
query_keys str None Keys for query extraction
response_keys str None Keys for response storage
reasoning_keys str None Keys for reasoning extraction and storage
tensor_parallel_size int 1 Number of tensor parallel units in vllm
model_num int None Number of models to be loaded in vllm
port int 50000 Port number for vllm serve.
max_model_len int None Maximum model length in vllm
show_log bool True Whether to display logs of vllm
timeout int 30 Timeout duration in seconds
threads int 20 Number of threads
enable_reasoning bool False Enable reasoning when using LRMs
reasoning_parser str 'deepseek_r1' Reasoning parser type, support deepseek_r1, openthinker and simplescaling
system_prompt_file str None Path to the system prompt file
chat_template_file str None Path to the chat template file
max_new_tokens int 8192 Maximum number of new tokens
device_ids str None Device IDs to use
reasoning_max_retry int 10 Maximum number of retries when the model's output does not conform to the expected format
add_reasoning_prompt bool False Manually add the reasoning token.
enable_length_ctrl bool False Enable length control
reasoning_max_len int None Maximum length for reasoning
reasoning_min_len int 0 Minimum length for reasoning
reasoning_scale float None Scaling factor for reasoning
cut_by_sentence bool False Cut content by sentence for length control
force_reasoning_content_keys str None Keys for forced reasoning content, usually consistent with the previously saved reasoning_keys
overwrite bool False Overwrite the responses and reasoning keys of the input files

Tasks

We now support testing for MMLU-Pro, IFEval, Live-Bench, StrongReject, WildJailbreak, and XSTest. If you need to add a custom task, please refer to src/task.py to add the dataset and evaluator.

Example

cd FreeEvalLM
bash scripts/distill-8b_ifeval.sh

If you want to control the thinking length and try multi-GPUs inference, please refer to scripts/distill-8b_ifeval_length-control.sh

Acknowledgments

We would like to express sincere gratitude to the following open-source projects and their development teams.

Citation

If you find our work useful, please consider citing our paper:

@article{zhao2025trade,
  title={Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities},
  author={Zhao, Weixiang and Sui, Xingyu and Guo, Jiahe and Hu, Yulin and Deng, Yang and Zhao, Yanyan and Qin, Bing and Che, Wanxiang and Chua, Tat-Seng and Liu, Ting},
  journal={arXiv preprint arXiv:2503.17979},
  year={2025}
}

About

[AAAI26] Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors