Code related to tool creation is in code/tool_creation/.
Take the causality book as an example. Place the LaTeX content of the book in books/.
- Extract the book structure:
python extract_book_structure.py --domain causality
- Generate tools:
python initial_tool_generation.py --model_name gpt4o --domain causality
- Validate tools:
python validate_tools.py --model_name gpt4o --domain causality --stage unfiltered
- Refine tools:
python refine_tools.py --model_name gpt4o --domain causality
- Validate again:
python validate_tools.py --model_name gpt4o --domain causality --stage refined
Code related to tool utilization and evaluation is in code/inference/.
- Chapter selection:
python select_chapter.py --model_name gpt4o --domain causality
- Tool selection within chapter:
python select_skills_by_chapter.py --model_name gpt4o --domain causality
- Solution generation:
python run_tool_0shot.py --model_name gpt4o --domain causality
python evaluator.py --model_name gpt4o --domain causality --method tool_0shot --force_generate
Evaluation questions for causality, physics, and chemistry are in evaluation_data/qrdata_causal.json, evaluation_data/theoremqa_phy.json, and evaluation_data/scibench_chem.json, respectively.
For causality, please also download the corresponding data from the original benchmark https://github.com/xxxiaol/QRData/blob/main/benchmark/data.zip. Unzip and place the data/ directory under evaluation_data/.
We do not provide the LaTeX files of reference materials because of intellectual copyright.