TL;DR: Formally verifying LLM mathematical reasoning using the Lean 4 formal language!
The official implementation of our paper Safe (Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification) and its associated datasets FormalStep.
📃 [Paper] • 💻 [Github] • 🤗 [Dataset]
To run this project, you need to locally deploy both a 7B reasoning model and a 7B prover using vLLM. We recommend using a server with at least 2 NVIDIA GPUs for optimal performance.
The Lean environment setup can be complex, so we strongly recommend using our provided Docker image to simplify configuration. For those interested in the manual setup process, please refer to our Dockerfile and compose.yml. Below is a step-by-step Docker-based configuration guide.
git clone 'https://github.com/liuchengwucn/Safe.git' && cd SafeEnsure Docker with NVIDIA Container Toolkit and Docker Compose plugin are properly installed. Reference: Docker, NVIDIA Container Toolkit and Docker Compose.
docker info | grep nvidiaSuccessful installation will display nvidia in Runtimes:
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
Create .env file with your preferred editor and set OPENAI_API_KEY and OPENAI_BASE_URL for GPT-4o/GPT-4o-mini API access.
OPENAI_API_KEY="sk-123456"
OPENAI_BASE_URL="https://api.openai.com/v1"We strongly recommend pre-downloading model weights and mounting ~/.cache/huggingface to the Docker container. Use the following Python code to cache models (Hugging Face CLI login may be required for gated models):
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
# Reasoning Models
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
# model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-math-7b-instruct")
# tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-math-7b-instruct")
# model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
# tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
# Prover Model
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-Prover-V1.5-RL")
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-Prover-V1.5-RL")
# Reward Models
model = AutoModelForCausalLM.from_pretrained("peiyi9979/math-shepherd-mistral-7b-prm")
tokenizer = AutoTokenizer.from_pretrained("peiyi9979/math-shepherd-mistral-7b-prm")
# model = AutoModelForCausalLM.from_pretrained("RLHFlow/Llama3.1-8B-PRM-Deepseek-Data")
# tokenizer = AutoTokenizer.from_pretrained("RLHFlow/Llama3.1-8B-PRM-Deepseek-Data")
# model = AutoModelForSequenceClassification.from_pretrained("RLHFlow/ArmoRM-Llama3-8B-v0.1")
# tokenizer = AutoTokenizer.from_pretrained("RLHFlow/ArmoRM-Llama3-8B-v0.1")
# model = AutoModelForSequenceClassification.from_pretrained("Skywork/Skywork-Reward-Llama-3.1-8B-v0.2")
# tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-Reward-Llama-3.1-8B-v0.2")For online model loading (not recommended), modify compose.yml to allow downloads:
services:
deepseek-prover:
...
environment:
- NVIDIA_VISIBLE_DEVICES=1
# Comment out for online download
# - HF_HUB_OFFLINE=1
reasoning-model:
...
environment:
- NVIDIA_VISIBLE_DEVICES=2
# Comment out for online download
# - HF_HUB_OFFLINE=1
...Required datasets: Math, GSM8K, PRM800K (MATH-500), and CollegeMath. Math/GSM8K/CollegeMath are included; use download.py for PRM800K:
python download.pySpecify GPU devices for each service in compose.yml. Empirical requirement: ~40GB VRAM per service (prover & reasoning model).
Our images include pre-configured Python/Lean environments and cached mathlib. While docker compose up auto-pulls these, manual pulling is available:
docker pull vllm/vllm-openai:v0.8.5.post1
docker pull ghcr.io/liuchengwucn/safe:1.0.0Mathlib cache will be automatically symlinked to the project directory.
docker compose up -d
docker compose exec safe bashTo reproduce our experiments, execute these four scripts in order. You can modify the configuration at the beginning of each script, including the reasoning model and prover to be used, as well as the dataset and hyperparameters, among other settings.
collect_trace.py- Collects reasoning traces (train traces for aggregator training, test traces for evaluation)aggregator.py- Trains the LSTM aggregatorbenchmark.py- Evaluates SAFE framework performance using test tracesbenchmark_ensemble.py- Tests hyperparameter alpha's impact on framework performance
If you find our work useful, please consider citing our paper.
@inproceedings{liu-etal-2025-safe,
title = "Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification",
author = "Liu, Chengwu and
Yuan, Ye and
Yin, Yichun and
Xu, Yan and
Xu, Xin and
Chen, Zaoyu and
Wang, Yasheng and
Shang, Lifeng and
Liu, Qun and
Zhang, Ming",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.594/",
doi = "10.18653/v1/2025.acl-long.594",
pages = "12171--12186",
ISBN = "979-8-89176-251-0",
abstract = "Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that ``the gold standard for supporting a mathematical claim is to provide a proof''. We propose a retrospective, step-aware formal verification framework Safe. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework Safe across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose FormalStep as a benchmark for step correctness theorem proving with 30,809 formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs."
}