Skip to content

(ACL 2025 Main) Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification - Official Implementation & Dataset

Notifications You must be signed in to change notification settings

liuchengwucn/Safe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Safe (ACL 2025 Main)

TL;DR: Formally verifying LLM mathematical reasoning using the Lean 4 formal language!

The official implementation of our paper Safe (Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification) and its associated datasets FormalStep.

📃 [Paper] • 💻 [Github] • 🤗 [Dataset]

Configuration Guide

To run this project, you need to locally deploy both a 7B reasoning model and a 7B prover using vLLM. We recommend using a server with at least 2 NVIDIA GPUs for optimal performance.

The Lean environment setup can be complex, so we strongly recommend using our provided Docker image to simplify configuration. For those interested in the manual setup process, please refer to our Dockerfile and compose.yml. Below is a step-by-step Docker-based configuration guide.

Step 1: Clone the repository

git clone 'https://github.com/liuchengwucn/Safe.git' && cd Safe

Step 2: Verify Docker and NVIDIA Container Toolkit installation

Ensure Docker with NVIDIA Container Toolkit and Docker Compose plugin are properly installed. Reference: Docker, NVIDIA Container Toolkit and Docker Compose.

docker info | grep nvidia

Successful installation will display nvidia in Runtimes:

Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc

Step 3: Configure environment variables

Create .env file with your preferred editor and set OPENAI_API_KEY and OPENAI_BASE_URL for GPT-4o/GPT-4o-mini API access.

OPENAI_API_KEY="sk-123456"
OPENAI_BASE_URL="https://api.openai.com/v1"

Step 4 (Optional): Pre-download model weights

We strongly recommend pre-downloading model weights and mounting ~/.cache/huggingface to the Docker container. Use the following Python code to cache models (Hugging Face CLI login may be required for gated models):

from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
# Reasoning Models
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
# model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-math-7b-instruct")
# tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-math-7b-instruct")
# model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
# tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

# Prover Model
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-Prover-V1.5-RL")
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-Prover-V1.5-RL")

# Reward Models
model = AutoModelForCausalLM.from_pretrained("peiyi9979/math-shepherd-mistral-7b-prm")
tokenizer = AutoTokenizer.from_pretrained("peiyi9979/math-shepherd-mistral-7b-prm")
# model = AutoModelForCausalLM.from_pretrained("RLHFlow/Llama3.1-8B-PRM-Deepseek-Data")
# tokenizer = AutoTokenizer.from_pretrained("RLHFlow/Llama3.1-8B-PRM-Deepseek-Data")
# model = AutoModelForSequenceClassification.from_pretrained("RLHFlow/ArmoRM-Llama3-8B-v0.1")
# tokenizer = AutoTokenizer.from_pretrained("RLHFlow/ArmoRM-Llama3-8B-v0.1")
# model = AutoModelForSequenceClassification.from_pretrained("Skywork/Skywork-Reward-Llama-3.1-8B-v0.2")
# tokenizer = AutoTokenizer.from_pretrained("Skywork/Skywork-Reward-Llama-3.1-8B-v0.2")

For online model loading (not recommended), modify compose.yml to allow downloads:

services:
  deepseek-prover:
    ...
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
      # Comment out for online download
      # - HF_HUB_OFFLINE=1
      
  reasoning-model:
    ...
    environment:
      - NVIDIA_VISIBLE_DEVICES=2
      # Comment out for online download
      # - HF_HUB_OFFLINE=1
    ...

Step 5: Download datasets

Required datasets: Math, GSM8K, PRM800K (MATH-500), and CollegeMath. Math/GSM8K/CollegeMath are included; use download.py for PRM800K:

python download.py

Step 6: GPU allocation

Specify GPU devices for each service in compose.yml. Empirical requirement: ~40GB VRAM per service (prover & reasoning model).

Step 7 (Optional): Pre-pull Docker images

Our images include pre-configured Python/Lean environments and cached mathlib. While docker compose up auto-pulls these, manual pulling is available:

docker pull vllm/vllm-openai:v0.8.5.post1
docker pull ghcr.io/liuchengwucn/safe:1.0.0

Step 8: Launch containers & Enter development environment

Mathlib cache will be automatically symlinked to the project directory.

docker compose up -d
docker compose exec safe bash

Paper Reproduction

To reproduce our experiments, execute these four scripts in order. You can modify the configuration at the beginning of each script, including the reasoning model and prover to be used, as well as the dataset and hyperparameters, among other settings.

  1. collect_trace.py - Collects reasoning traces (train traces for aggregator training, test traces for evaluation)
  2. aggregator.py - Trains the LSTM aggregator
  3. benchmark.py - Evaluates SAFE framework performance using test traces
  4. benchmark_ensemble.py - Tests hyperparameter alpha's impact on framework performance

Citation

If you find our work useful, please consider citing our paper.

@inproceedings{liu-etal-2025-safe,
    title = "Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification",
    author = "Liu, Chengwu  and
      Yuan, Ye  and
      Yin, Yichun  and
      Xu, Yan  and
      Xu, Xin  and
      Chen, Zaoyu  and
      Wang, Yasheng  and
      Shang, Lifeng  and
      Liu, Qun  and
      Zhang, Ming",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.594/",
    doi = "10.18653/v1/2025.acl-long.594",
    pages = "12171--12186",
    ISBN = "979-8-89176-251-0",
    abstract = "Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that ``the gold standard for supporting a mathematical claim is to provide a proof''. We propose a retrospective, step-aware formal verification framework Safe. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework Safe across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose FormalStep as a benchmark for step correctness theorem proving with 30,809 formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs."
}

About

(ACL 2025 Main) Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification - Official Implementation & Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors