DiffTrust: Estimating Correctness Without Oracles in LLM-Based Code Generation

This repository contains the implementation used in the experiments for the DiffTrust paper, which introduces incoherence as a theoretically grounded proxy for correctness in LLM-based code generation—designed to operate without access to ground-truth implementations or oracles.

Overview

Large Language Models (LLMs) have demonstrated strong performance in code generation tasks, yet concerns about confabulation remain—models frequently produce syntactically valid but semantically incorrect programs. In DiffTrust, we propose a principled proxy for correctness called incoherence, which quantifies semantic disagreement between independently sampled model outputs.

This repository supports empirical evaluation of our new metrics across two popular benchmarks: HumanEval and MBPP.

Note: This repository is not intended as a general-purpose benchmarking toolkit, but rather as the exact implementation behind our experimental results.

Paper Contributions (Recap)

Incoherence is proposed as an unsupervised, theoretically grounded estimator for correctness.
A probabilistic framework links incoherence to model error via a provable lower bound.
Empirical validation across 16 LLMs showing that incoherence alone can detect ~2/3 of incorrect progrms without any false positives, matching oracle-based rankings with Spearman’s ρ ≥ 0.92.

Project Structure

.
├── HumanEval
│   ├── instance.py
│   ├── remove_duplicates.py
│   ├── run.py
│   └── stats.py
├── MBPP
│   ├── instance.py
│   ├── remove_duplicates.py
│   ├── run.py
│   └── stats.py
├── README.md
├── difftrust
│   ├── config.json
│   ├── config.py
│   ├── core
│   │   ├── checking.py
│   │   ├── coder.py
│   │   ├── experiment.py
│   │   ├── function.py
│   │   ├── metrics.py
│   │   ├── refiner.py
│   │   └── specification.py
│   ├── fuzzer
│   │   ├── coverage.py
│   │   └── fuzzer.py
│   ├── generic
│   │   ├── generic_equal.py
│   │   ├── generic_explorer.py
│   │   ├── generic_fuzzer.py
│   │   ├── generic_mutator.py
│   │   └── generic_repr.py
│   ├── llm
│   │   ├── abstract.py
│   │   ├── chat.py
│   │   ├── chatgpt.py
│   │   ├── claude.py
│   │   ├── gemini.py
│   │   ├── open_router.py
│   │   └── open_router_models.json
│   ├── rqs
│   │   ├── ablation.py
│   │   ├── aggregate-plots
│   │   ├── constants.py
│   │   ├── costs_calculation.py
│   │   ├── counter.py
│   │   ├── custom_classes.py
│   │   ├── latex.py
│   │   ├── plots.py
│   │   ├── rq_utils.py
│   │   └── utils.py
│   └── tracing
│       ├── events.py
│       └── tracer.py
└── requirements.txt

Requirements

Install dependencies via:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Running Experiments

Each dataset folder (e.g., MBPP/ or HumanEval/) contains a run.py script that replicates the experimental setup from the paper.

Example

# Run the pointwise incoherence experiment on HumanEval
python HumanEval/run.py

To adjust parameters such as the LLM, number of candidate functions (nb_candidate), number of test inputs (nb_sample), or temperature settings.

Key Parameters (set in `run.py`)

llm_name: Name of the LLM to use (e.g., "gpt_4", "claude_opus_4", etc.)
nb_candidate: Number of candidate programs to sample per task (default: 10)
nb_sample: Number of test inputs per comparison (default: 1000)
temperature: Sampling temperature for the LLM (default: 0.0 for deterministic outputs)
timeout: Max execution time per comparison (default: 60 seconds)

License

MIT License. See LICENSE for ful text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DiffTrust: Estimating Correctness Without Oracles in LLM-Based Code Generation

Overview

Paper Contributions (Recap)

Project Structure

Requirements

Running Experiments

Example

Key Parameters (set in `run.py`)

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
HumanEval		HumanEval
MBPP		MBPP
difftrust		difftrust
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

mpi-softsec/DiffTrust

Folders and files

Latest commit

History

Repository files navigation

DiffTrust: Estimating Correctness Without Oracles in LLM-Based Code Generation

Overview

Paper Contributions (Recap)

Project Structure

Requirements

Running Experiments

Example

Key Parameters (set in run.py)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Key Parameters (set in `run.py`)

Packages