Get a coding agent to find bugs in your codebase by mining properties and testing them via Hypothesis.
For the artifacts from the paper, including bug reports and rankings, see the paper directory. Note that the code that was used in the paper is slightly behind what is in the main folder. See paper/README.md for more details.
To see all the bugs our agent found, see our website.
To read the blog post on the Hypothesis website, see here.
The agent is a Claude Code command. You will need to have Claude Code installed to run it. You will need a subscription to Claude Code, or an API key (we recommend an API key if you are running it over a large number of packages, or to reproduce the paper).
The command is contained in the hypo.md file. You will need to place this file in the .claude/commands/ directory, which can either be in ~ or in whichever directory you are running the agent from. The agent can then be invoked with /hypo <target>.
You will need pytest, hypothesis, and the package you are testing installed.
The agent takes one argument, which is the target to test. This can be a file, a function, or a module. If no argument is given, it will test the entire codebase, i.e., the current working directory. You can pass whichever other arguments that Claude Code supports, like the model, permissions, etc.
Example usage:
claude "/hypo numpy"
claude "/hypo statistics.median" --model opusYou can also just start Claude Code, and then invoke the agent.
The run.py script is a wrapper around the agent to test multiple packages, in parallel. It is what was used in the paper. This script does not require any other requirements beyond the standard library (of course, you still need to have Claude Code installed). You need python3 and pip to be in your PATH.
Note that the runner operates at the module level.
The only required argument is the path to a json file containing the packages to test, and which modules to test within each package. It looks like:
{
"pathlib": {
"type": "stdlib",
"modules": ["pathlib"]
},
"numpy": {
"type": "pypi",
"modules": ["numpy"]
}
}The keys in the json file are the package names, either the standard library name or the PyPI name. For standard library packages, specify "stdlib", and for PyPI packages, specify "pypi". This is important so the runner knows how to set up the virtual environment.
The runner takes two optional arguments:
--max-workers: the number of parallel workers to use. Default is 20.--model: the model to use. Default is "opus".--preinstall-workers: the number of parallel workers to use for setting up the virtual environments. Default is 10.
The runner will output all bug reports in the results/ directory.
Example usage:
python run.py packages.jsonIn the example_packages/ directory, there are some example package json files to test:
packages_mini.json: a mini set of modules to test (this took 6 minutes to run, with default settings)packages_10k.json: top 10,000 pypi packages, with the main module and all submodules one level deep
The packages tested in the paper are in the paper/ directory.
The runner sets up virtual environments, with venv, for each package. Standard library packages just use the same virtual environment, and PyPI packages get their own virtual environment. The runner will also install pytest and hypothesis in each virtual environment. It does this in parallel, which is controllable; see the CLI arguments below.
It then then sets up directories, up to a specified number of maximum workers (see CLI arguments below), which is a "sandbox" for the agent to run in. It only has permission to edit files within this sandbox. Each worker directory also contains .claude/commands/hypo.md, so that the agent can run. The runner parallelizes across modules.
Note that the runner also checks if the module has already been tested, and skips it if so. So, you can easily resume a run by just running the runner again.
The runner calls the agent with restricted permissions. It only has permission to read/write/edit files in the sandbox in which it is called, and it also has read permission to the virtual environment, so that it can read the source code of the package. Furthermore, it can only write/edit .py and .md files. The only bash commands it can run are python and pytest. Note that because of how the virtual environments are set up, the Python command will be python. Lastly, it also has access to the Todo and WebFetch tools.
You should still be careful with the runner, because running arbitrary code is dangerous!
In the results/ directory, there will be a directory named after the package. Each of these will have the following structure:
bug_reports/- <all bug reports written by the agent>
logs/claude_call_$id.json<the log of the Claude Code call corresponding to this id>
aux_files/$id- <all other files written by Claude Code during the Claude Code call corresponding to this id, e.g., Python files>
call_mappings.jsonl, with the following format:call_id: the Claude Code call idmodule: the module testedtimestamp: date executedbug_reports: the filename of any bug reports in thebug_reportsdirectory written by this Claude Code callaux_files_dir: the directory containing all files written by the agent during the Claude Code call corresponding to this id
To score the bug reports, you can run python scoring.py results/. This uses the rubric contained in that file, and passes it to Claude (not Claude Code, just the Claude API). This script outputs a CSV file containing the scores for each bug report, as well as the reasoning.
It takes the following arguments:
--retry-failures: if set, it will retry the bug reports that failed to score. This requires the CSV file to already exist, as it checks for failed scores in the CSV file.reports_dir: the directory containing the bug reports to score. Default is "results/".--max-workers: the number of parallel workers to use. Default is 20.--model: the model to use. Default is "claude-opus-4-1" (note model names are different when using the Claude API directly)--csv-path: the path to the CSV file to write the results to. Default is "scoring_results.csv".
Example usage:
python scoring.py results/