Documentary

Natural Language Software Engineering via Code-Documentation Equivalence

Documentary generates docstrings for Python functions that help both developers and LLMs perform software engineering tasks better.

Setup

Requirements

Python 3.10+
OpenAI API Key (for LLM calls)
Docker (for running the evaluations)
Gemini API Key (for running RQ2)

Installation

In the documentary directory, run:

# Create a virtual environment
python -m venv .venv

# Activate the virtual environment
source .venv/bin/activate

# Install requirements
pip install -r requirements.txt

# Install Documentary
pip install .

Usage

To generate documentation for a single function, create a config file, in the format of example_config.json and run:

python -m documentary.cli isomorphisize --config example_config.json

This will output the generated docstring for the first function in example_file.py to the console. To replace the docstring in the file, run:

python -m documentary.cli isomorphisize_and_replace --config example_config.json --write

Repository Structure

This repository contains the implementation of Documentary in documentary/ and the scripts to reproduce our evaluation results in evaluation/.

Documentary

Documentary is the tool that generates equivalent documentation for Python functions. It is implemented as a Python package and can be used as a command-line tool or as a library.

The package has three main components:

cli.py: The command-line interface for Documentary.
utils.py: Utility functions for LLM queries and simple code modifications.
main.py: The main algorithm for generating equivalent documentation.

Evaluation

In the evaluation/ directory there are three subdirectories:

equivalence/: evaluations for generating equivalent documentation, which includes scripts to reproduce RQ1, RQ2, RQ3, and RQ4.
output_prediction/: evaluations for predicting outputs of functions, which includes scripts to reproduce RQ6.
user_study/: evaluations for user study, which includes the source code of the web application used for the user study and scripts to reproduce RQ5.

Evaluation

We provide 3 levels for reproducing our evaluation results based on the time and LLM costs:

Inspecting and visualizing the results of our executions (no LLM calls, very low execution time).
Running Documentary on a single function (low execution time and low LLM costs).
Running the full evaluation (high execution time and high LLM costs).

For level 1, we have published the logs, and LLM interactions of our evaluation executions as a record on Zenodo. You can download the data from Zenodo, and then follow the instructions marked with "level 1" below.

For level 2, you can follow the above instructions of the "Setup" and "Usage" sections.

For level 3, you can follow the instructions below after following the instructions of the "Setup" section above.

LLM API Keys (level 3)

In the .env file at the top level, set the OPENAI_API_KEY and GEMINI_API_KEY (Gemini API key is optional and only used for RQ2) environment variables.

RQ1: Effectivness in Generating Equivalent Documentation

Build the docker image (level 3):

bash evaluation/equivalence/docker_build.sh

Run the evaluation (level 3):

bash evaluation/equivalence/docker_run.sh <instances>

This will run the evaluation with <instances> parallel containers. If <instances> is 1, the container will not be detached, so you can see the logs in the terminal. The results will be stored in the evaluation/equivalence/results directory.

Summarize the results (level 1 and 3):

python evaluation/equivalence/quick_results.py

RQ2: Transferability to Other Models

When running the command above in step 2, if the GEMINI_API_KEY is set in the environment, the evaluation will be run with openai:gpt-5-nano and gemini:gemini-2.5-flash-lite. Otherwise, it will only be run with openai:gpt-5-nano.

RQ3: Ablation Study

(level 3) In step 2 above, use

bash evaluation/equivalence/docker_run.sh <instances> <iterations> <size_limit>

will run the evaluation with <instances> parallel containers, maximum of <iterations> iterations per function, and a size limit of <size_limit> for the generated docstrings. The default values are iterations=5 and size_limit=1.0.

RQ4: Costs

(level 1 and 3) When summarizing results (step 3 from above), you can use --costs, which prints the cost values and plots the cost distribution.

python evaluation/equivalence/quick_results.py --costs

RQ5: User Study

The results of the user study are available in evaluation/user_study/results/.

(level 1 and 3) A summary of the results can be generated using

python evaluation/user_study/analysis/calculate_treatment_stats.py

You can also run the user study to collect new data. To do this, first set values to environment variables in evaluation/user_study/study_interface/.env.example and then follow the instructions in evaluation/user_study/study_interface/README.md.

RQ6: Output Prediction

Build the docker image (level 3):

bash evaluation/output_prediction/docker_build_experiment.sh

Run the evaluation (level 3):

bash evaluation/output_prediction/docker_run_experiment.sh

The results will be stored in evaluation/output_prediction/workspace/results_*.json.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
documentary		documentary
evaluation		evaluation
.env		.env
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
REQUIREMENTS.md		REQUIREMENTS.md
STATUS.md		STATUS.md
main.pdf		main.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Documentary

Natural Language Software Engineering via Code-Documentation Equivalence

Setup

Requirements

Installation

Usage

Repository Structure

Documentary

Evaluation

Evaluation

LLM API Keys (level 3)

RQ1: Effectivness in Generating Equivalent Documentation

RQ2: Transferability to Other Models

RQ3: Ablation Study

RQ4: Costs

RQ5: User Study

RQ6: Output Prediction

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Documentary

Natural Language Software Engineering via Code-Documentation Equivalence

Setup

Requirements

Installation

Usage

Repository Structure

Documentary

Evaluation

Evaluation

LLM API Keys (level 3)

RQ1: Effectivness in Generating Equivalent Documentation

RQ2: Transferability to Other Models

RQ3: Ablation Study

RQ4: Costs

RQ5: User Study

RQ6: Output Prediction

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages