Documentary generates docstrings for Python functions that help both developers and LLMs perform software engineering tasks better.
- Python 3.10+
- OpenAI API Key (for LLM calls)
- Docker (for running the evaluations)
- Gemini API Key (for running RQ2)
In the documentary directory, run:
# Create a virtual environment
python -m venv .venv
# Activate the virtual environment
source .venv/bin/activate
# Install requirements
pip install -r requirements.txt
# Install Documentary
pip install .To generate documentation for a single function, create a config file, in the format of example_config.json and run:
python -m documentary.cli isomorphisize --config example_config.jsonThis will output the generated docstring for the first function in example_file.py to the console. To replace the docstring in the file, run:
python -m documentary.cli isomorphisize_and_replace --config example_config.json --writeThis repository contains the implementation of Documentary in documentary/ and the scripts to reproduce our evaluation results in evaluation/.
Documentary is the tool that generates equivalent documentation for Python functions. It is implemented as a Python package and can be used as a command-line tool or as a library.
The package has three main components:
cli.py: The command-line interface for Documentary.utils.py: Utility functions for LLM queries and simple code modifications.main.py: The main algorithm for generating equivalent documentation.
In the evaluation/ directory there are three subdirectories:
equivalence/: evaluations for generating equivalent documentation, which includes scripts to reproduce RQ1, RQ2, RQ3, and RQ4.output_prediction/: evaluations for predicting outputs of functions, which includes scripts to reproduce RQ6.user_study/: evaluations for user study, which includes the source code of the web application used for the user study and scripts to reproduce RQ5.
We provide 3 levels for reproducing our evaluation results based on the time and LLM costs:
- Inspecting and visualizing the results of our executions (no LLM calls, very low execution time).
- Running Documentary on a single function (low execution time and low LLM costs).
- Running the full evaluation (high execution time and high LLM costs).
For level 1, we have published the logs, and LLM interactions of our evaluation executions as a record on Zenodo. You can download the data from Zenodo, and then follow the instructions marked with "level 1" below.
For level 2, you can follow the above instructions of the "Setup" and "Usage" sections.
For level 3, you can follow the instructions below after following the instructions of the "Setup" section above.
In the .env file at the top level, set the OPENAI_API_KEY and GEMINI_API_KEY (Gemini API key is optional and only used for RQ2) environment variables.
- Build the docker image (level 3):
bash evaluation/equivalence/docker_build.sh- Run the evaluation (level 3):
bash evaluation/equivalence/docker_run.sh <instances>This will run the evaluation with <instances> parallel containers. If <instances> is 1, the container will not be detached, so you can see the logs in the terminal.
The results will be stored in the evaluation/equivalence/results directory.
- Summarize the results (level 1 and 3):
python evaluation/equivalence/quick_results.pyWhen running the command above in step 2, if the GEMINI_API_KEY is set in the environment, the evaluation will be run with openai:gpt-5-nano and gemini:gemini-2.5-flash-lite.
Otherwise, it will only be run with openai:gpt-5-nano.
(level 3) In step 2 above, use
bash evaluation/equivalence/docker_run.sh <instances> <iterations> <size_limit>will run the evaluation with <instances> parallel containers, maximum of <iterations> iterations per function, and a size limit of <size_limit> for the generated docstrings.
The default values are iterations=5 and size_limit=1.0.
(level 1 and 3) When summarizing results (step 3 from above), you can use --costs, which prints the cost values and plots the cost distribution.
python evaluation/equivalence/quick_results.py --costsThe results of the user study are available in evaluation/user_study/results/.
(level 1 and 3) A summary of the results can be generated using
python evaluation/user_study/analysis/calculate_treatment_stats.pyYou can also run the user study to collect new data.
To do this, first set values to environment variables in evaluation/user_study/study_interface/.env.example and then follow the instructions in evaluation/user_study/study_interface/README.md.
- Build the docker image (level 3):
bash evaluation/output_prediction/docker_build_experiment.sh- Run the evaluation (level 3):
bash evaluation/output_prediction/docker_run_experiment.shThe results will be stored in evaluation/output_prediction/workspace/results_*.json.