How to Use LLMs for Dynamic Documentation

The other day I revisited a SQL query I’d written a while ago, in order to adapt it to the latest version of the GitHub plugin. The minor adjustments took longer than they should have. Why? I’d forgotten how the query works! And of course, I hadn’t written any comments; it seemed simple at the time, right?
Here’s the original query. It reports recent commits for a set of Steampipe plugins.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
with repos as ( select name_with_owner as repo_name, 'author-date:>' || $2|| ' repo:' || name_with_owner as query from github_search_repository where query = $1 || ' in:name ' ) select r.repo_name, c.commit -> 'author' ->> 'name' as author, substring(c.commit -> 'committer' ->> 'date' from 1 for 10) as date, c.commit ->> 'message' as message from repos r join github_search_commit c on r.repo_name = c.repository_full_name and r.query = c.query order by date desc |
Here’s how it works. If you fill in the $1 and $2 parameters with the values turbot/steampipe and 2023-10-04, the with repos CTE (common table expression) runs this query.
1 2 3 4 5 6 7 |
select g.name_with_owner as repo_name, 'turbot/steampipe in name' as query from github_search_repository g where query = 'turbot/steampipe in:name' |
github_search_repository is a table, provided by the GitHub plugin, that encapsulates GitHub’s syntax for searching repositories. When you write a SQL query that uses the table, you’re required to provide a WHERE clause with a qualifier called query and a value that is a piece of query syntax for searching GitHub repositories.
At the moment, this query returns 208 rows like this.
1 2 3 4 5 6 7 |
+---------------------------------------------------------------+-----------------------------------------------+ | repo_name | query | +---------------------------------------------------------------+-----------------------------------------------+ | turbot/steampipe-plugin-ldap | author-date:>2023-10-04 repo:turbot/steampipe | | turbot/steampipe-plugin-bitbucket | author-date:>2023-10-04 repo:turbot/steampipe | | turbot/steampipe-plugin-alicloud | author-date:>2023-10-04 repo:turbot/steampipe | ... |
The query column contains another piece of GitHub search syntax. In this case, it’s the required qualifier for a different Steampipe table, github_search_commit, which encapsulates the GitHub syntax for searching commits. Here’s a standalone version of a query against this table.
1 2 3 4 5 6 7 |
select * from github_search_commit where repository_full_name = 'steampipe-plugin-ldap' and query = 'author-date:>2023-10-04 repo:turbot/steampipe' |
So we’ve got two different query qualifiers in play.
used in | table | example qualifier |
---|---|---|
the with repos CTE | github_search_repository | where query = ‘turbot/steampipe in name’ |
the body of the main query | github_search_commit | where query = ‘author-date:>2023-10-04 repo:turbot/steampipe’ |
Now let’s focus on the body of the main query.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
select r.repo_name, c.commit -> 'author' ->> 'name' as author, substring(c.commit -> 'committer' ->> 'date' from 1 for 10) as date, c.commit ->> 'message' as message from repos r join github_search_commit c on r.repo_name = c.repository_full_name and r.query = c.query order by date desc |
There are no WHERE clauses in play here but the same qualifiers are required; they can also be provided by way of a JOIN.
So there are two levels at which GitHub search syntax is dynamically built.
- to find GitHub repositories matching steampipe-plugin
- for each of the 200+ repositories thus found, to find commits after a given date
Now that I’ve written out the explanation, it hardly seems simple at all. If you could only write a single comment to help clarify things, what would it be?
The Most Important Single Comment
When I first showed the query to ChatGPT it wasn’t to ask for documentation, but rather to resolve a name conflict I’d introduced when adapting to some changed column names in the new version of the GitHub plugin. Nevertheless, the LLM provided two critical improvements — unrelated to those column changes — that I hadn’t even asked for.
+ ‘author-date:>’ || $2|| ‘ repo:’ || g.name_with_owner as commit_query– and r.query = c.query
+ and r.query = c.commit_query — use constructed commit_query
That was exactly right. The query column produced by the CTE shouldn’t have shared the same name as the query qualifier used in the body of the query. This wasn’t a syntactic problem but it was a conceptual disaster. Renaming the CTE’s query column to commit_query signals that it’s distinct from the query qualifier: a different kind of thing that powers the github_search_commit query.
This is the kind of tacit knowledge transfer discussed in Learning While Coding: How LLMs Teach You Implicitly. It was a bonus outcome that I didn’t even ask for.
I think this is a powerful prompt:
If you could add just one comment to clarify this uncommented code, what would it be? And why?
I put the question to both Copilot Chat and Sourcegraph Cody.
Copilot
“If I were to add just one comment to the query to clarify how it works, I would add a comment above the WITH
clause to explain what it does.”
1 |
-- This CTE retrieves the top 10 repositories that match the search query |
Bzzzt. Wrong answer!
Cody
“If I were to add one explanatory comment to this query, it would be here:
1 2 |
-- Construct dynamic commit search per repo 'author-date:>' || $2|| ' repo:' || g.name_with_owner as query |
The reason I would add a comment here is because this line is constructing a custom search query for commits for each repository.
This is a key part of the query — it allows searching commits within the date range specifically for each repo.
Without this dynamic construction, we wouldn’t be able to filter commits effectively per repo.
So adding a quick comment explaining that this line is creating a custom commit search per repo helps clarify why it is needed and how it works with the rest of the query to connect repos to matching commits.”
Exactly right. Well done Cody!
Dynamic Documentation
I understand why programmers often resist writing comments. Laziness? Sure, but sometimes laziness is a virtue. The computer doesn’t care about comments, it only cares about code, so if you can write code in a way that works for the computer while also helping the programmer understand the code, that’s a great start. The code itself is already a maintenance burden; comments that can easily drift from the source of truth become an additional burden.
Still, I’ve always been sympathetic to the idea of literate programming. A year ago, in Literate Programming in Go, I described a project to improve the documentation of the Steampipe plugin SDK. It was interesting to learn that Go source code can embed wiki-like documentation, and we made some headway toward enhancing the otherwise uninspiring docs generated from Go source. But we haven’t made further progress, and I’m keen to take another run at it in the LLM era.
For starters, I’d revisit the overview docs that we wrote to explain the high-level components and concepts. For that, I think Cody will serve us well, once it has indexed github.com/turbot/steampipe-plugin-sdk. For these overview docs, it will be worthwhile to iterate with LLMs to create documentation that will become a permanent part of the code and will be maintained accordingly.
But for function- and line-level comments, I’m now wondering whether a dynamic approach may sometimes (or often!) turn out to be best. I have never found mechanically-generated function-level comments to be particularly useful. But we have new machines to partner with now. Will the comments they can write dynamically be useful enough to avoid baking in function- and line-level documentation that’s liable to drift from the source of truth?
This isn’t an either/or question. We’re entering a phase in which both methods will coexist. But my hunch is that we’re about to see a fascinating new twist on the old idea of literate programming. Some explanations can, will, and should be written by code authors alone, or by those authors in partnership with LLMs. Others can, will, and should be conjured dynamically by code readers who ask LLMs for explanations on the fly.