Puzzling over the Postgres Query Planner with LLMs

Steampipe plugins read APIs into Postgres foreign tables. Each plugin has a home page that documents the tables it provides and, for each table, shows examples of SQL queries that use it. For example, here are the tables provided by the Mastodon plugin, and here’s the page for the mastodon_follower table which includes two sample queries. When I recently tried one of those, Count followers by month of account creation, it returned zero rows. Why? I’d hardcoded my own account ID which then changed when I moved from mastodon.social to mastodon.coop. Of course, it was a bad idea to hardcode the account ID, the query should have used the ID of my account, or of the account of anyone else running the query, using the mastodon_my_account table.
That seemed like an easy fix, but before tackling it I decided to dig into another bit of housekeeping. The example queries are supposed to be copy/paste/run artifacts. If you’ve installed a plugin, and authenticated to its API, you should be able to run all the examples and see results. But there wasn’t an automated way to do that, and I’d been meaning to create a test runner for examples, so I diverted to that task.
Web Scraping with LLM Assistance
The obvious approach was to lift the queries from the examples page and run them locally in Steampipe. Despite the proliferation of APIs, there always seem to be reasons to write web scrapers. That’s something I used to enjoy but, after all these years doing it, the thrill is gone. I just want to get the job done as quickly and painlessly as possible. This was the first scraper of the LLM era and, sure enough, my team of assistants made the chore impressively quick and painless.
Rather than scrape the web pages at hub.steampipe.io, I decided to go upstream to the Markdown files that are the canonical sources for the example pages — for example, https://github.com/turbot/steampipe-plugin-mastodon/blob/main/docs/tables/mastodon_account.md. Those pages are delivered to a web client, I learned, as JSON payloads. Deep in each payload, there’s a key, richText, which contains the HTML markup for the examples page. The richText object contains one or (usually) several examples. The scraper needed to capture the title for each example, and the SQL code that the test runner would run.
I’m a longtime fan of Leonard Richardson’s wonderful BeautifulSoup library. But I’ve paid my dues, and I don’t want to write code like this anymore — I want to coax an LLM assistant into writing it for me.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
def examples_from_rich_text(rich_text): extracted_data = [] rich_text_cleaned = re.sub(r'^```.*$', '', rich_text, flags=re.MULTILINE) soup = BeautifulSoup(rich_text_cleaned, 'html.parser') table_tag = soup.find('h1') table_name = table_tag.get_text(strip=True).replace('Table: ', '') if table_tag else 'Unknown Table' h3_tags = soup.find_all('h3') titles = [h3.get_text(strip=True) for h3 in h3_tags] query_blocks = soup.find_all('div', class_='highlight highlight-source-sql notranslate position-relative overflow-auto') for title, query_block in zip(titles, query_blocks): query_text = query_block.find('pre').get_text() query_text = re.sub(r'\n\s*', ' ', query_text) extracted_data.append(table_name, title, query_text.strip()) return extracted_data |
Reviewing the ChatGPT transcript I can see plenty of coaxing. To find the richText key in the JSON object, it initially tried:
1 |
rich_text = data.get('blob').get('richText') |
I had to point out that the actual path to the key is data -> payload -> blog -> richText, which then got us to the correct accessor.
1 |
rich_text = data.get('payload').get('blob').get('richText') |
The lesson here, one that I keep learning, is to be as explicit as possible. I could see that path in the debugger, but ChatGPT can’t (yet); so the optimal human-machine partnership here was for me to provide the path and let the LLM worry about constructing the JSON accessor.
A similar thing happened when I asked to match and extract titles. In response to “Please also find titles,” ChatGPT proposed:
1 |
titles = soup.find_all('h3', {'id': lambda x: x and x.startswith('user-content-')}) |
When that didn’t work I asked more explicitly.
The pattern to search for titles is: <h3 tabindex=\”-1\” id=\”user-content-list-artifacts\” dir=\”auto\”><a class=\”heading-link\” href=\”#list-artifacts\”>List artifacts<svg class=\”octicon octicon-link\” viewBox=\”0 0 16 16\” version=\”1.1\” width=\”16\” height=\”16\” aria-hidden=\”true\”>
Again the target was something I could see and report to the LLM, so it could then work out the details of the BeautifulSoup accessor. And likewise for the queries we needed to capture. In the Before Time, I had to identify the targets for these accessors and then work out the JSON or BeautifulSoup or regex idioms required to match those targets. Now, I point an assistant at the targets and am happy to let it figure out the idioms.
Learning While Coding, Revisited
In Learning while coding: How LLMS teach you implicitly I showed several examples of things I learned as a by-product of coding exercises. When you work with a human partner, some of what you learn is explicit: your partner knows something and consciously transmits it to you. But there’s also the transmission of tacit knowledge. You pick up things from your human partner that they may not even know that they know, and aren’t aware of teaching you.
This kind of tacit learning happens all the time when I work with LLMs, and the function shown above has a nice example of it. The function makes two pattern-matching passes over the examples: one for titles and the other for queries. It then walks the two lists in parallel using a technique I’d not known about: zip for parallel iteration. If I’d written the function myself, I’d have done this less elegantly by indexing into the parallel arrays. I could, of course, have asked any search engine: “How do I iterate over two lists in Python?” The answer is easy to find. But we are creatures of habit, and while I’d like to believe I’d have asked the question and found a better answer than my lazy default, that probably wouldn’t have happened.
As I worked with ChatGPT to write that function, I was in a teachable moment. When ChatGPT implicitly asked and answered a question I wouldn’t have thought to ask — as a side-effect of solving a coding problem — I learned about the zip function in the best imaginable way.
Using the Test Runner
With the test runner in hand, I returned to the original task: convert a query that hardcodes a Mastodon account ID into a query that uses the mastodon_my_account table to provide the ID of the authenticated Mastodon user. The solution looked easy. Just replace the hardcoded ID with a subquery, right?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
with data as ( select to_char(created_at, 'YYYY-MM') as created from mastodon_follower where followed_account_id::text = ( select id::text from mastodon_my_account limit 1) -- was '57523' ) select created, count(*) from data group by created order by created; |
But the query still returned zero rows. How can this be? I can prove the subquery works.
1 2 3 4 5 6 |
select id from mastodon_my_account; +--------------------+ | id | +--------------------+ | 110147269591397842 | +--------------------+ |
And I can prove that this hardcoded value yields the expected results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
with data as ( select to_char(created_at, 'YYYY-MM') as created from mastodon_follower where followed_account_id::text = '110147269591397842' ) select created, count(*) from data group by created order by created; +---------+-------+ | created | count | +---------+-------+ | 1970-01 | 1 | | 2016-10 | 4 | | 2016-11 | 41 | |
The two queries are logically identical, so how can the first one fail?
A Query Planning Conundrum
Let’s look at the EXPLAIN ANALYZE output.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------+ | QUERY PLAN | +---------------------------------------------------------------------------------------------------------------------------------------------------------+ | GroupAggregate (cost=70000000.06..70000000.11 rows=3 width=40) (actual time=890.370..890.374 rows=0 loops=1) | | Group Key: (to_char(mastodon_follower.created_at, 'YYYY-MM'::text)) | | -> Sort (cost=70000000.06..70000000.07 rows=3 width=32) (actual time=890.366..890.369 rows=0 loops=1) | | Sort Key: (to_char(mastodon_follower.created_at, 'YYYY-MM'::text)) | | Sort Method: quicksort Memory: 25kB | | -> Limit (cost=10000000.00..70000000.01 rows=3 width=32) (actual time=890.335..890.338 rows=0 loops=1) | | InitPlan 1 (returns $0) | | -> Limit (cost=0.00..10000000.00 rows=1 width=100) (never executed) | | -> Foreign Scan on mastodon_my_account (cost=0.00..10000000000000.00 rows=1000000 width=100) (never executed) | | -> Foreign Scan on mastodon_follower (cost=0.00..20000000002500.00 rows=1000000 width=32) (actual time=890.326..890.327 rows=0 loops=1) | | Filter: (followed_account_id = $0) | | Planning Time: 19.165 ms | | Execution Time: 893.598 ms | +---------------------------------------------------------------------------------------------------------------------------------------------------------+ |
The foreign scan was never executed? How can that be? While puzzling over that question, I worked with my team of assistants to come with an alternative query strategy. This one works as expected.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
with my_account_id as ( select id::text from mastodon_my_account limit 1 ) select to_char(mf.created_at, 'yyyy-mm') as created, count(*) from mastodon_follower mf join my_account_id mai on mf.followed_account_id::text = mai.id group by created order by created +---------+-------+ | created | count | +---------+-------+ | 1970-01 | 1 | | 2016-10 | 4 | | 2016-11 | 41 | |
Why the difference? Short answer: I don’t know, and I’ve yet to find anybody who does know. The Postgres query planner is a complex and mysterious beast even under normal circumstances, and foreign tables are unusual. Steampipe users have reported a number of these puzzling outcomes. There are always workarounds. You might need to switch from a subquery to a direct join, as seen here. It might help to define a CTE using the MATERIALIZED keyword, e.g. with data as materialized ( … ). In some cases adding a seemingly unnecessary ORDER BY to a CTE makes a difference. But … why?
I created a suite of examples to compare the two strategies shown here — subquery vs join — and put the question to my team of assistants. You can see the examples in this gist, where I prompted my assistants like so.
Consider these 5 cases. It looks like a join strategy against a foreign table makes it more likely the foreign table will be used than a where strategy. First, do you agree the evidence suggests that? Second, if so, why?
The consensus is that the join strategy seems more reliable, but I still don’t know why. An excursion into the steampipe-postgres-fdw code didn’t shed any light. I tried twiddling all the available knobs I could find in order to make the planner choose an effective plan for the subquery-based version of the query, and nothing seemed to make any difference. At the end of my conversation with Unblocked it made a wise suggestion.
I would also recommend reaching out to the PostgreSQL community for further assistance. They might be able to provide more insight into this issue and suggest possible solutions or workarounds.
LLMs are, of course, changing the game for experts in Postgres internals as well for the rest of us. Phil Eaton has been writing a series of Postgres deep dives. His latest, on interpreting query plans, concludes:
My recent Postgres explorations would have been basically impossible if it weren’t for being able to ask ChatGPT simple, stupid questions like “How do I get from a Postgres Var to a column name”.
It isn’t always right. It doesn’t always give great code. Actually, it normally gives pretty weird code. But it’s been extremely useful for quick iteration when I get stuck.
The only other place the information exists is in small blog posts around the internet, the Postgres mailing lists (that so far for me hasn’t been super responsive), and the code itself.
I don’t know if I’ll arrive at a satisfactory answer to my question. But the ability of LLMs to synthesize what is known and discussed among experts, and thereby amplify the investigative power of other experts, makes that outcome more likely.