TNS
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Data / Large Language Models / Software Development

Techniques for Using LLMs to Improve SQL Queries

First we fixed a bug in an SQL query. Then we rethought the design of the query. Here are further ways to use LLMs to adjust your SQL queries.
Nov 9th, 2023 7:31am by
Featued image for: Techniques for Using LLMs to Improve SQL Queries
Image via Unsplash

The problematic query was supposed to detect Azure virtual machines whose network security groups allowed inbound access on sensitive ports. A user reported that for two different test cases the query reported OK when it should have reported Alarm. In the first case the security group was attached directly to the VM’s network interface; in the second it was attached to the VM’s subnet.

To debug the query, which interrogates an Azure subscription using Steampipe tables including azure_compute_virtual_machine and azure_network_security_group, there would need to be a pair of VMs with security groups arranged in those two different ways. Provisioning those resources in the pre-LLM era meant searching for Terraform examples, then ironing out the differences between found examples and the desired setup.

Now you can ask an LLM to write the initial Terraform code and then, in a back-and-forth dialogue, iteratively refine it. All of my LLM assistants proved capable of this: a miracle that we shouldn’t take for granted, but probably soon will. The LLM-written Terraform code made quick work of provisioning the needed resources, and we moved on to debugging the query.

Debugging a Faulty JOIN

Here was the main part of the failing query.


The join was failing because the two network_interfaces columns contained JSONB objects with differing shapes; Postgres’ JSONB containment operator, @>, couldn’t match them. Since the JSONB objects are arrays, and since the desired match was a key/value pair common to both arrays, it made sense to explode the array and iterate through its elements looking to match that key/value pair.

Terse and Verbose SQL Syntax

Initial solutions from ChatGPT, Copilot Chat, and newcomer Unblocked implemented that strategy using various flavors of cross joins involving Postgres’ jsonb_array_elements function. Here are some of the variants:

ChatGPT


Copilot Chat


Unblocked


While all these solutions accomplished the desired JOIN, they’re hard to read and understand — even for someone like me with a fair amount of experience using Postgres’ JSONB datatype and set-returning functions like jsonb_array_elements. When you combine these things with (possibly lateral) cross joins you wind up with SQL expressions that are concise but (to me at least) opaque and difficult to debug.

I have long advocated for restructuring queries as pipelines of CTEs (common table expressions), with each CTE performing one step in a complex transformation. That way you can reason about, test, and debug each step individually. But I acknowledge a role for terse syntax that collapses steps. With deep expertise and long practice, you can build up mental models that enable you to visualize these complex transformations. At that point, a more explicit step-by-step query can be too verbose; it’s more efficient to work with compact syntax.

So should you write for the novice or the expert? Ideally that’s a choice you ought not have to make. During this exercise I prompted my LLM assistants to rewrite terse expressions like those above as more verbose pipelines of simple CTEs. All were capable of doing that. I can envision having such novice-to-expert translations (in both directions) done automatically, on the fly, as another form of dynamic documentation.

Guilty Until Proven Innocent

Tests of the revised query surfaced a new problem. There are a number of ways to describe the inbound ports specified in a network security group: specific ports or port ranges, words like Internet and any, and wildcards like * and /0. Failure to account for any of these resulted in a false OK. How to make the query resilient to such an omission? Here’s how I framed the challenge to ChatGPT.

PROMPT:

Although our query now works as designed, there is a problem with the design. The query’s job is to ring an alarm bell if an azure vm is open to attack in two different ways. The first case is an Azure VM with a security group – that has sensitive open inbound ports – attached directly to its network interface. The second case is similar, but now the security group is attached to the VM’s subnet. If you have two VMs matching these conditions, two alarms should ring. The problem we solved was that the alarms weren’t ringing because there was a faulty SQL join between two of Steampipe’s Azure tables. The shape of the data structure in the JSONB network_interfaces column of one table didn’t match the shape of the data the corresponding column of the other, and the JSONB comparison operator couldn’t handle that difference. To fix the query, we switched to a strategy that unnests the JSONB structures and matches on a particular key/value pair. Are you with me so far?

RESPONSE

Yes, I understand the scenario you’re describing. [replay of prompt]

PROMPT:

OK good. Now here is the design problem I mentioned. This query isn’t failsafe. The VMs are deemed innocent until proven guilty. When the conditions of guilt aren’t met because of a query error or omission, the query reports OK instead of Alarm. How can we make it so the VMs are deemed guilty until proven innocent? In case of a query error or omission, they’ll report Alarm instead of OK. Better to have a false alarm in a case where the VMs are actually innocent than to have no alarm when they are guilty. Do you understand the problem?

RESPONSE:

Yes, I understand the problem. [replay of prompt]

We then discussed how to invert the logic of the query to make it failsafe. After a few iterations, ChatGPT arrived at the following strategy:

RESPONSE:

To modify the query according to the discussed plan, we will first create a list of all VMs and assign an initial status of ‘alarm’ to each. Then we’ll check against the conditions that would indicate a VM is not vulnerable (i.e., ‘innocent’). If a VM does not meet these conditions, it will remain ‘guilty’ (status remains ‘alarm’).

After a few more iterations, we arrived at a working query that correctly identifies vulnerable security groups but is resilient to errors or omissions. The VMs were innocent until proven guilty, now it’s the inverse: they’re guilty until proven innocent. And for now, I’ve left the query in the verbose, step-by-step format that we used while debugging.

Terse and Verbose Syntax Revisited

As a final test, I asked ChatGPT to make the verbose query more compact. It proposed a terse alternative that made use of CROSS JOIN LATERAL along with implicit cross joins of jsonb_array_elements in an EXISTS subquery. That approach didn’t work on the first try, and proved resistant to iterative attempts.

Automatic translation between verbose and terse SQL syntax is a compelling idea. But while LLMs are pretty good at boiling down a 5000-word article to 500 words, doing the same for SQL queries may be asking too much for now. Our effort to compact the query ended here.

RESPONSE:

Since the goal is to produce a more succinct query without losing the logical flow of the original, we must preserve the order of operations and the conditions applied in each CTE. However, the challenge is that compacting the query might obscure the complex relationships between network interfaces, security groups, and virtual machines. Considering the complexity, the verbose query might be the best approach for maintainability and debugging purposes. Compact queries can be harder to debug and understand, especially when dealing with nested JSONB structures and multiple JOIN conditions.

Pretty good advice from a stochastic parrot!

I’m having good success asking LLMs to unpack terse queries into more understandable pipelines of simple CTEs, and I highly recommend using them that way. If you want to go the other way, though, it looks like you’re on your own for now.

Group Created with Sketch.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.