0% found this document useful (0 votes)
64 views37 pages

Automatic Metadata Extraction For Text-to-SQL: Vladislav Shkapenyuk Divesh Srivastava Theodore Johnson

This paper explores automatic metadata extraction techniques to enhance text-to-SQL generation, addressing challenges such as lack of documentation and unclear data formats. The authors evaluate their methods using the BIRD benchmark, achieving top scores by employing database profiling, query log analysis, and LLM-based SQL-to-text generation. Their findings indicate that using profile-generated metadata significantly improves the understanding of database contents, leading to more accurate SQL query generation.

Uploaded by

tanwei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views37 pages

Automatic Metadata Extraction For Text-to-SQL: Vladislav Shkapenyuk Divesh Srivastava Theodore Johnson

This paper explores automatic metadata extraction techniques to enhance text-to-SQL generation, addressing challenges such as lack of documentation and unclear data formats. The authors evaluate their methods using the BIRD benchmark, achieving top scores by employing database profiling, query log analysis, and LLM-based SQL-to-text generation. Their findings indicate that using profile-generated metadata significantly improves the understanding of database contents, leading to more accurate SQL query generation.

Uploaded by

tanwei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Automatic Metadata Extraction for Text-to-SQL

Vladislav Shkapenyuk Divesh Srivastava Theodore Johnson


AT&T CDO AT&T CDO AT&T CDO
[email protected] [email protected] [email protected]
Parisa Ghane
AT&T CDO
[email protected]

ABSTRACT descriptions. Primary key – foreign key dependencies


are not labeled. The new database user has only cryptic
Large Language Models (LLMs) have recently become
field and table names for guidance.
sophisticated enough to automate many tasks ranging from pattern
 Incomplete or dated documentation: Databases
finding to writing assistance to code generation. In this paper, we
examine text-to-SQL generation. We have observed from undergo continual schema and data change. Fields get
decades of experience that the most difficult part of query added or dropped, and their contents can change their
development lies in understanding the database contents. These format and meaning. Often times, the documentation is
experiences inform the direction of our research. not updated. In general, even the Subject Matter
Experts (SMEs) are often not fully aware of the current
Text-to-SQL benchmarks such as SPIDER and Bird contain contents of a database.
extensive metadata that is generally not available in practice.
 Unclear data formats: To take a simple example,
Human-generated metadata requires the use of expensive Subject
suppose that we know that the owner field is the name
Matter Experts (SMEs), who are often not fully aware of many
of the owner of the account. What is the format? Is it
aspects of their databases. In this paper, we explore techniques
for automatic metadata extraction to enable text-to-SQL last_name first_name, last_name, first_name, or
generation. first_name last_name? Is there any capitalization?
Punctuation? How are more than two names handled?
We explore the use of two standard and one newer metadata  Multiple data formats: A single field might have data
extraction techniques: profiling, query log analysis, and SQL-to- in multiple formats. For example, the owner field might
text generation using an LLM. We use BIRD benchmark
half its entries in format first_name last_name and half
[JHQY+23] to evaluate the effectiveness of these techniques.
in format last_name, first_name.
BIRD does not provide query logs on their test database, so we
 Multiple fields with similar meanings: A common
prepared a submission that uses profiling alone, and does not use
any specially tuned model (we used GPT-4o). From Sept 1 to example of this issue is a date field. Suppose that table
Sept 23, 2024, and Nov 11 through Nov 23, 2024 we achieved the customer has 4 date fields, a_date, b_date, c_date,
highest score both with and without using the “oracle” d_date each of which represents a different date of an
information provided with the question set. We regained the interaction (actual names can be almost as cryptic).
number 1 spot on Mar 11, 2025, and are still at #1 at the time of How does one find all customers who signed up on or
the writing (May, 2025). after Sept. 1, 2024? Different fields might be filled in
based on the sign up process so the proper formula
might be Coalesce(a_date, c_date) >= date
1 INTRODUCTION ‘01/09/2024’.
Recent developments in Large Language Models (LLMs) have  Complex join paths: There can be many complexities
spurred extensive research into automating many activities, in developing a correct join expression – the join key
including code generation. In this paper, we address text-to-SQL might involve multiple fields, it might involve
generation [QHWY+22].
conditional values (e.g. because of multiple processes
Our research was informed by our experiences with large-scale
adding records to the table – iff(R.vendor=5, R.id,
industrial databases [DJMS02][GJ14]. We have consistently
found that the most difficult part of writing correct SQL queries is R.serial_num)=S.serial_num), or it might involve
understanding what is in the database to begin with. After that, transformations on join keys. In one example, we
writing queries is relatively straightforward. Examples of needed to join two well-documented tables on the
difficulties include:
 Lack of documentation: In a large number of cases,
the database lacks documentation – no field or table
IMEI1. The join result was empty - until we realized submitted again using the oracle and achieved the #1 spot with a
that in one table the IMEI was 13 digits and in the other test score of 77.14. The top five scores using hints, at the time of
it was 14. The longer IMEI always had ‘1’ as a prefix. writing, are: 77.14 (AT&T), 76.02 (Google), 75.63 (Contextual
AI), 75.63 (Alibaba), 73.17 (IBM).
 Complex formulae: The complexities of an operations
or business process can make seemingly simple 1.1 Contributions
calculations obscure. For example, the revenue of an
In this paper, we investigate three schemes for automatic metadata
interaction might be calculated as iff(cds_code=1,
generation for text-to-SQL applications. Two of these schemes
revenue, zdc+total_rendered*10000). are traditional: database profiling and query log analysis. In the
 Default values: In many cases, seemingly required context of database profiling, we show that an LLM can translate
fields have nonsense values. For example, a telephone the profiling information (in the context of the table schema) into
number field might have a very large number of entries useful metadata about field meaning.
such as 123-456-7890 or 111-111-1111. These default We use the BIRD benchmark to evaluate automatic metadata
values should be excluded from results involving the extraction. In the context of database profiling, we find that by
using the LLM to summarize the profile metadata, we can gain
telephone number. If the telephone number is a join key
significant insights into field contents.
between two 1-million record tables and both fields
Using profile-enhanced field metadata blows up the size of the
have 123-456-7890 in one tenth of their records, then schema provided in an LLM prompt. To obtain better results, we
the query must materialize 10 billion useless records. develop a novel schema linking algorithm. We find that using our
schema linking algorithm provides a significant boost in accuracy
Several approaches exist for understanding complex databases. scores. We also find that using the profile-generated metadata
One well-known approach is profiling [AGN15], the systematic provides better results than using just the SME metadata supplied
querying of the tables in a database to create reports on their in the benchmark! Using fused metadata provides the best results,
properties. For example, the profiling might show that the and the combination of techniques let us achieve the #1 spot on
T.IMEI field is always 13 digits while the S.IMEI field is always the BIRD leaderboard twice.
14 digits and always starts with a 1. A natural conclusion is that T
The BIRD query set and schemas are relatively simplistic, but
and S are joinable on ‘1’||T.IMEI=S.IMEI.
interesting results can still be extracted. By using query log
Another common approach is to examine queries that SMEs have analysis, we can find a significant number (25% of total) of
written for clues to important fields, join paths, business logic, undocumented join paths. We can also find complex join
and so on. If a query log is available, this analysis can be predicates, and business logic for predicates and fields that are
automated to generate statistical reports which might indicate e.g. only documented in the oracles, or not at all.
complex join paths [YPS09].
We also investigate the use of the LLM SQL-to-text generation to
A third approach is newer, and is enabled by LLMs. If one has a create few-shot examples – a task made possible by the
query log and high-quality metadata, one can ask the LLM to introduction of LLMs. While SQL-to-text has been used in e.g.
translate the SQL into text. This technique allows the user to find [PLSC+24], we make an experimental study to evaluate the
related queries based on the textual similarity of the questions. technique and find that the LLM can generate questions as good
In this paper, we apply these techniques to LLM-based text-to- as or better than the human annotations.
SQL generation. For evaluation, we use the BIRD benchmark
[JHQY+23] to quantify their benefit. BIRD is a challenging
benchmark, with often ambiguous and/or dirty schemas, data, and 2 Profiling
metadata. Our experiments are run on the dev database, with
Database profiling has a huge literature dating back decades
questions selected from minidev for some experiments.
[AGN15]. The common idea is to analyze database contents to
In the BIRD benchmark, a submission is evaluated on an extract properties that aid in understanding database contents.
unknown test database – so no query log is available. We Basic profiling takes a pass over a table and collects statistics such
developed a submission which uses only profiling information. as
On Sept. 1, 2024 and again on Nov 11,2024, we achieved the
highest scores both using and not using the oracle2 information.  The number of records in a table.
Since oracle information is never present in practice, the test  For a field, the number of NULL vs. non-NULL values.
which does not use the oracle is more indicative of how a text-to-  For a field, the number of distinct values.
SQL technique works in practice. Without the oracle, our
submission got a score 10.28 percentage points higher than the  For a field, the “shape” of a field, e.g. min and max,
next best submission without the oracle, 67.41% vs. 57.13% (at number of characters (digits), alphabet
the time of writing – Jan. 2025). On March 11, 2025 we (upper/lower/punctuation/…), common prefixes, etc.
 For each field, a sample of the top-k field values.
1
 For each field, a minhash sketch.
A standard mobility equipment identifier.
https://en.wikipedia.org/wiki/International_Mobile_Equipment_Identity.
2
The BIRD question set includes an “oracle”, which is a hint about how to write the
Count distinct, and the set of top-k field values and their counts
query, specialized for each query. can be computed by approximate means [FFGM07] [IBS08] and
these functions are increasingly present in commercial Year”, “County Code”, and so on, the most likely meaning of
databases34. CDS is the one chosen.
A minhash sketch [B97] is a collection of K values computed by While this short description is good for identifying the meaning of
mi(f) = min(hi(vj) | over all values vj of field f) the field, more detail about the field values can guide the text-to-
for i ranging from 1 to K, and each hi is a different hash function. SQL LLM to use proper literal values. A long description is:
The minhash sketch can be used to compute the resemblance of The CDSCode column stores unique 14-character numeric
two contents of fields F and G, which is identifiers for each school in the database, where CDS stands
for County-District-School.The CDSCode column contains
res(F, G) = |F ∩ G|/|F U G| 14-character numeric strings with no NULL values, 9986
Given two minhash sketches m(f) and m(g), the resemblance distinct values, ranging from '01100170109835' to
between the values of fields f and g can be approximated by '58727695838305'; common values include
res(f, g) = sum( if(mi(f)=mi(g), 1, 0) ), i in 1, …, K)/K '01100170109835', '01100170112607', '01100170118489',
Given the minhash sketch of field f, the collection of fields g with '01100170123968', '01100170124172', '01100170125567',
a large intersection can be quickly computed. These can be used '01100170130401', '01100170130419', '01100176001788',
for tasks such as '01100176002000'.
For another example where the LLM can guide the choice of
 Finding join paths literals for constraints, consider frpm.`Academic Year`. The
 Imputing metadata from field f to field g. provided metadata is "Academic Year", with the field value
format left vague. Even the short LLM description is specific
Other more complex profiles can be collected, such as multi-field about the field value format:
keys, functional dependencies and other kinds of dependencies
The `Academic Year` column stores the academic year for
[AGN15]. In this study we restrict ourselves to the basic profiles.
each record in the format 'YYYY-YYYY'.
2.1 Using Profiling Information for Text-to-SQL A particularly striking example is cards.leadershipskills.
This field contains JSON data, but this is not indicated in the field
Actually using profiling information for text-to-SQL requires metadata:
transforming the raw statistics into a form that the LLM can A list of formats the card is legal to be a commander in
readily use. We describe the process using examples from BIRD.
The LLM recognizes the format of the field contents and provides
We can start with frpm.CDSCode. A mechanically generated this short summary:
English language description of the profile for this field is:
The leadershipSkills column stores JSON-formatted data
Column CDSCode has 0 NULL values out of 9986 records. indicating the formats in which a card is legal to be used as a
There are 9986 distinct values. The minimum value is commander, such as Brawl, Commander, and Oathbreaker
'01100170109835' and the maximum value is
'58727695838305'. Most common non-NULL column values 3 Schema Linking with Profile Metadata
are '01100170109835', '01100170112607', '01100170118489',
'01100170123968', '01100170124172', '01100170125567',
'01100170130401', '01100170130419', '01100176001788', The examples in the previous section make clear that an LLM can
'01100176002000'. The values are always 14 characters long. very often generate excellent descriptions of field contents and
Every column value looks like a number. meanings. However these descriptions, especially the long
descriptions, can overflow the token limit for LLM systems
The next step is to use the English-language profile, the provided
[TPCM+24]. In addition, we have observed that in the presence
metadata for this field (which is just “CDSCode”), the table name,
of long prompts, the LLM will pick up on the material in the
and the names of other fields in the table, to ask the LLM for a
beginning and the end, but tend to ignore the part in the middle
short description of the contents of the field. The resulting short
(also observed by [TPCM+24]). The long field descriptions
description is:
generated from profiling and LLM summarization are too long to
The CDSCode column stores unique 14-character numeric be provided for context if such a description is provided for every
identifiers for each school in the database, where CDS stands field of every table in a database.
for County-District-School.
The short description of CDSCode describes the format of the Schema linking [TPCM+24] [DZGM+23] [QHWY+22]
values in the field, and identifies its meaning: County-District- [LPKP24] [GWLS+23] refers to identifying which fields are
School. The LLM is able to pick up on the meaning of CDS relevant to generating an SQL query in response to a question.
because CDS is a common acronym of County-District-School. Some authors [FPZE+24] have found that schema linking
However CDS can also mean Cadmium Sulfide, credit default improves text-to-SQL performance even when the schema fits
swap, counterfeit deterrence system, cross domain solution, and into the prompt context. In CHESS [TPCM+24], the authors
so on. But in the context of the table name (FRPM, or Free or found that perfect schema linking significantly improves
Reduced Price Meal) and column names such as “Academic performance. While some authors [MAJM24] have expressed an
opinion that schema linking is not necessary with newer LLMs
with large prompt contexts, industrial databases can have
3
https://docs.databricks.com/en/sql/language-manual/sql-ref-functions-builtin- hundreds of tables each with hundreds of fields, so in practice
alpha.html schema linking is a necessity.
4
https://docs.snowflake.com/en/sql-reference/functions-aggregation
In this section, we describe how we performed schema linking in a. Fetch N distinct values of the f, or as many distinct
our BIRD submission. The value of the two types of LLM profile values exist.
summaries (short and long) should be clear: the short summary is b. Compute a string similarity index on these values
used to help with schema linking while the long summary is used
c. Attach the string similarity index to the field’s entry in
for generating SQL from text.
the profile.
There are four common schema linking mechanisms [TPCM+24],
which can be used in combination: We found that a Locality Sensitive Hash (LSH) index 5 on
shingles6 is effective for the field value index as it provides an
 Metadata similarity search: search a vector database for approximate match on values, but does not do semantic similarity.
fields whose name and/or metadata are semantically For the BIRD benchmark, we used N=10000. By contrast,
similar to the question. CHESS [TPCM+24] indexes all values of all fields for literal
 Column filtering: For each field, ask the LLM if the matching. This technique is not scalable outside of small
field is relevant. benchmark databases, so we limit ourselves to a moderate size
sample.
 Table selection: Give the LLM the full schema and ask
A second semantic similarity index is computed on the profile
it to identify relevant tables.
using FAISS7 [JDJ19]. This index is on textual descriptions on
 Column selection: Give the LLM the full schema and the profile of a field (i.e., the long summary), which allows for the
ask it to identify relevant fields. efficient search of fields likely to be relevant to a textual query.

We tried these schema linking techniques but obtained A key part of the schema linking algorithm is the metadata
unsatisfactory results. The authors of [QLLQ+24] have observed context. We use the following terms:
that LLMs have good performance at tasks for which they have
been trained, but poor performance on tasks outside of these  Focused schema: Given a question, this is the set of fields
boundaries. This property is known as task alignment. which are textually similar to the users question, based on the
Specifically, we have observed that LLMs are not good at direct string similarity index on the fields. In addition, literals are
schema linking – identifying which tables/columns are relevant to extracted from the question, and additional fields which
a question – but are good at generating SQL. So our schema
linking method is focused on generating SQL queries and include that literal in their values (using the LSH indices in
gathering the fields they reference. the profile) are added to the focused schema.
 Full schema: All fields in all tables.
Another consideration is that the literals used in the question can  Minimal profile: describe the field using the short LLM
help to indicate the field that should be constrained [TPCM+24]. summary.
One step in the algorithm is to identify fields which can contain a  Maximal profile: describe the field using the long LLM
literal, and if the field involved in the constraint is not among that
set, ask the LLM to rephrase the generated SQL using one of the summary.
fields in the set.  Full profile: describe the field using the SME-supplied
metadata along with the maximal profile.
Yet another consideration is that recall is better than precision – it
is better to put too many fields in the prompt (within limits) than Input: profile Profile, vector database Index, textual question
too few. Question, int MaxRetry
1. Let Fields be a set of fields and Lits be a set of literals
The general outline of our schema linking algorithm is: 2. For each of the following five cases of schema Schema: a)
 For several different variants of field collections and focused schema, minimal profile; b) focused schema,
their metadata: maximal profile; c) full schema, minimal profile; d) full
o Ask the LLM to generate an SQL query based on the schema, maximal profile; e) focused schema, full profile; do
question and in the context of the metadata variant the following
o Collect the fields and literals in the generated query a. Use the LLM to generate an SQL query Q in
o Adjust the query to try to use fields which contain response to Question and Schema.
the literals, if needed. b. Let FieldsQ be the fields referenced in Q and let
 Use the collection of all fields returned by the different LitsQ be the literals in Q.
variants as the schema linked to the question. c. Let LitFieldsQ and MissingLits be empty lists
d. For each literal l in LitsQ,
In the remainder of this section, we describe in more detail the
methods we have found to be effective. The schema linking
algorithm requires a preprocessing step to aid in finding fields that 5
https://github.com/ekzhu/datasketch
match a literal. 6
https://pypi.org/project/kshingle/0.1.0/
7
https://github.com/facebookresearch/faiss
1. For every field f,
i. Use the LSH indices in the profile to identify Now all literals are matched to a field in the query, so this field
the fields Fieldsl which contain l as a value. list is returned.
ii. If no field f in Fieldsl is in FieldsQ,
1. Add Fieldsl to LitFieldsQ 4 BIRD Submission Details
2. Add l to MissingLits Our BIRD submission which achieved the #1 spot on the BIRD
e. If LitFieldsQ is not empty and the number of retries leaderboard on Nov. 11, 2024 primarily uses the techniques of
is less than MaxRetry database profiling, LLM profile summarization, and schema
i. Let AugmentedSchema be the schema linking described in sections 2, 2.1, and 3. In this section, we
describe some additional details of the BIRD benchmark
augmented with any fields in LitFieldsQ submission.
which are not in Schema
Our BIRD submission makes use of few-shot examples
ii. Write a prompt which asks the LLM to revise [NZZR+23][GWLS+23][PL22] taken from the train query set.
the SQL query Q suggesting the use of fields We use a technique described by [LPKP24]. In every question,
which contain literals in MissingLits, resulting we use the LLM to replace names with placeholders. The masked
in revised SQL query Q. questions are put in a vector database with a reference to the
iii. Repeat steps 2.b through 2.e corresponding SQL. To find few-shot examples, we mask the
input question and find the 8 most similar questions in the vector
f. Add FieldsQ and LitsQ to Fields and Lits
database. These 8 queries are used as the few-shot examples.
3. Return FieldsQ as the set of fields for providing context.
As in Chase [PLSC+24], we generate multiple candidate queries
and select one as the answer. To introduce variety into the
3.1 Example candidate set, we use two techniques:
We work through a simplified example from the BIRD
 Changing the LLM randomization seed.
benchmark:
 Changing the prompt by randomizing the order of the
From the California Schools dataset, Please list the zip code
of all the charter schools in Fresno County Office of (schema linking-reduced) schema fields.
Education.
Our BIRD submission generates three candidates. These three
The schema variant we will use is the full minimal profile, a
candidates are checked for the validity of the SQL by using
sample of which is below:
SQLglot8.
Field frpm.CDSCode means: The CDSCode column stores
We then check for SQL constructions that are likely to indicate an
unique 14-character numeric identifiers for each school in the
incorrect response. Some of these are checks on possible SQL
database, where CDS stands for County-District-School. This
problems. For example, a NULL value is ordered before all other
field joins with schools.CDSCode.
values. So we check to ensure a NOT NULL predicate on f if:
Field frpm.`Academic Year`means: The `Academic Year`
column stores the academic year for each record in the format  If the output is in ascending order on field f.
'YYYY-YYYY'.  If the select list contains the aggregate min(f).
Field from.`County Code` means: The `County Code` column
stores 2-character codes representing different counties. Other checks relate to the apparent preferences of the authors of
… the SQL queries. For example:
Field schools.LastUpdate means: The LastUpdate column
stores the date when each record was last updated.  Check if a min/max query used a nested subquery
A prompt is prepared and sent to the LLM, which responds with instead of an Order By.
SELECT T2.Zip FROM frpm AS T1 INNER JOIN schools  Check if a query performs string catenation on fields
AS T2 ON T1.CDSCode = T2.CDSCode WHERE instead of returning the fields individually.
T1.`Charter School (Y/N)` = 1 AND T1.`Country
Name`= 'Fresno County Office of Education’
If a bad SQL construction is detected, the LLM is asked for a
We extract the following fields: frpm.CDSCode, frpm.`County correction with up to three retries.
Name`, frpm.`Charter School (Y/N)`, schools.CDSCode,
We use majority voting among the candidates to pick one to be
schools.Zip; and the following literals: 'Fresno County Office of
the final answer. Each of the up to three candidates are executed,
Education’.
and their results converted to sets. If there is agreement among
Using the LSH indices in the schemas, we find that the literal does two of the candidates, one of them is chosen. Else, an answer is
not occur in any field in the generated query, but does occur in chosen among the candidates randomly.
fields frpm.`District Name`, satscores.dname, and
schools.District. Another prompt is generated which recommends 4.1 Experiments
the LLM to use one of these fields to match the literal. The
revised SQL query is The methods described in Sections 3 and 4 got us to the #1 spot
SELECT T2.Zip FROM frpm AS T1 INNER JOIN schools on the BIRD benchmark leaderboard (currently at the #3 spot at
AS T2 ON T1.CDSCode = T2.CDSCode WHERE
T1.`Charter School (Y/N)` = 1 AND T2.District = 8
https://github.com/tobymao/sqlglot
'Fresno County Office of Education'
the time of writing). As our interest is on the efficacy of providing profile metadata can lead to generated queries being
automatically generated metadata, we ran some experiments on flagged as incorrect. For example, in Q356, the question is
how additional field metadata helps to improve accuracy. For How many cards have infinite power?
these experiments, we used the schema linking and additional With the hint
techniques described in this section and used the GPT-4o LLM. infinite power refers to power = '*'
We ran the experiments on the MiniDev questions/SQL set (of Profiling detects the presence of special symbols, and the LLM
500 questions). Our results are below. summary contains the phrase
No metadata, no hints 49.8% special symbols like "∞" for infinite power
The gold SQL is
Bird metadata, no hints 59.6 SELECT COUNT(*) FROM cards WHERE power = '*'
Profiling metadata, no hints 61.2 But the predicate should be power=’∞’.
Fused (bird and profiling) metadata, no hints 63.2 Another example is Q1260, which has question
Bird metadata, hints 72.0 Please list the ID of the patient whose RF is normal and who
Profiling metadata, hints 72.6 is older than 60.
With hint
Fused metadata, hints 73.0
normal RF refers to RF < 20; don't have thrombosis refers to
Table 1. Effects of metadata and hints on accuracy. Thrombosis = '0'
The gold SQL is
A first takeaway is that the most powerful metadata available in SELECT COUNT(DISTINCT T1.ID) FROM Examination
Bird are the hints, which in MiniDev are very clear and precise. AS T1 INNER JOIN Laboratory AS T2 ON T1.ID =
However such hints are not available in practice. If one has such T2.ID WHERE T2.RF < 20 AND T1.Thrombosis = 0
exact information about how to write a query, one might as well This query is wrong is several ways, but we will focus on the
develop a canned query system. predicate on Laboratory.RF. The hint states that the predicate to
Without hints (for which our submission is at the #1 spot among use is RF<20 and this predicate is pasted in. But RF is of type
those without hints), the text-to-SQL generation does surprisingly text, so the proper predicate should be CAST(T2.RF AS REAL)
well even without field metadata. This result reflects the < 20.
descriptive field and table names in many of the tables. Using In some cases, the query generated using the full schema is
field metadata improves accuracy as expected, but using profiling flagged as correct, but with the linked or perfect schema, its
metadata results in a bigger accuracy boost than using the Bird- flagged as incorrect. One example is Q1480, where there is an
supplied metadata does! However there are details in the Bird issue in formatting yearmonth.Date with the linked and perfect
metadata that are missing in the profiling metadata, so the fused schema, but not the full schema. Another example is Q1505,
metadata naturally provides the best accuracy (with or without where the choice between returning count(*) vs. count(distinct
hints). CustomerID) depends on the length of the schema. Here the
A next question to ask is, how well does schema linking work? linked and perfect schema SQLs are correct but the gold SQL is
We compare using the full schema, our schema linking (described not, so they are marked as wrong. These problems are due to the
in Section 3), and perfect schema linking. We used the fused instability of LLM answers.
metadata for these experiments. The results are in the table
below.
5 Query Log Analysis
Full schema, no hints 61.2%
Our schema linking, no hints 63.2 Most DBMS systems collect query logs – a log of all the SQL
query text submitted to the DBMS. Many of these queries are
Perfect schema linking, no hints 69.0 written by expert SMEs. So, by extracting features from the log,
Full schema, hints 71.2 we can obtain SME information without the need for extensive
Our schema linking, hints 73.0 interviews. In addition, query logs likely contain features that
Perfect schema linking, hints 77.4 SMEs don’t know about or have forgotten.
Table 2. Effects of schema linking on accuracy. The query log is kept for some window (e.g. 90 days) and
contains a variety of additional metadata such as the time of
Recent papers [MAJM24][PLSC+24b] claim that schema linking submission, user ID, query status (succeed/fail), performance
is not needed when using frontier models such as GTP-4o. metrics, or even the query plan. While these metadata fields are
However these results show that this is not the case even for the valuable for filtering, in the study we focus on the query text.
small schemas in the BIRD benchmark. The full schema case
corresponds to no-schema-linking. The algorithm described in In some data analytics platforms, e.g. DataBricks, a large portion
Section 4 provides a significant improvement, so effective schema of the queries submitted to the system do not have textual SQL
linking does help. However perfect schema linking provides a queries. This is the case when data analytics frameworks such as
large jump in scores. Clearly, further research on schema linking PySpark 9 are used. Pyspark primarily operates on dataframes,
for text-to-SQL is needed. (which correspond to Spark RDDs) using a sequence of dataframe
manipulation methods to construct an execution plan of select,
4.1.1 Detailed Discussion
In this section, we describe some peculiarities we observed in 9
https://spark.apache.org/docs/latest/api/python/index.html
these experiments. One phenomenon that we observed is that
project, join, aggregation, etc. operations. These are collected source_code planning_table.source_code
until an action triggers the dataframe evaluation. The execution total_planned Sum(planning_table.planned_amt)
plan is optimized and evaluated, generally in the same way that an
explicit SQL query would be. So, dataframes allow the
construction of SQL-equivalent plans but no SQL text is involved The algorithm for field resolution is:
and no SQL text is logged. Resolve_fields(root)
However Databricks keeps logs of query plans, which can be 1. For every subquery q in the From clause
handled in a manner similar to that of textual SQL queries. We
a. Summary(q) = resolve_fields(q)
will note how query plans can be handled in the discussion of
query log processing. Pyspark is becoming popular in other 2. Query_summary = <empty table>
databases, such as e.g. Snowpark10 in Snowflake. 3. For every field f in the Select list
a. Resolve the fields referenced in f’s formula
using the tables and subquery summaries in
5.1 Query Log Processing the From clause
The raw query text is not directly useful in this section, instead it b. Query_summary(f) = <resolved formula>
has to be processed to extract interesting features. To be useful 4. Return Query_summary
for SQL generation, all fields referenced in the query must be
traced back to their source table. Further, in the presence of For example, to resolve the top level in the example query, we
subqueries, the formula used to compute a field should be process each returned field in turn. Uop_cd is computed from
substituted for that field. A.uop_cd, which resolves to accounting_table.uop_cd.
For example, consider the following query: Current_spend_and_planned is computed from
Select A.uop_cd, A.trans_amt+P.total_planned as A.trans_amt+P.total_planned. The two fields in this
current_spend_and_planned formula resolve to accounting_table.trans_amt and
From accounting_table A, ( sum(planning_table.planned_amt). So spend_and_planned
Select source_code, sum(planned_amt)as
is computed from their sum, with result:
total_planned
From planning_table Field Formula
Where country_code=’USA’ Uop_cd Accounting_table.uop_cd
Group By source_code) P
Where A.uop_cd=P.source_code Current_spend_and_pla accounting_table.trans_amt
nned +
In the select clause, A.uop_cd can be resolved to be sum(planning_table.planned_
accounting_table.uop_cd. The second element has an element amt)
sourced from a subquery. Resolving the formula for
P.total_planned, we can determine that output field
current_spend_and_planned is sourced from 5.2 Query Plan Processing
accounting_table.trans_amt +
sum(planning_table.planned_amt) Ideally, one has access to query logs containing the SQL text.
To perform field resolution and feature extraction, the first step is However in some cases, only a query plan is available. We
to use an SQL parser to convert the SQL statement into an encountered this issue when trying to understand queries in
Abstract Syntax Tree (AST). There are many open source SQL DataBricks. Many data science workloads do not submit textual
parsers, e.g. sqlglot11 and Jsqlparser 12 . We developed our own SQL queries, but rather are written in a programming system such
using the Python Lark13 parser. as PySpark. For example, the query
Select a, b from Table where c=5
To simplify the discussion, we assume that the AST returns a
node which represents a regular query, e.g. Could be expressed as
select/from/where/group-by/having. The only subqueries (used Df =
spark.read.format("delta").load(<Table_delta_f
for field resolution) are in the From clause. The goal of the
ile>). where(col(‘c’)=5).select(‘a’, ‘b’)
algorithm is to resolve the formulas used to compute the fields of
the subqueries, if any, to use for resolving fields in the top-level These constructions are not rendered into SQL, instead they create
query. The actual algorithm we use has a variety of complexities a query plan which is optimized and executed. Given that Spark
to handle those of SQL, but they are not important here. integration is a desirable feature, additional vendors are adding
similar execution environments - e.g. Snowpark/Snowflake.
The main data structure is a subquery summary table, which maps
a field returned by a subquery with the formula which computes The simple query above would be rendered into a query plan in
it. For example, the subquery summary for P is the expected way: a table scan, followed by a Select operator,
followed by a Project operator, followed by an Output operator14.
Field Formula Each operator can be viewed as a separate subquery, so the
algorithm for resolving field formulas through subqueries applies
10
https://docs.snowflake.com/en/developer-guide/snowpark/index to well-annotated query plans also.
11
https://github.com/tobymao/sqlglot
12
https://github.com/JSQLParser/JSqlParser
13 14
https://github.com/lark-parser/lark Simple selection and filters will get pushed into the scan after optimization.
CHASE-SQL [PLSC+24] processes the SQLite query plan to BIRD does not reveal any information about test database used for
extract features, to ensure diversity in SQL queries generated in ranking, so no query logs are available for analysis. Instead, we
response to a question, for candidate selection. use BIRD as a test suite for determining if we can use log analysis
for detecting missing metadata. We use the BIRD dev test suite
from Sept 2023 (dev 9-23), as well as the newer cleaned-up
5.3 Feature Extraction version of June 2024 (dev 6-24), as the unrevised version is
likelier to reflect actual industrial databases.
We can extract further features from this query, for example
Our experiments focus on three questions:
 Spend_and_planned is computed from
accounting_table.trans_amt + 1. Can we find equality (e.g. pk-fk) join constraints that are not
sum(planning_table.planned_amt)
documented in the SQLite schema?
 There is a constraint
planning_table.country_code=’USA’, or alternatively
2. Can we find other interesting join constraints?
planning_table.country_code=<string> 3. Can we find interesting named formulas (business logic).
 Planning_table.source_code is used as a group-by
For interesting constraints and named formulas, we compare the
variable
features that we find to the provided metadata, and also to the
 There is a join path hints associated with each query. We note that in an industrial
accounting_table.uop_cd=planning_table.source_
code application, no hint is available, so one must use few shot
examples or relevant query features.
These features provide valuable metadata about the use of
accounting_table and planning_table. The named field, 5.4.1 Primary key - Foreign key / Equality Constraints
spend_and_planned, names the formula in bullet point 1. So if a Using the dev queries as the query log, we extract all constraints
question asks current and planned spending, textual similarity of the form R.f=S.g. This yields 109 distinct constraints for dev
leads to the formula. The labeling of a join path from 9-23 and 107 for dev 6-24. Comparing the constraint lists, there
accounting_table to planning_table provides solid information are 3 new constraints and 5 missing constraints in dev 6-24 as
about how to perform joins if primary key/foreign key compared to dev 9-23, reflecting a revision of the gold standard
information is missing from the schema (as it often is). queries. An example of an added constraint is
The types of features one can extract from a query include: card_games.legalities.format=card_games.maxbanned.format
 Named select fields and an example of a missing constraint is
 Unnamed select fields (i.e. no AS clause, or a trivial AS financial.disp.account_id=financial.trans.account_id
clause)
Extracting the PK-FK constraints from the BIRD documentation
 Non-join field constraints – i.e. between fields of the same required some manual intervention. We use the SQLite schemas
range variable of the SQLite databases to extract the foreign keys using PRAGMA
 Join field constraints – between fields of different range
foreign_key_list. On processing the results, we found
variables
that three constraints were improperly specified (for both):
 Constraints in the ON clause.
 Sets of constraints in the ON clause – two or more field debit_card_specializing.yearmonth.CustomerID =
constraints might be needed for the join from R to S. debit_card_specializing.customers.CustomerID
european_football_2.Match.league_id =
 Group-by formulas
european_football_2.League.id
 Group-by sets – all fields (formulas) in a group-by clause
european_football_2.Match.country_id =
 WITH subqueries. The naming of a subquery often indicates european_football_2.Country.id
its purpose.
The problem being that the foreign key field was not specified in
 Sets of tables referenced
the SQLite schema, and the sqlite pragma wasn’t able to identify
 Set of fields referenced
the referenced field. So, the field on the right hand side was
 Query complexity empty. We filled these in by hand. The result was 109
Query log features help in a variety of ways. For one, constraints for dev 9-23 and 104 for dev 6-24 (the unused
primary/foreign key constraints are often not well documented, databases card_games_2 in dev 9-23 was removed for dev 6-24).
and many join constraints involve multiple fields and data
We normalized the two sets of constraints and took set differences
transformations. For another, extracted features can show details to find how the results differed. From the SQLite dev 9-23
of how a database should be queried. A common strategy in text- constraints, there were 29 equality constraints that were never
to-SQL is the use of few-shot examples [NZZR+23]
used, and from dev 6-24 there are 24 unused constraints.
[GWLS+23][PL22] – sample queries generally found by semantic
Examples include:
similarity of their associated questions to the target question.
However, the few shot queries might not contain the necessary card_games_2.foreign_data.uuid = card_games_2.cards.uuid
details, and might have the wrong query “shape”. Further, european_football_2.Match.away_player_11 =
features can be expressed with less text that few-shot examples, european_football_2.Player.player_api_id
reducing prompt size. toxicology.connected.atom_id = toxicology.atom.atom_id
From the constraints extracted from log analysis, there are 29
5.4 Experiments constraints detected in dev 9-23 queries but not documented in the
SQLite schema, and for dev 6-24 there are 27. Examples include:
debit_card_specializing .gasstations.gasstationid = In our experience, queries contain many named non-trivial
debit_card_specializing .transactions_1k.gasstationid formulas (capturing useful business logic). The queries in BIRD
card_games.cards.setcode = card_games.sets.code generally don’t name elements in the select clause, but there are
thrombosis_prediction .examination.id = thrombosis_prediction some. For example, Q221 has a pair of named formulas in the
.laboratory.id Select clause, which we paraphrase as:
Of the discovered constraints, 4 equated different field names (e.g. atom_id1 is computed from substr(bond.bond_id, 1, 7)
cards.setcode=sets.code). If the hand-constructed constraints are atom_id2 is computed from
excluded, there are 32 / 30 discovered constraints of which 6 (bond.molecule_id)||(substr(bond.bond_id, 8, 2))
equate different field names. So, 27% / 25% of the field-to-field The field metadata contains the following:
equality constraints actually used in a dev query are either bond_id: unique id representing bonds; TRxxx_A1_A2:
undocumented or require hand extraction. While dev 6/24 has TRXXX refers to which molecule A1 and A2 refers to which
slightly fewer undocumented equality constraints, a large fraction atom
of these constraints are undocumented in the schema. molecule_id:identifying the molecule in which the bond appears
Many of the missing equality constraints have either the same So there is an indication of the structure of bond_id but the
field name, or they both end in “id”. So the LLM can make a formulas used in Q221 are not clear from the text. Some other
good guess about how to perform a join. However there are two examples are:
problems. First, industrial databases (especially those grown by Q1499: monthlyconsumption is computed from
accretion) often do not have such convenient naming systems. (sum(debit_card_specializing.yearmonth.consumption))/(12)
Second, its better to know the join path than to need to guess it.
Q215: iodine_nums is computed from count(DISTINCT case
For an experiment, we counted the number of fields per table that When ('i'=toxicology.atom.element) Then
end in “id”. We found that there are an average of 1.8 id fields toxicology.atom.atom_id Else Null End)
per table, and a maximum of 15 (in cardgames.cards). 20 tables
have no id field, and 23 have one id field. So guessing an equality Q222: diff_car_notcar is computed from (count(case When
predicate in BIRD can still present challenges. ('+'=toxicology.molecule.label) Then
toxicology.molecule.molecule_id Else Null End))-(count(case
5.4.2 Other Join Constraints When ('-'=toxicology.molecule.label) Then
Many more join constraints can be found (examples from dev 6- toxicology.molecule.molecule_id Else Null End))
24). For example, there are two multi-field joins: In each of these cases, the “evidence” hint suggests the use of
Q782: colour.id=superhero.eye_colour_id And these formulas. However, in practice these hints are not available.
colour.id=superhero.hair_colour_id But they can be extracted by query log analysis.
Q1016: results.raceid=pitstops.raceid And 5.4.4 Summary
results.driverid=pitstops.driverid
We have shown that many useful query features can be found in
None of these 4 individual predicates are listed in the SQLite the collection of dev queries, considered as a query log. For
schema, although the hint in 782 suggests the join. We can also example, 25% of the equality joins that are used in at least one
find joins that require computation: query are not documented in the SQLite schema. However, the
Q234: "bond.molecule_id | |'_1'=connected.atom_id And usefulness of this information is limited in the context of the
bond.molecule_id || '_2'=connected.atom_id2 And BIRD benchmark, which is fairly simple and readable. Field
bond.bond_id=connected.bond_id names are highly suggestive of e.g. join paths, and few formulas
The metadata for the bond.molecule_id field does mention the are explicitly named in the gold queries using AS (correctness
encoding standard. However no join path from bond.molecule_id checking is done using field positions, not names). The hints
to connected.atom_id is explicitly documented. There are also provided with the question generally contain the query features to
many multi-table join predicates: be used in the generated query – a very unrealistic situation. A
Q146: card.disp_id=disp.disp_id AND better test would list query features and require the submission to
client.client_id=disp.client_id AND do query feature linking.
disp.account_id=loan.account_id
These constraints indicate join patterns, which might go through
an otherwise unrelated linkage table. 6 Sql to Text
When a table has multiple date fields, understanding which should The use of question/SQL pairs as few-shot examples has been
be used in a constraint can be obscure. For example, shown to be an effective means of boosting text-to-SQL
california_schools.schools has three date fields, opendate, performance examples [NZZR+23][GWLS+23][PL22] and has
closedate, and lastupdate. Opendate is constrained in 6 queries (4, been used in our BIRD submission, as described in section 4.
27,39, 47, 66, 87) and closeddate is constrained in three (27, 67, However, generating these pairs creates a very large workload for
68), while lastupdate is never constrained. the SMEs who must think up questions and write the
Finally, we can find date constraints: corresponding SQL queries. For the Spider benchmark,
strftime('%Y', california_schools.schools.opendate)=’1980’ [YZYY+18], students spent 1,000 hours creating 5700
california_schools.schools.opendate>'2000-01-01' question/answer pairs. The BIRD benchmark does not list the
number of work hours, but states that 11 contributors and three
5.4.3 Interesting Formulas assessors were employed in developing 12000 question/answer
pairs, and that $98,000 was spent.
If one has access to query logs, then one can sample a selection of  Bad : The question is not related to the SQL.
these queries and use an LLM to generate a corresponding  Bad+ : The question is related to the SQL, but misses
question from them. The procedure we use is:
important details.
Input: SQL query Q, full schema S
 Good-: The question generally matches the SQL, but
1. Analyze Q to determine the set of fields F referenced in misses some important detail.
Q.  Good : The question matches the SQL and is on par
2. Extract a focused schema FS by selecting from S the with the supplied question in terms of accuracy and
fields in F readability.
3. With the context of the focused schema FS, ask the  Good+ : The question matches the SQL and is better
LLM to create a long question and a short question from than the supplied question.
query Q.
We note that good+ includes entries in which the supplied
That is, by query analysis, one can obtain perfect schema linking question and the SQL are not in agreement (problems with
[TPCM+24] – helping to make SQL-to-text an easier problem question/SQL pairs have been previously noted [TPCM+24]). We
than text-to-SQL. use four different kinds of metadata:
Query logs can contain a very large number of distinct queries,
 Base : no metadata.
but only an interesting and representative sample should be
selected for few-shot examples. For example, one might be  Bird : The benchmark-supplied metadata (i.e., from the
focused on generating SQL for a subset of tables, or one might *.csv files in the database_description subdirectories).
choose examples which use an interesting formula such as  LLM : The short LLM summary generated from the
“(bond.molecule_id)||(substr(bond.bond_id, 8, 2))”. We have field profile.
found the following procedure to be useful.
 Fused: Both the Bird and the LLM metadata.
1. For each query Q in query log L, analyze Q, extract its
features, and associate them with Q (e.g. in a JSON Our results, across all difficulty levels are below. The results
sliced on difficulty level (simple/moderate/challenging) are
file).
similar:
2. Provide the summary of extracted features to an SME,
Rating Base Bird only LLM only Fused
who identifies features F, containing a collection of
Bad 0 0 0 0
query features to match against. Aggregate the
collection of features into a set of features FS. Bad+ 1 1 3 0
3. For each set of features fs in FS, collect the list set of Good- 16 14 12 2
queries Qfs that contain fs. Good 53 55 55 68
4. Return the union of all Qfs. Good+ 13 14 13 13
Table 3. SQL-to-text evaluation.
6.1 Experiments Even with no metadata, the SQL-to-text performance is
Our experiments consist of generating text from SQL, and then surprisingly good – almost as good as human annotation. With
comparing the generated questions to the supplied question and fused metadata, the generated questions are significantly better
SQL. We used BIRD Minidev 15 as the source of the than the human annotation. We conclude that by using the
question/SQL pairs. However there are 500 question/SQL pairs in techniques described in Section 2, query extraction from a query
Minidev, the grading process is very time consuming, and we log plus SQL-to-text generation is an effective technique for
have limited human resources. So we further selected every 6 th generating few-shot examples.
entry (i.e., the first, 7th, 13th, etc. entry) for use in evaluation. The human-generated question is worse in 13 out of 83 sample
The question/SQL pairs were human-generated in the text-to-SQL questions (16%). This is likely due to the tedious and exhausting
direction. For our experiments, we treat the corpus as having nature of generating 12000+ question/SQL pairs. In the
been generated in the opposite direction: given an SQL query, remainder of this section, we explore some examples.
what is it asking? We start with the example where the generated question is rated
We generate both the long and short question, and use the best “bad+” (question_id 93). The SQL is
result for grading – which we consider reasonable both for query SELECT COUNT(T1.client_id) FROM client AS T1
INNER JOIN district AS T2 ON T1.district_id =
explanation and for query retrieval from a vector database for T2.district_id WHERE T1.gender = 'M' AND T2.A3 =
few-shot selection. 'north Bohemia' AND T2.A11 > 8000
Our grading is subjective, so list the SQL, supplied question, The bad+ question using LLM metadata is
generated questions and their ratings in the Appendix. Our ratings
How many men from districts in north Bohemia with
are:
populations over 8000 are clients?
Field A11 refers to salary not population so the generated question
15
https://github.com/bird-bench/mini_dev is significantly off. A good- base-generated question is:
How many men from the North Bohemia district with an A11 Which bond type accounted for the majority of the bonds
value over 8000 are clients? found in molecule TR010 and state whether or not this
This question also misses the meaning of field A11, but does not molecule is carcinogenic?
try to guess the meaning. The question with fused metadata The select list has no indication of carcinogenic status.
indicates a salary of 8000 or more, which is correct.
An example entry where all of the generated questions are good is 7 Related Work
question_id 710. The SQL is
Text-to-SQL code generation has attracted a great deal of
SELECT COUNT(T1.id) FROM comments AS T1 INNER attention in recent years [CLHY+22][QHWY+22][YWDD17][
JOIN posts AS T2 ON T1.PostId = T2.Id WHERE XLS17][ZM96]. Development has been accelerated by the
T2.CommentCount = 1 AND T2.Score = 0 release of standardized benchmarks: WikiSQL 16 [ZXS17],
While the supplied question is Spider17 [YZYY+18] and BIRD18 [JHQY+23]. In this discussion,
In posts with 1 comment, how many of the comments have 0 all scores and rankings are at the time of writing (Jan. 2025).
score? In [GWLS+23], the authors put a corpus of Spider questions in a
The generated questions are similar, though the one generated vector database and extracted few-shot examples by similarity to
with the fused metadata is more accurate: the posed question. This technique and a tuned LLM got them the
How many comments are linked to posts with only one #1 spot on the Spider leaderboard (currently at #2).
comment and no upvotes? The CHESS19 submission to BIRD [TPCM+24] uses LLM-driven
In 8 of the 83 total questions, all of the generated questions are schema linking techniques: column filtering, table filtering, and
rated “good”, but we added a clarification note, generally column selection. This plus query revision and query candidate
indicating that the supplied question is vague or has poor selection got them on the #1 spot (currently at #7).
grammar. An example is question_id 39 with SQL: The BIRD submission from Distillery [MAJM24] takes a different
SELECT AVG(T1.NumTstTakr) FROM satscores AS T1 approach. They argue that newer LLMs remove the need to do
INNER JOIN schools AS T2 ON T1.cds = T2.CDSCode schema linking. They achieved the #1 spot on BIRD, and are
WHERE strftime('%Y', T2.OpenDate) = '1980' AND currently at #6.
T2.County = 'Fresno' IBM achieved the #1 spot on Bird, and is currently at #4 spot.
And supplied question They have not posted a paper but marketing materials20 state that
they use “extractive schema-linking” and a tuned version of their
What is the average number of test takers from Fresno schools
Granite LLM.
that opened between 1/1/1980 and 12/31/1980?
Chase [PLSC+24] has the current #2 spot on the BIRD
Even the question generated with the base metadata is more
benchmark. Their paper describes a number of interesting
accurate:
techniques, including methods for generating diverse query
What's the average SAT participation for schools opened in candidates. They use query plan analysis to determine the
1980 in Fresno County? “shape” of the query, and try to get candidates with a variety of
For some generated questions labeled good+, an example where query shapes. They also use a tuned version of the Google
all generated questions are good+ is question_id 112, with SQL Gemini LLM. Our submission and the Google submission have
SELECT T1.A2 FROM district AS T1 INNER JOIN client been trading the #1 spot on the Bird benchmark, both recently
AS T2 ON T1.district_id = T2.district_id WHERE beaten by Alibaba.
T2.birth_date = '1976-01-29' AND T2.gender = 'F' XiYan-SQL [GLLS+24] has the current number spot on the BIRD
The supplied question is benchmark. Among the techniques used is to generate candidate
For the female client who was born in 1976/1/29, which SQL using several different models, and then use a fine-tuned
district did she opened her account? selector model to pick the result.
However, there is nothing in the SQL which suggests that there is We explore the use of well-known, and also newer, metadata
only one match. A more accurate question is: extraction techniques for text-to-SQL generation. Database
profiling [AGN15] has a large literature. Query parsing and
Which districts have female clients born on 29th January
feature extraction is at the core of query planning, and has been
1976?
used to e.g. find join paths in large, complex databases [YPS09].
An example of a supplied question which can be considered A new technique is the conversion of SQL to text questions. This
accurate, but which have poor grammar is question_id 862: technique has been used for candidate selection by Chase
For the Bahrain Grand Prix in 2007, how many drivers not [PLSC+24].
finished the game?
An example where the supplied question does not match the SQL
is question_id 231, with SQL 8 Conclusions
SELECT T.bond_type FROM ( SELECT T1.bond_type,
COUNT(T1.molecule_id) FROM bond AS T1 WHERE 16
https://www.kaggle.com/datasets/shahrukhkhan/wikisql
17
T1.molecule_id = 'TR010' GROUP BY T1.bond_type https://yale-lily.github.io/spider
18
ORDER BY COUNT(T1.molecule_id) DESC LIMIT 1 ) AS https://bird-bench.github.io/
19
T https://github.com/ShayanTalaei/CHESS
20
https://research.ibm.com/blog/granite-LLM-text-to-SQL
And supplied question
In our decades of experience developing queries on large and incomplete and out of date. Many databases contain thousands of
complex industrial databases, we have found that the hardest task tables, many of which have 500+ fields. Generating SQL from
is understanding the database contents. After that, writing SQL questions for these databases requires a significant step up from
queries is generally straightforward. With that motivation, we the schema linking techniques developed to date.
develop several strategies for automatic metadata extraction for BIRD relies on per-question hints to supply special information,
text-to-SQL generation: database profiling, query log analysis, (e.g. formulas, predicates, join paths) to allow that query-
and SQL-to-text generation. generating prompt to supply these non-obvious details. However
Database profiling [AGN15] is a collection of well-known per-query hints are very unrealistic in real-world databases. A
techniques used to characterize database contents. We combine more realistic benchmark would provide a table of query features
profiling results with basic table metadata, and find that an LLM which apply to the database in question. This change would allow
can provide a surprisingly good summary of the field meaning. In for the testing of query feature linking as well as schema linking.
one example, the summary discovered the JSON format of the The BIRD benchmark has a second leaderboard for query
field contents. execution efficiency (our submission is at the #3 spot). This is a
For evaluation, we developed a novel schema linking strategy that curious metric for rating text-to-SQL systems, as the results are
relies on task alignment. Using this schema linking strategy, we very highly dependent on the query optimizer, which varies from
find that field metadata generation using profiling and LLM DBMS to DBMS, and is frequently updated in actively supported
summarization is highly effective. Using BIRD as a test suite, we DBMSs.
find that using profiling metadata leads to higher accuracy than However, it is possible to engineer performance bombs in the
using the supplied metadata (with or without hints)! Using fused data, which a good text-to-SQL system should be able to detect,
metadata unsurprisingly provides the highest accuracy. Our or at the least warn about. One common example is the presence
BIRD benchmark submission used this technique and achieved of default values, instead of NULL values. Often there are
the #1 position on the leaderboard (with hints) from Sept. 1 to multiple default values. For example, a telephone_number fields
Sept. 23, 2024 and from Nov. 11 to Nov 24, 2024, and retains the might have 10% of its entries with value 123-456-7890, and
highest position on the leaderboard without hints at the time of the another 5% with value 999-999-9999. If one does a join on
writing. telephone_number and both fields have a large overlap in their
We applied query log analysis to the BIRD dev query set. Even default values (likely), then a join which should return millions of
though primary key – foreign key dependences are well records will return trillions of records and most of these records
documented in the schema, the log analysis was able to find 25% are garbage.
more useful equality constraints than are documented. Query log
mining also found multi-field join constraints, join constraints that ACKNOWLEDGMENTS
involve computations, field constraint formulas, and formulas in
the select clause. Many of these features are either documented in We’d like to thank Chris Vo and Pavan Tagarisa for their help
the hints (which are not present in actual text-to-SQL tasks) or not with creating the BIRD benchmark submissions, and the BIRD
documented. team for their prompt evaluations.
Finally, we ran experiments on SQL-to-text, which can be used
for query explanation or for generating few-shot examples from REFERENCES
query logs. Even with no metadata, the LLM scored nearly as [AGN15] Z. Abedjan, L. Golab, F. Naumann. Profiling relational
well as the human-generated questions. This result reflects the data: a survey. VLDB J. 24(4) 557-581, 2015.
clean and clear naming in most parts of the BIRD schemas. With [B97] A. Broder. On the resemblance and containment of documents.
fused BIRD/profiling metadata, the generated questions scored Proc. Compression and Complexity of SEQUENCES 1997.
significantly better than the human generated questions. This is [CLHY+22] R. Cai, B. Xu, Z. Zhang, X. Yang, Z. Li, Z. Liang. An
likely due to the tedious nature of question/SQL generation for encoder-decoder framework translating natural language to
12000+ query pairs in dev and test alone, leading to frequent database queries. In Proc. Twenty-Seventh International Joint
Conference on Artificial Intelligence, 2018
human mistakes.
[DJMS02] T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk.
Mining database structure; or, how to build a data quality
Suggestions for Future Benchmarks browser. Sigmod Conf. 2002.
The BIRD benchmark is a significant step up in complexity as [DZGM+23] X. Dong, C. Zhang, Y. Ge, Y. Mo Y. Gao, I. Chen, J.
Lin, D. Lou. C#: Zero-shot text-to-SQL with chatgpt. CORR,
compared to the Spider benchmark. This can be seen in the ans/2307.07306. 2023.
accuracy scores of the entries at the #1 spot: 91.2% vs. 74.79%. [FPZE+24] V. Floratou et al. Nl2sql is a solved problem... not!
Part of the increase in difficulty is in the ambiguity of the schema Conference on Innovative Data Systems Research, 2024.
and field names, and the sometimes-poor field metadata. These [FFGM07] P. Flajolet, E. Fusy, O. Gandouet, F. Muenier.
issues allow our automatic metadata extraction techniques to work Hyperloglog: The analysis of a near-optimal cardinality
well. estimation algorithm. Discrete Mathematics and Theoretical
Computer Science Proceedings 137-156 2007.
However the schemas are still simple and have highly suggestive
[GJ14] L. Golab, T. Johnson. Data stream warehousing. Proc. ICDE
names and join paths. This can be seen in the good performance 2014.
of text-to-SQL and SQL-to-text without any field metadata. [GLLS+24] Y. Gao et al. XiYan-SQL: A Multi-Generator Ensemble
In our experience, industrial databases are far more complex and Framework for Text-to-SQL. arXiv:2411.08599 2024.
dirty. Field and table names tend to be obscure, join paths are not [GWLS+23] D. Gao et al. Text-to-sql empowered by large language
obvious from the field names, and documentation tends to be models: A benchmark evaluation. CoRR, abs/2308.15363, 2023.
[IBS08] I. Ilyas, G. Beskales, M.A. Soliman. A survey of top-k Appendix
processing techniques in relational databases. ACM Computing
Surveys Vol 40, issue 11, pg. 1-58 2008. In this section, we show the SQL, supplied question, generated
[JDJ19] J. Johnson, M. Douze, H. Jegou. Billion-scale similarity questions and their ratings from the experiments in Section 6.
search with GPUs. IEEE Transactions on Big Data vol 7, issue
3, 2019.
[JHQY+23] J. Li et al. Can LLM Already Serve as A Database Question 1471:
Interface? A BIg Bench for Large-Scale Database Grounded SQL:
Text-to-SQLs. 37th Conference on Neural Information SELECT CAST(SUM(IIF(Currency = 'EUR', 1, 0)) AS FLOAT) /
Processing Systems (NeurIPS 2023) 2023.
[LPKP24] D. Lee, C. Park, J. Kim, N. Tang. Mcs-sql: Leveraging
SUM(IIF(Currency = 'CZK', 1, 0)) AS ratio FROM customers
multiple prompts and multiple-choice selection for text-to-sql Supplied question: What is the ratio of customers who pay in
generation. CoRR, abs/2405.07467, 2024. EUR against customers who pay in CZK?
[MAJM24] K. Maamari, F. Abubaker, D. Jaroslawwicz, A. Mhedhni. Base generated questions: Q1:: What is the ratio of the number of
The Death of Schema Linking? Text-to-SQL in the Age of Well-
customers using the EUR currency to the number of customers
Reasoned Language Models. aXiv:2408.07702 2024.
[NZZR+23] L. Nin, Y. Zhao, W. Zou, N. Ri, J. Tae, E. Zhang, A. using the CZK currency in the customers table? Q2:: What is the
Cohan, D. Daved. Enhancing few-shot text-to-SQL capabilities EUR to CZK customer ratio?
of large language models: A study on prompt design strategies. Base question rating: good.
CoRR abs/2305.12586 2023. Bird only generated questions: Q1:: What is the ratio of the total
[PLSC+24] M. Pourreza, H. Li, R. Sun, Y. Chung, S. Talaei, G.T.
Kakkar, Y. Gan, A. Saberi, F. Ozcan, S.O. Arik. CHASE-SQL:
number of customers using the Euro (EUR) to the total number of
Multi-path reasoning and preference optimized candidate customers using the Czech koruna (CZK) in the customers table?
selection in text-to-SQL. arXiv:2410.01943 2024. Q2:: What is the EUR to CZK customer ratio?
[PLSC+24b] M. Pourreza et al. CHASE-SQL: Multi-Path Reasoning Bird only question rating: good+. Reason: Q2 is succinct and
and Preference Optimized Candidate Selection in Text-to-SQL. expands the acronyms.
arXiv:2410.01943v1 2024.
[PL22] A. Parnami, M. Lee. Learning from few examples: a LLM only generated questions: Q1:: What is the ratio of the total
summary of approaches to few-shot learning. arXiv:2203.04291, number of customers using the EUR currency to the total number
2022. of customers using the CZK currency in the customers table?
[PR23] M. Pourreza, D. Rafiei. Din-sql: Decomposed in-context Q2:: What is the EUR to CZK customer ratio?
learning of text-to-sql with self-correction. CoRR, LLM only question rating: good.
abs/2304.11015, 2023.
[QHWY+22] B. Qin et al. A survey on text-to-SQL parsing: Fused generated questions: Q1:: What is the ratio of the total
Concepts, methods, and future directions. arXiv:2208.13629, number of customers using the EUR currency to the total number
2022. of customers using the CZK currency in the customers table?
[QLLQ+24] Before generation, align it! A novel and effective Q2:: What is the EUR to CZK customer ratio?
strategy for mitigating hallucinations in text-to-SQL generation. Fused question rating: good.
62nd annual meeting of the Association for Computational
Linguistics, 2024. Also in arXiv:2405.15307. ----------------------------------------------------
[TPCM+24] S. Talaei, M. Pourreza, Y-C Chang, A. Mirhoseini, A. Question 1481:
Saberi. CHESS: Contextual Harnessing for Efficient SQL SQL:
Synthesis. Arvix 2405.16755v2 2024.
[WRYL24] B. Wang at al. Mac-sql: A multi-agent collaborative SELECT CAST(SUM(IIF(T1.Segment = 'SME',
framework for text-to-sql. CoRR, abs/2312.11242, 2024. T2.Consumption, 0)) AS REAL) / COUNT(T1.CustomerID) -
[XLS17] X. Xu, C. Liu, D. Song. SQLNet: Generating Structured CAST(SUM(IIF(T1.Segment = 'LAM', T2.Consumption, 0)) AS
Queries From Natural Language Without Reinforcement REAL) / COUNT(T1.CustomerID) ,
Learning. arXiv:1711.04436 2017. CAST(SUM(IIF(T1.Segment = 'LAM', T2.Consumption, 0)) AS
[YPS09] X. Yang, C.M. Procopiuc, D. Srivastava. Recommending REAL) / COUNT(T1.CustomerID) -
Join Queries via Query Log Analysis. ICDE 2009: 864-975
CAST(SUM(IIF(T1.Segment = 'KAM', T2.Consumption, 0)) AS
2009.
[YWDD17] N. Yaghmazadeh, Y. Wang, I. Dillig, T. Dillig. Sqlizer: REAL) / COUNT(T1.CustomerID) ,
query synthesis from natural language. Proc. ACM OOPSLA CAST(SUM(IIF(T1.Segment = 'KAM', T2.Consumption, 0)) AS
2017. REAL) / COUNT(T1.CustomerID) -
[YZYY+18] T. Yu, et al. Spider: a large-scale human-labeled dataset CAST(SUM(IIF(T1.Segment = 'SME', T2.Consumption, 0)) AS
for complex and cross-domain semantic parsing and text-to-SQL REAL) / COUNT(T1.CustomerID) FROM customers AS T1
task. Proc. Conf. on Empirical Methods in Natural Language INNER JOIN yearmonth AS T2 ON T1.CustomerID =
Processing. 2018 T2.CustomerID WHERE T1.Currency = 'CZK' AND
[ZM96] J.M. Zelle, R.J. Mooney. Learning to parse database queries
T2.Consumption = ( SELECT MIN(Consumption) FROM
using inductive logic programming. Proc 14th Conf. on Artificial
Intelligence / 9th Conf. Innovative Applications of Artificial yearmonth ) AND T2.Date BETWEEN 201301 AND 201312
Intelligence, 1996. Supplied question: What is the difference in the annual average
[ZXS17] V. Zhong, C. Xiong, R. Socher. Seq2SQL: Generating consumption of the customers with the least amount of
Structured Queries from Natural Language using Reinforcement consumption paid in CZK for 2013 between SME and LAM,
Learning. arXiv:1709.00103 2017. LAM and KAM, and KAM and SME?
Base generated questions: Q1:: What is the difference in average
minimum monthly consumption between 'SME' and 'LAM'
segments, 'LAM' and 'KAM' segments, and 'KAM' and 'SME'
segments for customers with the 'CZK' currency within the year Fused question rating: good.
2013? Q2:: What are the average differences in the least monthly ----------------------------------------------------
consumption for 'SME', 'LAM', and 'KAM' customer segments Question 1507:
using Czech koruna in 2013?
SQL:
Base question rating: good. Reason: Q2.
SELECT DISTINCT T1.Time FROM transactions_1k AS T1
Bird only generated questions: Q1:: What is the difference in INNER JOIN gasstations AS T2 ON T1.GasStationID =
average consumption per customer between the SME and LAM T2.GasStationID WHERE T2.ChainID = 11
segments, LAM and KAM segments, and KAM and SME
segments for customers with the currency 'CZK' who had the Supplied question: Please list the disparate time of the
minimum consumption in the year 2013? Q2:: What are the transactions taken place in the gas stations from chain no. 11.
average consumption differences per customer among SME, Base generated questions: Q1:: At what distinct times did
LAM, and KAM segments for 'CZK' currency customers with the transactions occur at gas stations belonging to chain ID 11? Q2::
least consumption in 2013? When did gas stations in chain 11 have transactions?
Bird only question rating: good. Base question rating: good.
LLM only generated questions: Q1:: What is the average Bird only generated questions: Q1:: What are the distinct times
difference in the minimum consumption between the 'SME' and when transactions occurred at gas stations belonging to chain ID
'LAM' segments, the 'LAM' and 'KAM' segments, and the 'KAM' 11? Q2:: What times did transactions happen at chain 11 gas
and 'SME' segments for customers with the currency 'CZK' within stations?
the year 2013? Q2:: What are the average differences in the least Bird only question rating: good.
consumption among 'SME', 'LAM', and 'KAM' customer segments LLM only generated questions: **Q1::** What are the distinct
in CZK for 2013? times at which transactions occurred at gas stations belonging to
LLM only question rating: good. the chain with ID 11? **Q2::** What times did transactions
Fused generated questions: Q1:: What is the difference in average happen at gas stations of chain 11?
minimum monthly consumption in the year 2013 between the LLM only question rating: good.
SME and LAM segments, between the LAM and KAM segments, Fused generated questions: Q1:: What are the distinct transaction
and between the KAM and SME segments for customers using the times for transactions that occurred at gas stations belonging to
CZK currency? Q2:: What is the average minimum consumption chain number 11? Q2:: What times did transactions occur at
difference in 2013 between customer segments SME-LAM, chain 11 gas stations?
LAM-KAM, and KAM-SME using Czech currency? Fused question rating: good.
Fused question rating: good. ----------------------------------------------------
---------------------------------------------------- Question 1525:
Question 1493: SQL:
SQL: SELECT CAST(SUM(IIF(T2.Currency = 'EUR', 1, 0)) AS
SELECT CAST(SUM(IIF(Consumption > 528.3, 1, 0)) AS FLOAT) * 100 / COUNT(T1.CustomerID) FROM
FLOAT) * 100 / COUNT(CustomerID) FROM yearmonth transactions_1k AS T1 INNER JOIN customers AS T2 ON
WHERE Date = '201202' T1.CustomerID = T2.CustomerID WHERE T1.Date = '2012-08-
Supplied question: In February 2012, what percentage of 25'
customers consumed more than 528.3? Supplied question: What is the percentage of the customers who
Base generated questions: Q1:: What is the percentage of used EUR in 2012/8/25?
customers with a consumption greater than 528.3 units in Base generated questions: Q1:: What is the percentage of
February 2012? Q2:: What percent of February 2012 customers transactions made on August 25, 2012, by customers who have
used more than 528.3 units? their currency set to EUR, compared to the total number of
Base question rating: good. customers who made transactions on that same date? Q2:: What
Bird only generated questions: Q1:: What is the percentage of percent of August 25, 2012 transactions were made by customers
customers who had a consumption greater than 528.3 units in using euros?
February 2012? Q2:: What percent of customers consumed more Base question rating: good+. Reason: suppliedf question does not
than 528.3 units in Feb 2012? match the SQL.
Bird only question rating: good+. Bird only generated questions: Q1:: What is the percentage of
LLM only generated questions: Q1:: What is the percentage of transactions made on August 25, 2012, by customers who have
customers whose consumption exceeded 528.3 in February 2012? their currency set to EUR, relative to the total number of unique
Q2:: What percentage of customers had a consumption over 528.3 customers who made transactions on that date? Q2:: What
in February 2012? percentage of August 25, 2012, transactions were by Euro-using
LLM only question rating: good. customers?
Fused generated questions: Q1:: What percentage of customers Bird only question rating: good+. Reason: suppliedf question
consumed more than 528.3 units of resources in February 2012? does not match the SQL.
Q2:: What is the percentage of high consumption customers in LLM only generated questions: Q1:: What is the percentage of
February 2012? transactions made on August 25, 2012, by customers who have
their currency set to EUR, relative to the total number of unique LLM only generated questions: Q1:: Was the expense for the
customers who made transactions on that date? Q2:: What 'October Meeting' that occurred on October 8th, 2019 approved?
percentage of August 25, 2012 transactions were by Euro-using Q2:: Is the October 8th, 2019 'October Meeting' expense
customers? approved?
LLM only question rating: good+. Reason: suppliedf question LLM only question rating: good+. Reason: supplied question
does not match the SQL. doesn't quite match the SQL.
Fused generated questions: Q1:: What is the percentage of Fused generated questions: Q1:: Was the expense for the "October
transactions made in euros (EUR) by all customers on the date Meeting" that occurred on October 8th, 2019 approved? Q2:: Is
August 25, 2012? Q2:: What percentage of August 25, 2012, the October 8th, 2019 "October Meeting" expense approved?
transactions were in euros? Fused question rating: good+. Reason: supplied question doesn't
Fused question rating: good+. Reason: suppliedf question does quite match the SQL.
not match the SQL. ----------------------------------------------------
---------------------------------------------------- Question 1351:
Question 1312: SQL:
SQL: SELECT T2.major_name FROM member AS T1 INNER JOIN
SELECT T2.major_name FROM member AS T1 INNER JOIN major AS T2 ON T1.link_to_major = T2.major_id WHERE
major AS T2 ON T1.link_to_major = T2.major_id WHERE T1.first_name = 'Brent' AND T1.last_name = 'Thomason'
T1.first_name = 'Angela' AND T1.last_name = 'Sanders' Supplied question: What was Brent Thomason's major?
Supplied question: What's Angela Sanders's major? Base generated questions: Q1:: What is the major name associated
Base generated questions: Q1:: What is the major name associated with the member named Brent Thomason? Q2:: What is Brent
with the member named Angela Sanders? Q2:: What is Angela Thomason's major?
Sanders' major? Base question rating: good.
Base question rating: good. Bird only generated questions: Q1:: What is the major name
Bird only generated questions: Q1:: What is the major name associated with the member named Brent Thomason? Q2:: What
associated with a member named Angela Sanders in the database? is Brent Thomason's major?
Q2:: What is Angela Sanders' major? Bird only question rating: good.
Bird only question rating: good. LLM only generated questions: Q1:: What is the name of the
LLM only generated questions: Q1:: What is the major name major associated with a member named Brent Thomason? Q2::
associated with the member named Angela Sanders? Q2:: What What major is Brent Thomason studying?
major is Angela Sanders studying? LLM only question rating: good.
LLM only question rating: good. Fused generated questions: Q1:: What is the academic major of
Fused generated questions: Q1:: What is the academic major of the club member named Brent Thomason? Q2:: What major is
the club member Angela Sanders? Q2:: What major is Angela Brent Thomason studying?
Sanders studying? Fused question rating: good.
Fused question rating: good. ----------------------------------------------------
---------------------------------------------------- Question 1362:
Question 1338: SQL:
SQL: SELECT COUNT(city) FROM zip_code WHERE county =
SELECT T3.approved FROM event AS T1 INNER JOIN budget 'Orange County' AND state = 'Virginia'
AS T2 ON T1.event_id = T2.link_to_event INNER JOIN expense Supplied question: How many cities are there in Orange County,
AS T3 ON T2.budget_id = T3.link_to_budget WHERE Virginia?
T1.event_name = 'October Meeting' AND T1.event_date LIKE Base generated questions: Q1:: How many distinct cities are there
'2019-10-08%' in Orange County, Virginia, according to the zip_code database?
Supplied question: Was each expense in October Meeting on Q2:: What is the number of cities in Orange County, VA?
October 8, 2019 approved? Base question rating: good.
Base generated questions: Q1:: Was the expense for the 'October Bird only generated questions: Q1:: How many cities are located
Meeting' held on October 8th, 2019 approved? Q2:: Is the in Orange County, Virginia according to the zip_code database?
October 8th, 2019 'October Meeting' expense approved? Q2:: What is the number of cities in Orange County, VA?
Base question rating: good+. Reason: supplied question doesn't Bird only question rating: good.
quite match the SQL.
LLM only generated questions: Q1:: How many cities are there in
Bird only generated questions: Q1:: Which expenses have been Orange County, Virginia, according to the zip_code table? Q2::
approved for the event named 'October Meeting' that took place What is the city count in Orange County, VA?
on October 8th, 2019? Q2:: What approved expenses are there for
the 'October Meeting' on 2019-10-08? LLM only question rating: good.
Bird only question rating: good. Fused generated questions: Q1:: How many distinct cities are in
Orange County, Virginia, according to the ZIP code data? Q2::
How many cities are in Orange County, VA?
Fused question rating: good. T1.event_name = 'April Speaker' GROUP BY T2.category
---------------------------------------------------- ORDER BY SUM(T2.amount) ASC
Question 1380: Supplied question: Calculate the amount budgeted for 'April
SQL: Speaker' event. List all the budgeted categories for said event in
an ascending order based on their amount budgeted.
SELECT SUM(spent) FROM budget WHERE category = 'Food'
Base generated questions: Q1:: What are the total amounts spent
Supplied question: What is the total amount of money spent for in each category for the event named 'April Speaker', listed in
food? ascending order of amount spent? Q2:: How much was spent per
Base generated questions: Q1:: How much has been spent in total category for 'April Speaker', from least to most?
on the 'Food' category according to the budget records? Q2:: Base question rating: good.
What is the total expenditure on food?
Bird only generated questions: Q1:: What are the total amounts
Base question rating: good. spent in each category for the event named 'April Speaker', listed
Bird only generated questions: Q1:: What is the total amount of in ascending order of expenditure? Q2:: What is the cost
money spent on the 'Food' category according to the records in the breakdown by category for the 'April Speaker' event, from least to
budget table? Q2:: How much was spent on food? most expensive?
Bird only question rating: good. Bird only question rating: good.
LLM only generated questions: Q1:: What is the total amount of LLM only generated questions: Q1:: Which categories of the
money spent on the 'Food' category according to the budget budget are associated with the 'April Speaker' event and what is
records? Q2:: How much was spent on food? the total amount spent in each category, listed in ascending order
LLM only question rating: good. of expenditure? Q2:: What are the spending totals per category
Fused generated questions: Q1:: What is the total amount of for the 'April Speaker' event, from lowest to highest?
money spent on the category 'Food' according to the budget LLM only question rating: good.
records? Q2:: How much was spent on food? Fused generated questions: Q1:: What are the total budgeted
Fused question rating: good. amounts for each category, listed in ascending order of amount,
---------------------------------------------------- for the 'April Speaker' event? Q2:: What is the smallest to largest
Question 1394: budget allocation by category for the 'April Speaker' event?
SQL: Fused question rating: good.
SELECT COUNT(T2.member_id) FROM major AS T1 INNER ----------------------------------------------------
JOIN member AS T2 ON T1.major_id = T2.link_to_major Question 1427:
WHERE T1.major_name = 'Physics Teaching' SQL:
Supplied question: How many members of the Student_Club have SELECT DISTINCT T2.category, T1.type FROM event AS T1
major in 'Physics Teaching'? INNER JOIN budget AS T2 ON T1.event_id = T2.link_to_event
Base generated questions: Q1:: How many members are linked to WHERE T1.location = 'MU 215' AND T2.spent = 0 AND T1.type
the major with the name 'Physics Teaching'? Q2:: How many = 'Guest Speaker'
students are enrolled in the Physics Teaching major? Supplied question: What are the budget category of the events
Base question rating: good. located at MU 215 and a guest speaker type with a 0 budget
Bird only generated questions: Q1:: How many members are spent?
associated with the 'Physics Teaching' major according to the Base generated questions: Q1:: What are the distinct categories of
database? Q2:: How many members are studying to become budget items linked to events of the type 'Guest Speaker' that have
physics teachers? occurred at location 'MU 215' and have a recorded spent amount
Bird only question rating: good. of 0? Q2:: Which unpaid budget categories are associated with
'Guest Speaker' events at 'MU 215'?
LLM only generated questions: Q1:: How many members are
associated with the major named 'Physics Teaching'? Q2:: What Base question rating: good.
is the count of members majoring in 'Physics Teaching'? Bird only generated questions: Q1:: What are the distinct
LLM only question rating: good. categories of budget items associated with events of type 'Guest
Speaker' held at location 'MU 215' that have not had any money
Fused generated questions: Q1:: How many members are
spent? Q2:: What budget categories are linked to unpaid 'Guest
associated with the 'Physics Teaching' major according to the
Speaker' events at 'MU 215'?
database? Q2:: What is the count of 'Physics Teaching' major
members? Bird only question rating: good.
Fused question rating: good. LLM only generated questions: Q1:: What are the distinct
categories of budgets linked to events of type 'Guest Speaker' held
----------------------------------------------------
at location 'MU 215' which have had no expenditure? Q2:: What
Question 1405: budget categories are there for unspent 'Guest Speaker' events at
SQL: 'MU 215'?
SELECT T2.category, SUM(T2.amount) FROM event AS T1 LLM only question rating: good.
JOIN budget AS T2 ON T1.event_id = T2.link_to_event WHERE Fused generated questions: Q1:: What are the distinct budget
categories associated with 'Guest Speaker' events that have not
had any money spent (budget spent is $0) and are located at 'MU LLM only generated questions: Q1:: Which patients have a
215'? Q2:: Which budget categories have unused funds for 'Guest Thrombosis score of 2, and what are their unique IDs, genders,
Speaker' events at 'MU 215'? and diagnoses? Q2:: Who are the patients with a Thrombosis
Fused question rating: good. score of 2, and what are their genders and diagnoses?
---------------------------------------------------- LLM only question rating: good-. Reason: not picking up on
Question 1149: thrombrosis score of 2 meaning severe, but still accurate.
SQL: Fused generated questions: Q1:: Which patients have a severe
degree of thrombosis, as indicated by a thrombosis score of 2, and
SELECT CAST(SUM(CASE WHEN Admission = '+' THEN 1 what are their unique identification numbers, genders, and
ELSE 0 END) AS REAL) * 100 / SUM(CASE WHEN Admission diagnosed diseases? Q2:: List the ID, sex, and diagnosis of
= '-' THEN 1 ELSE 0 END) FROM Patient WHERE SEX = 'M' patients with severe thrombosis.
Supplied question: Are there more in-patient or outpatient who Fused question rating: good.
were male? What is the deviation in percentage?
----------------------------------------------------
Base generated questions: Step 1: What is the percentage of male
patients who were admitted to the hospital compared to the Question 1171:
number of male patients who were not admitted? Step 2: What is SQL:
the admission rate for male patients? SELECT COUNT(T1.ID) FROM Patient AS T1 INNER JOIN
Base question rating: good-. Reason: not picking up on admitted Examination AS T2 ON T1.ID = T2.ID WHERE
vs. outpatient, but still accurate. STRFTIME('%Y', T2.`Examination Date`) BETWEEN '1990'
Bird only generated questions: Q1:: What is the percentage of AND '1993' AND STRFTIME('%Y', T2.`Examination Date`) -
male patients (SEX = 'M') with a positive admission status ('+') STRFTIME('%Y', T1.Birthday) < 18
compared to the total number of male patients with a negative Supplied question: How many underage patients were examined
admission status ('-') in the Patient table? Q2:: What percentage during the course of the three-year period from 1990 to 1993?
more male patients are admitted than not admitted? Base generated questions: Q1:: How many patients were under the
Bird only question rating: good-. Reason: not picking up on age of 18 when they had an examination between the years 1990
admitted vs. outpatient, but still accurate. and 1993? Q2:: How many minors had examinations from 1990
LLM only generated questions: Q1:: What is the percentage of to 1993?
male patients admitted (marked with '+') relative to those not Base question rating: good.
admitted (marked with '-')? Q2:: What percentage of male Bird only generated questions: Q1:: How many patients were
patients were admitted compared to those not admitted? under the age of 18 at the time of their examination between the
LLM only question rating: good-. Reason: not picking up on years 1990 and 1993? Q2:: How many minors were examined
admitted vs. outpatient, but still accurate. from 1990 to 1993?
Fused generated questions: Q1:: What is the percentage of male Bird only question rating: good.
patients who were admitted to the hospital compared to those who LLM only generated questions: Q1:: How many patients were
were followed at the outpatient clinic? Q2:: What is the hospital under the age of 18 at the time of their examination that took
admission rate for male patients versus their outpatient clinic place between the years 1990 and 1993? Q2:: How many
visits? pediatric examinations occurred between 1990 and 1993?
Fused question rating: good. LLM only question rating: good.
---------------------------------------------------- Fused generated questions: Q1:: How many patients who were
Question 1157: under the age of 18 at the time of their examination had an
SQL: examination between the years 1990 and 1993? Q2:: How many
minors were examined from 1990 to 1993?
SELECT DISTINCT T1.ID, T1.SEX, T1.Diagnosis FROM
Patient AS T1 INNER JOIN Examination AS T2 ON T1.ID = Fused question rating: good.
T2.ID WHERE T2.Thrombosis = 2 ----------------------------------------------------
Supplied question: For patients with severe degree of thrombosis, Question 1192:
list their ID, sex and disease the patient is diagnosed with. SQL:
Base generated questions: Q1:: Which patients have a Thrombosis SELECT DISTINCT T1.ID FROM Patient AS T1 INNER JOIN
score of 2, and what are their unique IDs, genders, and diagnoses? Laboratory AS T2 ON T1.ID = T2.ID WHERE T1.Admission = '-
Q2:: Who are the patients with a Thrombosis score of 2? ' AND T2.`T-BIL` < 2.0 AND T2.Date LIKE '1991-10-%'
Base question rating: good-. Reason: not picking up on Supplied question: List all patients who were followed up at the
thrombrosis score of 2 meaning severe, but still accurate. outpatient clinic who underwent a laboratory test in October 1991
Bird only generated questions: Q1:: Which distinct patients have a and had a total blood bilirubin level within the normal range.
diagnosis recorded and have a Thrombosis value of 2 in their Base generated questions: Q1:: Which patient IDs correspond to
examination results? Q2:: Who are the unique patients diagnosed individuals who were not admitted, had a total bilirubin level
with a Thrombosis level of 2? below 2.0, and had laboratory tests conducted in October 1991?
Bird only question rating: good-. Reason: not picking up on Q2:: Who had normal bilirubin levels and lab tests in October
thrombrosis score of 2 meaning severe, but still accurate. 1991 without being admitted?
Base question rating: good-. Reason: not picking up on field SQL:
meaning details. SELECT DISTINCT T1.ID, T1.SEX , STRFTIME('%Y',
Bird only generated questions: Q1:: Which patient IDs correspond CURRENT_TIMESTAMP) - STRFTIME('%Y', T1.Birthday)
to individuals who were not admitted, had a total bilirubin level FROM Patient AS T1 INNER JOIN Laboratory AS T2 ON T1.ID
below 2.0, and underwent laboratory testing in October 1991? = T2.ID WHERE T2.GLU >= 180 AND T2.`T-CHO` < 250
Q2:: Who were the outpatients with normal bilirubin levels tested Supplied question: Provide ID, sex and age of patient who has
in October 1991? blood glucose (GLU) not within normal range but with total
Bird only question rating: good. cholesterol(T-CHO) within normal range.
LLM only generated questions: Q1:: Which patient IDs Base generated questions: Step 1: Which patients have a glucose
correspond to individuals who were not admitted to the hospital level of 180 or higher and a total cholesterol level of less than
and had a total bilirubin level of less than 2.0 mg/dL during any 250, and what are their IDs, sexes, and current ages? Step 2: Who
lab test conducted in October 1991? Q2:: What are the IDs of are the patients with high glucose but not high cholesterol, and
patients with normal bilirubin levels in October 1991 who were what are their IDs, genders, and ages?
not admitted? Base question rating: good.
LLM only question rating: good-. Reason: not picking up on field Bird only generated questions: Q1:: Which patients have a
menaing details. recorded glucose level of 180 or higher and a total cholesterol
Fused generated questions: Q1:: Which patients were followed at level below 250, and what are their IDs, sexes, and current ages?
the outpatient clinic and had total bilirubin levels below the Q2:: Who are the patients with high glucose and acceptable
normal range in October 1991? Q2:: Who had healthy liver cholesterol levels, and how old are they?
function tests at the outpatient clinic in October 1991? Bird only question rating: good.
Fused question rating: good. LLM only generated questions: Q1:: Which patients have a
---------------------------------------------------- glucose level of 180 or higher and a total cholesterol level of less
Question 1209: than 250, and what are their IDs, sexes, and ages based on the
SQL: current year? Q2:: Who are the patients with high glucose but not
high cholesterol, and what are their ID, gender, and age?
SELECT DISTINCT T1.Diagnosis FROM Patient AS T1 INNER
JOIN Laboratory AS T2 ON T1.ID = T2.ID WHERE T2.GPT > LLM only question rating: good.
60 ORDER BY T1.Birthday ASC Fused generated questions: Q1:: Which patients have a blood
Supplied question: Please provide the diagnosis of patients with glucose level of 180 or higher and a total cholesterol level below
ALT glutamic pylvic transaminase beyond the normal range by 250, and what are their genders and ages? Q2:: Who are the
ascending order of their date of birth. patients with high glucose and acceptable cholesterol levels, along
with their genders and ages?
Base generated questions: Q1:: What are the distinct diagnoses for
patients who have a GPT level greater than 60, sorted by their Fused question rating: good.
birthday in ascending order? Q2:: Which unique health ----------------------------------------------------
conditions are found in patients with elevated GPT levels, listed Question 1243:
from youngest to oldest? SQL:
Base question rating: good. SELECT CAST(SUM(CASE WHEN T2.PT >= 14 AND T1.SEX
Bird only generated questions: Q1:: Which distinct diagnoses are = 'F' THEN 1 ELSE 0 END) AS REAL) * 100 / COUNT(CASE
associated with patients who have a GPT level greater than 60, WHEN T2.PT >= 14 THEN 1 ELSE 0 END) FROM Patient AS
when the patients are sorted by their birth dates in ascending T1 INNER JOIN Laboratory AS T2 ON T1.ID = T2.ID WHERE
order? Q2:: What are the unique diagnoses for patients with high STRFTIME('%Y', CURRENT_TIMESTAMP) -
GPT levels, listed from oldest to youngest? STRFTIME('%Y', T1.Birthday) > 55
Bird only question rating: good. Supplied question: For all patients who are older than 55 years
LLM only generated questions: Q1:: What are the distinct old, what is the percentage of female who has abnormal
diagnoses for patients who have a GPT level greater than 60, prothrombin time (PT)?
sorted by their birthdays in ascending order? Q2:: What distinct Base generated questions: Q1:: What is the percentage of female
conditions have been diagnosed for patients with elevated GPT patients over the age of 55 who have a PT value of 14 or greater,
levels, listed from youngest to oldest? out of all patients over the age of 55 with a PT value of 14 or
LLM only question rating: good. greater? Q2:: What percentage of older females (over 55) have
Fused generated questions: Q1:: Which distinct diseases have high PT levels?
been diagnosed in patients who have an elevated GPT level above Base question rating: good.
60, sorted by the patient's age from the youngest to the oldest? Bird only generated questions: Step 1: What percentage of female
Q2:: What diseases are diagnosed in younger to older patients patients over the age of 55 have a PT (Prothrombin Time) value of
with high liver enzyme levels? 14 or higher? Step 2: What is the percentage of women older than
Fused question rating: good. 55 with a high PT level?
---------------------------------------------------- Bird only question rating: good.
Question 1232: LLM only generated questions: Q1:: What is the percentage of
female patients over the age of 55 who have a PT (Prothrombin
Time) value of 14 or greater out of all patients over the age of 55 Bird only generated questions: Q1:: What is the most recent
with a recorded PT value of 14 or greater? Q2:: What percentage birthday among patients who have a Glutamic-Oxaloacetic
of women over 55 have a high PT level? Transaminase (GOT) level of 60 or higher according to the
LLM only question rating: good. laboratory results? Q2:: Who is the youngest patient with a GOT
Fused generated questions: Q1:: What is the percentage of female level of at least 60?
patients over the age of 55 who have an abnormal prothrombin Bird only question rating: good-. Reason: Not picking up on field
time (PT) of 14 or higher, out of all patients over the age of 55 meanings.
with an abnormal PT? Q2:: What percentage of women over 55 LLM only generated questions: **Q1::** What is the most recent
have a high blood clotting time? birthday of a patient who has a Glutamate Oxaloacetate
Fused question rating: good. Transaminase (GOT) level of 60 or higher according to laboratory
---------------------------------------------------- records? **Q2::** What is the latest birthday for a patient with a
GOT level at least 60?
Question 1256:
LLM only question rating: good-. Reason: Not picking up on
SQL: field meanings.
SELECT COUNT(T1.ID) FROM Patient AS T1 INNER JOIN Fused generated questions: Q1:: What is the most recent birthday
Laboratory AS T2 ON T1.ID = T2.ID WHERE (T2.CRP = '+' ) of a patient who has an AST level of 60 or higher? Q2:: Who is
AND T1.Description IS NULL the youngest patient with elevated AST levels?
Supplied question: How many patients with a abnormal C-reactive Fused question rating: good.
protein don't have their data recorded?
----------------------------------------------------
Base generated questions: Q1:: How many patients have a
positive CRP test result recorded in the Laboratory table and have Question 1031:
a NULL description in the Patient table? Q2:: How many patients SQL:
with positive CRP lack a description? SELECT DISTINCT DATETIME() - T2.birthday age FROM
Base question rating: good. Player_Attributes AS t1 INNER JOIN Player AS t2 ON
Bird only generated questions: Q1:: How many patients have a t1.player_api_id = t2.player_api_id WHERE
positive C-reactive protein (CRP) test result recorded in the STRFTIME('%Y',t1.`date`) >= '2013' AND
Laboratory table and do not have a description entered in the STRFTIME('%Y',t1.`date`) <= '2015' AND t1.sprint_speed >= 97
Patient table? Q2:: How many patients with positive CRP lack a Supplied question: At present, calculate for the player's age who
description? have a sprint speed of no less than 97 between 2013 to 2015.
Bird only question rating: good. Base generated questions: Q1:: What are the distinct ages of
LLM only generated questions: Q1:: How many patients have a players who had a sprint speed of 97 or higher recorded between
positive C-reactive protein (CRP) test result in the Laboratory the years 2013 and 2015? Q2:: What ages did the fastest players
table and have a null description in the Patient table? Q2:: How have during 2013-2015?
many patients with positive CRP lack a description? Base question rating: good.
LLM only question rating: good. Bird only generated questions: Q1:: Which players had a sprint
Fused generated questions: Q1:: How many patients have a speed of 97 or higher between the years 2013 and 2015, and what
positive C-reactive protein test result indicating inflammation but were their ages at the time of the recorded sprint speed? Q2::
do not have a recorded first date of data entry in the system? Q2:: Who were the fastest players aged 97 or above from 2013 to
How many patients with inflammation have no recorded first data 2015?
entry? Bird only question rating: good-. Reason: getting ID, not age.
Fused question rating: good. LLM only generated questions: Q1:: What are the distinct ages of
---------------------------------------------------- players who had a sprint speed of 97 or higher recorded in their
player attributes between the years 2013 and 2015? Q2:: How old
Question 1281: were the fastest players between 2013 and 2015?
SQL: LLM only question rating: good.
SELECT T1.Birthday FROM Patient AS T1 INNER JOIN Fused generated questions: Q1:: What are the ages of players who
Laboratory AS T2 ON T1.ID = T2.ID WHERE T2.GOT >= 60 have a sprint speed rating of 97 or higher according to the
ORDER BY T1.Birthday DESC LIMIT 1 attributes recorded between the years 2013 and 2015? Q2:: How
Supplied question: Among the patients who have an abnormal old are the fastest players, with speeds of at least 97, based on
level of glutamic oxaloacetic transaminase, when was the data from 2013 to 2015?
youngest of them born? Fused question rating: good.
Base generated questions: Q1:: What is the most recent birthday ----------------------------------------------------
of a patient who has a Glutamic Oxaloacetic Transaminase (GOT)
level of 60 or higher as recorded in the laboratory results? Q2:: Question 1040:
Who is the youngest patient with a GOT level at or above 60? SQL:
Base question rating: good-. Reason: Not picking up on field SELECT t1.player_name FROM Player AS t1 INNER JOIN
meanings. Player_Attributes AS t2 ON t1.player_api_id = t2.player_api_id
WHERE t1.height > 180 GROUP BY t1.id ORDER BY
CAST(SUM(t2.heading_accuracy) AS REAL) / Fused generated questions: Q1:: What is the average overall rating
COUNT(t2.`player_fifa_api_id`) DESC LIMIT 10 for players taller than 170 cm based on the recorded attributes
Supplied question: List the top 10 players' names whose heights between the years 2010 and 2015? Q2:: What is the average FIFA
are above 180 in descending order of average heading accuracy. strength rating for players over 170 cm tall from 2010 to 2015?
Base generated questions: Q1:: Which ten players taller than 180 Fused question rating: good.
cm have the highest average heading accuracy in the ----------------------------------------------------
Player_Attributes table? Q2:: Who are the top ten tallest players Question 1088:
with the best average heading skills? SQL:
Base question rating: good-. Reason: not picking up on the order SELECT DISTINCT t1.player_name FROM Player AS t1 INNER
by. JOIN Player_Attributes AS t2 ON t1.player_api_id =
Bird only generated questions: Q1:: Which ten players taller than t2.player_api_id WHERE t2.volleys > 70 AND t2.dribbling > 70
180 cm have the highest average heading accuracy in the Supplied question: Please list the names of the players whose
Player_Attributes table? Q2:: Who are the top ten tallest players volley score and dribbling score are over 70.
with the best average heading skill?
Base generated questions: Q1:: Which players have a volleys
Bird only question rating: good-. Reason: not picking up on the attribute greater than 70 and a dribbling attribute also greater than
order by. 70, without listing any player more than once? Q2:: Who are the
LLM only generated questions: Q1:: Which ten players taller than players with both volleys and dribbling over 70?
180 cm have the highest average heading accuracy in the Base question rating: good.
Player_Attributes table? Q2:: Who are the top ten tallest players
with the best average heading accuracy? Bird only generated questions: Q1:: Which players have a volleys
attribute greater than 70 and a dribbling attribute greater than 70?
LLM only question rating: good-. Reason: not picking up on the Q2:: Who are the players with volleys and dribbling skills over
order by. 70?
Fused generated questions: Q1:: Which 10 players taller than 180 Bird only question rating: good.
cm have the highest average heading accuracy score in the
database, based on their attributes? Q2:: Who are the top 10 LLM only generated questions: Q1:: Which players have both
tallest players with the best average heading accuracy? volleys and dribbling attributes greater than 70? Q2:: Who are the
players with volleys and dribbling skills above 70?
Fused question rating: good-. Reason: not picking up on the order
by. LLM only question rating: good.
---------------------------------------------------- Fused generated questions: Q1:: Which players have a volley
score and a dribbling score both greater than 70 according to the
Question 1068: FIFA calculations? Q2:: Who are the players with volley and
SQL: dribbling skills above 70?
SELECT CAST(SUM(t2.overall_rating) AS REAL) / Fused question rating: good.
COUNT(t2.id) FROM Player AS t1 INNER JOIN ----------------------------------------------------
Player_Attributes AS t2 ON t1.player_api_id = t2.player_api_id
WHERE t1.height > 170 AND STRFTIME('%Y',t2.`date`) >= Question 1102:
'2010' AND STRFTIME('%Y',t2.`date`) <= '2015' SQL:
Supplied question: From 2010 to 2015, what was the average SELECT t1.player_name FROM Player AS t1 INNER JOIN
overall rating of players who are higher than 170? Player_Attributes AS t2 ON t1.player_api_id = t2.player_api_id
Base generated questions: Q1:: What is the average overall rating WHERE SUBSTR(t2.`date`, 1, 10) = '2016-06-23' AND
of players who are taller than 170 cm, based on the player t2.overall_rating = 77 ORDER BY t1.birthday ASC LIMIT 1
attributes recorded between the years 2010 and 2015? Q2:: Supplied question: For the players who had a 77 points overall
What's the average rating for players over 170 cm tall from 2010 rating on 2016/6/23, who was the oldest? Give the name of the
to 2015? player.
Base question rating: good. Base generated questions: Q1:: Who is the earliest born player
Bird only generated questions: Q1:: What is the average overall with an overall rating of 77 on June 23, 2016, according to the
rating of players who are taller than 170 cm, based on their Player and Player_Attributes tables? Q2:: Who was the oldest
attributes recorded between the years 2010 and 2015? Q2:: player with a rating of 77 on June 23, 2016?
What's the average rating for players over 170 cm tall from 2010 Base question rating: good.
to 2015? Bird only generated questions: Q1:: Who is the earliest born
Bird only question rating: good. player with an overall rating of 77 on the date of June 23, 2016,
LLM only generated questions: **Q1::** What is the average according to the Player and Player_Attributes tables? Q2:: Who
overall rating of players who are taller than 170 cm, based on their was the oldest player with a 77 rating on June 23, 2016?
attributes recorded between the years 2010 and 2015? **Q2::** Bird only question rating: good.
What is the average rating for players over 170 cm tall from 2010 LLM only generated questions: Q1:: Who is the earliest born
to 2015? player with an overall rating of 77 on June 23, 2016, according to
LLM only question rating: good. the Player and Player_Attributes tables? Q2:: Who is the oldest
player with a rating of 77 on June 23, 2016?
LLM only question rating: good. Bird only question rating: good.
Fused generated questions: Q1:: Who is the oldest player with an LLM only generated questions: Q1:: How many players named
overall rating of 77 according to the player attributes recorded on Aaron, or with a name starting with "Aaron", were born after the
June 23, 2016? Q2:: Which oldest player had a FIFA rating of 77 year 1990 according to the database? Q2:: How many Aarons are
on June 23, 2016? born post-1990?
Fused question rating: good. LLM only question rating: good.
---------------------------------------------------- Fused generated questions: Q1:: How many players in the
Question 1114: database were born after the year 1990 and have a name that starts
SQL: with "Aaron"? Q2:: How many post-1990 players are named
Aaron?
SELECT CAST(SUM(t2.overall_rating) AS REAL) /
COUNT(t2.id) FROM Player AS t1 INNER JOIN Fused question rating: good.
Player_Attributes AS t2 ON t1.player_fifa_api_id = ----------------------------------------------------
t2.player_fifa_api_id WHERE t1.player_name = 'Marko Question 1144:
Arnautovic' AND SUBSTR(t2.`date`, 1, 10) BETWEEN '2007- SQL:
02-22' AND '2016-04-21' SELECT id, finishing, curve FROM Player_Attributes WHERE
Supplied question: What was the average overall rating for Marko player_api_id = ( SELECT player_api_id FROM Player ORDER
Arnautovic from 2007/2/22 to 2016/4/21? BY weight DESC LIMIT 1 ) LIMIT 1
Base generated questions: Q1:: What is the average overall rating Supplied question: Please state the finishing rate and curve score
of the player named 'Marko Arnautovic' in the Player_Attributes of the player who has the heaviest weight.
table, based on records from February 22, 2007, to April 21, Base generated questions: Q1:: What are the id, finishing, and
2016? Q2:: What is Marko Arnautovic's average rating between curve attributes of the player who has the highest weight in the
2007 and 2016? Player_Attributes table? Q2:: What are the key stats for the
Base question rating: good. heaviest player?
Bird only generated questions: Q1:: What is the average overall Base question rating: good.
rating of the player named 'Marko Arnautovic' according to the Bird only generated questions: Q1:: What are the id, finishing, and
Player_Attributes table for the period between February 22, 2007, curve attributes of the player with the highest weight in the
and April 21, 2016? Q2:: What is Marko Arnautovic's average Player_Attributes table? Q2:: What skills does the heaviest player
rating from 2007 to 2016? have?
Bird only question rating: good. Bird only question rating: good.
LLM only generated questions: Q1:: What is the average overall LLM only generated questions: Q1:: What are the id, finishing,
rating of the player named 'Marko Arnautovic' according to the and curve attributes of the player who has the highest weight
Player_Attributes table, for the records that fall between the dates according to the Player_Attributes table? Q2:: What are the key
'2007-02-22' and '2016-04-21'? Q2:: What is Marko Arnautovic's attributes of the heaviest player?
average rating from 2007 to 2016?
LLM only question rating: good.
LLM only question rating: good.
Fused generated questions: Q1:: What are the id, finishing, and
Fused generated questions: Q1:: What is the average overall rating curve attributes for the player with the highest weight in the
of the player named 'Marko Arnautovic' according to the FIFA Player table? Q2:: What finishing and curve scores does the
database for the period between February 22, 2007, and April 21, heaviest player have?
2016? Q2:: What is Marko Arnautovic's average FIFA rating
from 2007 to 2016? Fused question rating: good.
Fused question rating: good. ----------------------------------------------------
---------------------------------------------------- Question 847:
Question 1133: SQL:
SQL: SELECT T2.surname FROM qualifying AS T1 INNER JOIN
drivers AS T2 ON T2.driverId = T1.driverId WHERE T1.raceId =
SELECT COUNT(id) FROM Player WHERE birthday > '1990' 19 ORDER BY T1.q2 ASC LIMIT 1
AND player_name LIKE 'Aaron%'
Supplied question: What is the surname of the driver with the best
Supplied question: How many football players born after the lap time in race number 19 in the second qualifying period?
1990s have the first name "Aaron"?
Base generated questions: Q1:: Which driver's surname
Base generated questions: Q1:: How many players named Aaron corresponds to the driver who achieved the fastest time in the
who were born after the year 1990 are recorded in the Player second qualifying session (Q2) for the race with the raceId of 19?
database? Q2:: How many post-1990 Aarons are in the Player Q2:: Who had the fastest Q2 time in race 19?
database?
Base question rating: good.
Base question rating: good.
Bird only generated questions: Q1:: Which driver's surname is
Bird only generated questions: Q1:: How many players in the associated with the fastest qualifying time in Q2 for race number
Player table were born after the year 1990 and have a name that 19? Q2:: Who had the fastest Q2 time in race 19?
starts with "Aaron"? Q2:: How many "Aaron"s born post-1990
are listed as players? Bird only question rating: good.
LLM only generated questions: Q1:: Who is the driver with the Bird only generated questions: Q1:: What is the URL of the
surname that qualified fastest in the second qualifying session for season associated with the race that has the raceId of 901? Q2::
race number 19? Q2:: Which driver had the best Q2 time in the What's the season's URL for the race with ID 901?
19th race? Bird only question rating: good.
LLM only question rating: good. LLM only generated questions: Q1:: What is the URL of the
Fused generated questions: Q1:: Which driver's surname season corresponding to the race with the raceId of 901 in the
corresponds to the driver who had the fastest lap time in the database? Q2:: What is the season URL for race 901?
second qualifying session (Q2) for the race with the identification LLM only question rating: good.
number 19? Q2:: Who had the fastest Q2 lap in race 19? Fused generated questions: Q1:: What is the website link for the
Fused question rating: good. introduction page of the Formula One season in which the race
---------------------------------------------------- with the raceId 901 occurred? Q2:: What's the intro page link for
Question 862: the F1 season of race 901?
SQL: Fused question rating: good. Reason: more accurate but not really
SELECT COUNT(T3.driverId) FROM races AS T1 INNER JOIN needed.
results AS T2 ON T2.raceId = T1.raceId INNER JOIN drivers AS ----------------------------------------------------
T3 ON T3.driverId = T2.driverId WHERE T1.year = 2007 AND Question 892:
T1.name = 'Bahrain Grand Prix' AND T2.time IS NULL SQL:
Supplied question: For the Bahrain Grand Prix in 2007, how many SELECT T3.forename, T3.surname, T2.points FROM races AS
drivers not finished the game? T1 INNER JOIN driverStandings AS T2 ON T2.raceId =
Base generated questions: Q1:: How many drivers who T1.raceId INNER JOIN drivers AS T3 ON T3.driverId =
participated in the 2007 Bahrain Grand Prix did not finish the T2.driverId ORDER BY T2.points DESC LIMIT 1
race? Q2:: How many drivers did not finish the 2007 Bahrain Supplied question: State the driver with the most points scored.
Grand Prix? Find his full name with that points.
Base question rating: good+. Reason: the supplied question is not Base generated questions: Q1:: Which driver has the highest
proper English. number of points in a specific race, and what are their first and
Bird only generated questions: Step 1: How many drivers who last names? Q2:: Who is the top-scoring driver in a race?
participated in the 2007 Bahrain Grand Prix did not finish the race Base question rating: good-. Reason: a race vs. any race.
(as indicated by having a NULL time in the results)? Step 2: How Bird only generated questions: Q1:: Which driver has the highest
many drivers did not finish the 2007 Bahrain Grand Prix? number of points in a single race, and what are their forename and
Bird only question rating: good+. Reason: the supplied question surname? Q2:: Who is the top-scoring driver in any race?
is not proper English. Bird only question rating: good.
LLM only generated questions: Q1:: How many drivers who LLM only generated questions: Q1:: Which driver has the highest
participated in the 2007 Bahrain Grand Prix did not finish the number of points, and what are their forename and surname, based
race, as indicated by having no recorded finish time in the results? on the latest race results in the database? Q2:: Who is the top-
Q2:: How many drivers did not finish the 2007 Bahrain Grand scoring driver in the latest race?
Prix?
LLM only question rating: good. Reason: focusing on latest race.
LLM only question rating: good+. Reason: the supplied question
is not proper English. Fused generated questions: Q1:: Which driver, including their first
and last name, scored the highest number of points in a single
Fused generated questions: Q1:: How many drivers who race, according to the records in the database? Q2:: Who scored
participated in the 2007 Bahrain Grand Prix did not finish the the most points in a single race?
race? Q2:: How many drivers did not finish the 2007 Bahrain
Grand Prix? Fused question rating: good.
Fused question rating: good+. Reason: the supplied question is ----------------------------------------------------
not proper English. Question 901:
---------------------------------------------------- SQL:
Question 875: SELECT DISTINCT T2.name, T1.name, T1.location FROM
SQL: circuits AS T1 INNER JOIN races AS T2 ON T2.circuitID =
T1.circuitId WHERE T2.year = 2005 AND STRFTIME('%m',
SELECT T2.url FROM races AS T1 INNER JOIN seasons AS T2 T2.date) = '09'
ON T2.year = T1.year WHERE T1.raceId = 901
Supplied question: Name the races along with its circuit name and
Supplied question: Show me the season page of year when the location for f1 races hosted in September 2005.
race No. 901 took place.
Base generated questions: Q1:: What are the names of the circuits
Base generated questions: Q1:: What is the URL of the season and the corresponding race names for races that took place in
associated with the race that has the raceId of 901? Q2:: What's September 2005, along with the locations of those circuits? Q2::
the season's URL for race 901? Which races and circuits occurred in September 2005, and where
Base question rating: good. were they located?
Base question rating: good.
Bird only generated questions: Q1:: What are the names and Bird only generated questions: Q1:: What was the time achieved
locations of the circuits and the corresponding names of the races by the driver who finished in second place at the Chinese Grand
that took place in September 2005? Q2:: Which races and circuits Prix in the year 2008? Q2:: What was the runner-up's time in the
occurred in September 2005? 2008 Chinese Grand Prix?
Bird only question rating: good-. Reason: not saying all the Bird only question rating: good+. Reason: bad typo in supplied
output fields. question.
LLM only generated questions: Q1:: Which race names and LLM only generated questions: Q1:: What was the time taken by
circuit names with their locations are associated with races that the driver who finished in second place in the Chinese Grand Prix
took place in September 2005? Q2:: What are the names and in the year 2008? Q2:: What was the runner-up's time in the 2008
locations of circuits that hosted races in September 2005? Chinese Grand Prix?
LLM only question rating: good. LLM only question rating: good+. Reason: bad typo in supplied
Fused generated questions: Q1:: What are the full names of the question.
racing circuits and their locations, along with the official names of Fused generated questions: Q1:: What was the finish time
the races that took place in September 2005? Q2:: Which races difference of the driver who ranked second in the Chinese Grand
and circuits were in the September 2005 schedule? Prix in 2008? Q2:: What was the runner-up's time behind the
Fused question rating: good. winner in the 2008 Chinese Grand Prix?
---------------------------------------------------- Fused question rating: good+. Reason: bad typo in supplied
Question 912: question.
SQL: ----------------------------------------------------
SELECT circuitRef FROM circuits WHERE name = 'Marina Bay Question 951:
Street Circuit' SQL:
Supplied question: What's the reference name of Marina Bay SELECT COUNT(T1.raceId) FROM constructorStandings AS T1
Street Circuit? INNER JOIN constructors AS T2 on T1.constructorId =
Base generated questions: Q1:: What is the circuit reference code T2.constructorId WHERE T1.points = 0 AND T2.nationality =
for the racing track known as 'Marina Bay Street Circuit' in the 'Japanese' GROUP BY T1.constructorId HAVING
database? Q2:: What's the code for the 'Marina Bay Street COUNT(raceId) = 2
Circuit'? Supplied question: How many Japanese constructors have 0 points
Base question rating: good. in 2 races?
Bird only generated questions: Q1:: What is the circuit reference Base generated questions: Q1:: How many Japanese constructors
code for the 'Marina Bay Street Circuit' in the circuits table? Q2:: have exactly two race entries where their teams scored zero
What is the 'Marina Bay Street Circuit' reference code? points? Q2:: Which Japanese constructors have two races with no
points scored?
Bird only question rating: good.
Base question rating: good.
LLM only generated questions: Q1:: What is the circuit reference
code for the 'Marina Bay Street Circuit' in the circuits database? Bird only generated questions: Q1:: How many Japanese
Q2:: What's the reference code for Marina Bay Street Circuit? constructors have participated exactly in two races without
scoring any points? Q2:: Which Japanese constructors have two
LLM only question rating: good. zero-point races?
Fused generated questions: Q1:: What is the unique reference Bird only question rating: good.
name for the racing circuit with the full name 'Marina Bay Street
Circuit' in the database? Q2:: What is the reference name for LLM only generated questions: Q1:: How many Japanese
'Marina Bay Street Circuit'? constructors have exactly two race entries where their teams
scored zero points? Q2:: Which Japanese constructors had two
Fused question rating: good. races with no points scored?
---------------------------------------------------- LLM only question rating: good.
Question 937: Fused generated questions: Q1:: Which Japanese constructors
SQL: have exactly two race entries where they scored zero points? Q2::
SELECT T1.time FROM results AS T1 INNER JOIN races AS Which Japanese constructors finished two races with no points?
T2 on T1.raceId = T2.raceId WHERE T1.rank = 2 AND T2.name Fused question rating: good.
= 'Chinese Grand Prix' AND T2.year = 2008 ----------------------------------------------------
Supplied question: What's the finish time for the driver who Question 963:
ranked second in 2008's AustChineseralian Grand Prix?
SQL:
Base generated questions: Q1:: What was the time recorded for
the driver who finished in second place at the Chinese Grand Prix SELECT COUNT(T1.driverId) FROM drivers AS T1 INNER
in the year 2008? Q2:: What was the runner-up's time in the 2008 JOIN lapTimes AS T2 on T1.driverId = T2.driverId WHERE
Chinese Grand Prix? T1.nationality = 'French' AND (CAST(SUBSTR(T2.time, 1, 2)
AS INTEGER) * 60 + CAST(SUBSTR(T2.time, 4, 2) AS
Base question rating: good+. Reason: bad typo in supplied INTEGER) + CAST(SUBSTR(T2.time, 7, 2) AS REAL) / 1000)
question. < 120
Supplied question: How many French drivers who obtain the ----------------------------------------------------
laptime less than 02:00.00? Question 994:
Base generated questions: Q1:: How many French drivers have SQL:
recorded a lap time of less than two minutes in the lapTimes SELECT SUM(T1.points), T2.name, T2.nationality FROM
database? Q2:: How many French drivers completed a lap in constructorResults AS T1 INNER JOIN constructors AS T2 ON
under two minutes? T1.constructorId = T2.constructorId INNER JOIN races AS T3
Base question rating: good+. Reason: supplied question is ON T3.raceid = T1.raceid WHERE T3.name = 'Monaco Grand
ambiguous about count vs. name. Prix' AND T3.year BETWEEN 1980 AND 2010 GROUP BY
Bird only generated questions: Q1:: How many laps have been T2.name ORDER BY SUM(T1.points) DESC LIMIT 1
completed by French drivers in the lapTimes table where their lap Supplied question: Which constructor scored most points from
time was less than two minutes? Q2:: How many sub-two-minute Monaco Grand Prix between 1980 and 2010? List the score, name
laps were completed by French drivers? and nationality of this team.
Bird only question rating: good+. Reason: supplied question is Base generated questions: Q1:: Which constructor has the highest
ambiguous about count vs. name. total number of points scored at the Monaco Grand Prix between
LLM only generated questions: Q1:: How many French drivers the years 1980 and 2010, and what is the name and nationality of
have completed a lap in less than 120 seconds according to the that constructor? Q2:: Who was the most successful constructor
lapTimes database? Q2:: How many French drivers have lapped at the Monaco Grand Prix from 1980 to 2010, and where were
under 2 minutes? they from?
LLM only question rating: good+. Reason: supplied question is Base question rating: good. Reason: more succinct but does not
ambiguous about count vs. name. explicitly ask for the score.
Fused generated questions: Q1:: How many French drivers have Bird only generated questions: Q1:: Which constructor has scored
completed a lap in less than 120 seconds? Q2:: How many French the highest total points at the Monaco Grand Prix between the
drivers finished a lap in under 2 minutes? years 1980 and 2010, and what is the name and nationality of that
Fused question rating: good+. Reason: supplied question is constructor? Q2:: Who was the top-scoring constructor from
ambiguous about count vs. name. 1980 to 2010 at the Monaco Grand Prix, and where were they
---------------------------------------------------- from?
Question 978: Bird only question rating: good. Reason: more succinct but does
not explicitly ask for the score.
SQL:
LLM only generated questions: Q1:: Which constructor has
SELECT DISTINCT location, lat, lng FROM circuits WHERE scored the highest total points at the Monaco Grand Prix from
country = 'Austria' 1980 to 2010, and what is the name and nationality of that
Supplied question: How many times the circuits were held in constructor? Q2:: Who was the most successful constructor at the
Austria? Please give their location and coordinates. Monaco Grand Prix between 1980 and 2010, and where were they
Base generated questions: Q1:: Which unique locations and their from?
corresponding latitude and longitude coordinates are associated LLM only question rating: good. Reason: more succinct but does
with circuits in Austria? Q2:: What are the distinct locations and not explicitly ask for the score.
coordinates of Austrian racing circuits? Fused generated questions: Q1:: Which constructor scored the
Base question rating: good+. Reason: supplied question is highest total points at the Monaco Grand Prix from 1980 to 2010,
ambiguous between count and list. and what is their nationality? Q2:: Who was the top-scoring
Bird only generated questions: Q1:: What are the unique locations constructor at the Monaco Grand Prix between 1980 and 2010,
and their latitude and longitude coordinates for the racing circuits and where were they from?
located in Austria? Q2:: Where are Austria's racing circuits and Fused question rating: good. Reason: more succinct but does not
their coordinates? explicitly ask for the score.
Bird only question rating: good+. Reason: supplied question is ----------------------------------------------------
ambiguous between count and list. Question 719:
LLM only generated questions: Q1:: What are the unique SQL:
locations and their corresponding latitude and longitude
coordinates for the racing circuits located in Austria? Q2:: What SELECT COUNT(T1.id) FROM superhero AS T1 INNER JOIN
are the distinct locations and GPS coordinates of racing tracks in hero_power AS T2 ON T1.id = T2.hero_id INNER JOIN
Austria? superpower AS T3 ON T2.power_id = T3.id WHERE
T3.power_name = 'Super Strength' AND T1.height_cm > 200
LLM only question rating: good+. Reason: supplied question is
ambiguous between count and list. Supplied question: Among the superheroes with the super power
of "Super Strength", how many of them have a height of over
Fused generated questions: Q1:: What are the unique names of the 200cm?
cities along with their latitude and longitude coordinates where
the racing circuits are located in Austria? Q2:: What are the Base generated questions: Q1:: How many superheroes who are
distinct city names and coordinates of racing circuits in Austria? taller than 200 centimeters possess the power of 'Super Strength'
according to the database? Q2:: How many tall superheroes have
Fused question rating: good+. Reason: supplied question is super strength?
ambiguous between count and list.
Base question rating: good. Base question rating: good+. Reason: supplied question asks for
Bird only generated questions: Q1:: How many superheroes who name of more prolific publisher, this is not part of the SQL.
are taller than 200 centimeters possess the 'Super Strength' power Bird only generated questions: Q1:: What is the difference in the
according to the database? Q2:: How many tall superheroes have number of superheroes published by Marvel Comics and DC
super strength? Comics according to the database? Q2:: How many more
Bird only question rating: good. superheroes does Marvel have compared to DC?
LLM only generated questions: Q1:: How many superheroes are Bird only question rating: good+. Reason: supplied question asks
taller than 200 centimeters and possess the superpower of 'Super for name of more prolific publisher, this is not part of the SQL.
Strength'? Q2:: How many tall superheroes have super strength? LLM only generated questions: Q1:: What is the difference in the
LLM only question rating: good. number of superheroes published by Marvel Comics and DC
Fused generated questions: Q1:: How many superheroes with a Comics according to the database? Q2:: How many more
height greater than 200 centimeters possess the superpower 'Super superheroes does Marvel have compared to DC?
Strength'? Q2:: How many tall superheroes (over 200 cm) have LLM only question rating: good+. Reason: supplied question
super strength? asks for name of more prolific publisher, this is not part of the
Fused question rating: good. SQL.
---------------------------------------------------- Fused generated questions: Q1:: How many more superheroes are
published by Marvel Comics than by DC Comics according to the
Question 733: database? Q2:: What is the difference in the number of
SQL: superheroes between Marvel Comics and DC Comics?
SELECT COUNT(T1.id) FROM superhero AS T1 INNER JOIN Fused question rating: good+. Reason: supplied question asks for
publisher AS T2 ON T1.publisher_id = T2.id INNER JOIN colour name of more prolific publisher, this is not part of the SQL.
AS T3 ON T1.eye_colour_id = T3.id WHERE ----------------------------------------------------
T2.publisher_name = 'Marvel Comics' AND T3.colour = 'Gold'
Question 758:
Supplied question: How many gold-eyed superheroes did Marvel
Comics publish? SQL:
Base generated questions: Q1:: How many superheroes in the SELECT DISTINCT T3.colour FROM superhero AS T1 INNER
database have a publisher named 'Marvel Comics' and have gold JOIN race AS T2 ON T1.race_id = T2.id INNER JOIN colour AS
as their eye color? Q2:: How many Marvel superheroes have gold T3 ON T1.hair_colour_id = T3.id WHERE T1.height_cm = 185
eyes? AND T2.race = 'Human'
Base question rating: good. Supplied question: Provide the hair colour of the human superhero
who is 185 cm tall.
Bird only generated questions: Q1:: How many superheroes in the
database have a publisher named 'Marvel Comics' and gold eye Base generated questions: Q1:: What are the distinct hair colours
colour? Q2:: How many Marvel superheroes have gold eyes? of human superheroes who are exactly 185 cm tall? Q2:: What
hair colours do 185 cm tall human superheroes have?
Bird only question rating: good.
Base question rating: good+. Reason: question implies singular,
LLM only generated questions: Q1:: How many superheroes in query can return zero or more.
the database have a publisher named 'Marvel Comics' and gold
eye colour? Q2:: How many Marvel superheroes have gold eyes? Bird only generated questions: Q1:: What are the distinct hair
colours of human superheroes who are exactly 185 cm tall? Q2::
LLM only question rating: good. What hair colours do 185 cm tall human superheroes have?
Fused generated questions: Q1:: How many superheroes Bird only question rating: good+. Reason: question implies
published by 'Marvel Comics' have 'Gold' as their eye color? Q2:: singular, query can return zero or more.
How many Marvel superheroes have gold eyes?
LLM only generated questions: Q1:: What are the distinct hair
Fused question rating: good. colours of human superheroes who are exactly 185 cm tall? Q2::
---------------------------------------------------- What hair colours do 185 cm tall human superheroes have?
Question 744: LLM only question rating: good+. Reason: question implies
SQL: singular, query can return zero or more.
SELECT SUM(CASE WHEN T2.publisher_name = 'Marvel Fused generated questions: Q1:: What are the distinct hair colors
Comics' THEN 1 ELSE 0 END) - SUM(CASE WHEN of human superheroes who are exactly 185 centimeters tall? Q2::
T2.publisher_name = 'DC Comics' THEN 1 ELSE 0 END) FROM What hair colors do 185 cm tall human superheroes have?
superhero AS T1 INNER JOIN publisher AS T2 ON Fused question rating: good+. Reason: question implies singular,
T1.publisher_id = T2.id query can return zero or more.
Supplied question: Between DC and Marvel Comics, which ----------------------------------------------------
publisher has published more superheroes? Find the difference in Question 769:
the number of superheroes they have published.
SQL:
Base generated questions: Q1:: What is the difference in the
number of superheroes published by Marvel Comics and DC SELECT T1.superhero_name FROM superhero AS T1 INNER
Comics according to the database? Q2:: How many more JOIN hero_attribute AS T2 ON T1.id = T2.hero_id INNER JOIN
superheroes does Marvel have compared to DC? attribute AS T3 ON T3.id = T2.attribute_id INNER JOIN
publisher AS T4 ON T4.id = T1.publisher_id WHERE
T4.publisher_name = 'Dark Horse Comics' AND superpower AS T3 ON T2.power_id = T3.id WHERE
T3.attribute_name = 'Durability' ORDER BY T2.attribute_value T1.superhero_name = 'Abomination'
DESC LIMIT 1 Supplied question: What is Abomination's superpower?
Supplied question: Which superhero has the most durability Base generated questions: Q1:: What are the names of the
published by Dark Horse Comics? superpowers associated with the superhero known as
Base generated questions: Q1:: Which superhero from Dark Horse 'Abomination' in the database? Q2:: What superpowers does
Comics has the highest durability attribute value? Q2:: Who is Abomination have?
the most durable superhero from Dark Horse Comics? Base question rating: good.
Base question rating: good. Reason: poor grammer in the Bird only generated questions: Q1:: What are the names of the
supplied question. superpowers associated with the superhero known as
Bird only generated questions: Q1:: Which superhero from Dark 'Abomination' in the database? Q2:: What superpowers does
Horse Comics has the highest durability attribute value? Q2:: Abomination have?
Who is the most durable superhero from Dark Horse Comics? Bird only question rating: good.
Bird only question rating: good. Reason: poor grammer in the LLM only generated questions: Q1:: What are the names of the
supplied question. superpowers associated with the superhero known as
LLM only generated questions: Q1:: Which superhero from 'Dark 'Abomination' according to the database? Q2:: What superpowers
Horse Comics' has the highest 'Durability' attribute value? Q2:: does Abomination have?
Who is the most durable superhero from Dark Horse Comics? LLM only question rating: good.
LLM only question rating: good. Reason: poor grammer in the Fused generated questions: Q1:: What are the names of the
supplied question. superpowers associated with the superhero known as
Fused generated questions: Q1:: Which superhero from 'Dark 'Abomination' according to the database? Q2:: What superpowers
Horse Comics' has the highest 'Durability' attribute value? Q2:: does 'Abomination' have?
Who is the most durable superhero from Dark Horse Comics? Fused question rating: good.
Fused question rating: good. Reason: poor grammer in the ----------------------------------------------------
supplied question. Question 801:
---------------------------------------------------- SQL:
Question 782: SELECT CAST(COUNT(CASE WHEN T2.gender = 'Male'
SQL: THEN T1.id ELSE NULL END) AS REAL) / COUNT(CASE
SELECT T1.superhero_name FROM superhero AS T1 INNER WHEN T2.gender = 'Female' THEN T1.id ELSE NULL END)
JOIN colour AS T2 ON T1.eye_colour_id = T2.id AND FROM superhero AS T1 INNER JOIN gender AS T2 ON
T1.hair_colour_id = T2.id WHERE T2.colour = 'Black' T1.gender_id = T2.id
Supplied question: List the heroes' names whose eyes and hair Supplied question: Find the ratio between male superheroes and
colours are both black. female superheroes.
Base generated questions: Q1:: Which superheroes have both Base generated questions: Step 1: The SQL code is performing the
black eye color and black hair color according to the database? following operations: 1. It selects data from two tables -
Q2:: Who are the superheroes with black eyes and hair? `superhero` and `gender` - using an INNER JOIN on the
Base question rating: good. `gender_id` field of the `superhero` table and the `id` field of the
Bird only generated questions: Q1:: Which superheroes have both `gender` table. This join operation combines rows from both
black eye color and black hair color according to the database? tables based on a related column between them. 2. It then counts
Q2:: Who are the superheroes with black eyes and hair? the number of rows from the `superhero` table (using `T1.id` as a
reference) where the corresponding `gender` from `T2` is 'Male'.
Bird only question rating: good. This is done using a CASE statement within the COUNT
LLM only generated questions: Q1:: Which superheroes have function, which only counts rows where the condition is met
both eye and hair colour matching with the colour 'Black' in the (gender = 'Male'). 3. Similarly, it counts the number of rows
database? Q2:: Who are the superheroes with black eyes and where the `gender` is 'Female'. 4. After obtaining the counts for
hair? both 'Male' and 'Female' genders, the SQL code casts the count of
LLM only question rating: good. 'Male' to REAL to ensure that the division operation results in a
Fused generated questions: Q1:: Which superheroes have both floating-point number instead of an integer. 5. Finally, it divides
their eye color and hair color listed as black in the database? Q2:: the count of 'Male' superheroes by the count of 'Female'
Who are the superheroes with black eyes and hair? superheroes to obtain a ratio. Question that can be answered by
Fused question rating: good. the given SQL code: What is the ratio of the number of male
superheroes to the number of female superheroes in the database?
----------------------------------------------------
Step 2: Rewritten Question: What is the male to female superhero
Question 792: ratio?
SQL: Base question rating: good.
SELECT T3.power_name FROM superhero AS T1 INNER JOIN Bird only generated questions: Q1:: What is the ratio of the
hero_power AS T2 ON T1.id = T2.hero_id INNER JOIN number of male superheroes to the number of female superheroes
in the database, based on the gender information associated with Fused generated questions: Q1:: How many more superheroes are
each superhero in the 'superhero' and 'gender' tables? Q2:: What associated with DC Comics than with Marvel Comics in the
is the male to female superhero ratio? database? Q2:: What is the difference in the number of
Bird only question rating: good. superheroes between DC Comics and Marvel Comics?
LLM only generated questions: Q1:: What is the ratio of the Fused question rating: good+. Reason: question asks for identity,
number of male superheroes to the number of female superheroes but only a diffrence in counts is computed.
in the database, according to their respective gender identification ----------------------------------------------------
in the gender table? Q2:: What is the male to female superhero Question 544:
ratio? SQL:
LLM only question rating: good. SELECT T2.DisplayName FROM posts AS T1 INNER JOIN
Fused generated questions: Q1:: What is the ratio of the number users AS T2 ON T1.LastEditorUserId = T2.Id WHERE T1.Title =
of male superheroes to the number of female superheroes in the 'Examples for teaching: Correlation does not mean causation'
database? Q2:: What is the male to female superhero ratio? Supplied question: What is the display name of the user who last
Fused question rating: good. edited the post "Examples for teaching: Correlation does not mean
---------------------------------------------------- causation"?
Question 829: Base generated questions: Q1:: Who is the user that last edited the
SQL: post with the title 'Examples for teaching: Correlation does not
SELECT SUM(CASE WHEN T2.publisher_name = 'DC Comics' mean causation'? Q2:: Who last edited the 'Correlation vs
THEN 1 ELSE 0 END) - SUM(CASE WHEN T2.publisher_name Causation' teaching post?
= 'Marvel Comics' THEN 1 ELSE 0 END) FROM superhero AS Base question rating: good.
T1 INNER JOIN publisher AS T2 ON T1.publisher_id = T2.id Bird only generated questions: Q1:: Who is the display name of
Supplied question: Which publisher created more superheroes: the last editor who edited the post with the title 'Examples for
DC or Marvel Comics? Find the difference in the number of teaching: Correlation does not mean causation'? Q2:: What is the
superheroes. name of the user who last edited the 'Correlation does not mean
Base generated questions: Step 1: The given SQL code is causation' teaching post?
performing the following operations: 1. It selects records from Bird only question rating: good.
two tables, `superhero` (aliased as T1) and `publisher` (aliased as LLM only generated questions: Q1:: Who is the display name of
T2), where the `superhero` table has a `publisher_id` that the user who last edited the post titled "Examples for teaching:
references the `id` in the `publisher` table. 2. It performs an Correlation does not mean causation"? Q2:: Who last edited the
INNER JOIN on these two tables using the `publisher_id` from "Correlation does not mean causation" teaching post?
the `superhero` table and the `id` from the `publisher` table to LLM only question rating: good.
match the superheroes with their respective publishers. 3. For Fused generated questions: Q1:: Who is the display name of the
each matched record, it checks the `publisher_name` from the user who last edited the post titled 'Examples for teaching:
`publisher` table: - If the `publisher_name` is 'DC Comics', it Correlation does not mean causation'? Q2:: Who last edited
counts this as 1. - If the `publisher_name` is 'Marvel Comics', it 'Correlation does not mean causation'?
counts this as -1 (because it's subtracted). - All other
`publisher_name` values do not affect the count. 4. It then sums Fused question rating: good.
these counts to get two totals: one for 'DC Comics' and one for ----------------------------------------------------
'Marvel Comics'. 5. Finally, it subtracts the total count for 'Marvel Question 565:
Comics' from the total count for 'DC Comics'. The question that SQL:
can be answered by the given SQL code is: "What is the SELECT IIF(T2.ClosedDate IS NULL, 'NOT well-finished',
difference in the number of superheroes between DC Comics and 'well-finished') AS resylt FROM comments AS T1 INNER JOIN
Marvel Comics?" Step 2: "How many more superheroes does DC posts AS T2 ON T1.PostId = T2.Id WHERE T1.UserId = 23853
Comics have compared to Marvel Comics?" AND T1.CreationDate = '2013-07-12 09:08:18.0'
Base question rating: good+. Reason: question asks for identity, Supplied question: User No.23853 gave a comment to a post at
but only a diffrence in counts is computed. 9:08:18 on 2013/7/12, was that post well-finished?
Bird only generated questions: Q1:: How many more superheroes Base generated questions: Q1:: Was the post associated with the
are published by DC Comics compared to Marvel Comics in the comment made by user ID 23853 on July 12, 2013, at 09:08:18
database? Q2:: What is the difference in the number of finished or not at the time of the query? Q2:: Is the post from July
superheroes between DC Comics and Marvel Comics? 12, 2013, by user 23853 closed?
Bird only question rating: good+. Reason: question asks for Base question rating: good.
identity, but only a diffrence in counts is computed.
Bird only generated questions: Q1:: For the user with ID 23853,
LLM only generated questions: Q1:: How many more superheroes what is the status (well-finished or not well-finished) of the post
are published by DC Comics than by Marvel Comics in the associated with the comment they made on July 12, 2013, at
database? Q2:: What is the difference in the number of 09:08:18.0, based on whether the post has a closed date? Q2::
superheroes between DC Comics and Marvel Comics? What is the completion status of the post commented on by user
LLM only question rating: good+. Reason: question asks for 23853 on July 12, 2013?
identity, but only a diffrence in counts is computed. Bird only question rating: good.
LLM only generated questions: Q1:: What is the status of LLM only question rating: good.
completion ('well-finished' or 'NOT well-finished') for the posts Fused generated questions: Q1:: How many users have obtained
associated with comments made by user ID 23853 on the specific more than five different badges? Q2:: How many users earned
date and time of July 12, 2013, at 09:08:18.0? Q2:: What is the over five badges?
completion status of posts commented on by user 23853 on July Fused question rating: good.
12, 2013, at 9:08 AM?
----------------------------------------------------
LLM only question rating: good.
Question 634:
Fused generated questions: Q1:: Was the post associated with the
comment made by user 23853 on July 12, 2013, at 09:08:18 well- SQL:
finished or not? Q2:: Is the post linked to the user 23853's SELECT T1.DisplayName FROM users AS T1 INNER JOIN
comment from July 12, 2013, closed? postHistory AS T2 ON T1.Id = T2.UserId INNER JOIN posts AS
Fused question rating: good. T3 ON T2.PostId = T3.Id WHERE T1.DisplayName = 'Harvey
Motulsky' OR T1.DisplayName = 'Noah Snyder' GROUP BY
---------------------------------------------------- T1.DisplayName ORDER BY SUM(T3.ViewCount) DESC
Question 576: LIMIT 1
SQL: Supplied question: Among posts by Harvey Motulsky and Noah
SELECT T1.DisplayName FROM users AS T1 INNER JOIN Snyder, which one has higher popularity?
comments AS T2 ON T1.Id = T2.UserId WHERE T2.Text = Base generated questions: Q1:: Which user, either Harvey
'thank you user93!' Motulsky or Noah Snyder, has the highest cumulative view count
Supplied question: Name the user that commented 'thank you across all their posts? Q2:: Who has more post views, Harvey
user93!' Motulsky or Noah Snyder?
Base generated questions: Q1:: Which users have posted Base question rating: good.
comments with the exact text 'thank you user93!'? Q2:: Who Bird only generated questions: Q1:: Which user, either Harvey
commented 'thank you user93!'? Motulsky or Noah Snyder, has the highest total view count for all
Base question rating: good. their posts combined in the database? Q2:: Who has more post
Bird only generated questions: Q1:: Which users have posted views, Harvey Motulsky or Noah Snyder?
comments containing the exact text 'thank you user93!'? Q2:: Bird only question rating: good.
Who commented 'thank you user93!'? LLM only generated questions: Q1:: Which user, Harvey
Bird only question rating: good. Motulsky or Noah Snyder, has a display name associated with
LLM only generated questions: **Q1::** What is the display posts that have the highest total view count? Q2:: Who has more
name of the user who made a comment with the text 'thank you post views, Harvey Motulsky or Noah Snyder?
user93!'? **Q2::** Who commented 'thank you user93!'? LLM only question rating: good.
LLM only question rating: good. Fused generated questions: Q1:: Which user, either Harvey
Fused generated questions: Q1:: Which users have posted a Motulsky or Noah Snyder, has the highest total view count for all
comment with the exact text 'thank you user93!'? Q2:: Who said their posts? Q2:: Who has more popular posts, Harvey Motulsky
'thank you user93!'? or Noah Snyder?
Fused question rating: good. Fused question rating: good.
---------------------------------------------------- ----------------------------------------------------
Question 592: Question 671:
SQL: SQL:
SELECT COUNT(UserId) FROM ( SELECT UserId, SELECT T1.DisplayName FROM users AS T1 INNER JOIN
COUNT(Name) AS num FROM badges GROUP BY UserId ) T badges AS T2 ON T1.Id = T2.UserId WHERE T2.`Name` =
WHERE T.num > 5 'Autobiographer' ORDER BY T2.Date LIMIT 1
Supplied question: How many users are awarded with more than 5 Supplied question: What is the display name of the user who
badges? acquired the first Autobiographer badge?
Base generated questions: Q1:: How many users have been Base generated questions: Q1:: Which user was the first to receive
awarded more than 5 different types of badges? Q2:: How many the 'Autobiographer' badge according to the records in the
users earned over 5 badge types? database? Q2:: Who first earned the 'Autobiographer' badge?
Base question rating: good. Base question rating: good.
Bird only generated questions: Q1:: How many users have earned Bird only generated questions: Q1:: Which user was the first to
more than 5 different types of badges? Q2:: How many users receive the 'Autobiographer' badge according to the recorded data
received over 5 badges? in the system? Q2:: Who first earned the 'Autobiographer' badge?
Bird only question rating: good. Bird only question rating: good.
LLM only generated questions: Q1:: How many users have been LLM only generated questions: Q1:: Which user was the first to
awarded more than 5 different badges? Q2:: How many users receive the 'Autobiographer' badge according to the recorded
earned over 5 badges? dates in the database? Q2:: Who first earned the 'Autobiographer'
badge?
LLM only question rating: good. Bird only generated questions: Q1:: How many comments are
Fused generated questions: Q1:: Who was the first user to receive there in the database where the associated post has exactly one
the 'Autobiographer' badge according to the recorded dates in the comment and a score of zero? Q2:: How many comments exist on
database? Q2:: Who first earned the 'Autobiographer' badge? single-comment, zero-score posts?
Fused question rating: good. Bird only question rating: good.
---------------------------------------------------- LLM only generated questions: Q1:: How many comments are
Question 687: associated with posts that have exactly one comment and a score
of zero? Q2:: How many comments are on posts with a single
SQL: comment and no score?
SELECT COUNT(T2.Id) FROM posts AS T1 INNER JOIN LLM only question rating: good.
comments AS T2 ON T1.Id = T2.PostId GROUP BY T1.Id
ORDER BY T1.Score DESC LIMIT 1 Fused generated questions: Q1:: How many comments are there
where the associated post has exactly one comment and a score of
Supplied question: How many comments were added to the post zero? Q2:: How many comments are linked to posts with only
with the highest score? one comment and no upvotes?
Base generated questions: Q1:: What is the highest number of Fused question rating: good. Reason: 'no upvotes' is more in line
comments associated with a single post from the 'posts' table, with field metadata.
considering only the post with the highest score? Q2:: What's the
maximum comment count for the top-scoring post? ----------------------------------------------------
Base question rating: good. Question 346:
Bird only generated questions: Step 1: The given SQL code is SQL:
performing the following operations: 1. It is joining two tables: SELECT T1.id, T1.artist FROM cards AS T1 INNER JOIN
'posts' (aliased as T1) and 'comments' (aliased as T2), on the legalities AS T2 ON T1.uuid = T2.uuid WHERE T2.status =
condition that the 'Id' field in the 'posts' table matches the 'PostId' 'Legal' AND T2.format = 'commander' AND (T1.power IS NULL
field in the 'comments' table. 2. It counts the number of comments OR T1.power = '*')
(T2.Id) associated with each post. 3. It groups the results by the Supplied question: List all the card id and artist with unknown
post ID (T1.Id), meaning it calculates the count of comments for power which are legal for commander play format.
each individual post. 4. It orders the grouped results by the 'Score' Base generated questions: Q1:: Which cards have a legal status in
field of the posts table (T1) in descending order, meaning the post the Commander format and either do not have a defined power or
with the highest score will be first. 5. It limits the output to only have a power denoted by an asterisk (*), and what are the names
the first row of the sorted result set. Question that can be of the artists associated with these cards? Q2:: What are the artist
answered by the given SQL code: "What is the number of names for Commander-legal cards with undefined or variable
comments on the post with the highest score?" Step 2: Rewritten power?
question in natural language and in fewer words: "How many Base question rating: good-. Reason: not picking up on *
comments does the top-scoring post have?" meaning unknown.
Bird only question rating: good. Bird only generated questions: Q1:: Which cards, identified by
LLM only generated questions: Q1:: What is the highest number their ID and artist, are legal in the Commander format and either
of comments associated with a single post in the database, have an undefined power or a power denoted by an asterisk (*)?
considering only the post with the highest score? Q2:: How many Q2:: What are the IDs and artists of Commander-legal cards with
comments does the top-scoring post have? no specific power?
LLM only question rating: good. Bird only question rating: good.
Fused generated questions: Q1:: What is the number of comments LLM only generated questions: Q1:: Which cards, identified by
for the post with the highest score? Q2:: How many comments their ID and artist, have a status of 'Legal' in the 'commander'
does the top scored post have? format and either have a null power value or a power value of '*'?
Fused question rating: good. Q2:: What are the IDs and artists of 'commander' legal cards with
---------------------------------------------------- no power or a power of '*'?
Question 710: LLM only question rating: good-. Reason: not picking up on *
SQL: meaning unknown.
SELECT COUNT(T1.id) FROM comments AS T1 INNER JOIN Fused generated questions: Q1:: Which cards, identified by their
posts AS T2 ON T1.PostId = T2.Id WHERE T2.CommentCount unique ID number, have an artist associated with them and are
= 1 AND T2.Score = 0 legal in the Commander format, but either do not have a defined
power level or have a power level marked as asterisk (*)? Q2::
Supplied question: In posts with 1 comment, how many of the What are the IDs and artists of cards with no specific power that
comments have 0 score? are legal in Commander?
Base generated questions: Q1:: How many comments are there on Fused question rating: good. Reason: 'no upvotes' is more in line
posts that have exactly one comment and a score of zero? Q2:: with field metadata.
What is the count of comments on single-comment, zero-score
posts? ----------------------------------------------------
Base question rating: good. Question 366:
SQL:
SELECT T2.format FROM cards AS T1 INNER JOIN legalities LLM only question rating: good. Reason: not picking up on
AS T2 ON T1.uuid = T2.uuid WHERE T1.name = 'Benalish foreign_data mening foreign language, but there is no implication
Knight' that there is a translation.
Supplied question: What is the rule of playing card "Benalish Fused generated questions: Q1:: What are the names of all distinct
Knight"? 'Artifact' cards with black color ('B') that have corresponding
Base generated questions: Q1:: What are the formats in which the foreign data entries? Q2:: List the unique black artifact cards with
card named 'Benalish Knight' is legal according to the 'legalities' foreign entries.
table? Q2:: In which formats is 'Benalish Knight' a legal card? Fused question rating: good. Reason: not picking up on
Base question rating: good-. Reason: not picking up on format foreign_data mening foreign language, but there is no implication
meaning rule, but its in the metadata. that there is a translation.
Bird only generated questions: Q1:: Which formats are legal for ----------------------------------------------------
the card named 'Benalish Knight' according to the database? Q2:: Question 409:
In what formats can 'Benalish Knight' be played? SQL:
Bird only question rating: good-. Reason: not picking up on SELECT COUNT(T1.id) FROM cards AS T1 INNER JOIN
format meaning rule, but its in the metadata. legalities AS T2 ON T1.uuid = T2.uuid INNER JOIN rulings AS
LLM only generated questions: Q1:: What are the formats in T3 ON T1.uuid = T3.uuid WHERE T2.format = 'premodern' AND
which the card named 'Benalish Knight' is legal to play according T3.text = 'This is a triggered mana ability.' AND T1.Side IS
to the database, based on the information in the 'legalities' table NULL
linked to the 'cards' table through the 'uuid'? Q2:: In which game Supplied question: Indicates the number of cards with pre-modern
formats is 'Benalish Knight' a legal card? format, ruling text "This is a triggered mana ability." that do not
LLM only question rating: good-. Reason: not picking up on have multiple faces.
format meaning rule, but its in the metadata. Base generated questions: Q1:: How many cards in the
Fused generated questions: Q1:: What are the formats of play in 'premodern' format do not have a specified side and have a ruling
which the card named 'Benalish Knight' is legal according to the text that states 'This is a triggered mana ability.'? Q2:: How many
legalities table? Q2:: In which formats can you play 'Benalish 'premodern' cards with no side have a ruling about being a
Knight'? triggered mana ability?
Fused question rating: good-. Reason: not picking up on format Base question rating: good+. Reason: poor grammer in the
meaning rule, but its in the metadata. supplied question.
---------------------------------------------------- Bird only generated questions: Q1:: How many cards that have no
Question 391: side specified are there in the 'premodern' format with rulings that
SQL: state 'This is a triggered mana ability'? Q2:: How many
'premodern' cards without a side have a ruling mentioning a
SELECT DISTINCT T1.name FROM cards AS T1 INNER JOIN triggered mana ability?
foreign_data AS T2 ON T1.uuid = T2.uuid WHERE
T1.originalType = 'Artifact' AND T1.colors = 'B' Bird only question rating: good+. Reason: poor grammer in the
supplied question.
Supplied question: Among the Artifact cards, which are black
color and comes with foreign languague translation? LLM only generated questions: **Q1::** How many cards from
the 'cards' table, that are involved in the 'premodern' format as per
Base generated questions: Q1:: What are the names of the distinct the 'legalities' table and have a ruling stating 'This is a triggered
cards that are classified as 'Artifact' in their original type and have mana ability' according to the 'rulings' table, do not have a
the color 'B', which also have corresponding entries in the specified side? **Q2::** How many premodern cards with no
foreign_data table? Q2:: What are the unique names of black specified side have a ruling that describes a triggered mana
artifact cards listed in both the cards and foreign_data tables? ability?
Base question rating: good. Reason: not picking up on LLM only question rating: good+. Reason: poor grammer in the
foreign_data mening foreign language, but there is no implication supplied question.
that there is a translation.
Fused generated questions: Q1:: How many cards that do not have
Bird only generated questions: Q1:: What are the names of multiple faces are associated with the 'premodern' format and have
distinct cards that are classified as 'Artifact' in their original type a ruling text that states 'This is a triggered mana ability'? Q2::
and have the color 'B' (Black), which also have corresponding How many single-faced 'premodern' cards have a ruling about a
entries in the foreign_data table? Q2:: List the unique names of triggered mana ability?
black Artifact cards with foreign data.
Fused question rating: good+. Reason: poor grammer in the
Bird only question rating: good. Reason: not picking up on supplied question.
foreign_data mening foreign language, but there is no implication
that there is a translation. ----------------------------------------------------
LLM only generated questions: Q1:: What are the names of Question 424:
distinct cards that are classified as 'Artifact' in their original type SQL:
and have 'B' as their color, according to the data in the 'cards' and SELECT CAST(SUM(CASE WHEN isTextless = 1 AND layout
'foreign_data' tables where the 'uuid' fields match? Q2:: What are = 'normal' THEN 1 ELSE 0 END) AS REAL) * 100 / COUNT(*)
the unique names of black artifact cards? FROM cards
Supplied question: What proportion of cards do not have a text SQL:
box with a normal layout? SELECT T1.artist FROM cards AS T1 INNER JOIN sets AS T2
Base generated questions: Q1:: What is the percentage of 'normal' ON T2.code = T1.setCode WHERE (T2.name = 'Coldsnap' AND
layout cards without text out of the total number of cards in the T1.artist = 'Chippy') OR (T2.name = 'Coldsnap' AND T1.artist =
'cards' table? Q2:: What percent of all cards have a normal layout 'Aaron Miller') OR (T2.name = 'Coldsnap' AND T1.artist =
and no text? 'Jeremy Jarvis') GROUP BY T1.artist
Base question rating: good. Supplied question: Which of these artists have designed a card in
Bird only generated questions: Q1:: What is the percentage of the set Coldsnap, Jeremy Jarvis, Aaron Miller or Chippy?
'cards' that have the 'isTextless' attribute set to 1 and the 'layout' Base generated questions: Q1:: Which artists named 'Chippy',
attribute set to 'normal' out of all the cards in the database? Q2:: 'Aaron Miller', or 'Jeremy Jarvis' have created art for cards in the
What percentage of all cards are textless with a normal layout? 'Coldsnap' set? Q2:: Who are the 'Coldsnap' card artists among
Bird only question rating: good. Chippy, Aaron Miller, and Jeremy Jarvis?
LLM only generated questions: Q1:: What is the percentage of Base question rating: good.
cards that are both textless (isTextless = 1) and have a 'normal' Bird only generated questions: Q1:: Which artists have
layout out of the total number of cards in the database? Q2:: contributed to the 'Coldsnap' set whose names are either 'Chippy',
What percent of the cards are textless with a normal layout? 'Aaron Miller', or 'Jeremy Jarvis'? Q2:: Who are the 'Coldsnap' set
LLM only question rating: good. artists named 'Chippy', 'Aaron Miller', or 'Jeremy Jarvis'?
Fused generated questions: Q1:: What is the percentage of cards Bird only question rating: good.
in the 'cards' table that are textless (isTextless = 1) and have a LLM only generated questions: **Q1::** Which artists named
'normal' layout? Q2:: What percent of normal layout cards are 'Chippy', 'Aaron Miller', or 'Jeremy Jarvis' have contributed to the
textless? 'Coldsnap' set in the database? **Q2::** Which artists from
Fused question rating: good. 'Chippy', 'Aaron Miller', and 'Jeremy Jarvis' have artwork in the
---------------------------------------------------- 'Coldsnap' set?
Question 466: LLM only question rating: good.
SQL: Fused generated questions: Q1:: Which artists named 'Chippy',
'Aaron Miller', or 'Jeremy Jarvis' have illustrated cards in the
SELECT COUNT(T1.id) FROM cards AS T1 INNER JOIN 'Coldsnap' set? Q2:: Who are the artists from 'Coldsnap' set that
set_translations AS T2 ON T2.setCode = T1.setCode WHERE are either 'Chippy', 'Aaron Miller', or 'Jeremy Jarvis'?
T2.translation = 'Hauptset Zehnte Edition' AND T1.artist = 'Adam
Rex' Fused question rating: good.
Supplied question: Among the cards in the set "Hauptset Zehnte ----------------------------------------------------
Edition", how many of them are designed by Adam Rex? Question 487:
Base generated questions: Q1:: How many cards illustrated by SQL:
Adam Rex are there in the set with the translation name 'Hauptset SELECT CAST(SUM(CASE WHEN T1.cardKingdomFoilId IS
Zehnte Edition'? Q2:: How many 'Hauptset Zehnte Edition' cards NOT NULL AND T1.cardKingdomId IS NOT NULL THEN 1
did Adam Rex illustrate? ELSE 0 END) AS REAL) * 100 / COUNT(T1.id) FROM cards
Base question rating: good. Reason: card_games.cards.artist AS T1 INNER JOIN sets AS T2 ON T2.code = T1.setCode
referes to the illustrator, not the designer. WHERE T2.name = 'Coldsnap'
Bird only generated questions: Q1:: How many cards illustrated Supplied question: What is the percentage of incredibly powerful
by Adam Rex are part of the set whose translation is 'Hauptset cards in the set Coldsnap?
Zehnte Edition'? Q2:: How many 'Hauptset Zehnte Edition' cards Base generated questions: Q1:: What is the percentage of cards
did Adam Rex illustrate? from the 'Coldsnap' set that have both a 'cardKingdomFoilId' and
Bird only question rating: good. Reason: card_games.cards.artist a 'cardKingdomId' associated with them in the database? Q2::
referes to the illustrator, not the designer. What percentage of 'Coldsnap' set cards have both foil and non-
LLM only generated questions: Q1:: How many cards illustrated foil versions available on Card Kingdom?
by Adam Rex are there in the set with the translation 'Hauptset Base question rating: good+. Reason: supplied question does not
Zehnte Edition'? Q2:: How many 'Hauptset Zehnte Edition' cards match the SQL.
did Adam Rex illustrate? Bird only generated questions: Q1:: What is the percentage of
LLM only question rating: good. Reason: card_games.cards.artist cards in the 'Coldsnap' set that have both a Card Kingdom foil ID
referes to the illustrator, not the designer. and a Card Kingdom ID? Q2:: What percentage of 'Coldsnap'
Fused generated questions: Q1:: How many cards illustrated by cards have both foil and non-foil versions available on Card
Adam Rex are there in the database that belong to the set with the Kingdom?
translation 'Hauptset Zehnte Edition'? Q2:: How many 'Hauptset Bird only question rating: good+. Reason: supplied question does
Zehnte Edition' cards were illustrated by Adam Rex? not match the SQL.
Fused question rating: good. Reason: card_games.cards.artist LLM only generated questions: Q1:: What is the percentage of
referes to the illustrator, not the designer. cards in the 'Coldsnap' set that have both a 'cardKingdomFoilId'
---------------------------------------------------- and a 'cardKingdomId'? Q2:: What percentage of 'Coldsnap' cards
have both foil and non-foil versions on Card Kingdom?
Question 477:
LLM only question rating: good+. Reason: supplied question Base question rating: good.
does not match the SQL. Bird only generated questions: Q1:: What are the distinct elements
Fused generated questions: Q1:: What percentage of cards from in the 'atom' table that are part of molecules where at least one
the 'Coldsnap' set have both a unique identifier for the standard bond is a double bond ('=') and these atoms are also present in the
version and a unique identifier for the foil version as assigned by 'connected' table? Q2:: Which elements form double bonds in
Card Kingdom? Q2:: What is the percentage of 'Coldsnap' cards connected molecules?
with identifiers for both standard and foil versions from Card Bird only question rating: good.
Kingdom? LLM only generated questions: Q1:: What are the distinct
Fused question rating: good+. Reason: supplied question does not elements from the 'atom' table that are part of molecules where
match the SQL. there is at least one double bond ('=') present, according to the
---------------------------------------------------- 'bond' table, and have a corresponding entry in the 'connected'
Question 195: table? Q2:: Which elements are in molecules with double bonds?
SQL: LLM only question rating: good.
SELECT T.bond_type FROM ( SELECT bond_type, Fused generated questions: Q1:: Which chemical elements are
COUNT(bond_id) FROM bond GROUP BY bond_type ORDER part of molecules that have at least one double bond? Q2:: What
BY COUNT(bond_id) DESC LIMIT 1 ) AS T elements are in molecules with double bonds?
Supplied question: What is the most common bond type? Fused question rating: good.
Base generated questions: Step 1: The given SQL code is ----------------------------------------------------
performing the following steps: 1. It is selecting two columns Question 219:
from the 'bond' table: 'bond_type' and the count of 'bond_id'. 2. It SQL:
groups the results by 'bond_type' so that it can calculate the count SELECT CAST(COUNT(DISTINCT CASE WHEN T2.label =
of 'bond_id' for each 'bond_type'. 3. It orders the grouped results '+' THEN T2.molecule_id ELSE NULL END) AS REAL) * 100 /
in descending order by the count of 'bond_id' to find out which COUNT(DISTINCT T2.molecule_id) FROM atom AS T1
'bond_type' has the highest number of bonds. 4. It limits the INNER JOIN molecule AS T2 ON T1.molecule_id =
results to only the top row, which effectively means it selects the T2.molecule_id INNER JOIN bond AS T3 ON T2.molecule_id =
'bond_type' with the highest count of bonds. 5. It then creates a T3.molecule_id WHERE T3.bond_type = '#'
subquery (alias as T) and from this subquery, it selects the
'bond_type'. Question that can be answered by the given SQL Supplied question: What is the percentage of carcinogenic
code: What is the type of bond that has the highest number of molecules in triple type bonds?
entries in the bond table? Step 2: What is the most common bond Base generated questions: Q1:: What is the percentage of distinct
type? molecules labeled with a '+' that have at least one triple bond, out
Base question rating: good. of the total number of distinct molecules that have at least one
triple bond? Q2:: What percentage of molecules with a triple
Bird only generated questions: Q1:: What is the most common bond are positively labeled?
type of bond in the bond table? Q2:: Which bond type occurs
most frequently? Base question rating: good-. Reason: not picking up on meaning
of label.
Bird only question rating: good.
Bird only generated questions: Q1:: What is the percentage of
LLM only generated questions: Q1:: What is the most common distinct molecules with a positive label ('+') among all distinct
type of bond in the bond table, based on the highest occurrence of molecules that have at least one atom involved in a bond with a
bond IDs? Q2:: What is the most frequent bond type? bond type of '#'? Q2:: What percentage of molecules with a triple
LLM only question rating: good. bond are positively labeled?
Fused generated questions: Q1:: What is the most common type of Bird only question rating: good-. Reason: not picking up on
chemical bond present in the molecules recorded in the bond meaning of label.
table? Q2:: Which bond type occurs most frequently? LLM only generated questions: **Q1::** What is the percentage
Fused question rating: good. of molecules that have a label of '+' among all the molecules that
---------------------------------------------------- contain at least one bond of type '#'? **Q2::** What percentage
Question 207: of molecules with a '#' bond type are labeled with '+'?
SQL: LLM only question rating: bad+. Reason: not picking up on the
SELECT DISTINCT T1.element FROM atom AS T1 INNER meanings of the literals.
JOIN bond AS T2 ON T1.molecule_id = T2.molecule_id INNER Fused generated questions: Q1:: What is the percentage of unique
JOIN connected AS T3 ON T1.atom_id = T3.atom_id WHERE molecules that contain a triple bond and are carcinogenic relative
T2.bond_type = '=' to the total number of unique molecules that contain a triple bond?
Supplied question: What elements are in a double type bond? Q2:: What percentage of molecules with a triple bond are
carcinogenic?
Base generated questions: Q1:: What are the distinct elements that
are part of molecules where at least one bond is a double bond Fused question rating: good.
(represented by '=')? Q2:: Which unique elements are in ----------------------------------------------------
molecules with double bonds? Question 231:
SQL: 25, and also have a molecule label of '+'? Q2:: What are the IDs
SELECT T.bond_type FROM ( SELECT T1.bond_type, of molecules with '+' label and atoms ending in 21 to 25?
COUNT(T1.molecule_id) FROM bond AS T1 WHERE LLM only question rating: good-. Reason: not picking up on
T1.molecule_id = 'TR010' GROUP BY T1.bond_type ORDER meaning of label.
BY COUNT(T1.molecule_id) DESC LIMIT 1 ) AS T Fused generated questions: Q1:: Which unique identifiers
Supplied question: Which bond type accounted for the majority of (molecule_id) belong to molecules that are carcinogenic and have
the bonds found in molecule TR010 and state whether or not this at least one atom whose identifier ends with a number between 21
molecule is carcinogenic? and 25? Q2:: What are the IDs of carcinogenic molecules with
Base generated questions: Step 1: What is the most common type atoms numbered 21 to 25?
of bond found in the molecule with the ID 'TR010' in the bond Fused question rating: good.
table? Step 2: What is the most frequent bond type in molecule ----------------------------------------------------
'TR010'? Question 249:
Base question rating: good+. Reason: supplied question does not SQL:
match the SQL.
SELECT T2.element FROM connected AS T1 INNER JOIN
Bird only generated questions: Step 1: What is the most common atom AS T2 ON T1.atom_id = T2.atom_id WHERE T1.bond_id
type of bond in the molecule with the ID 'TR010' according to the = 'TR144_8_19'
bond records in the database? Step 2: What is the most frequent
bond type in molecule 'TR010'? Supplied question: What are the elements of the atoms of
TR144_8_19?
Bird only question rating: good+. Reason: supplied question does
not match the SQL. Base generated questions: Q1:: What are the names of the
elements connected by the bond with the ID 'TR144_8_19' in the
LLM only generated questions: Q1:: What is the most common database? Q2:: Which elements are bonded by 'TR144_8_19'?
type of bond in molecule 'TR010' according to the records in the
bond table? Q2:: What is the most frequent bond type for Base question rating: good. Reason: supplied question is vague.
molecule 'TR010'? Bird only generated questions: Q1:: What are the elements
LLM only question rating: good+. Reason: supplied question associated with the atom entries in the 'atom' table that are
does not match the SQL. connected through the bond with ID 'TR144_8_19' in the
'connected' table? Q2:: Which elements are bonded by
Fused generated questions: Q1:: What is the most common type of 'TR144_8_19'?
chemical bond in the molecule with the identifier 'TR010'? Q2::
What is the predominant bond type in molecule 'TR010'? Bird only question rating: good. Reason: supplied question is
vague.
Fused question rating: good+. Reason: supplied question does not
match the SQL. LLM only generated questions: Q1:: What are the elements
associated with the atom_ids that are connected through the
---------------------------------------------------- bond_id 'TR144_8_19'? Q2:: Which elements are connected by
Question 242: bond 'TR144_8_19'?
SQL: LLM only question rating: good. Reason: supplied question is
SELECT DISTINCT T2.molecule_id FROM atom AS T1 INNER vague.
JOIN molecule AS T2 ON T1.molecule_id = T2.molecule_id Fused generated questions: Q1:: What are the chemical element
WHERE SUBSTR(T1.atom_id, -2) BETWEEN '21' AND '25' symbols of the atoms that are connected by the bond with the
AND T2.label = '+' identifier 'TR144_8_19'? Q2:: Which elements are bonded in
Supplied question: Among all the atoms from 21 to 25, list all the 'TR144_8_19'?
molecules that are carcinogenic. Fused question rating: good. Reason: supplied question is vague.
Base generated questions: Q1:: Which molecule IDs are ----------------------------------------------------
associated with atoms whose IDs end in a number between 21 and Question 273:
25, and also have a molecule label of '+'? Q2:: What are the IDs
of molecules with a '+' label and containing atoms ending in 21 to SQL:
25? SELECT CAST(COUNT( CASE WHEN T1.element = 'cl' THEN
Base question rating: good-. Reason: not picking up on meaning T1.element ELSE NULL END) AS REAL) * 100 /
of label. COUNT(T1.element) FROM atom AS T1 INNER JOIN molecule
AS T2 ON T1.molecule_id = T2.molecule_id WHERE T2.label =
Bird only generated questions: Q1:: Which molecule IDs are '+'
associated with atoms whose IDs end in a number between 21 and
25 and are also part of molecules labeled with a positive charge? Supplied question: What is the percentage of element chlorine in
Q2:: What are the IDs of positively charged molecules with atom carcinogenic molecules?
IDs ending in 21 to 25? Base generated questions: Q1:: What is the percentage of 'cl'
Bird only question rating: good-. Reason: not picking up on elements within all the elements associated with molecules that
meaning of label. have a label of '+'? Q2:: What percentage of elements are chlorine
in positively labeled molecules?
LLM only generated questions: Q1:: Which molecule IDs are
associated with atoms whose IDs end in a number between 21 and Base question rating: good-. Reason: not picking up on meaning
of label.
Bird only generated questions: Q1:: What is the percentage of 'cl' ----------------------------------------------------
elements within the atoms associated with molecules that have a Question 27:
label of '+'? Q2:: What percentage of atoms in positively labeled SQL:
molecules are chlorine?
SELECT T2.School, T1.AvgScrWrite, T2.Phone FROM schools
Bird only question rating: good-. Reason: not picking up on AS T2 LEFT JOIN satscores AS T1 ON T2.CDSCode = T1.cds
meaning of label. WHERE strftime('%Y', T2.OpenDate) > '1991' OR strftime('%Y',
LLM only generated questions: **Q1::** What is the percentage T2.ClosedDate) < '2000'
of atoms labeled as 'cl' in molecules with a label of '+'? **Q2::** Supplied question: What is the average score in writing for the
What percentage of atoms are chlorine in positively labeled schools that were opened after 1991 or closed before 2000? List
molecules? the school names along with the score. Also, list the
LLM only question rating: good-. Reason: not picking up on communication number of the schools if there is any.
meaning of label. Base generated questions: Q1:: Which schools that opened after
Fused generated questions: Q1:: What is the percentage of 1991 or closed before 2000 have associated average writing SAT
chlorine atoms among all atoms in carcinogenic molecules? Q2:: scores and phone numbers according to the schools and satscores
What percent of atoms are chlorine in carcinogenic compounds? tables? Q2:: What are the phone numbers and average SAT
Fused question rating: good. writing scores for schools opened post-1991 or closed pre-2000?
---------------------------------------------------- Base question rating: good-. Reason: Not asking for school name.
Question 12: Bird only generated questions: Q1:: Which schools that opened
SQL: after 1991 or closed before 2000 have available average writing
SELECT MAX(CAST(T1.`Free Meal Count (Ages 5-17)` AS SAT scores and contact phone numbers? Q2:: What are the SAT
REAL) / T1.`Enrollment (Ages 5-17)`) FROM frpm AS T1 writing scores and phone numbers for schools opened post-1991
INNER JOIN satscores AS T2 ON T1.CDSCode = T2.cds or closed pre-2000?
WHERE CAST(T2.NumGE1500 AS REAL) / T2.NumTstTakr > Bird only question rating: good-. Reason: Not asking for school
0.3 name.
Supplied question: Among the schools with an SAT excellence LLM only generated questions: Q1:: Which schools that opened
rate of over 0.3, what is the highest eligible free rate for students after 1991 or closed before 2000 have associated average SAT
aged 5-17? writing scores and contact phone numbers? Q2:: What are the
Base generated questions: Q1:: What is the highest ratio of free names, average SAT writing scores, and phone numbers of
meals provided to students aged 5-17 to the total enrollment of schools opened post-1991 or closed pre-2000?
students aged 5-17 in schools where more than 30% of students LLM only question rating: good.
taking the SAT scored 1500 or higher? Q2:: What is the Fused generated questions: Q1:: What are the names, average
maximum percentage of students aged 5-17 receiving free meals SAT writing scores, and contact phone numbers of schools that
in schools with over 30% of SAT scores above 1500? were opened after 1991 or closed before 2000? Q2:: List the
Base question rating: good-. Reason: Not picking up on schools with their SAT writing scores and phone numbers, which
sat>=1500 meaning excellent. opened post-1991 or closed pre-2000.
Bird only generated questions: Q1:: What is the highest ratio of Fused question rating: good.
free meals provided to students aged 5-17 to the total enrollment ----------------------------------------------------
of students aged 5-17 in schools where more than 30% of students Question 39:
taking the SAT scored 1500 or higher? Q2:: What is the SQL:
maximum percentage of students aged 5-17 receiving free meals
in schools with over 30% of SAT scores above 1500? SELECT AVG(T1.NumTstTakr) FROM satscores AS T1 INNER
JOIN schools AS T2 ON T1.cds = T2.CDSCode WHERE
Bird only question rating: good-. Reason: Not picking up on strftime('%Y', T2.OpenDate) = '1980' AND T2.County = 'Fresno'
sat>=1500 meaning excellent.
Supplied question: What is the average number of test takers from
LLM only generated questions: Q1:: What is the highest ratio of Fresno schools that opened between 1/1/1980 and 12/31/1980?
students aged 5-17 receiving free meals to the total enrollment of
students aged 5-17 in schools where more than 30% of test takers Base generated questions: Q1:: What is the average number of test
scored above 1500 on the SAT? Q2:: What is the maximum takers for the SAT from schools in Fresno County that were
proportion of students aged 5-17 on free meals in schools where opened in the year 1980? Q2:: What's the average SAT
over 30% of SAT takers score above 1500? participation for schools opened in 1980 in Fresno County?
LLM only question rating: good-. Reason: Not picking up on Base question rating: good. Reason: supplied question is vague.
sat>=1500 meaning excellent. Bird only generated questions: Q1:: What is the average number
Fused generated questions: Q1:: What is the highest ratio of of test takers for the SAT from schools in Fresno County that
students eligible for free meals to total enrollment for schools were opened in the year 1980? Q2:: What is Fresno's 1980 school
where more than 30% of SAT test takers scored 1500 or higher? average SAT test takers?
Q2:: What is the maximum free meal eligibility rate among Bird only question rating: good. Reason: supplied question is
schools with over 30% SAT excellence? vague.
Fused question rating: good.
LLM only generated questions: Q1:: What is the average number Q2:: What is the Graduate School offering of the school located at
of test takers for the SAT from schools in Fresno County that the extreme longitude?
were opened in the year 1980? Q2:: What's the average SAT Base question rating: bad+. Reason: not picking up on the
participation for Fresno schools founded in 1980? meaning of GSoffered.
LLM only question rating: good. Reason: supplied question is Bird only generated questions: Q1:: What is the grade span
vague. offered by the school that is located the farthest east or west from
Fused generated questions: Q1:: What is the average number of the Prime Meridian? Q2:: Which school farthest from the Prime
students who took the SAT at schools in Fresno County that were Meridian offers what grade levels?
opened in the year 1980? Q2:: What's the average SAT Bird only question rating: good.
participation at Fresno schools started in 1980? LLM only generated questions: Q1:: Which school offers a
Fused question rating: good. Reason: supplied question is vague. graduate program and is located at the point with the greatest
---------------------------------------------------- absolute value of longitude? Q2:: What is the graduate program
Question 48: at the school furthest east or west?
SQL: LLM only question rating: bad+. Reason: not picking up on
SELECT CAST(SUM(CASE WHEN DOC = 54 THEN 1 ELSE 0 meaning of GSoffered.
END) AS REAL) / SUM(CASE WHEN DOC = 52 THEN 1 Fused generated questions: Q1:: What is the grade span offered by
ELSE 0 END) FROM schools WHERE StatusType = 'Merged' the school located furthest east or west from the prime meridian?
AND County = 'Orange' Q2:: Which school has the widest range of grades and is furthest
Supplied question: What is the ratio of merged Unified School from the prime meridian?
District schools in Orange County to merged Elementary School Fused question rating: good. Reason: awkward but accurate.
District schools? ----------------------------------------------------
Base generated questions: Q1:: What is the ratio of the number of Question 93:
schools with DOC code 54 to the number of schools with DOC SQL:
code 52 in Orange County where the StatusType is 'Merged'? SELECT COUNT(T1.client_id) FROM client AS T1 INNER
Q2:: What is the ratio of type 54 to type 52 merged schools in JOIN district AS T2 ON T1.district_id = T2.district_id WHERE
Orange County? T1.gender = 'M' AND T2.A3 = 'north Bohemia' AND T2.A11 >
Base question rating: good-. Reason: not picking up on school 8000
district code. Supplied question: How many male customers who are living in
Bird only generated questions: Q1:: What is the ratio of the North Bohemia have average salary greater than 8000?
number of schools with a DOC code of 54 to the number of Base generated questions: Q1:: How many male clients are there
schools with a DOC code of 52 in Orange County that have a who live in a district classified as 'north Bohemia' where the
status type of 'Merged'? Q2:: What is the 'Merged' schools ratio district's attribute A11 (which could represent a specific
of DOC 54 to DOC 52 in Orange County? demographic or economic feature) has a value greater than 8000?
Bird only question rating: good-. Reason: not picking up on Q2:: How many men from the North Bohemia district with an
school district code. A11 value over 8000 are clients?
LLM only generated questions: Q1:: What is the ratio of the Base question rating: good-. Reason: not picking up on meaning
number of schools with document type 54 to the number of of A11.
schools with document type 52 in Orange County among those Bird only generated questions: Q1:: How many male clients are
that have a status type of 'Merged'? Q2:: What is the ratio of type there in the 'north Bohemia' region of the 'district' table where the
54 to type 52 merged schools in Orange County? socio-economic indicator 'A11' is greater than 8000? Q2:: How
LLM only question rating: good-. Reason: not picking up on many men from North Bohemia districts with a high socio-
school district code. economic status are clients?
Fused generated questions: Q1:: What is the ratio of the number Bird only question rating: good-. Reason: not picking up on
of merged unified school districts to the number of merged meaning of A11.
elementary school districts in Orange County? Q2:: What's the LLM only generated questions: Q1:: How many male clients are
ratio of merged unified to elementary districts in Orange County? there in the 'north Bohemia' region where the population of the
Fused question rating: good. district is greater than 8000? Q2:: How many men from districts
---------------------------------------------------- in north Bohemia with populations over 8000 are clients?
Question 82: LLM only question rating: bad. Reason: misses the meaning.
SQL: Fused generated questions: Q1:: How many male clients are
SELECT GSoffered FROM schools ORDER BY ABS(longitude) associated with branches located in the 'north Bohemia' region
DESC LIMIT 1 where the average salary of the district is greater than 8000? Q2::
Supplied question: What is the grade span offered in the school How many men are clients in 'north Bohemia' with district salaries
with the highest longitude? over 8000?
Base generated questions: Q1:: Which school offers a Graduate Fused question rating: good.
School program that is located furthest east or west from the ----------------------------------------------------
Prime Meridian, based on the absolute value of its longitude? Question 112:
SQL: Fused generated questions: Q1:: What are the top 9 districts with
SELECT T1.A2 FROM district AS T1 INNER JOIN client AS T2 the highest number of female clients, along with the count of
ON T1.district_id = T2.district_id WHERE T2.birth_date = '1976- female clients in each of those districts? Q2:: Which 9 districts
01-29' AND T2.gender = 'F' have the most female clients?
Supplied question: For the female client who was born in Fused question rating: good. Reason: supplied question is vague.
1976/1/29, which district did she opened her account? ----------------------------------------------------
Base generated questions: Q1:: Which districts do female clients Question 138:
born on January 29, 1976, belong to according to the database? SQL:
Q2:: What are the districts of female clients born on 29th January SELECT COUNT(T1.client_id) FROM client AS T1 INNER
1976? JOIN district AS T2 ON T1.district_id = T2.district_id WHERE
Base question rating: good+. Reason: supplied question T1.gender = 'M' AND T2.A15 = (SELECT T3.A15 FROM district
presupposes exactly one client. AS T3 ORDER BY T3.A15 DESC LIMIT 1, 1)
Bird only generated questions: Q1:: Which districts do female Supplied question: In the branch where the second-highest
clients born on January 29, 1976, belong to? Q2:: What are the number of crimes were committed in 1995 occurred, how many
districts of female clients born on 1/29/1976? male clients are there?
Bird only question rating: good+. Reason: supplied question Base generated questions: Q1:: How many male clients are there
presupposes exactly one client. in the district with the second-highest value of A15? Q2:: How
LLM only generated questions: Q1:: What are the names of the many men live in the district with the second-highest A15?
districts (A2) where female clients born on January 29, 1976, are Base question rating: good-. Reason: not picking up on meaning
located according to the database? Q2:: Which districts have of A15.
female clients born on 29th January 1976? Bird only generated questions: Q1:: How many male clients are
LLM only question rating: good+. Reason: supplied question there in the district with the second-highest A15 value? Q2:: How
presupposes exactly one client. many men live in the district with the second-highest A15?
Fused generated questions: Q1:: What are the names of the Bird only question rating: good-. Reason: not picking up on
districts where female clients who were born on January 29, 1976, meaning of A15.
are associated with? Q2:: Which districts have female clients LLM only generated questions: Q1:: How many male clients are
born on 29th January 1976? there in the district with the second highest value of the A15
Fused question rating: good+. Reason: supplied question attribute? Q2:: How many men live in the district with the second
presupposes exactly one client. highest A15 value?
---------------------------------------------------- LLM only question rating: good-. Reason: not picking up on
Question 128: meaning of A15.
SQL: Fused generated questions: Q1:: How many male clients are
SELECT T2.A2, COUNT(T1.client_id) FROM client AS T1 associated with the district that has the second-highest number of
INNER JOIN district AS T2 ON T1.district_id = T2.district_id committed crimes in 1995? Q2:: How many men are from the
WHERE T1.gender = 'F' GROUP BY T2.district_id, T2.A2 district with the second-most crimes in 1995?
ORDER BY COUNT(T1.client_id) DESC LIMIT 9 Fused question rating: good.
Supplied question: List the top nine districts, by descending order, ----------------------------------------------------
from the highest to the lowest, the number of female account Question 169:
holders. SQL:
Base generated questions: Q1:: Which are the top 9 districts with SELECT CAST((SUM(CASE WHEN STRFTIME('%Y', T1.date)
the highest number of female clients, and how many female = '1997' THEN T1.amount ELSE 0 END) - SUM(CASE WHEN
clients are there in each of these districts? Q2:: What are the top 9 STRFTIME('%Y', T1.date) = '1996' THEN T1.amount ELSE 0
districts by female client count? END)) AS REAL) * 100 / SUM(CASE WHEN STRFTIME('%Y',
Base question rating: good. Reason: supplied question is vague. T1.date) = '1996' THEN T1.amount ELSE 0 END) FROM loan
Bird only generated questions: Q1:: What are the top 9 districts AS T1 INNER JOIN account AS T2 ON T1.account_id =
with the highest number of female clients, along with the count of T2.account_id INNER JOIN disp AS T3 ON T3.account_id =
female clients in each of those districts? Q2:: Which 9 districts T2.account_id INNER JOIN client AS T4 ON T4.client_id =
have the most female clients? T3.client_id WHERE T4.gender = 'M' AND T3.type = 'OWNER'
Bird only question rating: good. Reason: supplied question is Supplied question: What was the growth rate of the total amount
vague. of loans across all accounts for a male client between 1996 and
LLM only generated questions: Q1:: Which nine districts have the 1997?
highest number of female clients, and how many female clients Base generated questions: Q1:: What is the percentage change in
are there in each of these districts? Q2:: What are the top nine the sum of loan amounts from male clients who are owners of
districts with the most female clients? their accounts between the years 1996 and 1997? Q2:: What is
LLM only question rating: good. Reason: supplied question is the loan amount percentage increase or decrease for male account
vague. owners from 1996 to 1997?
Base question rating: good. Reason: SQL has stability problem.
Bird only generated questions: Q1:: What is the percentage
change in the sum of loan amounts for male clients who are
owners of their accounts from the year 1996 to 1997? Q2::
What's the loan amount percentage increase for male account
owners from 1996 to 1997?
Bird only question rating: good. Reason: SQL has stability
problem.
LLM only generated questions: Q1:: What is the percentage
change in the sum of loan amounts from the year 1996 to 1997 for
male clients who are owners of their accounts? Q2:: What is the
percentage increase or decrease in loans taken by male account
owners from 1996 to 1997?
LLM only question rating: good. Reason: SQL has stability
problem.
Fused generated questions: Q1:: What is the percentage change in
the total amount of loans approved for male clients who are
owners of the account between the years 1996 and 1997? Q2::
What is the percentage increase or decrease in approved loan
amounts for male account owners from 1996 to 1997?
Fused question rating: good. Reason: SQL has stability problem.
----------------------------------------------------

You might also like