Datrics Text 2 SQL
Datrics Text 2 SQL
net/publication/389944067
CITATIONS READS
0 313
2 authors, including:
Tetiana Gladkykh
Datrics
12 PUBLICATIONS 11 CITATIONS
SEE PROFILE
All content following this page was uploaded by Tetiana Gladkykh on 19 March 2025.
Datrics∗
Whitepaper v. 1.0
March 2025
Introduction
Importance of Text-to-SQL Systems
Text-to-SQL systems translate natural language questions into SQL queries, making data access
more inclusive and intuitive. This is especially valuable for non-technical users in organizations who
may not know how to write SQL. By bridging natural language and database queries, Text-to-SQL
empowers analysts and business users to access and analyze data without mastering
complex SQL syntax[1, 2].
In effect, it democratizes data access – users across different departments can retrieve insights
using plain language, instead of relying on technical teams or learning query languages [3]. This
lowers the barrier to entry for data analysis and fosters a more data-driven culture. With more
employees able to query data directly, organizations can make faster, informed decisions at all
levels.
Furthermore, allowing people to ask questions in natural language speeds up the data retrieval
process, as business users can get answers immediately without waiting for a SQL expert to write
the query [4]. By streamlining how users interact with databases, Text-to-SQL systems help unlock
the full value of enterprise data and improve productivity in decision-making.
1
20% of user questions to text-to-SQL systems are problematic – often
ambiguous or even unanswerable without clarification [6].
• Domain-Specific Terminology: Every organization or dataset may use unique
vocabulary, acronyms, or business terms that do not exactly match the names
of tables or columns in the database [7, 8]. Text-to-SQL models struggle when
domain terms or synonyms don’t align with schema labels, requiring the
system to incorporate domain knowledge to make the correct associations [9, 10,
11]. Ensuring the model understands the context is essential but challenging.
• Complex Schemas and Queries: Real-world databases often have many tables
with intricate relationships, and user questions may require joining multiple tables
or applying advanced SQL functions. Generating correct SQL in these scenarios is
difficult. Writing efficient joins across numerous tables or handling nested
queries and analytics functions (e.g. window functions) is a major challenge
for Text-to-SQL [12]. Large schemas also introduce many similarly named fields,
increasing the risk that the model might confuse table or column names. LLMs
without guidance can hallucinate – producing SQL that is syntactically incorrect or
selecting the wrong table – especially when confronted with many tables or columns
with similar names [13]. Ensuring the generated query uses the right database
entities and is optimized for performance is non-trivial.
These challenges mean that out-of-the-box language models may falter on enterprise Text-to-SQL
tasks. Handling ambiguous phrasing, understanding domain context, and coping with complex,
multi-table databases all require carefully designed solutions beyond just training a model on
example queries.
2
the database schema.
2. Question-Answer Examples: A collection of question-SQL pairs that is used
as a reference for mapping natural language questions to specific SQL syntax and
database operations.
3. Domain-Specific Rules: Specialized instructions that capture the business logic
and relationships not explicitly stored in the vector database but critical for accurate
query generation.
The system functions through three main phases:
1. Knowledge Base Training: The process of building a comprehensive understand-
ing of the database by analyzing documentation, examples, and schema information.
This involves transforming unstructured descriptions into structured metadata and
vector embeddings.
2. Content Retrieval: This phase includes retrieving the most relevant documen-
tation and examples from the knowledge base using semantic similarity matching
with the user’s request concerning the business rules.
3. SQL Generation: The final stage is where the system utilizes the retrieved infor-
mation to generate a correct SQL query that satisfies the user’s intent, leveraging
both table descriptions and similar query examples.
This approach ensures that the Text2SQL system can handle complex queries by combining
structural knowledge about the database with practical examples of query patterns, all while
respecting domain-specific business rules that govern data relationships and access patterns.
3
• columns: An array where each element is an object describing a column (with its name and
a description)
• entities: A list of key concepts extracted from the documentation
• strong_entities: A subset of entities deemed particularly significant
You don’t need to manually create this structural definition. The system can automatically generate
it by analyzing any free-form text that contains descriptions of the table and its columns. This
automated approach saves time and reduces the potential for errors that might occur during manual
creation. The system can extract the necessary information about table relationships, column
definitions, and data types directly from natural language descriptions, converting them into a
formal structure without requiring manual intervention.
I. DOCUMENTS PROCESSING
The automated extraction of the required fields from a document is performed via processing the
document by LLM and divided into two steps:
1. Extraction of name, summary, table purpose, columns with their description, key columns,
and connected tables
2. Extraction of entities that might be inferred from this table based on it’s purpose
For the second step the voting strategy is applied - we generate several LLM responses for the same
input content and assess the frequency of the concept occurrence within multiple runs. This allows
extracting the strong entities from the entities’ list.
DOCUMENTS AUTO-GENERATION
Suppose the user’s DB contains well-named tables with columns that also have explicit names (the
names reflect the columns’ and table’s sense). In that case, we may assume their purposes and
dependencies considering the domain and the general knowledge about available data. So, if we have
a DB schema, we may use LLM to generate the table description with some reasonable limitations.
Generally, the generated description is applicable in the first approximation, but domain expert
validation is strongly required for the complex data structure.
Here are the most important characteristics of the generated data:
1. Auto-generated descriptions follow a clear format with sections like "Table Description,"
"Columns Description," and "Potential Dependencies," making them easy to navigate and
review.
2. Provide detailed descriptions of columns, often attempting to explain their possible roles and
relationships.
3. Highlight primary keys, sort keys, and potential table dependencies, which can be helpful in
understanding database structure and optimizing queries.
Limitations:
1. While auto-generated descriptions attempt to define a table’s role, they may sometimes
describe it too broadly rather than focusing on its specific function.
2. In some cases, the descriptions suggest dependencies between tables that are not explicitly
defined, which might lead to assumptions about relationships that do not exist.
3. While general column descriptions are useful, they may not always capture specific nuances
like predefined status codes or intentionally unused fields.
4
DOCUMENTS STRUCTURAL REPRESENTATION
For straightforward data structures, the generated descriptions offer enough context for a Text2SQL
agent to understand the table’s purpose and relationships without requiring expert review. Table and
column names typically provide sufficient clarity, making additional manual validation unnecessary.
However, expert review may be critical for more complex schemas with nuanced relationships or
specific field constraints.
5
Once all responses are received, the next step is to combine the extracted entities from each response
into one comprehensive collection. This aggregation allows the system to analyze the frequency
with which each unique entity appears across the multiple outputs. The underlying assumption is
that entities mentioned consistently are more likely to be accurate or significant. Those entities
occur just once within the series of LLM calls and are considered insignificant. This approach
leverages the diversity of multiple model outputs to enhance the reliability of entity extraction. By
aggregating and filtering based on frequency, the process reduces the impact of any anomalous or
less confident responses, resulting in a more robust and consistent set of extracted entities.
EXAMPLE
Let us consider an example. Suppose we have well-documented database with the detailed description
of each table, which looks like:
SMS Statuses
------------
The `sms_statuses` table contains data about which activists are subscribed to the mobile
,→ messaging list of groups being mirrored (or children, if in a network) and what their
,→ current statuses are (subscribed, unsubscribed, bouncing, or spam complaint).
This table contains data about the mobile subscription status such what status the activist
,→ has currently, as well as metadata about the mobile subscription status such as when it
,→ was created.
#### Fields
| Field Name | Description | Type | Is Sort Key? |
| --- | --- | --- | --- |
| id | The numerical identifier of the mobile subscription status. | INT | |
| subscriber_id | The numerical identifier of the activist who is attached to this status.
,→ Relates to the `id` field in the `[users](/mirroring/docs/users)` table. | INT | True |
| group_id | The numerical identifier of the group who's list this mobile subscription status
,→ is related to. Relates to the `id` field in the `[groups](/mirroring/docs/groups)` table.
,→ | INT | |
| user_id | This field is intentionally blank. | INT | |
| status | The mobile subscription status of the activist. `1` if subscribed. `2` if
,→ unsubscribed. `3` if bouncing. | INT | |
| source_action_id | This field is intentionally blank. | INT | |
| source_action_type | This field is intentionally blank. | VARCHAR | |
| join_date | The UTC timestamp when this mobile subscription status was last changed from any
,→ status (or none) to subscribed. Ex: The date the person last joined the list. | DATETIME |
,→ |
| created_at | The UTC timestamp when this mobile subscription status was created. | DATETIME |
,→ |
| updated_at | The UTC timestamp when this mobile subscription status was last updated. |
,→ DATETIME | |
It is structured representation, which is completely ready for use, will have a view:
"name": "sms_statuses",
"summary": "Data about the mobile subscription statuses of activists within groups.",
"purpose": "To track and manage the subscription statuses (subscribed, unsubscribed,
bouncing, or spam complaint) of activists in mobile messaging lists.",
"dependencies_thoughts": "Relates to the `users` table (subscriber_id to id) and the
`groups` table (group_id to id).",
"keys": [
"subscriber_id",
"group_id"
],
"connected_tables": [
"users",
"groups"
],
"columns": [
"column": "id",
"description": "The numerical identifier of the mobile subscription status."
,
...
"column": "updated_at",
"description": "The UTC timestamp when this mobile subscription status was last updated."
6
II. TRAIN ON DOCUMENTS
If we have a structured and comprehensive description of the database, including table names,
descriptions, relationships, and metadata, we can leverage this data to enhance Retrieval-Augmented
Generation (RAG) using a Vector Database (Vector DB).
A Vector DB stores documents in a high-dimensional vector space, transforming each document
into numerical representations (embeddings) that capture semantic meanings. These embeddings
allow for more intelligent and context-aware searches, significantly improving how information is
retrieved. Each document, representing information about a table, is stored as a vector embedding.
The embedding is generated based on multiple document features.
7
Train based on Examples
While the documentation offers a comprehensive overview of the database, including relationships,
connections, and other intricacies, it does not ensure that this information alone is sufficient to
accurately interpret domain-specific nuances and generate the correct SQL query in response to a
user’s request. That is why the ability to work with question-answering examples is crucial.
Each example consists of a "Question - SQL Query" pair, which we store in the Vector Database
(Vector DB). These examples serve as reference points, allowing the system to retrieve relevant
patterns and structures when generating SQL queries in response to user requests. By leveraging
these stored examples, the model can improve accuracy and adapt to complex, domain-specific
queries more effectively.
To enhance SQL query generation and natural language understanding, we implemented
a structured normalization and extraction process. This process performs the conversion
of the user request, which is accompanied by an answer in view of SQL code, into a structured
JSON format that captures key elements like normalized questions, requested entities, data
sources, and calculations.
where:
Ik — The initial user query for k example.
Scodek — The corresponding SQL query, if available.
Nk — The normalized form of the initial question.
Ek — The extracted entities relevant to the query.
Mk — The main clause of the query.
Vk — The identified data sources (tables and columns).
Ok — The operations (aggregations, filters, grouping) present in the query.
Tik ∈ T — The i-th table from the database that are used in k example
Cik − set of columns from Ti that are used in k example
8
I. EXAMPLE NORMALIZATION AND STRUCTURING
First, we transform a natural language query into a standardized format while removing
unnecessary details. This ensures consistency and adaptability across different database queries.
Moreover, we analyze the pair of (natural language query, SQL code) to extract the entities
that are connected with this request and identify data sources and operations.
This process is executed using a Large Language Model (LLM), which simultaneously performs
the following tasks:
The following rules define how the LLM transforms and extracts structured information from
natural language inputs.
1. Normalize the Question (Nk )
The LLM must standardize the user query by removing unnecessary details and ensuring a structured
format. The normalization process follows these steps:
• Remove database-specific names (e.g., table names, column names).
• Remove display instructions (e.g., "show", "display", "generate chart").
• Generalize conditions (e.g., "in 2023" → "for some time range").
• Abstract filtering details (e.g., replace specific IDs, names, or numbers with generalized terms).
• Maintain the directive form (e.g., "Show sales data" → "Sales data").
2. Requested Entities (Ek )
The LLM identifies key data elements referenced in the query. These entities define the core
information being retrieved.
• Identify core data being requested (e.g., "number of transactions per region").
• Keep significant categorical definitions (e.g., "mobile applications", "email campaigns").
• Strip out time filters and specific constraints if they don’t affect the entity’s identity.
Finally, we receive a structured set of entities:
Ek = {(Li , Ki , Ji )} , where Li — the label of the entity; Ki — the type of entity (category, metric,
or identifier); Ji — any additional attributes that provide context.
3. Data Sources (V∥ )
The LLM must determine which tables and columns provide the necessary data.
• If an SQL snippet is provided, extract table names and necessary columns from it.
• If no SQL code is given, keep the data source field empty to avoid assumptions.
The structured output for data sources is represented as:
Vk = {Tik , Cik }
4. Extract Calculations & Filters
The LLM must identify and structure any mathematical operations, grouping fields, or
filtering conditions.
• Identify aggregation operations (e.g., SUM, AVG, COUNT).
9
• Extract grouping fields (e.g., BY department).
• Convert all filters and conditions into human-readable form (e.g., "where region = ’US’"
→ "Filter by region (United States)").
The structured output for operations is represented as:
O = {(operation, target, condition)}
10
Let’s consider the example:
Question Show me a list of events in the past month including name, permalink, start date,
end date, and RSVP count
SQL Code
The initial question was transformed to the normalized form "Retrieve a list of events
over a defined timeframe including their name, permalink, start date, end date,
and RSVP count", which
• contains all entities that has to be retrieved
• contains identifier that the request contains some specific timeframe reference
• does not contain any specific details that contradict to the generalisation’s principals
The requested entity and main clause are "list of events with their characteristics" and
"list of events in the past month" correspondingly. In order to provide the wider coverage,
we generated the list of three additional questions similar to the normalized representation of the
request:
• Provide a list of events occurring within a specified time range, including their names,
permalinks, start dates, end dates, and the number of RSVPs.
• Can you give me the details of events within a certain period, listing their name, permalink,
start date, end date, and RSVP count?
• Generate a report of events within a certain period, showing their name, permalink, start
date, end date, and number of RSVPs
So, the example will be described via seven embeddings that increase the invariance to the request’s
specific when the user asks about the same concepts.
Let’s look at the diagrams below. The left diagram represents PCA Visualization of Text
Embeddings for two examples. Each color-coded group relates to an example, which is represented
via embeddings that originate from an initial question. Each labeled point corresponds to a text
embedding, with names such as "MainClause", "Entity", "Normalized", "Similar 1", "Similar 2",
"Similar 3", and "Init." We may see that these groups are well distinguished - embeddings within
the same group (same example) tend to be closer together.
The diagram on the right represents the inner-cluster pairwise distance comparison before and
after adding similar questions. It provides insight into how the embeddings’ spread changed with
the new data. According to the provided box plots, we may see that before adding similar
questions, the embeddings were more compact, meaning that questions and their variations
were closely related to the embedding space. After adding similar questions, the increase in
distances suggests that the embeddings became more dispersed, possibly introducing more variety
and reducing redundancy in the representations.
11
PCA visualization of text embeddings for two example Comparison of intra-cluster distances before and after
clusters. adding similar questions.
The wider distribution of the embedding suggests that new (similar) questions were not the exact
duplicates of the ones present in the training corpus. Rather, they were a paraphrase or a different
formulation of the query, which made the embeddings more diverse. This shows that the model
has improved its ability to deal with the different forms of the same query, which makes the model
more robust in the retrieval tasks. The increase in the median distance shows that the model
has learned to tell the difference between similar questions better, which in turn enhances the
generality of the encoding across the phrasing. This improvement makes the retrieval step inside a
Retrieval-Augmented Generation (RAG) system less dependent on the particular words used in a
request so that similar queries will always return relevant results no matter what phrase is used.
Finally, each example is stored in the Vector DB via a cloud of embeddings created for the Initial
Question and its variations, Normalized question, Main Clause, and other extracted entities.
Domain-Specific Instructions
Except for the overall and comprehensive question-answer examples description to be stored in
the Text2SQL model, we perform the auto-generation of the domain-specific instructions that are
inferred from the examples and contain the reasoning bridge between user questions expressed in
natural language and the underlying database schema. By applying systematic analysis of question-
SQL pairs, we receive structured mappings that associate domain entities with the corresponding
database tables.
The domain-specific mapping may be described as a multi-stage process that extracts knowledge
from each question-SQL pair to create a complete mapping framework:
12
{TLLMk ; WLLMk } = LLM(Ik , {Ti k}, R, Scodek )
where
TLLM is the set of tables selected by the model.
WLLM - justifications explaining why each table is necessary
The language model produces a list of required tables along with justifications explaining why each
table is necessary for answering the question. This output is then structured into a consistent JSON
format for further processing.
EXAMPLE
Let us consider the a simplified example, to illustrate the process described above.
Question: "How many people RSVP’d to the climate march event last month?"
SQL Query:
13
SELECT COUNT(rsvps.id)
FROM rsvps
JOIN events ON rsvps.event_id = events.id
WHERE events.name LIKE '%climate march%'
AND events.start_date BETWEEN DATE_SUB(CURDATE(), INTERVAL 2 MONTH)
AND DATE_SUB(CURDATE(), INTERVAL 1 MONTH)
Selected Tables:
• "events" - To identify the specific climate march event and its timing
• "rsvps" - To count the number of people who RSVP’d to the event
Derived Entity Mappings:
Minor Entities:
• "event name" → table: events, column: name
• "event start date" → table: events, column: start_date
• "RSVP count" → table: rsvps, aggregation: COUNT(id)
Major Entity:
• "event participation" → tables: events + rsvps, joined on event_id
Conclusion
The training of the Knowledge Base creates the foundation for the Text2SQL model, enabling
accurate and context-aware SQL query generation. By structuring the Knowledge Base with
database documentation, example queries, and domain-specific mappings, the system ensures
efficient retrieval and highly reliable response generation:
• The documentation-driven training creates a structured representation of database tables,
relationships, and dependencies that allows the model to understand the database schema
effectively.
• The example-based training enhances SQL query generation by leveraging stored question-
SQL pairs, improving the system’s ability to interpret natural language queries and retrieve
14
relevant patterns. The process of query normalization, entity extraction, data source identifi-
cation, and operation structuring refine query construction, leading to greater accuracy and
adaptability.
• The domain-specific instructions serve as a reasoning bridge between natural language
queries and the underlying database schema, ensuring correct table selection, attribute
mapping, and relationship definition.
where
X - vector representation of user’s question or it’s derivatives
Yi - stored vector representations of example’s descriptors, such as full-text question, it’s normalized
form, extracted entities, etc.
Y ∗ - is the set of all retrieved examples Yi
τ - similarity threshold
The process of the appropriate examples retrieval might be depicted via diagram below.
The examples’ retrieval flow might be divided into two global stages:
• extract examples that perfectly match the user’s request
• extract examples that are considered as additional support for the documentation in order to
reduce the error risks in the SQL code-writing stage
In the first case (if we managed to find the appropriate examples), the phase of the documents’s
retrieval might be skipped because all necessary information for SQL generation is already in
the extracted examples. The second stage is supposed to execute the documents’s retrieval
procedure as a main source of information for SQL generation.
15
Example retrieval flow with progressive thresholds.
16
the user’s request to the normalized form and extract the main clause following the same
process used during model training.
Since the normalized question is subject to the "Normalize the question rules" (see Example
Normalization and Structuring), we may expect high similarity between normalized versions of the
different questions, even if they differ with specific details or slightly differ in writing style, as long
as they convey the same meaning from a generalized perspective.
The boxplots below illustrate this concept. The diagrams display the distribution of pairwise
distances between normalized questions and their alternatives. The diagrams compare the distances
between nearest neighbours within the same example (intragroup) and between different
examples (intergroup) (for all dynamic diapason on the left, and limited by 0.05 - on the right).
Distribution of pairwise distances between normalized Distribution of pairwise distances between normalized
questions (full range). questions (limited to 0.05).
Nearest neighbours within the same example (between different examples) refer to
the closest matching questions, based on cosine similarity, that belong to the same question-
answer example or different question-answer example correspondingly. Given an example Gm
consisting of multiple question representations {Q1 , Q2 , ..., Qk }, the nearest neighbours within
the same example are defined as:
Dintra/inter (Qi )N = arg min d(Qi , Qj ) = arg max Sim(Qi , Qj )
Qj ∈Gm ,j̸=i Qj ∈Gm ,j̸=i
where
d(Qi , Qj ) - represents the pairwise cosine distance between embeddings of the two questions
For intra-case:
• Qi , Qj ∈ Gm - are different representations of the same example
• The nearest neighbour Qj is the one with the smallest distance to Qi within the same example
group.
For inter-case:
• Qi ∈ Gm and Qj ∈ Gn where Gn ̸= Gm (questions come from different groups).
• The nearest neighbour Qj is selected only from a different group.
At this stage, we consider the questions that are the normalized representation of the original
question and its alternatives, such as variations in phrasing or minor details, but with the same
underlying meaning - Dintra (Qi )N and Dinter (Qi )N .
Regarding the normalized representation of a question, a cosine distance of less than P5
DIN T −N
17
P5 = percentile(x, 5), where DIN T −N - inter-group distances for normalized questions
DIN T −N x∈DIN T −N
group, in our embedding model, enables the detection of semantically equivalent questions while
reducing the risk of misidentification. Therefore, by setting the similarity threshold to
Sim(A, B) ⩾ P95 = percentile(x, 95) , where SIN T −N - inter-group similarity for normalized
SIN T −N x∈SIN T −N
questions group, for the normalized form of the user’s question, we ensure the correct positioning
of the user’s request among the saved examples. The retrieved example can be used as a template
to generate the correct SQL code without needing to include the documentation in the context, as
it already contains all essential logic, differing only in minor details such as timeframe, geolocation,
etc.
N N
Ares = {Ak }, where Sim(N, Qk ) ⩾ P95 , Qk ∈ {Ik , Nk }; P95 = P95
SIN T −N
Percentiles 99% 95% 90% 75% 50% 25% 10% 05% 01%
Inner-group 0.99 0.99 0.98 0.98 0.97 0.95 0.95 0.94 0.93
Inter-group 0.96 0.95 0.95 0.93 0.91 0.88 0.86 0.84 0.83
To maximize precision while maintaining a moderate recall, we set the similarity threshold
at:
′
Sim(X, Y ) > P99 = P99N M
SinterN ∪M
where:
SinterN ∪M = SinterN ∪ SinterM represents inter-group similarity for the normalized questions
group expanded by main clause.
P99 of inter-group similarity is 0.96, meaning 99% of inter-group distances are below this
threshold.
′
By applying the threshold P99N M (99% percentile of inter-group similarity for the normalized
N
questions group expanded by main clause) instead of P95 (95% percentile of intra-group
similarity for the normalized questions), we ensure that retrieved examples are strongly
related to the user’s request, effectively minimizing false positives and improving precision.
Since only 1% of inter-group examples have a similarity above 0.96, the risk of retrieving
18
incorrect examples is minimal. However, in this case, we also need to consider the table description
as an additional source of information since the examples themselves may not always include the
specific attributes requested.
′
Ares = {Ak }, where Sim(N, Qk ) ⩾ P99N M , Qk ∈ {Ik , Nk , Mk }
Tcode = {Tk |Tk ∈ Ak ∀Ak ∈ Ares }
∗
Scode = {Scodek |Scodek ∈ Ak ∀Ak ∈ Ares }
where
∗
Scode — set of SQL-queries from the retrieved examples that have sufficient match with the
normalized question
Percentiles 99% 95% 90% 80% 50% 25% 10% 05% 01%
Inner-group 0.99 0.97 0.96 0.95 0.93 0.90 0.87 0.84 0.82
19
Percentiles 99% 95% 90% 80% 50% 25% 10% 05% 01%
Inter-group 0.98 0.88 0.86 0.83 0.78 0.76 0.74 0.72 0.70
This means that while the retrieved set of examples contains relevant information, it also has a
high level of noise. Therefore, an additional step is required to review the relevance of the extracted
question-code representatives and filter out highly irrelevant ones.
V. RERANKING
To achieve this, we apply a cross-encoder model trained on the Quora Duplicate Questions
dataset. This approach allows us to directly assess the semantic similarity between the questions,
which accompany extracted examples, and the user’s request, providing a more accurate relevance
ranking due to the ability of the cross-encoder to jointly process both inputs, capturing deeper
contextual relationships. The Quora Duplicate Questions dataset is specifically designed
for paraphrase identification, so the cross-encoder model has the ability to recognize if two
questions have the same meaning with differences in wording. This perfectly serves our goal - detect
semantically similar examples to the user’s request, ignoring lexical and style variations.
Ares = {Ak ∈ A′res | getRank(Ak ) > τ ′ }
Tcode = {Tk |Tk ∈ Ak ∀Ak ∈ Ares }|Ares ̸= ∅
∗
Scode = {Scodek |Scodek ∈ Ak ∀Ak ∈ Ares }|Ares ̸= ∅
where
∗
Scode — set of SQL-queries from the retrieved examples that have sufficient match with the
normalized question
τ ′ — the reranking threshold
∗
If we managed to find the relevant examples to the input request (Scode , Tcode ), we put them into
context for SQL code generation, preliminary accompanied them with the related tables description.
Otherwise, we need to take the codes examples and the corresponded tables (Scode , Tcode ) and
proceed with the Documentation on Related Tables retrieval.
20
Document retrieval process flow.
To detect tables that match the user’s request and might be used at the stage of SQL code generation,
we execute multiple queries using different representations of the user’s request. Each query is
formulated as follows:
Zi = (zi , Fi , ni )
where:
zi is the query text derived from one of the following sources: main clause (zmain = M),
normalized question (znorm = N ), extracted concepts (zconcepts = E)
Fi is the filter condition applied to metadata categories, which restricts the search space.
ni is the number of results to retrieve per query.
The set of query parameters is defined as:
Z = {Z1 , Z2 , ..., Zn }
where each query applies different filters:
1. Matching descriptions & entities with normalized question N and main clause M
F1 = F2 = {"category" ∈ {"description", "entity"}}
2. Matching connected tables & entities with extracted concepts E
F3 = {"category" ∈ {"connected_tables", "entity"}}
3. Matching table names with all of these categories
21
F4 = F5 = F6 = {"category" = "table_name"}
For each query, we perform retrieval from the documentation collection C:
Ri = query(C, zi , Fi , ni )
where Ri represents the set of retrieved metadata entries. The final retrieved set is the union of all
query results:
S6
R = i=1 Ri
Each retrieved entry r ∈ R consists of metadata fields: table name (T ′ ), dependencies (D′ ),
connected tables (CT ′ )
We construct a structured representation of the retrieved data:
Q = {json(T ′ , D′ , CT ′ ) | (T ′ , D′ , CT ′ ) ∈ R}
To ensure uniqueness, we remove duplicate representations:
Qunique = unique(Q)
Each unique metadata representation is then converted into structured documentation entries:
Doc = {get_docs(q) | q ∈ Qunique }
The final tables set is obtained by merging all document entries:
S
Tdet = d∈Doc d
Finally, Tdet represents the set of detected tables, ensuring that all relevant table names and their
relationships are correctly identified.
In the current implementation, we set ni = 10, which leads to retrieving about 30 unique
documents with table descriptions (corresponding to 30 unique tables) on average. This ensures
broad coverage, considering that a typical SQL query involves 4 to 5 relevant tables while
maintaining computational efficiency. The objective of the retrieval process is to maximize the
recall (inclusion of relevant tables) and, at the same time, reduce the noise level, ensuring that the
following steps operate on a well-curated tables’ candidate set.
22
This statement is reflected in the scatter plot that depicts the relationship between the Rand
Index and the Number of Unique Tables Retrieved.
The moderate overlap between retrieved sets, measured by a Rand Index of 0.36, shows that
different query formulations bring in new relevant tables while still having some common results.
This balance indicates that the search strategy is effective and reliable, using multiple query
approaches to capture more relevant tables without too much repetition. The results confirm that
the retrieval process maintains a good mix of variety and consistency, ensuring that important
tables are included while keeping the search efficient.
23
TR — is the final refined set of tables selected by the model.
N — normalized representation of user’s request
Instr_Domain — domain-specific mapping instructions
R — business rules
Ttotal — the candidate Tables with Descriptions received on the document’s retrieval stage
Tcode — the candidate Tables with Descriptions received on the example’s retrieval stage
Conclusion
The retrieval process from the Knowledge Base was designed to receive examples and table
documentation that are relevant to the user’s request, which is necessary for the SQL query
generation process. With the help of vector-based similarity searches, the system is capable of
effectively retrieving the relevant SQL query examples and their associated table documentation
through reranking techniques and business-rule-driven refinements.
24
A multi-step approach, which is described in the document, allows for the optimization of precision
without compromising on low error rates in table selection, thus ensuring that the result is
consistent with both the database structure and domain requirements. The extracted context,
based on a refined set of tables and retrieved SQL examples, is vital in the query generation
process because it serves as a basis for the accuracy, efficiency, and contextual consistency of the
whole process.
25
I. EXAMPLE-DRIVEN SQL GENERATION
• Primary Focus:
∗∗ ∗
– Leverages previous SQL query examples (Scode or Scode ) that are relevant to the
user’s question.
– Uses table information as supporting context (Tcode ) but relies mainly on patterns
from examples.
• Key Steps:
1. Analyze relevant SQL examples to extract query patterns.
2. Check documentation recommendations and apply direct insights.
3. Ensure table dependencies are correctly handled.
4. Minimize the number of tables included in the final query.
• Constraints:
– Must not introduce extra operations unless explicitly required by the question.
– Must follow the provided schema strictly.
Conclusion
SQL code generation constructs accurate queries by leveraging retrieved examples and table
documentation. The system dynamically chooses between example-driven and documentation-
driven approaches based on relevance, ensuring alignment with the user’s intent and database
schema. By strictly following table relationships, dependencies, and constraints, this method ensures
robust, efficient, and reliable SQL with minimal errors.
26
Conclusions
The Text2SQL system presented in this document establishes a comprehensive approach to convert-
ing natural language queries into correct SQL code using Retrieval-Augmented Generation (RAG).
By integrating database knowledge, query examples, and domain-specific rules, this system handles
the challenges of database complexity and linguistic nuance.
Future Work
Future developments should include empirical evaluation on publicly available benchmarks to
quantify performance relative to existing approaches. Additional research could focus on optimizing
the extraction process for greater efficiency, expanding the range of supported SQL constructs, and
developing methods for automated extraction of domain-specific rules. We also see potential in
adapting this framework to support multi-turn interactions, where users can refine their queries
through conversation.
27
References
[1] Naihao Deng, Yulong Chen, and Yue Zhang. Recent Advances in Text-to-SQL: A Survey of
What We Have and What We Expect. ACL Anthology. 2022. https://aclanthology.org/2022.
coling-1.190/
[2] Anindyadeep Sannigrahi. State of Text-to-SQL 2024. Premai Blog. 2024. https://blog.prema
i.io/state-of-text2sql-2024/
[3] AtScale. What is Text-to-SQL? Benefits, How it Works. AtScale Glossary. 2023. https:
//www.atscale.com/glossary/text-to-sql/
[4] Jinqing Lian, Xinyi Liu, Yingxia Shao, Yang Dong, Ming Wang, Zhang Wei, Tianqi Wan,
Ming Dong, and Hailin Yan. ChatBI: Towards Natural Language to Complex Business
Intelligence SQL. arXiv preprint. 2024. https://doi.org/10.48550/arXiv.2405.00527
[5] Hiba Fathima. Natural Language to SQL: Simplifying Data Access for Everyone. Sequel Blog.
2024. https://sequel.sh/blog/natural-language-to-sql
[6] Bing Wang, Yan Gao, Zhoujun Li, and Jian-Guang Lou. Know What I Don’t Know:
Handling Ambiguous and Unanswerable Questions for Text-to-SQL. ACL Anthology. 2023.
https://aclanthology.org/2023.findings-acl.352.pdf
[7] Yujian Gan, Xinyun Chen, and Matthew Purver. Exploring Underexplored Limitations of
Cross-Domain Text-to-SQL Generalization. ACL Anthology. 2021. https://aclanthology.org
/2021.emnlp-main.702.pdf
[8] Sanjeeb Panda and Burak Gözlüklü. Build a robust text-to-SQL solution generating complex
queries, self-correcting, and querying diverse data sources. AWS Machine Learning Blog.
2024. https://aws.amazon.com/blogs/machine-learning/build-a-robust-text-to-sql-solution-
generating-complex-queries-self-correcting-and-querying-diverse-data-sources/
[9] Irina Saparina and Mirella Lapata. AMBROSIA: A Benchmark for Parsing Ambiguous
Questions into Database Queries. arXiv preprint. 2024. https://doi.org/10.48550/arXiv.2406.
19073
[10] Zijin Hong, Zheng Yuan, Qinggang Zhang, Hao Chen, Junnan Dong, Feiran Huang, and Xiao
Huang. Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL. arXiv
preprint. 2024. https://arxiv.org/html/2406.08426
[11] Longxu Dou, Yan Gao, Xuqi Liu, Mingyang Pan, Dingzirui Wang, Wanxiang Che, Dechen
Zhan, Min-Yen Kan, and Jian-Guang Lou. Towards Knowledge-Intensive Text-to-SQL
Semantic Parsing with Formulaic Knowledge. ACL Anthology. 2022. https://doi.org/10.186
53/v1/2022.emnlp-main.350
[12] Ali Buğra Kanburoğlu and Faik Boray Tek. Text-to-SQL: A methodical review of challenges
and models. Turkish Journal of Electrical Engineering and Computer Sciences. 2024. https:
//doi.org/10.55730/1300-0632.4077
[13] Tobi Beck. How to Simplify SQL with Text-to-SQL Technology. Eckerson Group Blog. 2024.
https://www.eckerson.com/articles/how-to-simplify-sql-with-text-to-sql-technology
28