0% found this document useful (0 votes)

33 views6 pages

24 Data Centric Text To SQL Wi

The document presents a Data-Centric Text-to-SQL framework that enhances the accuracy of SQL queries generated from natural language by addressing data quality issues in messy and large datasets. It highlights that existing Query-Centric frameworks often overlook these challenges, leading to significant inaccuracies in ground truth answers, which can be as high as 37% due to problems like duplication and inconsistent data types. The proposed framework has shown to outperform human-generated SQL queries by up to 33.89% in experiments conducted on the BIRD benchmark.

Uploaded by

gegepokemon4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views6 pages

24 Data Centric Text To SQL Wi

Uploaded by

gegepokemon4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Data-Centric Text-to-SQL with Large Language Models

Zezhou Huang Shuo Zhang Kechen Liu

Columbia University Columbia University Columbia University
[email protected] [email protected] [email protected]

Eugene Wu
DSI, Columbia University
[email protected]

Abstract

Text-to-SQL is crucial for enabling non-technical users to access data, and large
language models have significantly improved performance. However, recent frame-
works are largely Query-Centric, focusing on improving models’ ability to translate
natural language into SQL queries. Despite these advancements, real-world chal-
lenges—especially messy and large datasets—remain a major bottleneck. Our case
studies reveal that 11-37% of the ground truth answers in the BIRD benchmark
are incorrect due to data quality issues (duplication, disguised missing values,
data types and inconsistent values). To address this, we propose a Data-Centric
Text-to-SQL framework that preprocesses and cleans data offline, builds a rela-
tionship graph between tables, and incorporates business logic. This allows LLM
agents to efficiently retrieve relevant tables and details during query time, signifi-
cantly improving accuracy. Our experiments show that this approach outperforms
human-provided ground truth answers on the BIRD benchmarks by up to 33.89%.

1 Introduction
Text-to-SQL is a widely used task that enables domain experts who are not familiar with databases or
SQL to access data easily. Recent Text-to-SQL frameworks [10, 16, 3, 24], based on off-the-shelf
large language models like GPT-4 [1] and Claude 3.5, with trillions of parameters, have demonstrated
state-of-the-art results on public benchmarks such as BIRD [19], Spider [28], and KaggleDBQA [14].
Current Text-to-SQL frameworks are Query-Centric, focusing mainly on enhancing models’ ability
to translate natural language into SQL queries. For example, previous works have improved query
results by using better fine-tuned models [17, 18, 8], implementing self-debugging frameworks [2], or
decomposing tasks for individual SQL components [24, 27, 15]. However, in real-world applications,
a major challenge—beyond just constructing queries—is dealing with messy and large datasets,
which complicates the query construction process. The models need to not only understand the SQL
language and map the concepts but also search for the correct tables and perform cleaning on the fly:

• Messy: Text-to-SQL solutions are mostly needed for raw data sources, such as those in an
enterprise data lake [22], where data has not yet been transformed into reporting models needed
for answering queries. However, these raw data sources often have issues like duplication, missing
values, or inconsistencies [5], which can negatively impact the performance of Text-to-SQL.
• Large: Enterprise data often consists of a large number of data sources [21, 9]. Even within a
single data source, it’s common to find hundreds or even thousands of tables with complex join
and union relationships, e.g., for ERP and CRM systems [7]. To link the query to the table schema,
previous works have represented the schema in different structures [8, 25]. Recent works schema

Table Representation Learning Workshop at NeurIPS 2024.

linking is less needed for latest LLMs [20]. However, all these works study relatively small
databases with < 20 tables, while enterprise data lakes have tables numbering in the thousands.

To illustrate how data can be a bottleneck that complicates text-to-SQL tasks, consider the example
tables and queries from the BIRD benchmark [19], as shown in Figure 1. Duplication leads to incorrect
aggregation queries, such as AVG. Disguised missing values and inconsistent value representations
make filter conditions challenging by requiring additional predicates. Additionally, column types
may need to be cast before performing aggregation. These issues are not mere edge cases. Our case
study over 4 databases in BIRD reveals that, surprisingly, 11-37% of the ground truth answers
are incorrect, because the provided SQL queries don’t handle these issues. In addition to obviously
incorrect answers, there are other data issues like missing values and referential integrity violations
that are concerning, though we cannot confirm whether they affect the correctness of the answers.
We propose a Data-Centric Text-to-SQL framework that leverages large language models to pre-
process data. This is an offline process independent of online queries that prepares the RAG
structure and cleans the table. It complements the Model-Centric framework online by helping
with schema linking over large tables and eliminating the need for cleaning on the fly. Offline, the
Data-Centric framework (1) profiles and cleans data to address quality issues, (2) builds a relationship
graph to identify join and union paths between tables, and (3) summarizes these tables from different
data sources, incorporating business logic. Throughout the process, we utilize existing enterprise text
documents, data pipeline code (using embedding-based RAG) and human feedback. Online, when a
natural language query is made, the framework provides APIs for query-centric LLM agents to follow
an "overview to zoom-in" approach: (1) first use RAG APIs to find relevant data sources and tables,
(2) then call functions to retrieve table and column details (e.g., descriptions, value representations,
data types, etc.) to finalize query construction. We present our early experiment results on the Bird
benchmarks, which show that, such data-centric framework, combined with a naive agent for online
query, outperformed human-provided answers by up to 33.89%.

2 Data-Centric framework

In this section, we describe the details of the Data-Centric LLM Framework. We start with the offline
data preparation, then the APIs it exposes for online Query-Centric Frameworks.

• Data cleaning: We begin by identifying and cleaning the data. Since data cleaning is too complex
for current LLMs, we break it down into specific tasks such as handling duplicates, disguised
missing values, data types, and value representations across tables and columns [11, 29, 12]. Our
previous experiments have shown better performance compared to existing data cleaning tools.
• Integration: We construct a relationship graph by identifying joinable and unionable paths [26].
For union, we focus on cases where there is a 1-1 correspondence between columns for table
partitions. We embed the schema and sample rows, then use LLMs to determine the 1-1 corre-
spondence for closest tables. Note that columns can be only partially corresponded or require
transformations, which we leave for future work. For joins, we break down the task: First, we
identify primary and foreign keys for all tables. Then, for each foreign key, we use LLMs to
find the corresponding primary keys, considering factors such as table features, column name
descriptions, and column values. Additionally, we treat time and spatial columns as special cases.
For example, using time as a join key can make the graph too dense to navigate since time is a
common attribute.
• Modeling: The relationship graph can be too large to explore fully. To address this, we follow
the Graph RAG preparation process [4] by clustering nodes into communities using the Leiden
algorithm. We then generate natural language summaries for each community, incorporating
relevant business logic from existing documents. For example, in a relationship graph of Salesforce
databases [6] with about 500 tables, the summarized communities include "User Management
and Authentication," "Customer Relationship Management," and "Financial and Transactional."

Online, we offer APIs for a Query-Centric Framework that follows an "overview-to-detail" approach.
Starting with a natural language query, we first identify the most relevant communities and tables
(through RAG), then zoom in on table specifics to complete the SQL construction.

2
-- Wrong ( includes duplicates ) user_reviews
SELECT AVG ( Sentiment ) FROM user_reviews
Review App Sentiment
-- Correct ( deduplicates first )
SELECT AVG ( Sentiment ) FROM I like diet Diet 0.9
( SELECT DISTINCT * FROM user_reviews ) I like diet Diet 0.9

(a) Duplication causes a direct average aggregation query without deduplication to be wrong.
-- Wrong ( misses ’nan ’ values )
SELECT AVG ( Sentiment ) FROM user_reviews user_reviews
WHERE Review IS NOT NULL Review App Sentiment
-- Correct ( handles both NULL and ’nan ’)
NULL Diet 0.7
SELECT AVG ( Sentiment ) FROM user_reviews
nan Diet 0.8
WHERE Review IS NOT NULL AND Review != ’ nan ’

(b) Disguised missing values cause the IS NULL filter without considering other variants to be wrong.

-- Wrong ( doesn ’ t cast TEXT to INT ) Scoring

SELECT AVG ( PPA ) FROM Scoring playerID PPA (TEXT)
-- Correct ( casts TEXT to INT )
1001 "15"
SELECT AVG ( CAST ( PPA AS INT ) ) FROM Scoring
1002 "18"

(c) Incorrect data types cause a direct average aggregation query without proper type casting to be wrong.
-- Wrong ( doesn ’ t account for variations )
SELECT dba_name FROM establishment establishment
WHERE city = ’ Chicago ’ dba_name city
-- Correct ( accounts for variations )
Mcdonald Chicago
SELECT dba_name FROM establishment
KFC CHICAGO
WHERE city IN ( ’ Chicago ’ , ’ CHICAGO ’)

(d) Inconsistent data representation causes a selection query that doesn’t account for variations to be wrong.
Figure 1: Data quality issues in BIRD benchmark leading to incorrect ground truth answers

Ofﬂine: Data-Centric
text NL Query
Vector RAG

Modeling
Human Community Summarization,
Veriﬁcation Process Documentation, etc.
Graph RAG

Integration
Enterprise Primary Key Foreign Key, Union, LLM Agent
Document Referential Integrity, etc.
Get Table Details

Cleaning
DMV, Inconsistency, Column
Online: Query-centric
Data Pipeline
Codes Type, Typo, Duplication, etc.

Figure 2: Data-Centric Text-to-SQL framework: Offline, prepares data through cleaning, integration,
and modeling. Online, exposes APIs for Query-Centric framework to navigate data and write SQLs.

• Vector RAG: For each community (across data sources), we embed its natural language summary,
allowing the agent to find the closest match based on the query within the embedding space.
• Graph RAG: Once relevant communities are detected, we provide Graph RAG APIs that (1)
output all tables and their descriptions, helping the LLM to decide which tables are relevant for
schema linking [17]; (2) for any table, return all K-hop neighbors within the relationship graph
(including from other communities); and (3) given a set of tables, find the minimum spanning tree
to identify potential join/union paths between source tables. For each join path, we also provide
APIs to check referential integrity violations and join multiplicity to avoid fanout issues.
• Table Details: We provide details of: (1) table and column descriptions, and sample data [13];
(2) column representations: For numerical data, we provide the 0th , 25th , 50th , 75th , 100th

3
percentiles. For categorical data, we list the categories. For free text, we return regex patterns if
available, or sample values otherwise; (3) data quality issues like missing values.

3 Experiments
3.1 Setup

To evaluate our data-centric framework, we use the BIRD dataset [19], which contains real-world
data quality issues. The limit of BIRD is that the data sources are small, with <20 tables. Therefore,
our experiment focuses on how data cleaning affects accuracy. For the query agent, we use a naive
implementation that first identifies the community (Vector RAG), then finds the related tables and
their join paths (Graph RAG), explores all relevant table details, and finally generates the SQL.
For the evaluation, we compare the results to ground-truth SQL queries (run on cleaned data). If they
differ, we manually review if the queries are semantically equivalent in 3 cases: handling ties, the
order of columns in the SELECT clause, and the reasonable inclusion of extra columns [23, 15].
To demonstrate our framework’s effectiveness on large enterprise datasets, we present modeling results
for Fivetran data sources at https://cocoon-data-transformation.github.io/page/model.
The largest data source (Salesforce) contains ∼ 500 tables. Since no text-to-SQL benchmark exists
for these data, we leave the accuracy assessment for future work.

3.2 Results

Table 1: SQL correctness. By cleaning data, Data-Centric framework outperforms ground truth.
System Hockey Food Inspection 2 App Store Human Resources
Ground Truth 77.30% 88.49% 63.50% 62.72%
Data-Centric 82.13% 89.93% 76.19% 96.61%

Table 2: Distribution of Data Quality Issues causing Incorrect Ground Truth Answers

Dataset DMV Inconsistency Column Type Duplication

Hockey 2.90% 5.31% 14.49% –
Food Inspection 2 – 9.35% 2.16% –
App Store 4.76% 3.17% 3.17% 25.40%
Human Resources 1.69% – 35.59% –

Across the 4 databases we analyzed, the accuracy of the provided ground truth queries and those
generated by the Data-Centric Framework are summarized in Table 1. The types of data quality
issues impacting the accuracy of the ground truth are shown in Table 2. Key findings include:

• 11-37% of the provided ground truth answers are incorrect due to data quality issues. Data
quality problems are widespread in BIRD, greatly affecting SQL accuracy. The most common
issue across all four databases is the improper column type. Additionally, 3 of the 4 databases
have disguised missing values and inconsistent data representation (including typos). In the App
Store database, duplication affects 25.4% of the ground truth queries. Finally, we find that all 4
databases contain missing values in columns, while Hockey and Food Inspection 2 have referential
integrity violations. These may include incomplete records that the data cleaning process cannot
resolve, and their impact on query results remains unknown.
• The Data-Centric Framework’s queries outperform the ground truth queries by up to
33.89%. Our findings show that preparing the data (cleaning, integration and modeling), even
when done once and independently from the queries, can significantly boost the performance
of online SQL queries, surpassing the provided ground truth. The level of improvement varies
across different databases, with those having more severe data quality issues showing the greatest
gains. It’s important to note that this experiment used a basic Query-Centric Framework without
advanced components for individual SQL optimization or self-debugging, which previous studies

4
have demonstrated to improve performance significantly [2, 24, 27, 15]. We anticipate even better
results once we incorporate these frameworks, which will be explored in future work.

4 Conclusion
This work proposes a Data-Centric Text-to-SQL framework that complements existing Query-Centric
approaches by addressing data quality and complexity issues in large and messy datasets. We find
that preprocessing data, building relationship graphs, and incorporating business logic significantly
improves query accuracy, outperforming human-provided answers on the BIRD benchmark by up to
33.89%. Future work will focus on integrating this data-centric approach with existing query-centric
frameworks and evaluating the combined system on larger enterprise-level Text-to-SQL benchmarks.

Acknowledgements
This research received funding from multiple sources, including National Science Foundation grants
(NSF #1845638, #1740305, #2008295, #2106197, #2103794) as well as corporate support from
Amazon, Google, Adobe, and CAIT. The views and conclusions presented here are those of the
authors and should not be interpreted as representing the official positions of the funding organizations.

References
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4
technical report. arXiv preprint arXiv:2303.08774, 2023.
[2] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language
models to self-debug. arXiv preprint arXiv:2304.05128, 2023.
[3] Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou,
et al. C3: Zero-shot text-to-sql with chatgpt. arXiv preprint arXiv:2307.07306, 2023.
[4] Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven
Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused
summarization. arXiv preprint arXiv:2404.16130, 2024.
[5] Lisa Ehrlinger and Wolfram Wöß. A survey of data quality measurement and monitoring tools.
Frontiers in big data, 5:850611, 2022.
[6] Fivetran. Salesforce. https://fivetran.com/docs/connectors/applications/
salesforce, n.d. Fivetran Documentation.
[7] Helena Forslund. Erp systems’ capabilities for supply chain performance management. Indus-
trial Management & Data Systems, 110(3):351–367, 2010.
[8] Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou.
Text-to-sql empowered by large language models: A benchmark evaluation. arXiv preprint
arXiv:2308.15363, 2023.
[9] Corinna Giebler, Christoph Gröger, Eva Hoos, Holger Schwarz, and Bernhard Mitschang.
Leveraging the data lake: current state and challenges. In Big Data Analytics and Knowledge
Discovery: 21st International Conference, DaWaK 2019, Linz, Austria, August 26–29, 2019,
Proceedings 21, pages 179–188. Springer, 2019.
[10] Gunther Hagleitner. Gpt-4’s sql mastery. https://medium.com/querymind/
gpt-4s-sql-mastery-2cd1f3dea543, April 2023.
[11] Zezhou Huang and Eugene Wu. Cocoon: Semantic table profiling using large language models.
In Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics, pages 1–7, 2024.
[12] Zezhou Huang and Eugene Wu. Relationalizing tables with large language models: The promise
and challenges. In 2024 IEEE 40th International Conference on Data Engineering Workshops
(ICDEW). IEEE, 2024.

5
[13] Zezhou Huang, Pavan Kalyan Damalapati, and Eugene Wu. Data ambiguity strikes back:
How documentation improves gpt’s text-to-sql. In NeurIPS 2023 Second Table Representation
Learning Workshop, 2023.
[14] Chia-Hsuan Lee, Oleksandr Polozov, and Matthew Richardson. Kaggledbqa: Realistic evalua-
tion of text-to-sql parsers. arXiv preprint arXiv:2106.11455, 2021.
[15] Dongjun Lee, Choongwon Park, Jaehyuk Kim, and Heesoo Park. Mcs-sql: Leveraging
multiple prompts and multiple-choice selection for text-to-sql generation. arXiv preprint
arXiv:2405.07467, 2024.
[16] Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. The dawn of natural
language to sql: Are we fully ready? arXiv preprint arXiv:2406.01265, 2024.
[17] Haoyang Li, Jing Zhang, Cuiping Li, and Hong Chen. Resdsql: Decoupling schema linking and
skeleton parsing for text-to-sql. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 37, pages 13067–13075, 2023.
[18] Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan
Pan, Cuiping Li, and Hong Chen. Codes: Towards building open-source language models for
text-to-sql. Proceedings of the ACM on Management of Data, 2(3):1–28, 2024.
[19] Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin,
Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for
large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems,
36, 2024.
[20] Karime Maamari, Fadhil Abubaker, Daniel Jaroslawicz, and Amine Mhedhbi. The death of
schema linking? text-to-sql in the age of well-reasoned language models. arXiv preprint
arXiv:2408.07702, 2024.
[21] Hassan Mehmood, Ekaterina Gilman, Marta Cortes, Panos Kostakos, Andrew Byrne, Katerina
Valta, Stavros Tekes, and Jukka Riekki. Implementing big data lake for heterogeneous data
sources. In 2019 ieee 35th international conference on data engineering workshops (icdew),
pages 37–44. IEEE, 2019.
[22] Fatemeh Nargesian, Erkang Zhu, Renée J Miller, Ken Q Pu, and Patricia C Arocena. Data
lake management: challenges and opportunities. Proceedings of the VLDB Endowment, 12(12):
1986–1989, 2019.
[23] Mohammadreza Pourreza and Davood Rafiei. Evaluating cross-domain text-to-sql models and
benchmarks. arXiv preprint arXiv:2310.18538, 2023.
[24] Mohammadreza Pourreza and Davood Rafiei. Dts-sql: Decomposed text-to-sql with small large
language models. arXiv preprint arXiv:2402.01117, 2024.
[25] Ge Qu, Jinyang Li, Bowen Li, Bowen Qin, Nan Huo, Chenhao Ma, and Reynold Cheng. Before
generation, align it! a novel and effective strategy for mitigating hallucinations in text-to-sql
generation. arXiv preprint arXiv:2405.15307, 2024.
[26] Dezhao Song, Frank Schilder, Shai Hertz, Giuseppe Saltini, Charese Smiley, Phani Nivarthi,
Oren Hazai, Dudi Landau, Mike Zaharkin, Tom Zielund, et al. Building and querying an
enterprise knowledge graph. IEEE Transactions on Services Computing, 12(3):356–369, 2017.
[27] Chang-You Tai, Ziru Chen, Tianshu Zhang, Xiang Deng, and Huan Sun. Exploring chain-of-
thought style prompting for text-to-sql. arXiv preprint arXiv:2305.14215, 2023.
[28] Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li,
Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex
and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887, 2018.
[29] Shuo Zhang, Zezhou Huang, and Eugene Wu. Data cleaning using large language models.
arXiv preprint arXiv:2410.15547, 2024.

Final Paper
No ratings yet
Final Paper
16 pages
The Death of Schema Linking? Text-to-SQL in The Age of Well-Reasoned Language Models
No ratings yet
The Death of Schema Linking? Text-to-SQL in The Age of Well-Reasoned Language Models
11 pages
NeurIPS 2024 Uqe A Query Engine For Unstructured Databases Paper Conference
No ratings yet
NeurIPS 2024 Uqe A Query Engine For Unstructured Databases Paper Conference
32 pages
Structure-Guided Large Language Models For
No ratings yet
Structure-Guided Large Language Models For
24 pages
SQLPa LM
No ratings yet
SQLPa LM
61 pages
In Context Reinforcement Learning Based Retrieval Augmented Generation For Text To SQL
No ratings yet
In Context Reinforcement Learning Based Retrieval Augmented Generation For Text To SQL
8 pages
Datrics Text 2 SQL
No ratings yet
Datrics Text 2 SQL
29 pages
From Natural Language To SQL Review of
No ratings yet
From Natural Language To SQL Review of
15 pages
Solid-SQL Enhanced Schema-Linking Based In-Context Learning For
No ratings yet
Solid-SQL Enhanced Schema-Linking Based In-Context Learning For
11 pages
Pet-Sql:: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL With Cross-Consistency
No ratings yet
Pet-Sql:: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL With Cross-Consistency
15 pages
Ijctt V72i12p103
No ratings yet
Ijctt V72i12p103
8 pages
Text-to-SQL Meets The Real-World: 1 4 A 1 B 1 C 1 4 D 1 e 4 F 1 G 1 4 H 2 I 3 J 1 4 K
No ratings yet
Text-to-SQL Meets The Real-World: 1 4 A 1 B 1 C 1 4 D 1 e 4 F 1 G 1 4 H 2 I 3 J 1 4 K
12 pages
Semantic Parsing For Complex Data Retrieval: Targeting Query Plans vs. SQL For No-Code Access To Relational Databases
No ratings yet
Semantic Parsing For Complex Data Retrieval: Targeting Query Plans vs. SQL For No-Code Access To Relational Databases
17 pages
Text-To-SQL Spider Paper
No ratings yet
Text-To-SQL Spider Paper
11 pages
Large Language Model Enhanced Text-to-SQL Generation - A Survey
No ratings yet
Large Language Model Enhanced Text-to-SQL Generation - A Survey
18 pages
Syntax and Relation Enhanced Query Generation For
No ratings yet
Syntax and Relation Enhanced Query Generation For
12 pages
1 Alberto Marengo Capstone Final Report
No ratings yet
1 Alberto Marengo Capstone Final Report
3 pages
HCteam IT Proposal
No ratings yet
HCteam IT Proposal
15 pages
Balancing Content Size in RAG-Text2SQL System: Prakhar Gurawa
No ratings yet
Balancing Content Size in RAG-Text2SQL System: Prakhar Gurawa
26 pages
LLM Based Survey Text 1741015993
No ratings yet
LLM Based Survey Text 1741015993
20 pages
E-SQL: Direct Schema Linking Via Question Enrichment in Text-to-SQL
No ratings yet
E-SQL: Direct Schema Linking Via Question Enrichment in Text-to-SQL
18 pages
LLM-Based Text-to-SQL Survey
No ratings yet
LLM-Based Text-to-SQL Survey
18 pages
LLMs as Database Interfaces: BIRD Benchmark
No ratings yet
LLMs as Database Interfaces: BIRD Benchmark
28 pages
CHESS: Contextual Harnessing For Efficient SQL Synthesis: Shayan Talaei Mohammadreza Pourreza
No ratings yet
CHESS: Contextual Harnessing For Efficient SQL Synthesis: Shayan Talaei Mohammadreza Pourreza
39 pages
Converting Relational To Graph Databases: June 2013
No ratings yet
Converting Relational To Graph Databases: June 2013
7 pages
Relational Table Learning with LLMs
No ratings yet
Relational Table Learning with LLMs
6 pages
LLM Model Transform For Short Term Trading On Commodity
No ratings yet
LLM Model Transform For Short Term Trading On Commodity
7 pages
A Survey of Large Language Model-Based Generative AI For Text-to-SQL: Benchmarks, Applications, Use Cases, and Challenges
No ratings yet
A Survey of Large Language Model-Based Generative AI For Text-to-SQL: Benchmarks, Applications, Use Cases, and Challenges
7 pages
RATSQL
No ratings yet
RATSQL
12 pages
Low-Code SQL for Developers
No ratings yet
Low-Code SQL for Developers
11 pages
Publi 8182
No ratings yet
Publi 8182
28 pages
E-SQL Direct Schema Linking Via Question Enrichment in
No ratings yet
E-SQL Direct Schema Linking Via Question Enrichment in
20 pages
Automatic Metadata Extraction For Text-to-SQL: Vladislav Shkapenyuk Divesh Srivastava Theodore Johnson
No ratings yet
Automatic Metadata Extraction For Text-to-SQL: Vladislav Shkapenyuk Divesh Srivastava Theodore Johnson
37 pages
Pipeline For ML Problem
No ratings yet
Pipeline For ML Problem
13 pages
A Natural Language Interface To Relational Databases Using An Online Analytic Processing Hypercube
No ratings yet
A Natural Language Interface To Relational Databases Using An Online Analytic Processing Hypercube
18 pages
Spider 2 0 Can Language
No ratings yet
Spider 2 0 Can Language
45 pages
Relational Deep Learning Framework
No ratings yet
Relational Deep Learning Framework
29 pages
Merged Presentation Choladeck
No ratings yet
Merged Presentation Choladeck
15 pages
T2S Retrieval
No ratings yet
T2S Retrieval
16 pages
Week 3
No ratings yet
Week 3
29 pages
LLM Based TXT To SQL
No ratings yet
LLM Based TXT To SQL
18 pages
Beyond Quacking: Deep Integration of Language Models and Rag Into Duckdb
No ratings yet
Beyond Quacking: Deep Integration of Language Models and Rag Into Duckdb
4 pages
Catsql:: Towards Real World Natural Language To SQL Applications
No ratings yet
Catsql:: Towards Real World Natural Language To SQL Applications
14 pages
STaR SQL Self Taught Reasoner For Text To SQL
No ratings yet
STaR SQL Self Taught Reasoner For Text To SQL
11 pages
CYT180Week2 - Big Data Models
No ratings yet
CYT180Week2 - Big Data Models
34 pages
4 - Spark SQL
No ratings yet
4 - Spark SQL
58 pages
Research Paper
No ratings yet
Research Paper
32 pages
MAG-SQ Multi-Agent Generative Approach With Soft Schema Linking
No ratings yet
MAG-SQ Multi-Agent Generative Approach With Soft Schema Linking
21 pages
Improving Retrieval Augmented Generation
No ratings yet
Improving Retrieval Augmented Generation
33 pages
Dusql
No ratings yet
Dusql
13 pages
Application Data 2
No ratings yet
Application Data 2
105 pages
Graph-Based Approaches and Functionalities in Retrieval-Augmented Generation: A Comprehensive Survey
No ratings yet
Graph-Based Approaches and Functionalities in Retrieval-Augmented Generation: A Comprehensive Survey
35 pages
2.1.2 Data Models
No ratings yet
2.1.2 Data Models
13 pages
Query Lifting: Language-Integrated Query For Heterogeneous Nested Collections
No ratings yet
Query Lifting: Language-Integrated Query For Heterogeneous Nested Collections
28 pages
Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
No ratings yet
Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
17 pages
Graph Databases: Design and Implementation
No ratings yet
Graph Databases: Design and Implementation
14 pages
Efficient Graph Extraction from Databases
No ratings yet
Efficient Graph Extraction from Databases
13 pages
Chapter 2 Kimball Dimensional Modelling Techniques Overview
No ratings yet
Chapter 2 Kimball Dimensional Modelling Techniques Overview
14 pages
BDABook Wiley
No ratings yet
BDABook Wiley
330 pages
SaaS Implementation Best Practices - v2
No ratings yet
SaaS Implementation Best Practices - v2
24 pages
SABER - EMIS - Rubric - Questionnaire
No ratings yet
SABER - EMIS - Rubric - Questionnaire
10 pages
Interview Data Science
No ratings yet
Interview Data Science
23 pages
DevOps Lifecycle for Business Agility
No ratings yet
DevOps Lifecycle for Business Agility
41 pages
Current Affairs - 2024-1
No ratings yet
Current Affairs - 2024-1
22 pages
Unit 1 Introduction To Datascience
No ratings yet
Unit 1 Introduction To Datascience
14 pages
MongoDB Connectivity with Node.js Guide
No ratings yet
MongoDB Connectivity with Node.js Guide
3 pages
RSDB Update 20190620
100% (1)
RSDB Update 20190620
1 page
Snowflake Data Sharing
No ratings yet
Snowflake Data Sharing
16 pages
Chaptrer 2 - Review Questions Accounting Information System
No ratings yet
Chaptrer 2 - Review Questions Accounting Information System
4 pages
Answers-Assignment#2 CIS 203 Fall 15-16
No ratings yet
Answers-Assignment#2 CIS 203 Fall 15-16
2 pages
OpenNTI: Network KPI Visualization Tool
No ratings yet
OpenNTI: Network KPI Visualization Tool
29 pages
IT - Sem VI - DMBI - Sample Questions
No ratings yet
IT - Sem VI - DMBI - Sample Questions
10 pages
SmartPlant Instrumentation
100% (3)
SmartPlant Instrumentation
319 pages
Adbase Presentation Group 4
No ratings yet
Adbase Presentation Group 4
60 pages
CPU Registers and Organization
No ratings yet
CPU Registers and Organization
9 pages
Data Engineer Questions
No ratings yet
Data Engineer Questions
10 pages
Text Classification
No ratings yet
Text Classification
11 pages
FinalReport (cs519)
No ratings yet
FinalReport (cs519)
20 pages
Health Care System Project
No ratings yet
Health Care System Project
53 pages
Apex
No ratings yet
Apex
95 pages
CH 1 Data Transfer Between Files SQL Databases and DataFrames
No ratings yet
CH 1 Data Transfer Between Files SQL Databases and DataFrames
13 pages
Ss3 Computer Note (Week 3) Introduction To Database
No ratings yet
Ss3 Computer Note (Week 3) Introduction To Database
7 pages
SQL Server 2016 Exam Prep Guide
No ratings yet
SQL Server 2016 Exam Prep Guide
25 pages
Data - Mining - Warehousing Unit 1
No ratings yet
Data - Mining - Warehousing Unit 1
35 pages
Business Communication - Ayushi
No ratings yet
Business Communication - Ayushi
4 pages
Class 12 CS
No ratings yet
Class 12 CS
9 pages
AWS EC2 Basics for Beginners
No ratings yet
AWS EC2 Basics for Beginners
56 pages

24 Data Centric Text To SQL Wi

Uploaded by

24 Data Centric Text To SQL Wi

Uploaded by

Data-Centric Text-to-SQL with Large Language Models

Zezhou Huang Shuo Zhang Kechen Liu

Table Representation Learning Workshop at NeurIPS 2024.

-- Wrong ( doesn ’ t cast TEXT to INT ) Scoring

Dataset DMV Inconsistency Column Type Duplication

You might also like