Skip to content

GSR-SQL/GSR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Prompting for Text2SQL via Gradual SQL Refinement

Overview

GSR

1. Clause Decomposition

We decompose the natural language question into logical clauses based on semantic units and incorporate this information into the prompt, allowing the LLM to generate Pre-SQL. At this stage of generating Pre-SQL, we ensure that the model fully utilizes the information from the Question and hint, as well as the DB schema information without value details.

2. SQL-driven Schema Linking

We instruct the model to extract the tables and columns involved in the Pre-SQL and then construct the following information:

  1. In the DB schema where the tables and columns involved in the Pre-SQL are masked, the model is tasked to explore potential table and column information based on the Question.

  2. For the tables and columns involved in the Pre-SQL, a value condition checker is used to further filter out the columns related to value condition judgments:

    ​ 2.1 For columns involved in value condition judgments, similarity search methods are used to provide value examples with high similarity to the keywords in the natural language question.

    ​ 2.2 For columns not involved in value condition judgments, SQL queries are directly constructed to fetch value examples.

  3. For the tables and columns involved in the Pre-SQL, if their relationships are incorrect or if there are errors in table names or column names, all errors are captured using SQL.

In summary, three pieces of information will be obtained:

  1. Simplified DB schema information with the Pre-SQL tables and columns masked.
  2. Value example information based on the Pre-SQL.
  3. Potential error information in the Pre-SQL.

Then, the model is allowed to correct the Pre-SQL based on the information obtained above.

3. SQL Execution Refinement

Execute the Second-SQL on the database, then integrate the execution results with the prompt as input for the model. Instruct the model to analyze whether the execution results of the Second-SQL are reasonable and refine the Second-SQL accordingly to produce the Final-SQL.

Project directory structure

GSR/
├── README.md
├── requirements.txt
│
├── data/
│   └── databases/
│		└── dev_20240627/
│
├── data_process/
│   └── sql_data_process_BIRD.py
│
├── run/
│   └── GSR.py
│
└── tools/

Environment

conda create -n GSR python=3.10
conda activate GSR
pip install -r requirements.txt

RUN

1. Data Preprocessing

Please place the test set files in the directory data/database/. Then set the path parameters.

In data_process_config.py, you need to set SQL_DATA_INFO and DATABASE_PATH. The parameters you need to set in SQL_DATA_INFO include ‘data_source’, ‘file’, ‘tables_file’, ‘database_name’.

cd data_process/
python sql_data_process_BIRD.py

Four files are generated after execution.

  1. all_mappings.json
  2. raw_format_data.json
  3. Pre_input.json
  4. Second_input.json

2. Generate SQL

Please set the parameters of ICL-SQL. In ICL-SQL.py, mainly set database_file_path, start_idx and end_idx.

cd run/
python GSR.py

You will get the generated SQL file in the output after execution.

Results

Comparsion of execution accuracy

Comparison

Ablation Study

Ablation

About

LLM Prompting for Text2SQL via Gradual SQL Reffnement

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages