Python

Build Text-to-SQL using LLM

Text-to-SQL is a powerful application of large language models (LLMs) like GPT that enables users to interact with a database using natural language. Let us delve into understanding how to build Text-to-SQL using LLM.

1. What is an OpenAI and LLM?

OpenAI is an AI research and deployment company focused on ensuring that artificial general intelligence (AGI) benefits all of humanity. It is known for developing advanced language models like GPT-3, GPT-4, and the newer GPT-4o, which power applications such as ChatGPT. These models are capable of understanding and generating human-like language, performing tasks like summarization, question answering, translation, and even code generation. LLM stands for Large Language Model, which refers to a type of AI model trained on massive amounts of textual data to understand context, syntax, semantics, and intent in human language. LLMs leverage deep learning, particularly transformer architectures, to process and generate language across a wide range of domains and tasks. When integrated with external systems like databases or APIs, LLMs can act as powerful natural language interfaces that convert user inputs into structured queries or commands.

2. PostgreSQL Setup on Docker and Sample Data

Begin by setting up a PostgreSQL container using Docker. This eliminates the need for a manual installation and provides an isolated, reproducible environment for development and testing.

docker run --name text2sql-postgres \
  -e POSTGRES_USER=admin \
  -e POSTGRES_PASSWORD=admin123 \
  -e POSTGRES_DB=employees_db \
  -p 5432:5432 \
  -d postgres

This command will create a new container named text2sql-postgres, set up a database user admin with the password admin123, initialize a database named employees_db, expose port 5432 to allow connections from your host machine, and run the official postgres image in detached mode. After the container starts, you can interact with the database using the `psql` CLI or GUI-based tools like DBeaver or pgAdmin. To connect with `psql`, use:

psql -h localhost -p 5432 -U admin -d employees_db

Once connected, execute the SQL statements below to create a sample table and seed it with mock data:

CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    department VARCHAR(100),
    salary INTEGER
);

INSERT INTO employees (name, department, salary) VALUES
('Alice', 'Engineering', 95000),
('Bob', 'HR', 60000),
('Charlie', 'Engineering', 105000),
('Diana', 'Sales', 70000);

This will create a table named employees with basic fields for ID, name, department, and salary, and insert sample records to simulate a simple company structure. This sample data serves as a foundational dataset for testing Text-to-SQL conversions, building query interfaces, or training AI-based SQL generation models.

3. Sample Python Example

We’ll now use Python to connect to both Azure OpenAI’s API and the PostgreSQL database using SQLAlchemy as the ORM layer. This integration enables us to dynamically generate SQL queries using natural language inputs and execute them securely against the database.

3.1 Dependencies

The following Python packages are required to build this integration:

pip install openai sqlalchemy psycopg2-binary python-dotenv

3.2 Python Code

Create a main.py file and add the following code to implement the complete flow that connects your Python application to Azure OpenAI and PostgreSQL using SQLAlchemy. The script allows users to input natural language questions, convert them to SQL using OpenAI, and execute those queries on the database.

# Author: Yatin B.

import logging
import re
from os import getenv

import sqlalchemy
from dotenv import load_dotenv
from openai import AzureOpenAI
from sqlalchemy import text
from sqlalchemy.exc import SQLAlchemyError

# ========== Setup Logging ==========
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

# ========== Load Environment ==========
load_dotenv()

# ========== Constants ==========
TABLE_NAME = "employees"
DB_URL = 'postgresql://username:password@hostname:5432/company'


# ========== Setup Clients ==========

def get_openai_client() -> tuple[AzureOpenAI, str | None]:
    """Initialize and return an Azure OpenAI client."""
    api_key = getenv('OPENAI_API_KEY')
    api_version = getenv('OPENAI_API_AZURE_API_VERSION')
    api_endpoint = getenv('OPENAI_API_AZURE_API_ENDPOINT')
    deployment = getenv('OPENAI_API_AZURE_DEPLOYMENT')

    if not all([api_key, api_version, api_endpoint, deployment]):
        logger.error("Missing OpenAI Azure environment configuration.")
        raise EnvironmentError("Missing OpenAI Azure credentials.")

    client = AzureOpenAI(
        api_key=api_key,
        api_version=api_version,
        azure_endpoint=api_endpoint
    )
    return client, deployment


def get_db_engine() -> sqlalchemy.Engine:
    """Initialize and return a SQLAlchemy engine."""
    return sqlalchemy.create_engine(DB_URL)


# ========== Database Utilities ==========

def get_table_schema(engine: sqlalchemy.Engine, table_name: str) -> str:
    """Fetch column names and data types from the specified table."""
    logger.info(f"Fetching schema for table: {table_name}")
    query = text("""
        SELECT column_name, data_type 
        FROM information_schema.columns 
        WHERE table_name = :table
    """)
    try:
        with engine.connect() as connection:
            result = connection.execute(query, {"table": table_name})
            schema = "\n".join(f"{row.column_name}: {row.data_type}" for row in result)
            if not schema:
                raise ValueError(f"No schema found for table '{table_name}'")
            return schema
    except SQLAlchemyError as e:
        logger.exception("Error fetching schema")
        raise RuntimeError(f"Schema retrieval failed: {e}")


def run_sql_query(engine: sqlalchemy.Engine, sql: str) -> list[dict]:
    """Execute a SQL query and return results."""
    logger.info("Executing SQL query...")
    try:
        with engine.connect() as conn:
            result = conn.execute(text(sql))
            return [dict(row._mapping) for row in result]
    except SQLAlchemyError as e:
        logger.exception("SQL execution failed")
        raise RuntimeError(f"Database query failed: {e}")


# ========== OpenAI Utilities ==========

def create_prompt(question: str, schema: str, table_name: str) -> str:
    """Generate prompt to send to OpenAI."""
    return f"""
You are an assistant that converts natural language into SQL queries.
Table: {table_name}
Schema:
{schema}

Question: {question}
SQL:
"""


def extract_sql_from_response(response_content: str) -> str:
    """Extract SQL query from OpenAI response by removing markdown formatting."""
    return re.sub(r"```(?:sql)?\s*|\s*```", "", response_content).strip()


def text_to_sql(client: AzureOpenAI, deployment: str, question: str, schema: str, table_name: str) -> str:
    """Generate SQL from a natural language question using OpenAI."""
    logger.info(f"Generating SQL using OpenAI for question: {question}")
    prompt = create_prompt(question, schema, table_name)

    try:
        response = client.chat.completions.create(
            model=deployment,
            messages=[
                {"role": "system", "content": "You are a helpful assistant that converts natural language questions into SQL queries."},
                {"role": "user", "content": prompt}
            ]
        )
        raw_sql = response.choices[0].message.content
        sql_query = extract_sql_from_response(raw_sql)
        logger.info("Generated SQL: %s", sql_query)
        return sql_query
    except Exception as e:
        logger.exception("OpenAI API call failed")
        raise RuntimeError(f"Failed to generate SQL: {e}")


# ========== Main Entry Point ==========

def main():
    try:
        question = input("Enter your question: ").strip()
        if not question:
            raise ValueError("Empty question provided.")

        # Setup
        engine = get_db_engine()
        client, deployment = get_openai_client()

        # Generate SQL
        schema = get_table_schema(engine, TABLE_NAME)
        sql_query = text_to_sql(client, deployment, question, schema, TABLE_NAME)

        # Execute and display results
        results = run_sql_query(engine, sql_query)
        logger.info(f"\nGenerated SQL Query:\n{sql_query}\n")
        logger.info("Query Results:")
        for row in results:
            print(row)

    except Exception as e:
        logger.error(f"[ERROR] {e}")


if __name__ == "__main__":
    main()

3.2.1 Code Explanation

This Python script connects Azure OpenAI with a PostgreSQL database using SQLAlchemy to convert natural language questions into executable SQL queries. It starts by importing required libraries for logging, environment management, database access, and OpenAI integration. Logging is configured to output timestamped messages, and environment variables are loaded using python-dotenv. Constants like TABLE_NAME and DB_URL are defined to point to the employees table and PostgreSQL connection string respectively. The get_openai_client() function initializes an AzureOpenAI client using credentials from environment variables, while get_db_engine() sets up a SQLAlchemy engine to interact with the database. The get_table_schema() function queries the database to retrieve column names and types, which are used to construct a schema prompt. The run_sql_query() function takes a SQL query, executes it, and returns the result as a list of dictionaries. The create_prompt() function builds a prompt that includes the table schema and user question, which is sent to OpenAI via text_to_sql(), where a chat completion request is made to generate the SQL query. The extract_sql_from_response() function cleans the SQL output by removing markdown formatting. Finally, the main() function orchestrates the entire workflow: it takes user input, prepares the schema, gets the generated SQL from OpenAI, runs it against the database, and prints the result. The script is executed when run directly and is designed to handle common errors with informative logging throughout.

3.3 Sample Output

The following is an example of how the application behaves when a user inputs a natural language question. In this case, the question is List all employees from Engineering. The application fetches the schema of the target table, constructs a prompt, sends the prompt to Azure OpenAI, receives a SQL query as a response, executes the query against the PostgreSQL database, and logs the results. Each step is timestamped using the logging module for easy debugging and monitoring. The OpenAI response is successfully parsed, resulting in a valid SQL query that retrieves all employee records from the Engineering department.

Enter your question: List all employees from Engineering
2025-07-04 20:57:01,226 - INFO - Fetching schema for table: employees
2025-07-04 20:57:01,260 - INFO - Generating SQL using OpenAI for question: List all employees from Engineering
2025-07-04 20:57:02,739 - INFO - HTTP Request: POST https://jcg.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-08-01-preview "HTTP/1.1 200 OK"
2025-07-04 20:57:02,746 - INFO - Generated SQL: SELECT * FROM employees WHERE department = 'Engineering';
2025-07-04 20:57:02,746 - INFO - Executing SQL query...
2025-07-04 20:57:02,753 - INFO - 
Generated SQL Query:
SELECT * FROM employees WHERE department = 'Engineering';

2025-07-04 20:57:02,753 - INFO - Query Results:
{'id': 1, 'name': 'Alice', 'department': 'Engineering', 'salary': 95000}
{'id': 3, 'name': 'Charlie', 'department': 'Engineering', 'salary': 105000}

4. Conclusion

We’ve built a simple but functional Text-to-SQL application using OpenAI’s LLMs. With just natural language, users can query structured databases. For production scenarios, you can improve security by adding query validation, and also switch to OpenAI’s chat/completions endpoint using function calling or tools like LangChain or LlamaIndex for better control. This approach can be extended for multi-table joins, complex aggregations, and dynamic dashboarding with visualization tools like Streamlit or Dash.

Yatin Batra

An experience full-stack engineer well versed with Core Java, Spring/Springboot, MVC, Security, AOP, Frontend (Angular & React), and cloud technologies (such as AWS, GCP, Jenkins, Docker, K8).
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button