0% found this document useful (0 votes)
36 views7 pages

Guoliang Li Vision-LLM-Enhanced Data Management

Uploaded by

Jiujing Zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views7 pages

Guoliang Li Vision-LLM-Enhanced Data Management

Uploaded by

Jiujing Zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

LLM-Enhanced Data Management [Vision]

Xuanhe Zhou Xinyang Zhao Guoliang Li


Tsinghua University Tsinghua University Tsinghua University
[email protected] [email protected] [email protected]

ABSTRACT Application Request: List the names and comments of


Request mountains with elevation above 8,000 meters
Machine learning (ML) techniques for optimizing data manage- LLM+Rules

ment problems have been extensively studied and widely deployed Processed Request
Request Request
in recent five years. However traditional ML methods have lim- Pre-Process Operation Data Access Transformations Output Actions
Metadata Data Sources Custom APIs User Intent

itations on generalizability (adapting to different scenarios) and Request


Data Access
inference ability (understanding the context). Fortunately, large LLM Parser
Transformations Output
arXiv:2402.02643v1 [cs.DB] 4 Feb 2024

DataScan ResultAgg Rerank Merge


language models (LLMs) have shown high generalizability and Pipeline
source
is_cached
api_name
is_cached
api_name
is_cached
api_name
is_cached

human-competitive abilities in understanding context, which are Thought


Executor
promising for data management tasks (e.g., database diagnosis and Chains Execution
General Domain LLM Data Execution Pipeline
data analytics). However, existing LLMs have several limitations: LLM LLM Agent
Sources
Operations Output
LLM Result
hallucination, high cost, and low accuracy for complicated tasks. To
Data / Refinement
address these challenges, we design LLMDB, an LLM-enhanced data Tools
Answer: 1. Everest1 Nepal/Tibet
management paradigm which has good generalizability and high in- Vector
Database
Tool
APIs
Model
APIs 8,85029,035 Edmund Hillary May 29, 1953 …
ference ability while avoiding hallucination, reducing LLM cost, and (a) Basic LLM (b) LLMDB (c) Toy Example of LLMDB
achieving high accuracy. LLMDB embeds domain-specific knowledge Task-Solving Stability Cost (manpower) Efficiency
LLM Weak Low Medium Medium
to avoid hallucination by LLM fine-tuning and prompt engineering.
Agent Medium Medium High Low
LLMDB reduces the high cost of LLMs by vector databases which pro- LLMDB High High Low Medium
vide semantic search and caching abilities. LLMDB improves the task (d) Comparison of Three LLM-Enhanced Data Management Paradigms
accuracy by LLM agent which provides multiple-round inference Figure 1: LLM for Data Management
and pipeline executions. We showcase three real-world data man-
agement scenarios that LLMDB can well support, including query and a high error rate. For instance, GPT-4 might fail to extract text
rewrite, database diagnosis and data analytics. We also summarize from slides due to misunderstandings of intent, despite having this
the open research challenges of LLMDB. capability. Second, for complex tasks like tool calling, extensive
training data for specific APIs are required to fine-tune the LLM.
1 INTRODUCTION This approach is vulnerable to API changes and can result in sig-
Machine learning algorithms have been widely adopted to opti- nificant cost inefficiencies [18]. Third, LLM agents still lack the
mize data management problems, e.g., data cleaning [22], data an- capability to fully utilize knowledge from multiple sources, which
alytics [5], query rewrite [29], database diagnosis [14]. However, are vital to mitigate hallucination issues in LLMs (e.g., serving as
traditional machine learning algorithms cannot address the gener- evaluation criteria). Thus it calls for a new LLM-enhanced data
alizability and inference problem. For example, existing machine management paradigm to address these limitations.
algorithms are hard to adapt to different databases, different query Vision of LLM-Enhanced Data Management. Developing a ro-
workloads, and different hardware environments. Moreover, ex- bust LLM-enhanced data management system is essential to harness
isting machine algorithms cannot understand the context and do the full potential of LLMs. There are three main challenges. First,
multi-step reasoning, e.g., database diagnosis for problem detection how to effectively utilize data sources (e.g., tabular data and vari-
and root-cause analysis. Fortunately, large language models (LLMs) ous document formats) to reduce LLM hallucination problems (e.g.,
can address these limitations and offer great opportunities for data through knowledge-augmented answering). Second, how to reduce
management [8–10]. For example, LLMs can be used to diagnose the LLM overhead? It is rather expensive to call LLMs for every
the database problems and help DBAs to find the root causes of slow request. It is important to accurately interpret the intent behind
SQL queries. LLMs can also enable natural language (NL) based user requests and capture the domain knowledge in order to re-
data analytics such that users can use NL to analyze their datasets. duce the iterations with LLMs. Third, the execution of complex
However LLMs have three limitations: hallucination, high cost, data management tasks may involve multiple operations. How to
and low accuracy for complicated tasks. These shortcomings pose efficiently manage these operations and pipelines to enhance both
significant challenges, even risks, especially when LLMs are used in execution effectiveness and efficiency.
critical data management steps (e.g., price trend analysis in financial Idea of LLMDB. To address these challenges, we propose LLMDB, a
forecasting). Although there are some LLM agent based methods general framework for designing and implementing LLM-enhanced
to overcome these challenges, like chains of thought [20, 25–27] data management applications. As shown in Figure 1, LLMDB im-
and tool-calling functionalities [19]. These works, while impres- proves the basic LLM by incorporating a series of modular com-
sive, reveal several limitations. First, they heavily rely on LLMs to ponents. LLMDB embeds domain-specific knowledge to avoid hal-
support nearly every tasks (e.g., designing, coding, and testing in lucination by LLM fine-tuning and prompt engineering. LLMDB re-
LLM-based software development [13, 16]), leading to instability duces the high cost of LLMs by vector databases which provide
semantic search and caching abilities. LLMDB improves the task ac- 2.2 Offline Preparation
curacy by LLM agent which provides multiple-round inference and Offline preparation is to initialize LLM-based data management
pipeline executions. Specifically, LLMDB leverages a combination of applications before online usage. This involves five main modules.
general LLMs, domain-specific LLMs, LLM agent, vector databases, (1) Data Source Collection: This module gathers a variety of data
and data sources to handle data management tasks. LLMs provide sources. For instance, in diagnosis task, the data can come from
understanding and inference ability. Users provide domain-specific maintenance documents, historical logs, and operational records.
knowledge and data sources. Domain-specific LLMs fine-tune the (2) Tool/Model Collection: Based on the collected data, this module
general LLMs based on the user-provided data sources, and pro- gathers and sets up essential tools and small-scaled AI models. For
vide the domain-specific knowledge. Vector databases generate instance, in database diagnosis, this might include the collection
embeddings of user-provided data sources, retrieve top-𝑘 document of monitoring tools, analytics tools, and image-to-text (e.g., for
segments, take them as prompts and input them to LLMs. With program running screenshot) models like vit-gpt2 [1].
the domain-knowledge prompts, LLMs can provide better answers. (3) Keyword-API Mapping: This module creates a mapping table
Moreover, for complicated tasks, LLM agent generates a pipeline (metadata) that links keywords or phrases to relevant tool or model
with multiple operations to process the task. APIs. For example, the keyword “image classification” might be
Contributions. We make the following contributions. mapped to an API of a convolutional neural network (CNN) model.
(1) We design an LLM-enhanced data management system, which This mapping enables the automated selection of appropriate tools
has generalizability and high inference ability while avoiding hal- or models based on the specific requirements of the input request.
lucination, reducing LLM cost, and achieving high accuracy. (4) Domain LLM Training: This module fine-tunes small-scale
(2) We propose a domain-knowledge embedding and semantic domain models using the updated data sources, which ensures that
search method to address the hallucination problem. We adopt the models are well-adapted to the application’s needs, leading to
vector databases to reduce the LLM overhead. We design a pipeline improved accuracy and performance in the target scenarios.
generation and execution framework to optimize complicated tasks. (5) LLM Alignment: This module adapts LLM workers to stuffs
(3) We conduct case study on three typical data management appli- like interface language updates. For instance, an LLM worker may
cations using LLMDB, and discuss the system advantages. need to transition from SQL to a knowledge graph language, em-
(4) We summarize research challenges of LLM-enhanced data man- phasizing semantic relationships over structured queries for richer,
agement, including LLMDB operation design, pipeline executor opti- more complex data interactions.
mization, embedding selection, and knowledge augmentation. (6) Vector Data Generation: This module generates embedding for
each data source and inserts the embeddings into vector databases.
2 LLMDB PARADIGM The vector database provides the vector search ability.
2.1 LLMDB Architecture
LLMDB has five important components, including general LLMs, 2.3 Online Inference
domain-specific LLMs, LLM executor agent, vector databases, and Given the prepared tool and model APIs/data sources/metadata, next
data source manager. Specifically, general LLMs provide understand- we explain how to handle various online requests.
ing and inference ability. Data source manager provides domain-
Request Pre-processor. This module prepares the input request
specific data and knowledge. Domain-specific LLMs fine-tune the so as to ease the following parsing and execution procedures.
general LLMs based on the user-provided data sources, which are (1) Keyword Recognition: With the keyword corpus, it emphasizes
used for request pre-processor and parser that capture the user keywords occurring within the input request, where we can use
intent and generate execution pipeline. Vector databases generate techniques like synonym matching (e.g., “examination” → “analy-
embeddings of user-provided data sources, retrieve top-𝑘 document sis”) and stem matching (e.g., “computing” → “comput-.”) [23].
segments for user requests, take them as prompts and input them (2) Intent Identification: Apart from basic keyword recognition,
to LLMs. With the domain-knowledge prompts, LLMs can provide it identifies the underlying purpose of a request using semantic
high-quality results. Moreover, for complex tasks, LLM agent gen- analysis. For instance, consider the request “Identify similar property
erates a pipeline with multiple operations to optimize the complex listings and compare their features”. A semantic model will not
task. LLMDB includes offline preparation and online inference.
only focus on keywords like “compare” and “features” but also
Offline Preparation. Given an application requirement, LLMDB understand the contextual meaning of “similar property listings”.
first collects and analyzes the information from various data sources. This involves identifying the implicit criteria for similarity (e.g.,
Then LLMDB derives detailed specifications that outline the appli- location, price range) and relevant features for comparison (e.g.,
cation objectives, traditional tools (e.g., various database plugins), number of bedrooms). Thus, the request is transformed into a more
and the usage scenarios of domain AI models (e.g., Marcoroni-7B actionable form, such as “Retrieve property listings based on [criteria
model for sentiment analysis [2]). for similarity] from · · · . Compare their [feature 1], [feature 2], · · · ”.
Online Inference. For a user request, LLMDB first parses the request (3) Dependency Identification analyzes the grammatical structure
to detect the intent and operations. Following this, a pipeline of the to understand how keywords relate to each other, which is vital to
operations is generated, which is then executed in a graph-style the following semantic-based request segmentation.
format to ensure efficient processing. The final step involves an (4) Interactive Inquiry allows the application to ask follow-up
iterative process of result evaluation and pipeline re-generation, questions so as to obtain more information that may clarify the
ensuring the accuracy and relevance of final outcome. user’s intent and required operations (e.g., more entity attributes).
2
User Request

Keyword Intent Dependency Processed Request Preparation


Request Recognition Identification Identification
[Keyword Highlight]
Pre-processor
Interactive Inquiry Knowledge Augmentation [Altered Structure] Data Source
Collection
Processed Request

Semantic-based request API Operations Execution Pipeline


Request Parser Tool / Model
Segmentation segments Data Access Operations Generation
Collection
Execution Pipeline

Pipeline Refinement Evaluation Pipeline Re-Generation Keyword-API


Pipeline LLM Agent LLM Agent LLM Agent Mapping
Executor Agent Pipeline Operation Domain LLM Operation Domain LLM Operation LLM

Knowledge Domain LLM


Semantic Caching Vertical Long-Term and Tool Model Training
Vector Database Search Mechanism Knowledge Real-Time Knowledge APIs APIs

Data Splitting Knowledge Extraction Knowledge Embedding LLM


Data Alignment
Sources
Data & Model Specifications Use Cases
Management Vector Data
Tools Models Generation

Figure 2: LLMDB Overview

(5) Knowledge Augmentation enriches the request with valuable elastic search engine [11]) and (𝑖𝑖) which operation orders are most
knowledge (e.g., the meaning of 𝑃𝑟𝑜𝑚𝑒𝑡ℎ𝑒𝑢𝑠 in diagnosis task) or execution-efficient (using methods like genetic algorithm).
empirical rules (e.g., the criteria of entity matching). (2) Evaluation: It assesses the quality of results by executing the
Request Parser. This module interprets the user’s request into an current pipeline. For instance, in a data analytics task, we can utilize
executable operation pipeline. an LLM to analyze user feedback about the usability and relevance
(1) Semantic-based Request Segmentation: Different from basic of the analytics results.
techniques like Tokenization, it splits the input request into mean- (3) Pipeline Re-Generation: If the evaluation result is poor, it gen-
ingful segments. For example, with the Semantic Role Labeling erates a new execution pipeline. For example, we can use decision
(SRL) technique [15], it assigns roles to words in the pre-processed trees and neural networks to identify and add any intermediate
request to capture their semantic relationships. For instance, in the operations that are potential to derive more reliable results.
request “Find flights from New York to London departing tomorrow”, Vector Databases. Vector databases are used to improve the exe-
it identifies segments such as “Find flights”, “from New York to Lon- cution effectiveness and efficiency from multiple aspects.
don”, and “departing tomorrow”, each conveying a specific operation (1) Semantic Argumentation and Search: When a knowledge query
or parameters for the flight search request. is issued by LLM worker, the vector database (𝑖) integrates context
(2) Operation Mapping: Based on the keyword-API mapping table, and intent analysis to enrich the query; and (𝑖𝑖) utilizes advanced
we convert the segments into functional operations (using tool or similarity search algorithms (e.g., graph-network-based embed-
model APIs) and data access operations (using data sources). ding for relational knowledge patterns) to ensure high relevance in
(3) Execution Pipeline Generation: These operations form a basic search results by understanding the semantic difference of queries.
execution pipeline based on their structural logic of the request. (2) Caching Mechanism: We can cache hot user requests and their
For example, for a segmented request 𝑅 = [“Identify missing value”, answers, and when a similar request is posted, we can use vector
“in dataset column”], 𝑅0 is designated as the root, helping arrange database to directly answer the request, without involving LLMs.
other operations based on both the order and semantic relations. (3) Vertical Knowledge: It serves as a metadata container, enabling
Executor Agent. Given the basic pipeline, this module refines the unified access and management of multiple data sources.
pipeline with underlying information, evaluates the outcomes, and (4) Long-term and Real-time Knowledge: It transforms time-series
re-generates more effective pipeline (arriving time or cost limit). data into vector embeddings, facilitating the storage and retrieval
(1) Pipeline Refinement: For each operation in the pipeline, it of real-time and long-term data. This way, it enhances the system’s
enriches with information like (𝑖) which data source to use (e.g., ability to perform complex analysis over extended periods.
the same document can be stored in both vector database and
3
System Alerts cannot generate detailed analysis reports like human DBAs. While
there are prototypes leveraging LLMs to solve this problem [30, 31],
"alerts": ["NodeLoadHigh"]
their implementation requires substantial expertise. In contrast,
Pre-Processed Request we explore how to automate the development of LLM-enhanced
Alert: The standard load (159%) exceeds its total capacity. database system diagnosis.
Charts of Abnormal Metrics: [node_process_blocked,…] Domain-Specific Data Preparation. First, LLMDB extracts knowl-
Workloads: [{sql_template, frequency, average_time}] edge from extensive documents, including technical manuals, logs,
and case studies. This extracted knowledge is then systematically
Execution Pipeline structured and stored in vector database, facilitating efficient re-
trieval and analysis. Second, LLMDB sets up a suite of diagnosis
Metric Analysis Monitoring Tool
tools, each with (𝑖) defined APIs, (𝑖𝑖) text specifications, and (𝑖𝑖𝑖)
The CPU usage is abnormal during the given time period. Now, to demonstration examples, which are tailored scenarios illustrating
diagnose the cause of the anomaly, I will use the information gained from the application of these tools in real-world system diagnosis. They
Diag-v1 cpu_match_diagnose_knowledge API.
not only serve as a guide for understanding the practical application
Activity Analysis Vector Database of the tools, but also assist in fine-tuning LLMs and domain models.
User Request Pre-Processing. LLMDB enriches original system
It is clear the CPU overload is likely caused by abnormalities on (node_ alerts to create a comprehensive context for the following analysis.
procs_running, …) and creating and dropping indexes repeatedly …
Diag-v1 For instance, if an alert indicates high CPU usage, LLMDB further
… decides relevant factors like the duration of high usage, related pro-
Root Cause: Continual and Intense Operations on Indices. … largely cesses consuming excessive CPU resources, and any concurrent system
attributed to the continual operations carried out on the indices, more activities that might contribute to the anomaly. Additionally, this
GPT3.5 specific to the ‘CREATE INDEX’ and ‘DROP INDEX’ operations …
stage includes workload or slow queries from the system (e.g., logs
or system views) during the anomaly period.
Figure 3: LLMDB for Database System Diagnosis
Request Parsing. Since the execution pipeline of database system
Data & Model Management. This module functions as the inter- diagnosis is relatively fixed, LLMDB directly maps the enriched alert
mediary between Pipeline Executor and data sources, tools, and AI into a pipeline with operations in the order of (𝑖) high-level metric
models. It includes LLM agents (used during pipeline scheduling analysis, (𝑖𝑖) finer-grained metric analysis, (𝑖𝑖𝑖) activity analysis,
and execution), which serve as interfaces to external tool or model (𝑖𝑣) workload analysis, and (𝑣) solution recommendation. Each
APIs. This module (𝑖) aligns the system’s internal understanding operation is accordingly matched with relevant tool APIs (e.g.,
with external knowledge and (𝑖𝑖) mitigates over-reliance on LLMs. index advisor API for solution recommendation) and knowledge.
Execution Pipeline. The execution pipeline for database diag-
2.4 Challenges nosis is primarily focused on two areas: tool callings for gaining
There are several research challenges in this LLMDB paradigm. First, specific abnormal status and knowledge-matching-based root cause
how to effectively understand the user request and generate the analysis. First, tool calling sequences perform automated checks
execution pipeline? Second, how to select well-designed execution on database performance and integrity. For example, one operation
operations with completeness and minimalism such that they can may use a performance monitoring tool to identify slow queries, fol-
be combined to generate high-quality execution pipeline? Third, lowed by an integrity checker for detecting data corruption. Second,
how to design high-quality executor agent that can utilize multiple knowledge matching in one operation leverages a vast repository
operations to effectively answer a complicated task? Fourth, how of documented issues and solutions, comparing current anomalies
to select effective embedding method that can capture the domain- with historical incidents to identify matches or similarities, which is
specific proximity such that can enrich the query with semantic addressed by vector databases. Moreover, there are also operations
augmentation? Fifth, how to balance LLM fine-tuning and prompt that need to implement selected solutions. For instance, a pattern of
engineering? Fine-tuning can optimize LLM with domain-specific locking conflicts could lead to actions like (𝑖) adjusting transaction
knowledge, but it requires a large volume of high-quality training isolation levels or (𝑖𝑖) redesigning the database schema.
data; while prompt does no require training data, but it requires to Model Management. Database diagnosis seldomly has an open
add the prompts for every request. Sixth, how to utilize the LLMDB corpus. Thus, it is vital to train LLM to learn anomaly root causes
feedback to iteratively optimize our pipeline. from real cases. This involves unsupervised LLM training to recog-
nize and interpret complex patterns from sampled time-series data.
3 LLMDB FOR SYSTEM DIAGNOSIS For example, an LLM needs to identify that recurring divergences
Database system diagnosis focuses on identifying the root causes in transactional data could suggest concurrency conflicts. Similarly,
of anomalies, which inputs various monitoring information and LLM needs to analyze log files showing intermittent database access
outputs root causes and solutions to fix the root causes. Traditional failures and correlate this with issues like network instability.
machine learning methods often treat system diagnosis as a clas- Research Challenges. First, how to involve human experts in
sification problem [7, 14, 28]. These methods (𝑖) rely heavily on the diagnosis loop. Human experts may provide a vast number
large sets of high-quality labeled data, limiting their generalizability of valuable advice when using the diagnosis application in real
to new scenarios (e.g., with different monitoring metrics) and (𝑖𝑖)
4
Origin Request
cases, which should be learned by caching in vector database or
fine-tuning the models. Second, how to integrate multi-modal in- You are provided with a table detailing the Hospitals in Canada.
Compute the average capacity of hospitals located in Toronto.
formation sources like text, figures, and videos in diagnosis process.
Hospital City Capacity Type
This could provide a more comprehensive context and improve the
Sunnybrook Toronto 600 beds Teaching
accuracy of root cause identification. Mont. General Montreal 350 bed General

4 LLMDB FOR DATA ANALYTICS Pre-Processed Request


In many data analytical scenarios, users are unable to write SQL Retrieve hospital data from [data sources] in Canada.
queries and prefer to use natural language (NL) to analyze the data. Filter and compute … records whose 'City' column equals to 'Toronto' .
A challenge here is how to generate analytical results from NL [data sources] : The tabular data
queries. Current LLM works [6, 17] depend on LLMs for crafting
these queries, which often leads to errors (e.g., non-executable Execution Pipeline
queries or results that do not meet user requirements). Thus, we Data Access
showcase how to accurately conduct data analysis with LLMDB. Request
Hospital City Capacity Type Code Generation
Domain-Specific Data Preparation. For a data analytic task like
(cached)
“You are given a table on · · · . Filter the table to show only facilities
Program
located in Boston.”, it mainly aims to generate executable analysis Synthesis Matched
programs with visualizations as outputs. Thus, we prepare (𝑖) data Program

sources like existing analysis requests and example programs in
483. The average capacity of Toronto hospitals is 483 beds,
different languages (e.g., SQL, Python, R); and (𝑖𝑖) models that are indicating a moderately high capacity for these medical facilities.
good at translating corresponding languages. For the data source
preparation, we can adopt active learning to refine the LLM’s un-
Figure 4: LLMDB for NL-based Data Analytics
derstanding, focusing on generating diverse examples that cover a
wide array of analytics scenarios. For model preparation, we can Execution Pipeline. Apart from directly running the program
fine-tune LLMs with training examples that involve Natural Lan- derived by the above execution pipeline, we can utilize the vector
guage (NL) instructions, sequences of API calls (code lines), and database to function as a dynamic caching mechanism. That is,
detailed explanations. the vector database stores and manages the embeddings of natural
Request Pre-Processing. When handling an incoming analytics language (NL) queries. These embeddings are effectively mapped
request, we first prepare and load the raw tabular data. For example, to corresponding program or a sub-sequence of APIs that are es-
when processing a dataset with attributes such as sales records, we sential for executing the analytics task at hand. For instance, when
standardize date formats by writing a program, typically in Python, a query related to trend analysis within a given dataset is received,
that utilizes a series of Pandas functions like drop_duplicates() and the vector database strategically caches embeddings. These cached
pd.to_datetime(). embeddings are then mapped to a series of specialized APIs that
Request Parsing. The pipelines in data analytics are sequences are proficient in time-series analysis. This approach also enables
of APIs tailored for the specific analytics task. Each chain in the the system to adaptively learn from incoming queries, thereby pro-
pipeline represents a workflow, integrating multiple APIs to per- gressively optimizing the caching mechanism for future requests.
form complex data analysis (e.g., data loading, transformation, and Research Challenges. First, how to develop algorithms and mod-
visualization). The operations of these chains utilizes the fine-tuned els that can accurately generate analytical results from natural
LLM to predict the most relevant API calls based on the input NL language queries. Second, how to automatically generate analysis
queries and data context (e.g., column correlations within the tab- programs, handling diverse data formats, and creating models ca-
ular data), ensuring accurate data processing flow. The following pable of translating different programming languages to facilitate
code demonstrates how to translate the natural language query seamless data analytics. Third, how to efficiently optimize the se-
into a data analytics workflow using Pandas. It involves loading quence of API calls in data analytics workflows, i.e., developing
the hospital data, filtering it by the condition “located at Toronto”, algorithms to predict the most effective sequence of API calls.
and computing the average capacity:
5 LLMDB FOR QUERY REWRITE
1 import pandas as pd
2 # Load data Query rewrite [12] aims to transfer a SQL query to an equivalent
3 data = pd.read_csv('hospital_data.csv') but more efficient SQL query. It includes two main types of work.
4 # Filtering data on the city column The first type [3, 24], aims to create new rewrite rules from existing
5 filter_data = data[data['city'] == 'Toronto'] SQL pairs [24] or to develop new domain-specific languages (DSLs)
6 # Calculate average capacity to make it easier to implement these rules [3]. However, these
7 average_capacity = filter_data['capacity'].mean() methods are limited because they can only handle simple rules (e.g.,
8
with up to three operators); but they don’t make use of valuable
9 print(f"The average capacity for hospitals located in rewrite experiences in textual documents and DBA experiences. The
Toronto is: {average_capacity}") second type of work [29, 32] tries to improve rewrite performance
by changing the order of rule application. Yet, these methods also
5
Origin Request Executor Agent. Within the execution pipeline, the Pipeline Re-
Rewrite this query: select t1.c1 from t1 where exists(select
finement module begins by selecting the most effective strategy of
* from (select c2*2 as cc2 from t2) as t2 where t1.deptno=t2.cc2); applying these rules. This involves training an order assignment
LLM, which learns from experiences such as: (𝑖) rules involving
Pre-Processed Request different operators that can be applied simultaneously, and (𝑖𝑖)
SELECT t1.c1 FROM t1 WHERE EXISTS ( removing aggregations within a subquery may enable the appli-
SELECT * FROM ( cation of more effective rules that were previously inapplicable.
SELECT c2*2 AS cc2 FROM t2
) AS t2 WHERE t1.deptno=t2.cc2); Next, we evaluate the rewritten query by criteria like whether it has
effectively reduced the complexity of operators. The final step is to
[schema] : tpch_toy
adopt the rewritten query that ensures (𝑖) semantically equivalent
Execution Pipeline results and (𝑖𝑖) the lowest estimated cost.
Caching-based Execution. We also support an advanced caching
A g gre g at e Fi l t e rTr an s po s eR u l e
mechanism to enhance the rewrite procedure. After rewriting an
F il t er Pro j e ct Tr an sp os eR u l e
… SQL query, the vector database caches its embedding together with
the rewrite strategies (i.e., rule sequence). This cached embedding is
Input
SELECT t1.c1 FROM t1 WHERE exists (
SELECT * FROM (
then leveraged to rapidly rewrite the following queries with similar
SQL SELECT c2 * 2 AS cc2 FROM t2
rewrite preferences. For instance, if a cached embedding indicates
) AS t2 WHERE t1.deptno = t2.cc2 );

Action: Action: Extracted Rules a query with complex grouping, the vector database guides the
Rule A Rule B system to employ the strategy optimized for grouping operations,
Action
Rewrite 1 Rewrite 2 thereby saving the rewrite time.
Model and Embedding Management. During online usage, there
Feedback : Feedback can be SQL queries where it has low confidence or makes incorrect
Incorrect or …
Inefficient LLM predictions. For these queries, generate representative samples
to (𝑖) train the LLM in Executor Agent and (𝑖𝑖) cache in vector
SELECT t1.c1 FROM t1, LATERAL (
Enhanced Output SELECT MIN(1) AS c3 FROM t2 databases, where each sample is composed of the original SQL
Cost Model SQL WHERE t1.t2 = t2 * 2
) AS t2 WHERE t2.c3 IS NOT NULL; query, rule, rewritten query, and explanatory annotations. Given
the scarcity of high-quality samples for query rewriting [33], we
Figure 5: LLMDB for Query Rewrite can apply methods like active learning to proactively collect such
do not use structural information to (𝑖) reduce the search space or samples.
to (𝑖𝑖) effectively judge the quality of rewrites at the logical query Research Challenges. First, how to evaluate the quality and over-
level. In this section, we showcase the methods and techniques used lap of generated rules. Some rules may be redundant or represent
in LLMDB for query rewrite. similar conditions, which existing LLMs may fail to identify. Sec-
Domain-Specific Data Preparation – Rewrite Rules. Given ond, how to utilize LLM to accurately verify semantic equivalence
documents like database operation manuals or academic research between the original and rewritten queries, which is a complex
papers, we can employ LLMs to automatically extract, verify, and problem and lacks a general solution. Third, how to make a light-
correct optimization rules. These rules are formulated in a formal weight implementation within database kernels. This involves is-
language, either as a logical DSL rule or as an executable function sues like minimizing the computational overhead of LLM inference
(e.g., Java). For each rule, we also create a natural language descrip- and reducing the number of applied rules.
tion. This description includes details like the rule’s purpose, steps,
application scenarios, and examples. These descriptions aid LLMs
6 CONCLUSION
in the executor agent to effectively rewrite queries. In this paper, we propose an LLM-enhanced data management
Request Pre-Processing. In most cases, extracting the target framework for addressing the limitations of LLMs: hallucination,
query from the rewrite requests is straightforward. This can be high cost, and low accuracy. LLMDB can well cover diversified data
done by writing basic regular expression (regex) code to capture management applications. By integrating LLMs, vector database,
complete SQL statements. For example, we can use the regex pat- and LLM agent, LLMDB offers enhanced accuracy, relevance, and re-
tern r“(select|· · · |delete).*?;” to retrieve queries ranging from “select” liability of data processing. The framework’s potential in areas such
to “delete” commands. as query rewriting, data analytics, and diagnosis is vast, promising
significant advancements in various sectors. Part of the source code
Request Parsing. We match the extracted SQL queries with the is open-sourced in https://github.com/TsinghuaDatabaseGroup/DB-
corresponding table schema and applicable rules. These rules may GPT (and will soon be split off into an independent repository).
be well-formatted (executed by engines like Calcite [4] or Cock-
roachDB [21]), or require LLM in Execution Agent to interpret and ACKNOWLEDGMENTS
follow. Each rule is specifically designed to address different aspects
We thank Wei Zhou, Zhaoyan Sun, Zhiyuan Liu, and AgentVerse
of query rewrite. For example, one rule might aim at optimizing
Team for their valuable advice on this vision. Zui Chen and Lei Cao
subqueries, while another might be focused on making aggregation
have assisted in designing Figure 5.
processes more efficient. For each extracted query, we organize the
sequence of matched rules as a basic execution pipeline.
6
,

REFERENCES Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating
[1] https://huggingface.co/nlpconnect/vit-gpt2-image-captioning. Last accessed on large language models to master 16000+ real-world apis, 2023.
2024-1. [19] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli,
[2] https://huggingface.co/thebloke/marcoroni-7b-v3-gguf. Last accessed on 2024-1. Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language
[3] Qiushi Bai, Sadeem Alsudais, and Chen Li. Querybooster: Improving SQL per- models can teach themselves to use tools, 2023.
formance using middleware services for human-centered query rewriting. Proc. [20] Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous
VLDB Endow., 16(11):2911–2924, 2023. agent with dynamic memory and self-reflection, 2023.
[4] Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J. Mior, and [21] Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, and et al. Cock-
Daniel Lemire. Apache calcite: A foundational framework for optimized query roachdb: The resilient geo-distributed SQL database. In SIGMOD, pages 1493–
processing over heterogeneous data sources. In SIGMOD, pages 221–230. ACM, 1509. ACM, 2020.
2018. [22] Nan Tang, Ju Fan, Fangyi Li, Jianhong Tu, Xiaoyong Du, Guoliang Li, Sam
[5] Chengliang Chai, Guoliang Li, Ju Fan, and Yuyu Luo. Crowdchart: Crowd- Madden, and Mourad Ouzzani. Rpt: relational pre-trained transformer is al-
sourced data extraction from visualization charts. IEEE Trans. Knowl. Data Eng., most all you need towards democratizing data preparation. arXiv preprint
33(11):3537–3549, 2021. arXiv:2012.02469, 2020.
[6] Zui Chen, Lei Cao, Sam Madden, Tim Kraska, Zeyuan Shang, Ju Fan, Nan Tang, [23] Peter D. Turney. A uniform approach to analogies, synonyms, antonyms, and
Zihui Gu, Chunwei Liu, and Michael Cafarella. Seed: Domain-specific data associations. In Donia Scott and Hans Uszkoreit, editors, COLING, pages 905–912,
curation with large language models. arXiv e-prints, pages arXiv–2310, 2023. 2008.
[7] Karl Dias, Mark Ramacher, Uri Shaft, Venkateshwaran Venkataramani, and Gra- [24] Zhaoguo Wang, Zhou Zhou, Yicun Yang, Haoran Ding, Gansen Hu, Ding Ding,
ham Wood. Automatic performance diagnosis and tuning in oracle. In Second Chuzhe Tang, Haibo Chen, and Jinyang Li. Wetune: Automatic discovery and
Biennial Conference on Innovative Data Systems Research, CIDR 2005, Asilomar, verification of query rewrite rules. In SIGMOD, pages 94–107. ACM, 2022.
CA, USA, January 4-7, 2005, Online Proceedings, pages 84–94. www.cidrdb.org, [25] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei
2005. Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits
[8] Fatih Kadir Akın et al. Awesome chatgpt prompts. https://github.com/f/awesome- reasoning in large language models, 2023.
chatgpt-prompts, 2023. [26] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi,
[9] Jules White et al. A prompt pattern catalog to enhance prompt engineering with Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning
chatgpt. 2023. in large language models. Advances in Neural Information Processing Systems,
[10] Pengfei Liu et al. Pre-train, prompt, and predict: A systematic survey of prompt- 35:24824–24837, 2022.
ing methods in natural language processing. ACM Comput. Surv., 2023. [27] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan,
[11] Clinton Gormley and Zachary Tong. Elasticsearch: the definitive guide: a dis- and Yuan Cao. React: Synergizing reasoning and acting in language models,
2023.
tributed real-time search and analytics engine. " O’Reilly Media, Inc.", 2015.
[28] Dong Young Yoon, Ning Niu, and Barzan Mozafari. Dbsherlock: A performance
[12] Goetz Graefe. Volcano - an extensible and parallel query evaluation system. IEEE
diagnostic tool for transactional databases. In Fatma Özcan, Georgia Koutrika,
Trans. Knowl. Data Eng., 6(1):120–135, 1994.
and Sam Madden, editors, Proceedings of the 2016 International Conference on
[13] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Ceyao Zhang, Zili
Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 -
Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, et al. Metagpt:
July 01, 2016, pages 1599–1614. ACM, 2016.
Meta programming for multi-agent collaborative framework. arXiv preprint
[29] Xuanhe Zhou, Guoliang Li, Chengliang Chai, and Jianhua Feng. A learned query
arXiv:2308.00352, 2023.
rewrite system using monte carlo tree search. Proc. VLDB Endow., 15(1):46–58,
[14] Minghua Ma, Zheng Yin, Shenglin Zhang, and et al. Diagnosing root causes
2021.
of intermittent slow queries in large-scale cloud databases. Proc. VLDB Endow.,
[30] Xuanhe Zhou, Guoliang Li, and Zhiyuan Liu. Llm as dba. arXiv preprint
13(8):1176–1189, 2020.
arXiv:2308.05481, 2023.
[15] Lluís Màrquez, Xavier Carreras, Kenneth C Litkowski, and Suzanne Stevenson.
[31] Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, Zhiyuan Liu, Weize Chen, Jianming
Semantic role labeling: an introduction to the special issue, 2008.
Wu, Jiesi Liu, Ruohang Feng, and Guoyang Zeng. D-bot: Database diagnosis
[16] Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, and et al. Com-
system using large language models. CoRR, abs/2312.01454, 2023.
municative agents for software development. arXiv preprint arXiv:2307.07924,
[32] Xuanhe Zhou, Guoliang Li, Jianming Wu, Jiesi Liu, Zhaoyan Sun, and Xinning
2023.
Zhang. A learned query rewrite system. Proc. VLDB Endow., 16(12):4110–4113,
[17] Yujia Qin, Shengding Hu, Yankai Lin, and et al. Tool learning with foundation
2023.
models. arXiv preprint arXiv:2304.08354, 2023.
[33] Xuanhe Zhou, Zhaoyan Sun, and Guoliang Li. Db-gpt: Large language model
[18] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin
meets database.
Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou,

You might also like