Guoliang Li Vision-LLM-Enhanced Data Management
Guoliang Li Vision-LLM-Enhanced Data Management
ment problems have been extensively studied and widely deployed Processed Request
Request Request
in recent five years. However traditional ML methods have lim- Pre-Process Operation Data Access Transformations Output Actions
Metadata Data Sources Custom APIs User Intent
(5) Knowledge Augmentation enriches the request with valuable elastic search engine [11]) and (𝑖𝑖) which operation orders are most
knowledge (e.g., the meaning of 𝑃𝑟𝑜𝑚𝑒𝑡ℎ𝑒𝑢𝑠 in diagnosis task) or execution-efficient (using methods like genetic algorithm).
empirical rules (e.g., the criteria of entity matching). (2) Evaluation: It assesses the quality of results by executing the
Request Parser. This module interprets the user’s request into an current pipeline. For instance, in a data analytics task, we can utilize
executable operation pipeline. an LLM to analyze user feedback about the usability and relevance
(1) Semantic-based Request Segmentation: Different from basic of the analytics results.
techniques like Tokenization, it splits the input request into mean- (3) Pipeline Re-Generation: If the evaluation result is poor, it gen-
ingful segments. For example, with the Semantic Role Labeling erates a new execution pipeline. For example, we can use decision
(SRL) technique [15], it assigns roles to words in the pre-processed trees and neural networks to identify and add any intermediate
request to capture their semantic relationships. For instance, in the operations that are potential to derive more reliable results.
request “Find flights from New York to London departing tomorrow”, Vector Databases. Vector databases are used to improve the exe-
it identifies segments such as “Find flights”, “from New York to Lon- cution effectiveness and efficiency from multiple aspects.
don”, and “departing tomorrow”, each conveying a specific operation (1) Semantic Argumentation and Search: When a knowledge query
or parameters for the flight search request. is issued by LLM worker, the vector database (𝑖) integrates context
(2) Operation Mapping: Based on the keyword-API mapping table, and intent analysis to enrich the query; and (𝑖𝑖) utilizes advanced
we convert the segments into functional operations (using tool or similarity search algorithms (e.g., graph-network-based embed-
model APIs) and data access operations (using data sources). ding for relational knowledge patterns) to ensure high relevance in
(3) Execution Pipeline Generation: These operations form a basic search results by understanding the semantic difference of queries.
execution pipeline based on their structural logic of the request. (2) Caching Mechanism: We can cache hot user requests and their
For example, for a segmented request 𝑅 = [“Identify missing value”, answers, and when a similar request is posted, we can use vector
“in dataset column”], 𝑅0 is designated as the root, helping arrange database to directly answer the request, without involving LLMs.
other operations based on both the order and semantic relations. (3) Vertical Knowledge: It serves as a metadata container, enabling
Executor Agent. Given the basic pipeline, this module refines the unified access and management of multiple data sources.
pipeline with underlying information, evaluates the outcomes, and (4) Long-term and Real-time Knowledge: It transforms time-series
re-generates more effective pipeline (arriving time or cost limit). data into vector embeddings, facilitating the storage and retrieval
(1) Pipeline Refinement: For each operation in the pipeline, it of real-time and long-term data. This way, it enhances the system’s
enriches with information like (𝑖) which data source to use (e.g., ability to perform complex analysis over extended periods.
the same document can be stored in both vector database and
3
System Alerts cannot generate detailed analysis reports like human DBAs. While
there are prototypes leveraging LLMs to solve this problem [30, 31],
"alerts": ["NodeLoadHigh"]
their implementation requires substantial expertise. In contrast,
Pre-Processed Request we explore how to automate the development of LLM-enhanced
Alert: The standard load (159%) exceeds its total capacity. database system diagnosis.
Charts of Abnormal Metrics: [node_process_blocked,…] Domain-Specific Data Preparation. First, LLMDB extracts knowl-
Workloads: [{sql_template, frequency, average_time}] edge from extensive documents, including technical manuals, logs,
and case studies. This extracted knowledge is then systematically
Execution Pipeline structured and stored in vector database, facilitating efficient re-
trieval and analysis. Second, LLMDB sets up a suite of diagnosis
Metric Analysis Monitoring Tool
tools, each with (𝑖) defined APIs, (𝑖𝑖) text specifications, and (𝑖𝑖𝑖)
The CPU usage is abnormal during the given time period. Now, to demonstration examples, which are tailored scenarios illustrating
diagnose the cause of the anomaly, I will use the information gained from the application of these tools in real-world system diagnosis. They
Diag-v1 cpu_match_diagnose_knowledge API.
not only serve as a guide for understanding the practical application
Activity Analysis Vector Database of the tools, but also assist in fine-tuning LLMs and domain models.
User Request Pre-Processing. LLMDB enriches original system
It is clear the CPU overload is likely caused by abnormalities on (node_ alerts to create a comprehensive context for the following analysis.
procs_running, …) and creating and dropping indexes repeatedly …
Diag-v1 For instance, if an alert indicates high CPU usage, LLMDB further
… decides relevant factors like the duration of high usage, related pro-
Root Cause: Continual and Intense Operations on Indices. … largely cesses consuming excessive CPU resources, and any concurrent system
attributed to the continual operations carried out on the indices, more activities that might contribute to the anomaly. Additionally, this
GPT3.5 specific to the ‘CREATE INDEX’ and ‘DROP INDEX’ operations …
stage includes workload or slow queries from the system (e.g., logs
or system views) during the anomaly period.
Figure 3: LLMDB for Database System Diagnosis
Request Parsing. Since the execution pipeline of database system
Data & Model Management. This module functions as the inter- diagnosis is relatively fixed, LLMDB directly maps the enriched alert
mediary between Pipeline Executor and data sources, tools, and AI into a pipeline with operations in the order of (𝑖) high-level metric
models. It includes LLM agents (used during pipeline scheduling analysis, (𝑖𝑖) finer-grained metric analysis, (𝑖𝑖𝑖) activity analysis,
and execution), which serve as interfaces to external tool or model (𝑖𝑣) workload analysis, and (𝑣) solution recommendation. Each
APIs. This module (𝑖) aligns the system’s internal understanding operation is accordingly matched with relevant tool APIs (e.g.,
with external knowledge and (𝑖𝑖) mitigates over-reliance on LLMs. index advisor API for solution recommendation) and knowledge.
Execution Pipeline. The execution pipeline for database diag-
2.4 Challenges nosis is primarily focused on two areas: tool callings for gaining
There are several research challenges in this LLMDB paradigm. First, specific abnormal status and knowledge-matching-based root cause
how to effectively understand the user request and generate the analysis. First, tool calling sequences perform automated checks
execution pipeline? Second, how to select well-designed execution on database performance and integrity. For example, one operation
operations with completeness and minimalism such that they can may use a performance monitoring tool to identify slow queries, fol-
be combined to generate high-quality execution pipeline? Third, lowed by an integrity checker for detecting data corruption. Second,
how to design high-quality executor agent that can utilize multiple knowledge matching in one operation leverages a vast repository
operations to effectively answer a complicated task? Fourth, how of documented issues and solutions, comparing current anomalies
to select effective embedding method that can capture the domain- with historical incidents to identify matches or similarities, which is
specific proximity such that can enrich the query with semantic addressed by vector databases. Moreover, there are also operations
augmentation? Fifth, how to balance LLM fine-tuning and prompt that need to implement selected solutions. For instance, a pattern of
engineering? Fine-tuning can optimize LLM with domain-specific locking conflicts could lead to actions like (𝑖) adjusting transaction
knowledge, but it requires a large volume of high-quality training isolation levels or (𝑖𝑖) redesigning the database schema.
data; while prompt does no require training data, but it requires to Model Management. Database diagnosis seldomly has an open
add the prompts for every request. Sixth, how to utilize the LLMDB corpus. Thus, it is vital to train LLM to learn anomaly root causes
feedback to iteratively optimize our pipeline. from real cases. This involves unsupervised LLM training to recog-
nize and interpret complex patterns from sampled time-series data.
3 LLMDB FOR SYSTEM DIAGNOSIS For example, an LLM needs to identify that recurring divergences
Database system diagnosis focuses on identifying the root causes in transactional data could suggest concurrency conflicts. Similarly,
of anomalies, which inputs various monitoring information and LLM needs to analyze log files showing intermittent database access
outputs root causes and solutions to fix the root causes. Traditional failures and correlate this with issues like network instability.
machine learning methods often treat system diagnosis as a clas- Research Challenges. First, how to involve human experts in
sification problem [7, 14, 28]. These methods (𝑖) rely heavily on the diagnosis loop. Human experts may provide a vast number
large sets of high-quality labeled data, limiting their generalizability of valuable advice when using the diagnosis application in real
to new scenarios (e.g., with different monitoring metrics) and (𝑖𝑖)
4
Origin Request
cases, which should be learned by caching in vector database or
fine-tuning the models. Second, how to integrate multi-modal in- You are provided with a table detailing the Hospitals in Canada.
Compute the average capacity of hospitals located in Toronto.
formation sources like text, figures, and videos in diagnosis process.
Hospital City Capacity Type
This could provide a more comprehensive context and improve the
Sunnybrook Toronto 600 beds Teaching
accuracy of root cause identification. Mont. General Montreal 350 bed General
Action: Action: Extracted Rules a query with complex grouping, the vector database guides the
Rule A Rule B system to employ the strategy optimized for grouping operations,
Action
Rewrite 1 Rewrite 2 thereby saving the rewrite time.
Model and Embedding Management. During online usage, there
Feedback : Feedback can be SQL queries where it has low confidence or makes incorrect
Incorrect or …
Inefficient LLM predictions. For these queries, generate representative samples
to (𝑖) train the LLM in Executor Agent and (𝑖𝑖) cache in vector
SELECT t1.c1 FROM t1, LATERAL (
Enhanced Output SELECT MIN(1) AS c3 FROM t2 databases, where each sample is composed of the original SQL
Cost Model SQL WHERE t1.t2 = t2 * 2
) AS t2 WHERE t2.c3 IS NOT NULL; query, rule, rewritten query, and explanatory annotations. Given
the scarcity of high-quality samples for query rewriting [33], we
Figure 5: LLMDB for Query Rewrite can apply methods like active learning to proactively collect such
do not use structural information to (𝑖) reduce the search space or samples.
to (𝑖𝑖) effectively judge the quality of rewrites at the logical query Research Challenges. First, how to evaluate the quality and over-
level. In this section, we showcase the methods and techniques used lap of generated rules. Some rules may be redundant or represent
in LLMDB for query rewrite. similar conditions, which existing LLMs may fail to identify. Sec-
Domain-Specific Data Preparation – Rewrite Rules. Given ond, how to utilize LLM to accurately verify semantic equivalence
documents like database operation manuals or academic research between the original and rewritten queries, which is a complex
papers, we can employ LLMs to automatically extract, verify, and problem and lacks a general solution. Third, how to make a light-
correct optimization rules. These rules are formulated in a formal weight implementation within database kernels. This involves is-
language, either as a logical DSL rule or as an executable function sues like minimizing the computational overhead of LLM inference
(e.g., Java). For each rule, we also create a natural language descrip- and reducing the number of applied rules.
tion. This description includes details like the rule’s purpose, steps,
application scenarios, and examples. These descriptions aid LLMs
6 CONCLUSION
in the executor agent to effectively rewrite queries. In this paper, we propose an LLM-enhanced data management
Request Pre-Processing. In most cases, extracting the target framework for addressing the limitations of LLMs: hallucination,
query from the rewrite requests is straightforward. This can be high cost, and low accuracy. LLMDB can well cover diversified data
done by writing basic regular expression (regex) code to capture management applications. By integrating LLMs, vector database,
complete SQL statements. For example, we can use the regex pat- and LLM agent, LLMDB offers enhanced accuracy, relevance, and re-
tern r“(select|· · · |delete).*?;” to retrieve queries ranging from “select” liability of data processing. The framework’s potential in areas such
to “delete” commands. as query rewriting, data analytics, and diagnosis is vast, promising
significant advancements in various sectors. Part of the source code
Request Parsing. We match the extracted SQL queries with the is open-sourced in https://github.com/TsinghuaDatabaseGroup/DB-
corresponding table schema and applicable rules. These rules may GPT (and will soon be split off into an independent repository).
be well-formatted (executed by engines like Calcite [4] or Cock-
roachDB [21]), or require LLM in Execution Agent to interpret and ACKNOWLEDGMENTS
follow. Each rule is specifically designed to address different aspects
We thank Wei Zhou, Zhaoyan Sun, Zhiyuan Liu, and AgentVerse
of query rewrite. For example, one rule might aim at optimizing
Team for their valuable advice on this vision. Zui Chen and Lei Cao
subqueries, while another might be focused on making aggregation
have assisted in designing Figure 5.
processes more efficient. For each extracted query, we organize the
sequence of matched rules as a basic execution pipeline.
6
,
REFERENCES Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating
[1] https://huggingface.co/nlpconnect/vit-gpt2-image-captioning. Last accessed on large language models to master 16000+ real-world apis, 2023.
2024-1. [19] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli,
[2] https://huggingface.co/thebloke/marcoroni-7b-v3-gguf. Last accessed on 2024-1. Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language
[3] Qiushi Bai, Sadeem Alsudais, and Chen Li. Querybooster: Improving SQL per- models can teach themselves to use tools, 2023.
formance using middleware services for human-centered query rewriting. Proc. [20] Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous
VLDB Endow., 16(11):2911–2924, 2023. agent with dynamic memory and self-reflection, 2023.
[4] Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J. Mior, and [21] Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, and et al. Cock-
Daniel Lemire. Apache calcite: A foundational framework for optimized query roachdb: The resilient geo-distributed SQL database. In SIGMOD, pages 1493–
processing over heterogeneous data sources. In SIGMOD, pages 221–230. ACM, 1509. ACM, 2020.
2018. [22] Nan Tang, Ju Fan, Fangyi Li, Jianhong Tu, Xiaoyong Du, Guoliang Li, Sam
[5] Chengliang Chai, Guoliang Li, Ju Fan, and Yuyu Luo. Crowdchart: Crowd- Madden, and Mourad Ouzzani. Rpt: relational pre-trained transformer is al-
sourced data extraction from visualization charts. IEEE Trans. Knowl. Data Eng., most all you need towards democratizing data preparation. arXiv preprint
33(11):3537–3549, 2021. arXiv:2012.02469, 2020.
[6] Zui Chen, Lei Cao, Sam Madden, Tim Kraska, Zeyuan Shang, Ju Fan, Nan Tang, [23] Peter D. Turney. A uniform approach to analogies, synonyms, antonyms, and
Zihui Gu, Chunwei Liu, and Michael Cafarella. Seed: Domain-specific data associations. In Donia Scott and Hans Uszkoreit, editors, COLING, pages 905–912,
curation with large language models. arXiv e-prints, pages arXiv–2310, 2023. 2008.
[7] Karl Dias, Mark Ramacher, Uri Shaft, Venkateshwaran Venkataramani, and Gra- [24] Zhaoguo Wang, Zhou Zhou, Yicun Yang, Haoran Ding, Gansen Hu, Ding Ding,
ham Wood. Automatic performance diagnosis and tuning in oracle. In Second Chuzhe Tang, Haibo Chen, and Jinyang Li. Wetune: Automatic discovery and
Biennial Conference on Innovative Data Systems Research, CIDR 2005, Asilomar, verification of query rewrite rules. In SIGMOD, pages 94–107. ACM, 2022.
CA, USA, January 4-7, 2005, Online Proceedings, pages 84–94. www.cidrdb.org, [25] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei
2005. Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits
[8] Fatih Kadir Akın et al. Awesome chatgpt prompts. https://github.com/f/awesome- reasoning in large language models, 2023.
chatgpt-prompts, 2023. [26] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi,
[9] Jules White et al. A prompt pattern catalog to enhance prompt engineering with Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning
chatgpt. 2023. in large language models. Advances in Neural Information Processing Systems,
[10] Pengfei Liu et al. Pre-train, prompt, and predict: A systematic survey of prompt- 35:24824–24837, 2022.
ing methods in natural language processing. ACM Comput. Surv., 2023. [27] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan,
[11] Clinton Gormley and Zachary Tong. Elasticsearch: the definitive guide: a dis- and Yuan Cao. React: Synergizing reasoning and acting in language models,
2023.
tributed real-time search and analytics engine. " O’Reilly Media, Inc.", 2015.
[28] Dong Young Yoon, Ning Niu, and Barzan Mozafari. Dbsherlock: A performance
[12] Goetz Graefe. Volcano - an extensible and parallel query evaluation system. IEEE
diagnostic tool for transactional databases. In Fatma Özcan, Georgia Koutrika,
Trans. Knowl. Data Eng., 6(1):120–135, 1994.
and Sam Madden, editors, Proceedings of the 2016 International Conference on
[13] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Ceyao Zhang, Zili
Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 -
Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, et al. Metagpt:
July 01, 2016, pages 1599–1614. ACM, 2016.
Meta programming for multi-agent collaborative framework. arXiv preprint
[29] Xuanhe Zhou, Guoliang Li, Chengliang Chai, and Jianhua Feng. A learned query
arXiv:2308.00352, 2023.
rewrite system using monte carlo tree search. Proc. VLDB Endow., 15(1):46–58,
[14] Minghua Ma, Zheng Yin, Shenglin Zhang, and et al. Diagnosing root causes
2021.
of intermittent slow queries in large-scale cloud databases. Proc. VLDB Endow.,
[30] Xuanhe Zhou, Guoliang Li, and Zhiyuan Liu. Llm as dba. arXiv preprint
13(8):1176–1189, 2020.
arXiv:2308.05481, 2023.
[15] Lluís Màrquez, Xavier Carreras, Kenneth C Litkowski, and Suzanne Stevenson.
[31] Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, Zhiyuan Liu, Weize Chen, Jianming
Semantic role labeling: an introduction to the special issue, 2008.
Wu, Jiesi Liu, Ruohang Feng, and Guoyang Zeng. D-bot: Database diagnosis
[16] Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, and et al. Com-
system using large language models. CoRR, abs/2312.01454, 2023.
municative agents for software development. arXiv preprint arXiv:2307.07924,
[32] Xuanhe Zhou, Guoliang Li, Jianming Wu, Jiesi Liu, Zhaoyan Sun, and Xinning
2023.
Zhang. A learned query rewrite system. Proc. VLDB Endow., 16(12):4110–4113,
[17] Yujia Qin, Shengding Hu, Yankai Lin, and et al. Tool learning with foundation
2023.
models. arXiv preprint arXiv:2304.08354, 2023.
[33] Xuanhe Zhou, Zhaoyan Sun, and Guoliang Li. Db-gpt: Large language model
[18] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin
meets database.
Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou,