0% found this document useful (0 votes)
53 views98 pages

LLM For Data Management

The document discusses the application of Large Language Models (LLMs) in data management, highlighting their potential for cost savings, improved service level agreements (SLAs), and adaptivity through automation and intelligent optimization. It outlines challenges faced in traditional AI for databases, such as adaptivity and generalization, and emphasizes the opportunities LLMs present for enhancing usability, performance, and maintainability in data tasks. Additionally, it covers various prompt engineering techniques for tasks like data preprocessing and query rewriting, aiming to improve efficiency and reduce the need for complex programming.

Uploaded by

TungKVT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views98 pages

LLM For Data Management

The document discusses the application of Large Language Models (LLMs) in data management, highlighting their potential for cost savings, improved service level agreements (SLAs), and adaptivity through automation and intelligent optimization. It outlines challenges faced in traditional AI for databases, such as adaptivity and generalization, and emphasizes the opportunities LLMs present for enhancing usability, performance, and maintainability in data tasks. Additionally, it covers various prompt engineering techniques for tasks like data preprocessing and query rewriting, aiming to improve efficiency and reduce the need for complex programming.

Uploaded by

TungKVT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

LLM for Data Management

Guoliang Li, Xuanhe Zhou, Xinyang Zhao


Department of Computer Science, Tsinghua University
https://github.com/code4DB/LLM4DB
AI4DB/ML4DB

l Cost Saving: Manual à Autonomous


l Auto Knob Tuner: ↓ Maintenance cost Cost Saving
l Auto Index Advisor: ↓ Optimization latency (resource,
DBAs, …)

l High SLAs: Heuristic à Intelligent


l Intelligent Optimizer: ↓ Query plan costs
l Intelligent Scheduler: ↑ Workload performance Adaptivity High SLAs
(applications, (throughput,
hardware, latency, scalability, …)

l Adaptivity: Empirical à Data-Driven


data, query, …)

l Learned Index: ↑ Data access efficiency


AI4DB
l Learned Layout: ↑ Data manipulation efficiency
Xuanhe Zhou, Chengliang Chai, Guoliang Li, Ji Sun, Database Meets AI: A Survey. TKDE 2021. 2
Challenges of AI4DB
q Challenges in Traditional AI4DB
• Adaptivity • Generalization
• Dynamic changing schema • Cold-start
• Dynamic changing data • High-quality training data
• Dynamic changing workload • Interpretability
• Dynamic changing hardware
Performance
degradation

Workload changing, 38.9% performance degradation for learned index tuning


Wei Zhou, Chen Lin, Xuanhe Zhou, Guoliang Li. Breaking It Down: An In-depth Study of Index Advisors. VLDB 2024. 3
Challenges of AI4DB
q Challenges in Traditional AI4DB
• Adaptivity • Generalization
• Dynamic changing schema • Cold-start
• Dynamic changing data • High-quality training data
• Dynamic changing workload • Interpretability
• Dynamic changing hardware

Traditional Learned Methods May not Work for some Scenarios.


Wei Zhou, Chen Lin, Xuanhe Zhou, Guoliang Li. Breaking It Down: An In-depth Study of Index Advisors. VLDB 2024. 4
Motivations of LLM for Data Management

q Excellent performance and generalization capability


• Support various natural and programming languages
• Improved reasoning ability (v.s. traditional PLM)
• Solve various real-world tasks (e.g., coding, report writing)

Wayne Xin Zhao, Kun Zhou, Junyi Li et al. A Survey of Large Language Models. arXiv, 2023. 5
Motivations of LLM for Data Management
q Opportunities of LLM for data management
• Improved usability (e.g., Text2SQL)
• Improved performance (e.g., Query Rewrite)
• Improved maintainability (e.g., Database Diagnosis)
• Improved interpretability (e.g., Database Tuning)

R-Bot Input SQL:


SELECT … FROM emp WHERE empno IN
Cases (SELECT deptno FROM dept … );

Rewrite Analysis:
… Convert the sub-query into a join
between the "emp" and "dept" tables …
· Supervised Finetuning
· Reinforcement Learning Rewritten SQL:
SELECT … FROM emp INNER JOIN dept
· Active Learning ON … AND emp.empno=dept.deptno;

Feedbacks Case Generation

Databases Rewrite Rewrite


· PostgreSQL · Oracle Rules Engines
· MySQL · SQL Server
… … Expert Experience

Text2SQL Query Rewrite Diagnosis


6
Different Stages of LLM
1. (Incremental) Pretraining 2. (SFT) Finetuning 3. (RLHF) Finetuning

• Common Knowledge Acquisition • Instruction Following • Align with human preferences


• Understanding Diverse Texts • Task Adaption like Traslation/Q&A
4. Prompting 5. Agent 6. RAG

• Context Comprehension • LLM system equipped with • External Knowledge Integration


• Learn from demo examples reasoning, tools, and memory • Contextual Relevance / QA Accuracy 7
LLM Pre-Training
q Pretrain LLM as the foundation model for database

Transformer-based LLM:
Predict the next word
given a sequence of
previous text
8
LLM Finetuning
q Finetuning LLM over labeled dataset is used for instruction-
following and understanding task-specific knowledge
q RLHF is used for human feedback and alignment
Preference
Alignment

9
Prompt for LLM Inference
q Input text for LLM to generate response or execute a task
• Simple Prompt
• (task) "Explain the theory of relational tables."

• Contextual Prompt
• (context) ”Undergraudate students are studying database concepts for the first time and
is curious about fundamental theories.”
• (task) "Explain the theory of relational table in a way beginners can understand.”

• Contextual Prompt with Instructions


• (context) " Undergraudate students are studying database ...”
• (task) "Explain the theory of relational tables ...”
• (instructions) ”1. Make sure the explanation is clear and engaging for someone new to
databases; 2. Limit the explanation to a few paragraphs with examples.”

• Contextual Prompt with Instructions + Demonstration Examples … 10


LLM Based Autonomous Agent
q LLM Agent: Perceiving the surrounding environment, planning,
executing actions to complete tasks, and memorize past executions

Lei Wang et al. A Survey on Large Language Model based Autonomous Agents. arXiv 2023.
11
RAG for LLM Inference

q Drawbacks of LLMs
• Hallucination
• Outdate information
• Low efficiency in LLM training
• Weak reasoning capability

q Practical Requirements
• Domain-Specific Accurate Q&A
• Frequent Data Update
• Explainability of Responses
• Controllable Cost
• Data Privacy Protection A motivative example.

12
Overview of LLM Challenges and Solutions
Give conflicting outputs for very similar prompts
Task decomposition; Prompt for multiple times and Vote; Self-Reflection …
Generate text that seems realistic and plausible but is actually inaccurate
RAG, Write instructive prompts to ask for source/evidence or call tools …
Cannot automatically retain information from previous chats or update in time
Cache and reuse historical messages …

Struggle with tasks requiring complex reasoning, multi step problem-solving, …


Task decomposition; Provide reasoning process examples; Prompt engineering …
The knowledge LLM used can be out-of-date, because the new knowledge is
learned in batch for traditional model finetuning
RAG …
Billions of parameters to update à LoRA; RAG …

Resource Have memory limits on how much text they can process at once
Chunking; Embedding; Prompt Compression; RAG + Vector Databases …
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models. CoRR abs/2309.01219 (2023)
https://promptdrive.ai/llm-limitations/
13
Overview of LLM4DB Framework
q Data Management tasks
q LLM Prompt for Data Management
• Instruction Prompting
• Few-Shot Prompting
q LLM Agent for Data Management
• Agent Models + Memory
• Reasoning / Planning Strategies
• Tool Management & Learning
q RAG for Data Management
• Semantic Segmentation
• Result Retrieval
• Result Reranking
q Finetuning for Data Management
• Reparamerization / LLM Adapter
q Data Preparation for LLM
q Open Problems 14
Overview of Data Management Tasks
q Data Management tasks
q LLM Prompt for Data Management
• Instruction Prompting
• Few-Shot Prompting
q LLM Agent for Data Management
• Agent Models + Memory
• Reasoning / Planning Strategies
• Tool Management & Learning
q RAG for Data Management
• Semantic Segmentation
• Result Retrieval
• Result Reranking
q Finetuning for Data Management
• Reparamerization / LLM Adapter
q Data Preparation for LLM
q Open Problems 15
Overview of Data Management Tasks
⚽ Instruction
q Data Management Tasks
Ø Data Preprocessing
⚽ Instruction
• Data cleaning ⚽ Instruction

• Data Standardization ⚽ Instruction

• Doc2Table
⚽ Instruction + Examples
• Entity matching ⚽ Instruction +
Examples +
Ø Database Optimization
Extra Info
• Knob tuning
• Database diagnosis
⚽ Instruction
+
• Query rewrite Examples +

Ø Data Analysis
⚽ Instruction+
• Text2SQL Examples +
Extra Info
• Text2Code
• ... ⚽ Instruction + Examples
16
LLM Prompt for Data Management
q Data Management tasks
q LLM Prompt for Data Management
• Instruction Prompting
• Few-Shot Prompting
q LLM Agent for Data Management
• Agent Models + Memory
• Reasoning / Planning Strategies
• Tool Management & Learning
q RAG for Data Management
• Semantic Segmentation
• Result Retrieval
• Result Reranking
q Finetuning for Data Management
• Reparamerization / LLM Adapter
q Data Preparation for LLM
q Open Problems 17
Prompt Engineering (PE)
Prompt of Query Rewrite
q High-Quality Prompt can instruct
Task Description
LLM to optimize DB tasks without Write an equivalent SQL query that can be
executed on a Postgres database with decreased latency.
training
Instruction
Ø Zero-shot Prompting 1. Ensure output query is semantical-equivalent to the input query …
• Input LLM with a task description, without Example Input
training over labeled data select ... from t1 where t1.a=(select avg(a) from t3 where t1.b=t3.b);
Example Output
• Instruction Prompting select … from t1 inner join (select avg(a) avg,t3.b from t3 group
by t3.b) as t3 on (t1.a=avg and t1.b=t3.b);
• Input LLM with explicit instructions on
approaching the task, e.g., detailing the Input
select t1.* from t1 where t1.col1>(
format, tone, or type of output response select max(t2.col2) from t2 where t2.col1 in (
select t1.col1 from t1 where t1.col1=t2.col1));
Ø Few-shot Prompting
• Provide LLM with a few examples of the Output
select t1.* from t1 inner join (
task within the prompt to guide the model select max(t2.col2) max, t2.col1 from t2
on how to generate responses group by t2.col1) as t2 on (
t1.col1=t2.col1)
where t1.col1>max;

Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023.
18
Challenges of PE for Data Management
q Challenge 1: How to Automatically Prompt of Query Rewrite
Task Description
Generate Input Prompt? Write an equivalent SQL query that can be
executed on a Postgres database with decreased latency.
Ø Automatically generate proper instructions
Instruction
and select demonstration examples from 1. Ensure output query is semantical-equivalent to the input query …
large space within limited prompt tokens
Example Input
(or limited interaction rounds). select ... from t1 where t1.a=(select avg(a) from t3 where t1.b=t3.b);
Example Output
select … from t1 inner join (select avg(a) avg,t3.b from t3 group
q Challenge 2: How to Efficiently by t3.b) as t3 on (t1.a=avg and t1.b=t3.b);
Interact with LLM Using Prompts? Input
select t1.* from t1 where t1.col1>(
Ø Iteratively adjust prompt for input request select max(t2.col2) from t2 where t2.col1 in (
(e.g., select suitable examples from select t1.col1 from t1 where t1.col1=t2.col1));

candidate ones) is time consuming Output


select t1.* from t1 inner join (
Ø Long prompts often include more useful select max(t2.col2) max, t2.col1 from t2
group by t2.col1) as t2 on (
information, but require long inference t1.col1=t2.col1)
where t1.col1>max;
time for LLM and hard to understand
Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023.
19
Prompt Engineering Techniques for Data Management

🏀 Instruction (by human)


🏀 Instruction & Examples (by human)

🏀 Instruction & Examples (by human)


🏀 Instruction
& Examples (by human)
Example Selection (BM25)

🏀 Instruction
Selection (ranking-based);
Example Selection (text-davinic-003)

🏀 Instruction (by human)


20
Prompt Engineering for Data Preprocessing
q Problems in Traditional Data Preprocessing Decompose into Operations
• data structural consistency
• Require substantial programming skills to write UDFs • data type conversion
• Require complex running environment to run the UDFs • data standardization
• Prior experience cannot be utilized in UDFs • data anomaly detection

q Manually write NL prompts to avoid writing complex functions

User-Defined Function User-Defined LLM Prompt


Luyi Ma, et al. LLMs with User-defined Prompts as Generic Data Operators for Reliable Data Processing. IEEE Big Data 2023.
21
Prompt Engineering for Data Preprocessing
q Problems in Traditional Data Preprocessing Decompose into Operations
• data structural consistency
• Require substantial programming skills to write UDFs • data type conversion
• Require complex running environment to run the UDFs • data standardization
• Prior experience cannot be utilized in UDFs • data anomaly detection

q Manually write NL prompts to avoid writing complex functions


• E.g., for date data structuralization,
• Traditional UDFs: Require enumeration of the date format /
utilize different date processing packages
à Manually define the output format (YYYYMMDD) in the prompt and
let LLMs handle the data processing

Luyi Ma, et al. LLMs with User-defined Prompts as Generic Data Operators for Reliable Data Processing. IEEE Big Data 2023.
22
Prompt Engineering for Query Rewrite
SELECT …
q Problems of Rule-driven Rewriters WHERE …
• Inadequacy of rules: Insufficient for handling complex AND (
EXISTS (
query transformations (e.g., merging sub-queries)
SELECT a2 FROM t2
• Cross-system migration: Different programming WHERE t2.b2=1
languages and SQL syntax )
OR
EXISTS (
SELECT a2 FROM t2
WHERE t2.b2=2 )
)

Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 23
Prompt Engineering for Query Rewrite
SELECT …
q Problems of Rule-driven Rewriters WHERE …
• Inadequacy of rules: Insufficient for handling complex AND (
EXISTS (
query transformations (e.g., merging sub-queries)
SELECT a2 FROM t2
• Cross-system migration: different programming WHERE t2.b2=1
languages and SQL syntax )
OR
q 1. Prompt-based Rewrite Case Generation EXISTS (
SELECT a2 FROM t2
• Utilize LLM to generate well-explained rewrite cases, i.e., WHERE t2.b2=2 )
(original query, rewritten query, rewrite rule, rewrite analysis). )

Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 24
Prompt Engineering for Query Rewrite
SELECT …
q Problems of Rule-driven Rewriters WHERE …
• Inadequacy of rules: insufficient for handling complex AND (
EXISTS (
query transformations (e.g., merging sub-queries)
SELECT a2 FROM t2
• Cross-system migration: different programming WHERE t2.b2=1
languages and SQL syntax )
OR
q 2. Prompt-based Query Rewrite EXISTS (
SELECT a2 FROM t2
§ The quality of prompt impacts the performance of LLM
WHERE t2.b2=2 )
on different rewrites à Automatic Prompt Generation )
2.1 Instruction Generation
Output best
• Write prompt to instruct LLM to generate instruction
instruction
candidates by examples (e.g., five pairs):
• I followed the instruction to rewrite the input SQL query
to produce an equivalent SQL query...
• Based on the instruction, they produced the following
input-output pairs: \n\n[example pairs] \n\nInstruction:"
Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 25
Prompt Engineering for Query Rewrite
SELECT …
q Problems of Rule-driven Rewriters WHERE …
• Inadequacy of rules: insufficient for handling complex AND (
EXISTS (
query transformations (e.g., merging sub-queries)
SELECT a2 FROM t2
• Cross-system migration: different programming WHERE t2.b2=1
languages and SQL syntax )
OR
q 2. Prompt-based Query Rewrite EXISTS (
SELECT a2 FROM t2
§ The quality of prompt impacts the performance of LLM
WHERE t2.b2=2 )
on different rewrites à Automatic Prompt Generation )
2.1 Instruction Generation
Output best
• Rank the generated instruction candidates based on
instruction
their benefit for validation set (e.g., the cost reduction
after rewriting with the instructed llm)

Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 26
Prompt Engineering for Query Rewrite
SELECT …
q Problems of Rule-driven Rewriters WHERE …
• Inadequacy of rules: insufficient for handling complex AND (
EXISTS (
query transformations (e.g., merging sub-queries)
SELECT a2 FROM t2
• Cross-system migration: different programming WHERE t2.b2=1
languages and SQL syntax )
OR
q 2. Prompt-based Query Rewrite EXISTS (
SELECT a2 FROM t2
§ The quality of prompt impacts the performance of LLM
WHERE t2.b2=2 )
on different rewrites à Automatic Prompt Generation )
2.1 Instruction Generation
• Rank the generated instruction candidates based on
their benefit for validation set (e.g., the cost reduction
after rewriting with the instructed llm)
2.2 Demonstration Example Generation
• Match the current query q with a few candidate rewrites
whose input queries are similar to q

Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 27
Prompt Engineering for Query Rewrite

q Problems of Rule-driven Rewriters


• Inadequacy of rules: insufficient for handling complex query transformations
(e.g., merging sub-queries)
• Cross-system migration: different programming languages and SQL syntax
q 3. Efficiency Issues
§ Search-based example matching is time-consuming
à Prompt / Finetune a model to identify the most suitable demo examples

Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 28
Prompt Engineering for Query Rewrite

q Problems of Rule-driven Rewriters


• Inadequacy of rules: insufficient for handling complex query transformations
(e.g., merging sub-queries)
• Cross-system migration: different programming languages and SQL syntax
q 3. Efficiency Issues
§ Excessively long prompt can slow down the LLM inference
à Strike a balance between prompt length and LLM performance (e.g.,
generate summary for the detailed description of rewrite rules)

Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 29
Prompt Engineering for Table Data Extraction
q LLM cannot directly extract structured table from long-context and
semi-structured documents
q Core Idea
q For long document, feed sampled documents to the LLM, and prompt it to generate
useful information (schema and cell values) that can form a structured table (e.g., writing
code to extract the values of important attributes)

EVAPORATE-CODE+
(Doc2Table)

Arora S, Yang B, Eyuboglu S, et al. Language models enable simple systems for generating structured views of heterogeneous data lakes[J]. arXiv 2023.
30
Prompt Engineering for Table Data Extraction
q LLM cannot directly extract structured table from long-context
and semi-structured documents
q Prompt-based Table Data Extraction
q Schema Synthesis
q With a sampling subset of documents, it prompts LLM to extract attributes
based on their occurrence frequencies
q Rerank the extracted attributes by adjusting their frequency weights with LLM
q Function Synthesis
q A heavy job to extract attribute values from every document à Prompt LLM to
write code to extract the attribute values

Function
Prompt

Arora S, Yang B, Eyuboglu S, et al. Language models enable simple systems for generating structured views of heterogeneous data lakes[J]. arXiv preprint arXiv:2304.09433, 2023.
31
Prompt Engineering for NL2Code

q Motivation: Textual instructions à Execute by generating code


• Core Idea
• Design prompt to drive LLM to (1) decompose the input query into a series of simple
processing steps and (2) translate the instructed steps into executable code
Example Prompt

Database description

Processing steps

Natural language
instructions

Trummer I. From BERT to GPT-3 Codex: Harnessing the Potential of Very Large Language Models for Data Management[J]. VLDB, 2022.
32
Prompt Engineering for Knob Tuning (Workload Generation)

q Motivation
• ML-based methods require numerous workloads as training data
• Main Steps
• Workload generation: Use manual-crafted GPT-4 to generate diversified
workloads for specific database schema and workload types
Workload Generation Prompt

You are a helpful DBA, tasked with generating challenging OLAP / OLTP workloads
and fulfill my goals.

Huang X, Li H, Zhang J, et al. LLMTune: Accelerate Database Knob Tuning with Large Language Models[J]. arXiv preprint arXiv:2404.11581, 2024. 33
Prompt Engineering for Knob Tuning (Workload Generation)

q Motivation
• ML-based methods require numerous workloads as training data
• Main Steps
• LLM Finetuning: Train Mistral-7B with workloads labeled with configurations
recommended by Bayesian Optimization (BO)-based algorithm
• LLM Input: Workload features, internal metrics, query plan;
• LLM Output: Generate the configuration change based on previous configuration

Instructed GPT-4 BO-based

Huang X, Li H, Zhang J, et al. LLMTune: Accelerate Database Knob Tuning with Large Language Models[J]. arXiv preprint arXiv:2404.11581, 2024. 34
Prompt Engineering for Knob Tuning (Knowledge Extraction)

q Textual experience are not well utilized in knob tuning


à Split knowledge extraction task into four main steps, and manually
design prompts to guide LLM in each step
• 1. Extract knowledge from LLM: • 2. Filter noisy knowledge: Prompt LLM to
retrieval knowledge by (1) directly evaluate whether the tuning knowledge
asking GPT or (2) prompting LLM to conflicts with the system view
summarize from documents

Knowledge

Lao J, Wang , et al. GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization[J]. Proceedings of the VLDB Endowment, 2024, 17(8): 1939-1952.
35
Prompt Engineering for Knob Tuning (Knowledge Extraction)

q Textual experience are not well utilized in knob tuning


à Split knowledge extraction task into four main steps, and manually
design prompts to guide LLM in each step
• 3. Summarize knowledge from various • 4. Check factual inconsistency with
resources: Manual > Web > GPT the document sources

Lao J, Wang , et al. GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization[J]. Proceedings of the VLDB Endowment, 2024, 17(8): 1939-1952.
36
LLM Prompt for Database Diagnosis
q Improve diagnosis prompt with matched Knowledge and Tools
• 1. Anomaly Description for triggered alerts Prompt Template
- Expert Description: <Role> ; <Task> ; <Steps>
- The anomaly alerts are {Anomaly Info}
- In this step, you can use the following tools: {Tool APIs}
- The matched knowledge is: {Knowledge}
============================================
• 2. Tool Matching with finetuned embed model (Demonstration Examples of available tool usage)

Example Knowledge Chunk from Past Diagnosis Report


𝑒𝑚𝑏(. ): the embedding function of the fine-tuned
Sentence-BERT model; 𝑠: context; 𝑡" : tool API
• 3. Knowledge Matching by metric attributes

• 𝐷: A candidate knowledge chunk; 𝑞! : abnormal metric

Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, et al. D-Bot: Database Diagnosis System using Large Language Models. VLDB 2024.
37
Summarization of Prompt-based Data Management
Prompt Prompt Efficiency
LLM Job Operations
(Content) (Examples) Issue
Data GPT: Modify Data Four Preprocessing
Instruction Manual N/A
Standardization format Operations
1. A small-sized model
Ranking
Query Typical Logical Instruction + for Example Selection;
GPT: Query Rewrite based
Rewrite Transformations Example 2. Prompt Compress;
Selection
3. Past rewrite Reuse
Table Data Schema Extraction; Instruction + Manual N/A
GPT: Doc2table
Extraction Value Extraction Example
Processing steps and Manual N/A
NL2Code GPT: Text2code Instruction
comments
GPT: Workload
Knob Workload Generation
Generation; Instruction Manual N/A
Tuning + Knob Tuning
Mistral: Knob Tuning
Knob GPT: Knowledge Four steps for Instruction +
Manual N/A
Tuning Extraction knowledge extraction Example
Instruction +
Database GPT / Llama: Root Three analysis and Matched by
Example + N/A
Diagnosis Cause Analysis tool calling steps external info
Knowledge / Tool 38
Take-aways
p Prompt generation
l Existing prompts mostly depend on human experts to craft high-quality instructions
and examples, which have the scalability problem
l Prompt examples can be generated from humans, real cases, LLMs (e.g., add
explanations), and traditional algorithms (e.g., search-based for configs)
p Instruction selection
l Instruction format can affect LLM performance (especially weak LLMs)
l The order of instructions / different examples can affect LLM performance
p Prompt Selection
l Prompt selection are critical to the success of the current task
l No one-size-fit-all-model prompt, which needs modifications for different LLMs

p Efficiency improvement: prompt compression, model-based example selection, and batch


inference 39
LLM Agent for Data Management
q Data Management tasks
q LLM Prompt for Data Management
• Instruction Prompting
• Few-Shot Prompting
q LLM Agent for Data Management
• Agent Models + Memory
• Reasoning / Planning Strategies
• Tool Management & Learning
q RAG for Data Management
• Semantic Segmentation
• Result Retrieval
• Result Reranking
q Finetuning for Data Management
• Reparamerization / LLM Adapter
q Data Preparation for LLM
q Open Problems 40
LLM Agent

q Human-Crafted Prompt for each task à Design Automatic Agents to


automatically execute complex tasks with predefined prompt templates
• LLM agents: An LLM system that can automatically resolve a series of
domain-specific tasks with minor human involution
LLM Prompting LLM Agent

41
Advantages of LLM Agent

q Autonomy
• Execute human instructions and complete tasks;
• Start and execute operations without human
requirements

q Reactivity
• Respond rapidly to immediate changes and
interact with its environment

q Pro-activeness
• Anticipate future. Make plans, and take proactive
measures in their actions to achieve specific goals

https://www.superannotate.com/blog/llm-agents
42
Advantages of LLM Agent

q Social (Multi-Agent) ability


• Interact with other agents (including humans),
generate and understand natural language

q Learning Ability
• Adaptively integrate new tools and refine the
execution pipelines based on environment feedback
• Memorize experience in both external knowledge
base and model parameters

q Mobility
• Easy to generalize to new platforms (e.g., Postgres
à MySQL) and scenarios (e.g., specifying new
workloads) via prompt-level modification
https://www.superannotate.com/blog/llm-agents
43
Main Components of LLM Agent
q Main components
• Agent Model: A powerful language model that can (1) conduct basic reasoning by
following the instruction; (2) generate tool calling commands and understand tool outputs …
• Planning: Decompose complex task and conduct step-by-step reasoning
• Memory: Store records of interactions from the agent’s (long/short term) previous tasks
• Tools: Manage agent’s calling of external API tools

https://www.superannotate.com/blog/llm-agents
44
Example LLM Agent
q Main components
• Agent Model: A powerful language model that can (1) conduct basic reasoning by
following the instruction; (2) generate tool calling commands and understand tool outputs …
• Planning: Decompose complex task and conduct step-by-step reasoning
• Memory: Store records of interactions from the agent’s (long/short term) previous tasks
• Tools: Manage agent’s calling of external API tools
Mining Raw Stock
Market data
(structured)
Data-Analytics
Agent
Mining from
By Nvidia financial reports
(unstructured)

https://developer.nvidia.com/blog/build-an-llm-powered-data-agent-for-data-analysis/
45
Challenges & Techniques of Agent Components
q Main components
• Agent Model: How to (1) conduct basic reasoning by following the instruction; (2)
generate tool calling commands and understand tool outputs …
• The selection / finetuning of suitable agent model is vital to the performance
• GPT-3.5 Turbo is effective as code interpreter, particularly in its ability to understand and
translate complex task descriptions into functional code;
• GPT-4 demonstrates weaker consistency compared with GPT-3.5

Zeeshan Rasheed, et al. Large Language Model Evaluation Via Multi AI Agents: Preliminary results. arXiv, 2024. 46
Challenges & Techniques of Agent Components
q Main components
• Planning: How to decompose complex task and conduct step-by-step reasoning
• Chain-of-Thought (CoT): Explicitly break down the task into a sequence of intermediate steps
• Tree-of-Thought (ToT): Decompose the thought process into multiple branches and sub-nodes
• Graph-of-Thought (GoT): LLM thoughts as vertices, edges as thought dependencies à
Arbitrary thoughts can be aggregated by constructing vertices with multiple incoming edges

Maciej Besta, et al. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. AAAI 2024
47
Challenges & Techniques of Agent Components
q Main components
• Planning: How to decompose complex task and conduct step-by-step reasoning
• Chain-of-Thought (CoT): Explicitly break down the task into a sequence of intermediate steps
• Tree-of-Thought (ToT): Decompose the thought process into multiple branches and sub-nodes
• Graph-of-Thought (GoT): LLM thoughts as vertices, edges as thought dependencies à
Arbitrary thoughts can be aggregated by constructing vertices with multiple incoming edges

Decompose into
Tree of Thought

Maciej Besta, et al. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. AAAI 2024
48
Challenges & Techniques of Agent Components
q Main components
Ø Memory: How to store historical messages for caching or effective task-solving
Ø LLM may forget past actions when resolving the same task
• à Short-Term Memory: Memory information is directly written into prompt (e.g., for
maintaining the internal state during executing a task)
Ø The effective plans for historical task may be useful for current task
• à Long-Term + Short-Term Memory: Long-term memory for stable knowledge (e.g., reuse
behaviors and thoughts in past plans for current situation); Short-term memory for flexible
planning (e.g., adjust the plan with recent feedback)
• Memory formats: Different memory formats possess distinct strengths
Memory Format Advantage Case
Natural language Flexible; Rich semantics; Reflexion: Stores experiential feedback in natural language
within a sliding window.
Embeddings Benefit retrieval ChatDev: Encode dialogue history into vectors for reuse
Databases Support Complex Query DB-GPT: Agents are fine-tuned to understand and execute
SQL queries, enabling them to interact with databases 49
Challenges & Techniques of Agent Components
q Main components
• Tools: How to prepare tool APIs for better tool calling and result understanding
• Vast number of Tools (length and latency issues)
• à Tool Selection
• Retriever-based tool selection
• LLM-based tool selection
• Adherence to API parameter and formats
à Tool Calling
• LLM Finetuning
• Output Constraints (e.g., in json format)
• …

Changle Qu, et al. Tool Learning with Large Language Models: A Survey. arXiv 2024.
50
LLM Agent Techniques for Data Management

🏀 Tool(unified API design)


Multi-Agent (e.g., split into chatter / annonater / executer)

🏀 Action
(e.g., document retrieval)
Memory (Q&A caching)
Human Feedback & Risk Estimation

🏀 Tools(monitoring / optimization tools)


Action (e.g., call tools, analyze, output causes)
Planning (tree of thought)
Memory (incremental summary of past messages)
Multi-Agent (e.g., split into chief DBA / domain experts)

51
LLM Agent for Data Standardization

q Problem: The complexity of tools like Pandas require great


human efforts to write code for various column types

Data Standardization: Unify the format


of values within the same column (e.g.,
the “Admission Date” column)

Case-by-Case Analysis and Coding ❌


Danrui Qi, Jiannan Wang et al. CleanAgent: Automating Data Standardization with LLM-based Agents. arXiv 2024
52
LLM Agent for Data Standardization
q Core Idea: Convert textual instructions into declarative API calls and
automate data standardization with LLM agent
q Challenges
• (1) How to design declarative and unified tool APIs for data standardization?
• (2) How to optimize the interaction between data scientists and LLM agent?
q CleanAgent
Ø Tools
• Unified tool API:

df: input dataframe (table)


column_name: the column needs to standardize
target_format: the target standardization format users specified
Danrui Qi, Jiannan Wang. CleanAgent: Automating Data Standardization with LLM-based Agents. arXiv 2024.
53
LLM Agent for Data Standardization
q Core Idea: Convert textual instructions into declarative API calls and
automate data standardization with LLM agent
q Challenges
• (1) How to design declarative and unified tool APIs for data standardization?
• (2) How to optimize the interaction between data scientists and LLM agent?
q CleanAgent
Ø Multi-Agent for Planning (ChatGPT as the model)
• Chat Manager (Mem): Store historical message
• Column Annotator (Model): Annotate type for each table column
• Python Programmer (Model): Generate code with candidate APIs
• Code Executor (Action): Execute code and feed result to
Chat Manager

Danrui Qi, Jiannan Wang. CleanAgent: Automating Data Standardization with LLM-based Agents. arXiv 2024.
54
LLM Agent for Database Diagnosis
• Numerous Documents: A single database product provides over 10,000+ pages of
materials (e.g., operational guides and instructions). It’s tricky for junior DBAs to fully
grasp all of this.
• Significant Operational Pressure: The number of cloud database instances is massive,
but there's a shortage of DBAs.
• Complex Issues: Many urgent problems are interconnected, making it hard to respond
quickly, resulting in economic losses.
query quality

others
audit
quality
caches

parameters
query
plan
indexes

Various Root Causes 1M+ Cloud Instances Correlated Online Issues

Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, et al. D-bot: Database diagnosis system using large language models[J]. VLDB, 2024.
55
LLM Agent for Database Diagnosis
• Goal: Utilize LLM as an “intelligent DBA assistant”, learning from human
operational experience and diagnose root causes.
• Benefits: Save manpower; Enhance proficiency in using maintenance tools;
Improve the ability to generalize operational knowledge.
Thought: High memory usage seems to be caused by
poor join performance and much inactive memory Other methods like
Reasoning: Poor joins can be solved by plan optimization
Action: optimize_query_plan expert systems are
hard to reuse
Rewrite logical query … similar cases
Optimize query plan …

DBAgent Add lacking indexes …

Knowledge From Documents


Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, et al. D-bot: Database diagnosis system using large language models[J]. VLDB, 2024.
56
LLM Agent for Database Diagnosis
• Tool API: Monitoring Tools (e.g.,logs, views, metrics); Optimization Tools (e.g., index)
• External knowledge Extraction: Segmentation of text blocks; generation of summary
indexes; extraction of formalized knowledge.
• Planning: Improving tool usage accuracy through tree-search-based method.
• Multi-Agent optimization: Chief DBA (diagnostic scheduling, finding summarization);
Domain Experts (e.g., resource anomaly, query optimization), Chat Summarizers; Users
(providing effective feedback).

Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, et al. D-bot: Database diagnosis system using large language models[J]. VLDB, 2024.
57
Agent-Based Data Management

Extra Multi-
Methods Model Memory Tool Planning
Knowledge Agent

Data Historical Declarative


GPT Iterative N/A N/A
Standardization messages APIs

Results of
Database GPT-4 Monitoring Tools; Tree of
historical Documents √
Diagnosis Llama Optimization Tools Thought
actions

58
Take-aways
p Agent integrates capabilities like planning, obtaining external knowledge
and reacting (i.e., tool learning), and so works better than prompt-only ones
p Existing Tool APIs and prompt templates for Agent are manually written,
which also have the scalability problem
p Planning is vital for Agent to resolve problems like early-stop,
hallucination (e.g., inaccurate tool calling), and self-inconsistent
p Memory reminds LLM of the historical messages (avoiding repeated
actions or serving as cache), but also causes great prompt redundancy
p Multi-Agent offers new mechanisms like collaboration (v.s., single-agent),
but the superiority needs to further explore, especially in real-world cases
(where single-agent already causes great overhead) 59
RAG for Data Management

q Data Management tasks


q LLM Prompt for Data Management
• Instruction Prompting
• Few-Shot Prompting
q LLM Agent for Data Management
• Agent Models + Memory
• Reasoning / Planning Strategies
• Tool Management & Learning
q RAG for Data Management
• Semantic Segmentation
• Result Retrieval
• Result Reranking
q Finetuning for Data Management
• Reparamerization / LLM Adapter
q Data Preparation for LLM
q Open Problems 60
Motivation of Retrieval Augmented Generation

Drawbacks of LLMs
• Hallucination
• Outdate information
• Low efficiency in parameterized knowledge
• Weak reasoning capability

Practical Requirements
• Domain-Specific Accurate Q&A
• Frequent Data Update
• Explainability of Generated Answer
• Controllable Cost
A motivative example.
• Data Privacy Protection

61
Retrieval and Generation Pipeline

Retrieval phase:
• The given corpus will be segmented into many chunks.
• For a question, a retriever identifies and selects the top K most related chunks as context.
Generation phase:
• The question, alongside the context, will be inputted into a LLM to generate an answer.

Segmented Embedding Embeddings


Corpus
Chunks Model of Chunks

Retrieval Query
Embedding Question Vector
Question
Model Embedding Database

Top N Chunks

Prompt
Generation LLM Answer
Engineering User
62
Limitations of RAG

Retrieval limitation:
• Chunk selection based on embedding similarity may lead to misleading or
irrelevant chunks and missing crucial information.
• Methods:
• Semantic segmentation
• Reranking
• Retrieval techniques

Generation limitation:
• LLMs may produce content that is not supported by the retrieved context.
• Methods:
• Prompt engineering
• Using LLMs with high proficiency levels

63
Semantic segmentation
Motivation: Segment the corpus into semantically coherent chunks, ensuring that the
retrieved information is semantically complete and relevant.

Training:
• Collecting sentence pairs with labels. (label=1 means they are semantically relevant.)
• Fitting the mapping between sentence pairs and labels using a model (Embedding model +
MLP).
Inference:
• Each two adjacent sentences in the corpus is judged by the segmentation model.

64
Reranking Technique
Reranking: Reorder chunks according to their relevance to the question, and then select
the top K based on specific metric (e.g. relevance).
• Enhancer: Boost the relevance and quality of context.
• Filter: Filter out misleading or irrelevant chunks.

Example: Using a PLM to compute the probability of the input


UPR | EMNLP 2022 question conditioned on a retrieved passage.

Devendra Singh Sachan et al. Improving Passage Retrieval with Zero-Shot Question Generation. EMNLP 2022.
65
Retrieval Technique

Retrieval augmentation
• Iterative Retrieval • Recursive Retrieval • Adaptive Retrieval

Yunfan Gao, Yun Xiong, Xinyu Gao , et al. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2023
66
Retrieval Technique
• Iterative Retrieval: Repeatedly search based on the initial query and the generation result.
Advantage: Enhance robustness of subsequent answer generation.

ITER-RETGEN | EMNLP 2023


ITER-RETGEN will iterate for specific times, in each iteration, it
uses generated content and the initial query to retrieve.

Zhihong Shao, Yeyun Gong et al. Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. EMNLP 2023.
67
Retrieval Technique
• Recursive Retrieval: Iteratively refining search queries based on previous retrieval result.
Advantage: Improve the depth and relevance of search results.

Example: IRCoT will


Iteratively repeat two steps:
• Generate CoT (chain-
of-thought) question
based on retrieved
corpus and question.
• Retrieve over the
previous CoT question.

Harsh Trivedi et al. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. ACL 2023.
68
Retrieval Technique
• Adaptive Retrieval: Enable LLMs to actively determine the moments for retrieval.
Advantage: Enhance the efficiency and relevance of the information sourced.
Example: FLARE
Retrieve when LLM generates tokens with probabilities
below a set threshold.

Zhengbao Jiang, Frank F. Xu, Luyu Gao et al. Active Retrieval Augmented Generation. EMNLP 2023.
69
Insights with experimental examples

• Noisy chunks retrieved considerably undermine the effectiveness of RAG systems.

70
Insights with experimental examples

• Precise retrieval is a predominant part in RAG.


• The proficiency level of LLMs plays a crucial role in RAG.
• Embedding models, though useful, are not as important as LLMs.

71
Graph RAG
• Graph RAG includes a graph database as a source of the retrieval context sent to the LLM.
Retrieval Method: Entity Linking & NL-to-Graph-query.

Advantages:
• Providing extra information
like metadata for generation.
• Handling complex and
nuanced queries.
• Supporting data update.

Example:

https://github.com/falkorDB/ufc 72
RAG for Text2SQL

• Pre-Processing
• Schema Linking
• Database Content Retrieval
• Additional Information Acquisition
• NL2SQL Translation Methods
• Encoding Strategy
• Decoding Strategy
• Task-specific Prompt Strategies
• Intermediate Representation
• Post-Processing
• Correction
• Consistency
• Execution-Guided
• N-best Rerankers
Paper: https://arxiv.org/pdf/2408.05109
NL2SQL Handbook: https://github.com/HKUSTDial/NL2SQL_Handbook 73
RAG for Text2SQL

Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, Nan Tang. The Dawn of Natural Language to SQL: Are We Fully Ready?. VLDB 2024 74
RAG for Text2SQL

NL2SQL Design Space

NL2SQL360-Automated Search

Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, Nan Tang. The Dawn of Natural Language to SQL: Are We Fully Ready?. VLDB 2024 75
RAG for Text2SQL

Multi-source RAG for QA


• Knowledge construction
• Knowledge retrieval
• Learning to Embed and Search
• Adaptive In-Contextual Learning (ICL) and Generation by LLM
• ICL is a technique used to improve LLMs’ performance in handling contextual information by
incorporating additional context during the training or inference phase

Xue S, Jiang C, Shi W, et al. Db-gpt: Empowering database interactions with private large language models[J]. arXiv preprint arXiv:2312.17449, 2023.
76
RAG for Q&A Diagnosis

Hard to accurately answer debugging


questions with citations
• 1. RAG based Document Retrieval
• 2. Feature Retrieval: Identify Top-K feature
names (e.g., queries, datrabase, schema) from
the retrieved documents
• 3. Context Aggregator: Merge retrieved docs
(taking Top-3 chunks if exceeding prompt limit)
and features in json format à Give the root
causes by LLM
• 4. Source Citation: Append retrieved docs in
passages (generated by LLM)

Vikramank Singh et al. Panda: Performance Debugging for Databases using LLM Agents. CIDR, 2024
77
Take-aways
p RAG is very important to adapt to vertical domains and support
data updates
p Not only improve the efficiency (vector database, vector index)
but also improve the recall (multi retriever, segmentation,
embedding models)
p It is important combine Prompt, RAG, and Agent to answer
complex tasks
p It requires to build a RAG system

78
Finetuning for Data Management

q Data Management tasks


q LLM Prompt for Data Management
• Instruction Prompting
• Few-Shot Prompting
q LLM Agent for Data Management
• Agent Models + Memory
• Reasoning / Planning Strategies
• Tool Management & Learning
q RAG for Data Management
• Semantic Segmentation
• Result Retrieval
• Result Reranking
q Finetuning for Data Management
• Reparamerization / LLM Adapter
q Data Preparation for LLM
q Open Problems 79
Abilities Gained by Fine-Tuning

Target, Terminology
Understanding
Prompt
Instruction
Following

Domain Knowledge
RAG Fine-tuning
Integration

Tool Calling
Abilities Required
Agent From Finetuning
Task Decomposition
80
Example: Prompt Only à Prompt + Finetuning

Query Rewrite Prompt


Task Description
Benefits of Finetuning
Translate the input query into an equivalent one that • Domain knowledge memorization
can be executed on Postgres with decreased latency.
• Redundant instructions (e.g., task,
Input
select t1.* from t1 where t1.col1>( example, style) à A simple prompt
select max(t2.col2) from t2 where t2.col1 in ( • Output Quality
select t1.col1 from t1 where t1.col1=t2.col1));

Prompt Only Prompt + Finetuning


Output (LLMx-7B finetuned over rewrite dataset)
Output (LLMx-7B) (Analysis of how to rewrite the input query)
(Translate the input query in Chinese) select t1.* from t1 inner join (
select max(t2.col2) max, t2.col1 from t2
group by t2.col1) as t2 on (
t1.col1=t2.col1) where t1.col1>max;
81
Typical Techniques for LLM Finetuning

q Reparameterization (LoRA): Update


layer parameters in a low-rank subspace

• 𝐻! and 𝐻" are the input and output of attention layer


• 𝑊" is the frozen model weights (MLP and Attention layer)
• ∆𝑊: with much fewer parameters than 𝑊"
• 𝐵, 𝐴: Low-dimensional matrices for approximating ∆𝑊

• Given basic dimension is 1024, and LoRA


rank r as 4:
• #-parameters of weight 𝑊 is
1024X1024≈1M
• #-parameters of 𝐴, 𝐵 are both
rX1024≈4K

• In this way, we only train 0.8% of the


parameters to update LLM

Edward J. Hu, Yelong Shen, Phillip Wallis, et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022
82
Typical Techniques for LLM Finetuning

q LLM Adapter: Design and train additional learned


modules for specific layers in an LLM

• 𝐻! and 𝐻" are the input and output


of attention layer
• Down-projected by 𝑊#$%& and
then up-projected by 𝑊'(

Different from LoRA, you can introduce


new capabilities in the adapter (e.g.,
image understanding) after finetuning

Zhiqiang Hu et al. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. EMNLP, 2023.
83
LLM Finetuning Techniques for Data Management

🏐 Dataset Preparation + LoRA

🏐 Dataset Preparation + LoRA

🏐 Dataset Preparation + Adapter

84
LLM Finetuning for Query Rewrite

q Problems of Prompt-Only Rewriters SELECT …


WHERE …
• Prompt engineering has been criticized for limited AND (
knowledge capacity and unstable performance EXISTS (
SELECT a2 FROM t2
q Finetuning for Query Rewrite WHERE t2.b2=1
)
q Training Data ~ (origin query, rewritten query, steps) OR
EXISTS (
§ Origin Query: (1) From various schemas; (2) Simple SELECT a2 FROM t2
WHERE t2.b2=2 )
queries with atomic patterns; (3) Complex queries )
merged from simples ones using LLM like GPT-4
§ Rewritten Query: (1) Heuristic Policy; (2) Volcano
Policy; (3) Option Monte-Carlo Tree Search • Targets, Terminologies
• Domain Knowledge

DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 85
LLM Finetuning for Query Rewrite

q Problems of Prompt-Only Rewriters SELECT …


WHERE …
• Prompt engineering has been criticized for limited AND (
knowledge capacity and unstable performance EXISTS (
SELECT a2 FROM t2
q Finetuning for Query Rewrite WHERE t2.b2=1
)
q Training Data ~ (origin query, rewritten query, steps) OR
EXISTS (
§ Steps: Prompt an LLM to explain the rewrite SELECT a2 FROM t2
WHERE t2.b2=2 )
procedure in details: )

• Targets, Terminologies
• Logical Plan Structures

DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 86
LLM Finetuning for Tool Learning in Diagnosis Agent
Anomaly
Alert q Train LLM to select and call diagnosis APIs

Thought: Determine if the CPU usage was indeed abnormal


Ø Prepare 1500 samples ~ 8:1:1
Action: whether_is_abnormal_metric
Arguments: {metric: “cpu_usage”, time: “xxx - xxx”} Ø Converge after training in 8 epoches
(Step 1)
Result 1 Result 2
Ø Accuracy rate over the test set: 149/150
Observation1: The metric Observation2: The service is
cpu_usage is abnormal unavailable
Matching
{"input": "analyse the
Matched Knowledge: status of
Diagnosis Failure
cpu_relevant_metrics
LLM Inference (Early Stop) sockstat_UDPLITE_inuse in
Output 1 (Early Stop) xxx for the next 2 hours",
Thought: The anomaly is caused by high CPU usage …
"output": "Action: risk-
Action: Finish analysis Action Input:
Output 2 {'metric':
Thought: Investigate metrics like node_procs_running …
'node_sockstat_UDPLITE_in
(Step 2)
Action: obtain_metric_values use', 'instance':
Arguments: {metrics: [“node_procs_running”, …], time: ..} '10.79.26.157:15766'}"}

API calling by GPT-4 Data Format Llama-13B Finetuning


Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, et al. D-Bot: Database Diagnosis System using Large Language Models. VLDB 2024
87
LLM Finetuning for Diagnosis Agent

q Finetune a diagnosis model (LoRA) over five subtasks

• Five diagnosis subtasks with 2819 finetuning samples


• Analysis, Summarization(target & terminologies); Agent Selection (task decompsition); Tool Calling

• Good testing performance: Good at subtasks (e.g., tool calling, abnormal metric
analysis) ; but highly rely on the quality of finetuning data

Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, et al. D-Bot: Database Diagnosis System using Large Language Models. VLDB 2024
88
LLM Finetuning for Table Learning

q Design and finetune a Table-Text Model for Tabular Data Tasks


• Model Structure for New Modality Learning
• Visual Encoder (pretrained ViT model) + Vision-Language Connector (two-layer
MLP) + Foundation LLM (Vicuna)

• Model Finetuning
• Vision-Language Connector and LLM are jointly finetuned with instruction-following
data of multi-modal tabular tasks

Textual Q&A

Mingyu Zheng et al. Multimodal Table Understanding. ACL, 2024.


89
Take-aways
p Finetuning is critical to domain tasks, such as (1) terminology and
syntax learning, (2) instruction following, (3) tool calling, and (4) even
supporting new modalities.
p The above three abilities can be achieved via reparameterization, but
new modality learning requires carefull-designed adapters
p The updated parameters and finetuning tecniques depend on factors,
such as the task complexity and the pretraining data acquired by the LLM
p Training data is vital to the finetuning performance, which relies on data
discovery, data processing, and expert insights
p LLM can be enhanced by combining both finetuning and RAG
90
Data Preparation for LLM
q Data Management tasks
q LLM Prompt for Data Management
• Instruction Prompting
• Few-Shot Prompting
q LLM Agent for Data Management
• Agent Models + Memory
• Reasoning / Planning Strategies
• Tool Management & Learning
q RAG for Data Management
• Semantic Segmentation
• Result Retrieval
• Result Reranking
q Finetuning for Data Management
• Reparamerization / LLM Adapter
q Data Preparation for LLM
q Open Problems
91
Data Preparation for LLM
Stages \ LLM Training Pretraining Incr.- Pretraining Finetuning RLHF RAG Prompting
Data Selection √ √ √ √ √ √
Data Cleaning √ √ √ - - -
Data Augmentation √ √ √ √ √ √
Data Labeling - - √ √ √ √
Data Mixing √ √ √ √ √ √
Data Synthesis √ √ √ √ - -
Vector Embedding - - - - √ -

1 2 3 5 Large-Scale Dirty Dataset


Synthesis
Synthesis

+
+ +
+ +

4 4 Low Accuracy 4 3 High Cost


* Deduplication Data
* Outlier Detection Selection
Coreset C Dataset A(er Clean C

1: If XX then… 3 Low Cost 4 High Accuracy


2: If XX then…
Coreset C’
? Coreset C’

92
Open Problems
q Data Management tasks
q LLM Prompt for Data Management
• Instruction Prompting
• Few-Shot Prompting
q LLM Agent for Data Management
• Agent Models + Memory
• Reasoning / Planning Strategies
• Tool Management & Learning
q RAG for Data Management
• Semantic Segmentation
• Result Retrieval
• Result Reranking
q Finetuning for Data Management
• SFT Dataset Generation
q Data Preparation for LLM
q Open Problems 93
Open Problem 1: Database Specific LLM
p Case-by-Case LLM Finetuning à Database-Specific LLM Construction
Ø Pretrain: Collect sufficient database-domain tokens (e.g., in millions) as pre-training
corpora from sources like database textbook and query analysis

Ø Finetune: Instruction Understanding in SQL / Text à Basic Q&A (DB / Product / Instance)
à Task-Solving in DB Domains à Alignment to Database Experts

Ø Evaluation: Evaluate the accuracy and robustness of the database model with carefully-
crafted validation dataset, measuring metrics, and end-to-end testbed.

Database Specific LLM

General Q&A Product Q&A Instance Q&A

Diagnosis SQL Rewrite Config Tuning


94
Open Problem 2: Tabular Data Learning
p Existing Adapters still cannot handle the following things:

Ø Complex Table Structure Understanding

Ø Alignment of table structure, table content, and in-context text

Ø Excessively large tables processing

Leiyuan Chen et al. TableVLM: Multi-modal Pre-training for Table Structure Recognition. ACL 2023.
95
Open Problem 3: Data Preparation System for LLM
p An Effective System for Preparing LLM Data

Ø Data & Model Co-Design

Ø Big Data Curation for dfferent LLM stages

Ø Data Synthesis: High LLM Perf. & Privacy-Preserving

Ø Data Flywheel: Self-Data-Reinforcing Loop

Ø Data Quality Evaluation

96
Other Problem 4: Easy-to-Use LLM (Agent)
q How to reduce the use costs of LLM / LLM-based Agent

Ø LLM Distillation: Large Powerful LLM à Cost-Efficient LLM

Ø Training or RAG? (e.g., according characters like update frequency)

Ø Prompt Managment: Prompt Template Library + Automatic Template Gen

Ø Tool API: Manual Generation à Automatic API Intergration (e.g., from programs)

Ø Agent-As-A-Service: Careful Agent Design à One-click Agent Generation

Ø …

97
Thanks!
Thanks the help & advice from
Jintao Zhang, Zhaoyan Sun,
Yuyang Gao, Wei Zhou, James Pan

Slides: https://dbgroup.cs.tsinghua.edu.cn/ligl/activities.html
Complete Paper List: https://github.com/code4DB/LLM4DB

You might also like