LLM For Data Management
LLM For Data Management
Wayne Xin Zhao, Kun Zhou, Junyi Li et al. A Survey of Large Language Models. arXiv, 2023. 5
Motivations of LLM for Data Management
q Opportunities of LLM for data management
• Improved usability (e.g., Text2SQL)
• Improved performance (e.g., Query Rewrite)
• Improved maintainability (e.g., Database Diagnosis)
• Improved interpretability (e.g., Database Tuning)
Rewrite Analysis:
… Convert the sub-query into a join
between the "emp" and "dept" tables …
· Supervised Finetuning
· Reinforcement Learning Rewritten SQL:
SELECT … FROM emp INNER JOIN dept
· Active Learning ON … AND emp.empno=dept.deptno;
…
Feedbacks Case Generation
Transformer-based LLM:
Predict the next word
given a sequence of
previous text
8
LLM Finetuning
q Finetuning LLM over labeled dataset is used for instruction-
following and understanding task-specific knowledge
q RLHF is used for human feedback and alignment
Preference
Alignment
9
Prompt for LLM Inference
q Input text for LLM to generate response or execute a task
• Simple Prompt
• (task) "Explain the theory of relational tables."
• Contextual Prompt
• (context) ”Undergraudate students are studying database concepts for the first time and
is curious about fundamental theories.”
• (task) "Explain the theory of relational table in a way beginners can understand.”
Lei Wang et al. A Survey on Large Language Model based Autonomous Agents. arXiv 2023.
11
RAG for LLM Inference
q Drawbacks of LLMs
• Hallucination
• Outdate information
• Low efficiency in LLM training
• Weak reasoning capability
q Practical Requirements
• Domain-Specific Accurate Q&A
• Frequent Data Update
• Explainability of Responses
• Controllable Cost
• Data Privacy Protection A motivative example.
12
Overview of LLM Challenges and Solutions
Give conflicting outputs for very similar prompts
Task decomposition; Prompt for multiple times and Vote; Self-Reflection …
Generate text that seems realistic and plausible but is actually inaccurate
RAG, Write instructive prompts to ask for source/evidence or call tools …
Cannot automatically retain information from previous chats or update in time
Cache and reuse historical messages …
Resource Have memory limits on how much text they can process at once
Chunking; Embedding; Prompt Compression; RAG + Vector Databases …
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models. CoRR abs/2309.01219 (2023)
https://promptdrive.ai/llm-limitations/
13
Overview of LLM4DB Framework
q Data Management tasks
q LLM Prompt for Data Management
• Instruction Prompting
• Few-Shot Prompting
q LLM Agent for Data Management
• Agent Models + Memory
• Reasoning / Planning Strategies
• Tool Management & Learning
q RAG for Data Management
• Semantic Segmentation
• Result Retrieval
• Result Reranking
q Finetuning for Data Management
• Reparamerization / LLM Adapter
q Data Preparation for LLM
q Open Problems 14
Overview of Data Management Tasks
q Data Management tasks
q LLM Prompt for Data Management
• Instruction Prompting
• Few-Shot Prompting
q LLM Agent for Data Management
• Agent Models + Memory
• Reasoning / Planning Strategies
• Tool Management & Learning
q RAG for Data Management
• Semantic Segmentation
• Result Retrieval
• Result Reranking
q Finetuning for Data Management
• Reparamerization / LLM Adapter
q Data Preparation for LLM
q Open Problems 15
Overview of Data Management Tasks
⚽ Instruction
q Data Management Tasks
Ø Data Preprocessing
⚽ Instruction
• Data cleaning ⚽ Instruction
• Doc2Table
⚽ Instruction + Examples
• Entity matching ⚽ Instruction +
Examples +
Ø Database Optimization
Extra Info
• Knob tuning
• Database diagnosis
⚽ Instruction
+
• Query rewrite Examples +
Ø Data Analysis
⚽ Instruction+
• Text2SQL Examples +
Extra Info
• Text2Code
• ... ⚽ Instruction + Examples
16
LLM Prompt for Data Management
q Data Management tasks
q LLM Prompt for Data Management
• Instruction Prompting
• Few-Shot Prompting
q LLM Agent for Data Management
• Agent Models + Memory
• Reasoning / Planning Strategies
• Tool Management & Learning
q RAG for Data Management
• Semantic Segmentation
• Result Retrieval
• Result Reranking
q Finetuning for Data Management
• Reparamerization / LLM Adapter
q Data Preparation for LLM
q Open Problems 17
Prompt Engineering (PE)
Prompt of Query Rewrite
q High-Quality Prompt can instruct
Task Description
LLM to optimize DB tasks without Write an equivalent SQL query that can be
executed on a Postgres database with decreased latency.
training
Instruction
Ø Zero-shot Prompting 1. Ensure output query is semantical-equivalent to the input query …
• Input LLM with a task description, without Example Input
training over labeled data select ... from t1 where t1.a=(select avg(a) from t3 where t1.b=t3.b);
Example Output
• Instruction Prompting select … from t1 inner join (select avg(a) avg,t3.b from t3 group
by t3.b) as t3 on (t1.a=avg and t1.b=t3.b);
• Input LLM with explicit instructions on
approaching the task, e.g., detailing the Input
select t1.* from t1 where t1.col1>(
format, tone, or type of output response select max(t2.col2) from t2 where t2.col1 in (
select t1.col1 from t1 where t1.col1=t2.col1));
Ø Few-shot Prompting
• Provide LLM with a few examples of the Output
select t1.* from t1 inner join (
task within the prompt to guide the model select max(t2.col2) max, t2.col1 from t2
on how to generate responses group by t2.col1) as t2 on (
t1.col1=t2.col1)
where t1.col1>max;
Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023.
18
Challenges of PE for Data Management
q Challenge 1: How to Automatically Prompt of Query Rewrite
Task Description
Generate Input Prompt? Write an equivalent SQL query that can be
executed on a Postgres database with decreased latency.
Ø Automatically generate proper instructions
Instruction
and select demonstration examples from 1. Ensure output query is semantical-equivalent to the input query …
large space within limited prompt tokens
Example Input
(or limited interaction rounds). select ... from t1 where t1.a=(select avg(a) from t3 where t1.b=t3.b);
Example Output
select … from t1 inner join (select avg(a) avg,t3.b from t3 group
q Challenge 2: How to Efficiently by t3.b) as t3 on (t1.a=avg and t1.b=t3.b);
Interact with LLM Using Prompts? Input
select t1.* from t1 where t1.col1>(
Ø Iteratively adjust prompt for input request select max(t2.col2) from t2 where t2.col1 in (
(e.g., select suitable examples from select t1.col1 from t1 where t1.col1=t2.col1));
🏀 Instruction
Selection (ranking-based);
Example Selection (text-davinic-003)
Luyi Ma, et al. LLMs with User-defined Prompts as Generic Data Operators for Reliable Data Processing. IEEE Big Data 2023.
22
Prompt Engineering for Query Rewrite
SELECT …
q Problems of Rule-driven Rewriters WHERE …
• Inadequacy of rules: Insufficient for handling complex AND (
EXISTS (
query transformations (e.g., merging sub-queries)
SELECT a2 FROM t2
• Cross-system migration: Different programming WHERE t2.b2=1
languages and SQL syntax )
OR
EXISTS (
SELECT a2 FROM t2
WHERE t2.b2=2 )
)
Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 23
Prompt Engineering for Query Rewrite
SELECT …
q Problems of Rule-driven Rewriters WHERE …
• Inadequacy of rules: Insufficient for handling complex AND (
EXISTS (
query transformations (e.g., merging sub-queries)
SELECT a2 FROM t2
• Cross-system migration: different programming WHERE t2.b2=1
languages and SQL syntax )
OR
q 1. Prompt-based Rewrite Case Generation EXISTS (
SELECT a2 FROM t2
• Utilize LLM to generate well-explained rewrite cases, i.e., WHERE t2.b2=2 )
(original query, rewritten query, rewrite rule, rewrite analysis). )
Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 24
Prompt Engineering for Query Rewrite
SELECT …
q Problems of Rule-driven Rewriters WHERE …
• Inadequacy of rules: insufficient for handling complex AND (
EXISTS (
query transformations (e.g., merging sub-queries)
SELECT a2 FROM t2
• Cross-system migration: different programming WHERE t2.b2=1
languages and SQL syntax )
OR
q 2. Prompt-based Query Rewrite EXISTS (
SELECT a2 FROM t2
§ The quality of prompt impacts the performance of LLM
WHERE t2.b2=2 )
on different rewrites à Automatic Prompt Generation )
2.1 Instruction Generation
Output best
• Write prompt to instruct LLM to generate instruction
instruction
candidates by examples (e.g., five pairs):
• I followed the instruction to rewrite the input SQL query
to produce an equivalent SQL query...
• Based on the instruction, they produced the following
input-output pairs: \n\n[example pairs] \n\nInstruction:"
Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 25
Prompt Engineering for Query Rewrite
SELECT …
q Problems of Rule-driven Rewriters WHERE …
• Inadequacy of rules: insufficient for handling complex AND (
EXISTS (
query transformations (e.g., merging sub-queries)
SELECT a2 FROM t2
• Cross-system migration: different programming WHERE t2.b2=1
languages and SQL syntax )
OR
q 2. Prompt-based Query Rewrite EXISTS (
SELECT a2 FROM t2
§ The quality of prompt impacts the performance of LLM
WHERE t2.b2=2 )
on different rewrites à Automatic Prompt Generation )
2.1 Instruction Generation
Output best
• Rank the generated instruction candidates based on
instruction
their benefit for validation set (e.g., the cost reduction
after rewriting with the instructed llm)
Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 26
Prompt Engineering for Query Rewrite
SELECT …
q Problems of Rule-driven Rewriters WHERE …
• Inadequacy of rules: insufficient for handling complex AND (
EXISTS (
query transformations (e.g., merging sub-queries)
SELECT a2 FROM t2
• Cross-system migration: different programming WHERE t2.b2=1
languages and SQL syntax )
OR
q 2. Prompt-based Query Rewrite EXISTS (
SELECT a2 FROM t2
§ The quality of prompt impacts the performance of LLM
WHERE t2.b2=2 )
on different rewrites à Automatic Prompt Generation )
2.1 Instruction Generation
• Rank the generated instruction candidates based on
their benefit for validation set (e.g., the cost reduction
after rewriting with the instructed llm)
2.2 Demonstration Example Generation
• Match the current query q with a few candidate rewrites
whose input queries are similar to q
Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 27
Prompt Engineering for Query Rewrite
Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 28
Prompt Engineering for Query Rewrite
Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 29
Prompt Engineering for Table Data Extraction
q LLM cannot directly extract structured table from long-context and
semi-structured documents
q Core Idea
q For long document, feed sampled documents to the LLM, and prompt it to generate
useful information (schema and cell values) that can form a structured table (e.g., writing
code to extract the values of important attributes)
EVAPORATE-CODE+
(Doc2Table)
Arora S, Yang B, Eyuboglu S, et al. Language models enable simple systems for generating structured views of heterogeneous data lakes[J]. arXiv 2023.
30
Prompt Engineering for Table Data Extraction
q LLM cannot directly extract structured table from long-context
and semi-structured documents
q Prompt-based Table Data Extraction
q Schema Synthesis
q With a sampling subset of documents, it prompts LLM to extract attributes
based on their occurrence frequencies
q Rerank the extracted attributes by adjusting their frequency weights with LLM
q Function Synthesis
q A heavy job to extract attribute values from every document à Prompt LLM to
write code to extract the attribute values
Function
Prompt
Arora S, Yang B, Eyuboglu S, et al. Language models enable simple systems for generating structured views of heterogeneous data lakes[J]. arXiv preprint arXiv:2304.09433, 2023.
31
Prompt Engineering for NL2Code
Database description
Processing steps
Natural language
instructions
Trummer I. From BERT to GPT-3 Codex: Harnessing the Potential of Very Large Language Models for Data Management[J]. VLDB, 2022.
32
Prompt Engineering for Knob Tuning (Workload Generation)
q Motivation
• ML-based methods require numerous workloads as training data
• Main Steps
• Workload generation: Use manual-crafted GPT-4 to generate diversified
workloads for specific database schema and workload types
Workload Generation Prompt
You are a helpful DBA, tasked with generating challenging OLAP / OLTP workloads
and fulfill my goals.
Huang X, Li H, Zhang J, et al. LLMTune: Accelerate Database Knob Tuning with Large Language Models[J]. arXiv preprint arXiv:2404.11581, 2024. 33
Prompt Engineering for Knob Tuning (Workload Generation)
q Motivation
• ML-based methods require numerous workloads as training data
• Main Steps
• LLM Finetuning: Train Mistral-7B with workloads labeled with configurations
recommended by Bayesian Optimization (BO)-based algorithm
• LLM Input: Workload features, internal metrics, query plan;
• LLM Output: Generate the configuration change based on previous configuration
Huang X, Li H, Zhang J, et al. LLMTune: Accelerate Database Knob Tuning with Large Language Models[J]. arXiv preprint arXiv:2404.11581, 2024. 34
Prompt Engineering for Knob Tuning (Knowledge Extraction)
Knowledge
Lao J, Wang , et al. GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization[J]. Proceedings of the VLDB Endowment, 2024, 17(8): 1939-1952.
35
Prompt Engineering for Knob Tuning (Knowledge Extraction)
Lao J, Wang , et al. GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization[J]. Proceedings of the VLDB Endowment, 2024, 17(8): 1939-1952.
36
LLM Prompt for Database Diagnosis
q Improve diagnosis prompt with matched Knowledge and Tools
• 1. Anomaly Description for triggered alerts Prompt Template
- Expert Description: <Role> ; <Task> ; <Steps>
- The anomaly alerts are {Anomaly Info}
- In this step, you can use the following tools: {Tool APIs}
- The matched knowledge is: {Knowledge}
============================================
• 2. Tool Matching with finetuned embed model (Demonstration Examples of available tool usage)
Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, et al. D-Bot: Database Diagnosis System using Large Language Models. VLDB 2024.
37
Summarization of Prompt-based Data Management
Prompt Prompt Efficiency
LLM Job Operations
(Content) (Examples) Issue
Data GPT: Modify Data Four Preprocessing
Instruction Manual N/A
Standardization format Operations
1. A small-sized model
Ranking
Query Typical Logical Instruction + for Example Selection;
GPT: Query Rewrite based
Rewrite Transformations Example 2. Prompt Compress;
Selection
3. Past rewrite Reuse
Table Data Schema Extraction; Instruction + Manual N/A
GPT: Doc2table
Extraction Value Extraction Example
Processing steps and Manual N/A
NL2Code GPT: Text2code Instruction
comments
GPT: Workload
Knob Workload Generation
Generation; Instruction Manual N/A
Tuning + Knob Tuning
Mistral: Knob Tuning
Knob GPT: Knowledge Four steps for Instruction +
Manual N/A
Tuning Extraction knowledge extraction Example
Instruction +
Database GPT / Llama: Root Three analysis and Matched by
Example + N/A
Diagnosis Cause Analysis tool calling steps external info
Knowledge / Tool 38
Take-aways
p Prompt generation
l Existing prompts mostly depend on human experts to craft high-quality instructions
and examples, which have the scalability problem
l Prompt examples can be generated from humans, real cases, LLMs (e.g., add
explanations), and traditional algorithms (e.g., search-based for configs)
p Instruction selection
l Instruction format can affect LLM performance (especially weak LLMs)
l The order of instructions / different examples can affect LLM performance
p Prompt Selection
l Prompt selection are critical to the success of the current task
l No one-size-fit-all-model prompt, which needs modifications for different LLMs
41
Advantages of LLM Agent
q Autonomy
• Execute human instructions and complete tasks;
• Start and execute operations without human
requirements
q Reactivity
• Respond rapidly to immediate changes and
interact with its environment
q Pro-activeness
• Anticipate future. Make plans, and take proactive
measures in their actions to achieve specific goals
https://www.superannotate.com/blog/llm-agents
42
Advantages of LLM Agent
q Learning Ability
• Adaptively integrate new tools and refine the
execution pipelines based on environment feedback
• Memorize experience in both external knowledge
base and model parameters
q Mobility
• Easy to generalize to new platforms (e.g., Postgres
à MySQL) and scenarios (e.g., specifying new
workloads) via prompt-level modification
https://www.superannotate.com/blog/llm-agents
43
Main Components of LLM Agent
q Main components
• Agent Model: A powerful language model that can (1) conduct basic reasoning by
following the instruction; (2) generate tool calling commands and understand tool outputs …
• Planning: Decompose complex task and conduct step-by-step reasoning
• Memory: Store records of interactions from the agent’s (long/short term) previous tasks
• Tools: Manage agent’s calling of external API tools
https://www.superannotate.com/blog/llm-agents
44
Example LLM Agent
q Main components
• Agent Model: A powerful language model that can (1) conduct basic reasoning by
following the instruction; (2) generate tool calling commands and understand tool outputs …
• Planning: Decompose complex task and conduct step-by-step reasoning
• Memory: Store records of interactions from the agent’s (long/short term) previous tasks
• Tools: Manage agent’s calling of external API tools
Mining Raw Stock
Market data
(structured)
Data-Analytics
Agent
Mining from
By Nvidia financial reports
(unstructured)
https://developer.nvidia.com/blog/build-an-llm-powered-data-agent-for-data-analysis/
45
Challenges & Techniques of Agent Components
q Main components
• Agent Model: How to (1) conduct basic reasoning by following the instruction; (2)
generate tool calling commands and understand tool outputs …
• The selection / finetuning of suitable agent model is vital to the performance
• GPT-3.5 Turbo is effective as code interpreter, particularly in its ability to understand and
translate complex task descriptions into functional code;
• GPT-4 demonstrates weaker consistency compared with GPT-3.5
Zeeshan Rasheed, et al. Large Language Model Evaluation Via Multi AI Agents: Preliminary results. arXiv, 2024. 46
Challenges & Techniques of Agent Components
q Main components
• Planning: How to decompose complex task and conduct step-by-step reasoning
• Chain-of-Thought (CoT): Explicitly break down the task into a sequence of intermediate steps
• Tree-of-Thought (ToT): Decompose the thought process into multiple branches and sub-nodes
• Graph-of-Thought (GoT): LLM thoughts as vertices, edges as thought dependencies à
Arbitrary thoughts can be aggregated by constructing vertices with multiple incoming edges
Maciej Besta, et al. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. AAAI 2024
47
Challenges & Techniques of Agent Components
q Main components
• Planning: How to decompose complex task and conduct step-by-step reasoning
• Chain-of-Thought (CoT): Explicitly break down the task into a sequence of intermediate steps
• Tree-of-Thought (ToT): Decompose the thought process into multiple branches and sub-nodes
• Graph-of-Thought (GoT): LLM thoughts as vertices, edges as thought dependencies à
Arbitrary thoughts can be aggregated by constructing vertices with multiple incoming edges
Decompose into
Tree of Thought
Maciej Besta, et al. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. AAAI 2024
48
Challenges & Techniques of Agent Components
q Main components
Ø Memory: How to store historical messages for caching or effective task-solving
Ø LLM may forget past actions when resolving the same task
• à Short-Term Memory: Memory information is directly written into prompt (e.g., for
maintaining the internal state during executing a task)
Ø The effective plans for historical task may be useful for current task
• à Long-Term + Short-Term Memory: Long-term memory for stable knowledge (e.g., reuse
behaviors and thoughts in past plans for current situation); Short-term memory for flexible
planning (e.g., adjust the plan with recent feedback)
• Memory formats: Different memory formats possess distinct strengths
Memory Format Advantage Case
Natural language Flexible; Rich semantics; Reflexion: Stores experiential feedback in natural language
within a sliding window.
Embeddings Benefit retrieval ChatDev: Encode dialogue history into vectors for reuse
Databases Support Complex Query DB-GPT: Agents are fine-tuned to understand and execute
SQL queries, enabling them to interact with databases 49
Challenges & Techniques of Agent Components
q Main components
• Tools: How to prepare tool APIs for better tool calling and result understanding
• Vast number of Tools (length and latency issues)
• à Tool Selection
• Retriever-based tool selection
• LLM-based tool selection
• Adherence to API parameter and formats
à Tool Calling
• LLM Finetuning
• Output Constraints (e.g., in json format)
• …
Changle Qu, et al. Tool Learning with Large Language Models: A Survey. arXiv 2024.
50
LLM Agent Techniques for Data Management
🏀 Action
(e.g., document retrieval)
Memory (Q&A caching)
Human Feedback & Risk Estimation
51
LLM Agent for Data Standardization
Danrui Qi, Jiannan Wang. CleanAgent: Automating Data Standardization with LLM-based Agents. arXiv 2024.
54
LLM Agent for Database Diagnosis
• Numerous Documents: A single database product provides over 10,000+ pages of
materials (e.g., operational guides and instructions). It’s tricky for junior DBAs to fully
grasp all of this.
• Significant Operational Pressure: The number of cloud database instances is massive,
but there's a shortage of DBAs.
• Complex Issues: Many urgent problems are interconnected, making it hard to respond
quickly, resulting in economic losses.
query quality
others
audit
quality
caches
parameters
query
plan
indexes
Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, et al. D-bot: Database diagnosis system using large language models[J]. VLDB, 2024.
55
LLM Agent for Database Diagnosis
• Goal: Utilize LLM as an “intelligent DBA assistant”, learning from human
operational experience and diagnose root causes.
• Benefits: Save manpower; Enhance proficiency in using maintenance tools;
Improve the ability to generalize operational knowledge.
Thought: High memory usage seems to be caused by
poor join performance and much inactive memory Other methods like
Reasoning: Poor joins can be solved by plan optimization
Action: optimize_query_plan expert systems are
hard to reuse
Rewrite logical query … similar cases
Optimize query plan …
Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, et al. D-bot: Database diagnosis system using large language models[J]. VLDB, 2024.
57
Agent-Based Data Management
Extra Multi-
Methods Model Memory Tool Planning
Knowledge Agent
Results of
Database GPT-4 Monitoring Tools; Tree of
historical Documents √
Diagnosis Llama Optimization Tools Thought
actions
58
Take-aways
p Agent integrates capabilities like planning, obtaining external knowledge
and reacting (i.e., tool learning), and so works better than prompt-only ones
p Existing Tool APIs and prompt templates for Agent are manually written,
which also have the scalability problem
p Planning is vital for Agent to resolve problems like early-stop,
hallucination (e.g., inaccurate tool calling), and self-inconsistent
p Memory reminds LLM of the historical messages (avoiding repeated
actions or serving as cache), but also causes great prompt redundancy
p Multi-Agent offers new mechanisms like collaboration (v.s., single-agent),
but the superiority needs to further explore, especially in real-world cases
(where single-agent already causes great overhead) 59
RAG for Data Management
Drawbacks of LLMs
• Hallucination
• Outdate information
• Low efficiency in parameterized knowledge
• Weak reasoning capability
Practical Requirements
• Domain-Specific Accurate Q&A
• Frequent Data Update
• Explainability of Generated Answer
• Controllable Cost
A motivative example.
• Data Privacy Protection
61
Retrieval and Generation Pipeline
Retrieval phase:
• The given corpus will be segmented into many chunks.
• For a question, a retriever identifies and selects the top K most related chunks as context.
Generation phase:
• The question, alongside the context, will be inputted into a LLM to generate an answer.
Retrieval Query
Embedding Question Vector
Question
Model Embedding Database
Top N Chunks
Prompt
Generation LLM Answer
Engineering User
62
Limitations of RAG
Retrieval limitation:
• Chunk selection based on embedding similarity may lead to misleading or
irrelevant chunks and missing crucial information.
• Methods:
• Semantic segmentation
• Reranking
• Retrieval techniques
Generation limitation:
• LLMs may produce content that is not supported by the retrieved context.
• Methods:
• Prompt engineering
• Using LLMs with high proficiency levels
63
Semantic segmentation
Motivation: Segment the corpus into semantically coherent chunks, ensuring that the
retrieved information is semantically complete and relevant.
Training:
• Collecting sentence pairs with labels. (label=1 means they are semantically relevant.)
• Fitting the mapping between sentence pairs and labels using a model (Embedding model +
MLP).
Inference:
• Each two adjacent sentences in the corpus is judged by the segmentation model.
64
Reranking Technique
Reranking: Reorder chunks according to their relevance to the question, and then select
the top K based on specific metric (e.g. relevance).
• Enhancer: Boost the relevance and quality of context.
• Filter: Filter out misleading or irrelevant chunks.
Devendra Singh Sachan et al. Improving Passage Retrieval with Zero-Shot Question Generation. EMNLP 2022.
65
Retrieval Technique
Retrieval augmentation
• Iterative Retrieval • Recursive Retrieval • Adaptive Retrieval
Yunfan Gao, Yun Xiong, Xinyu Gao , et al. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2023
66
Retrieval Technique
• Iterative Retrieval: Repeatedly search based on the initial query and the generation result.
Advantage: Enhance robustness of subsequent answer generation.
Zhihong Shao, Yeyun Gong et al. Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. EMNLP 2023.
67
Retrieval Technique
• Recursive Retrieval: Iteratively refining search queries based on previous retrieval result.
Advantage: Improve the depth and relevance of search results.
Harsh Trivedi et al. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. ACL 2023.
68
Retrieval Technique
• Adaptive Retrieval: Enable LLMs to actively determine the moments for retrieval.
Advantage: Enhance the efficiency and relevance of the information sourced.
Example: FLARE
Retrieve when LLM generates tokens with probabilities
below a set threshold.
Zhengbao Jiang, Frank F. Xu, Luyu Gao et al. Active Retrieval Augmented Generation. EMNLP 2023.
69
Insights with experimental examples
70
Insights with experimental examples
71
Graph RAG
• Graph RAG includes a graph database as a source of the retrieval context sent to the LLM.
Retrieval Method: Entity Linking & NL-to-Graph-query.
Advantages:
• Providing extra information
like metadata for generation.
• Handling complex and
nuanced queries.
• Supporting data update.
Example:
https://github.com/falkorDB/ufc 72
RAG for Text2SQL
• Pre-Processing
• Schema Linking
• Database Content Retrieval
• Additional Information Acquisition
• NL2SQL Translation Methods
• Encoding Strategy
• Decoding Strategy
• Task-specific Prompt Strategies
• Intermediate Representation
• Post-Processing
• Correction
• Consistency
• Execution-Guided
• N-best Rerankers
Paper: https://arxiv.org/pdf/2408.05109
NL2SQL Handbook: https://github.com/HKUSTDial/NL2SQL_Handbook 73
RAG for Text2SQL
Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, Nan Tang. The Dawn of Natural Language to SQL: Are We Fully Ready?. VLDB 2024 74
RAG for Text2SQL
NL2SQL360-Automated Search
Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, Nan Tang. The Dawn of Natural Language to SQL: Are We Fully Ready?. VLDB 2024 75
RAG for Text2SQL
Xue S, Jiang C, Shi W, et al. Db-gpt: Empowering database interactions with private large language models[J]. arXiv preprint arXiv:2312.17449, 2023.
76
RAG for Q&A Diagnosis
Vikramank Singh et al. Panda: Performance Debugging for Databases using LLM Agents. CIDR, 2024
77
Take-aways
p RAG is very important to adapt to vertical domains and support
data updates
p Not only improve the efficiency (vector database, vector index)
but also improve the recall (multi retriever, segmentation,
embedding models)
p It is important combine Prompt, RAG, and Agent to answer
complex tasks
p It requires to build a RAG system
78
Finetuning for Data Management
Target, Terminology
Understanding
Prompt
Instruction
Following
Domain Knowledge
RAG Fine-tuning
Integration
Tool Calling
Abilities Required
Agent From Finetuning
Task Decomposition
80
Example: Prompt Only à Prompt + Finetuning
Edward J. Hu, Yelong Shen, Phillip Wallis, et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022
82
Typical Techniques for LLM Finetuning
Zhiqiang Hu et al. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. EMNLP, 2023.
83
LLM Finetuning Techniques for Data Management
84
LLM Finetuning for Query Rewrite
DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 85
LLM Finetuning for Query Rewrite
• Targets, Terminologies
• Logical Plan Structures
DB-GPT: Large Language Model Meets Database. Data Science and Engineering 2023. 86
LLM Finetuning for Tool Learning in Diagnosis Agent
Anomaly
Alert q Train LLM to select and call diagnosis APIs
• Good testing performance: Good at subtasks (e.g., tool calling, abnormal metric
analysis) ; but highly rely on the quality of finetuning data
Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, et al. D-Bot: Database Diagnosis System using Large Language Models. VLDB 2024
88
LLM Finetuning for Table Learning
• Model Finetuning
• Vision-Language Connector and LLM are jointly finetuned with instruction-following
data of multi-modal tabular tasks
Textual Q&A
+
+ +
+ +
92
Open Problems
q Data Management tasks
q LLM Prompt for Data Management
• Instruction Prompting
• Few-Shot Prompting
q LLM Agent for Data Management
• Agent Models + Memory
• Reasoning / Planning Strategies
• Tool Management & Learning
q RAG for Data Management
• Semantic Segmentation
• Result Retrieval
• Result Reranking
q Finetuning for Data Management
• SFT Dataset Generation
q Data Preparation for LLM
q Open Problems 93
Open Problem 1: Database Specific LLM
p Case-by-Case LLM Finetuning à Database-Specific LLM Construction
Ø Pretrain: Collect sufficient database-domain tokens (e.g., in millions) as pre-training
corpora from sources like database textbook and query analysis
Ø Finetune: Instruction Understanding in SQL / Text à Basic Q&A (DB / Product / Instance)
à Task-Solving in DB Domains à Alignment to Database Experts
Ø Evaluation: Evaluate the accuracy and robustness of the database model with carefully-
crafted validation dataset, measuring metrics, and end-to-end testbed.
Leiyuan Chen et al. TableVLM: Multi-modal Pre-training for Table Structure Recognition. ACL 2023.
95
Open Problem 3: Data Preparation System for LLM
p An Effective System for Preparing LLM Data
96
Other Problem 4: Easy-to-Use LLM (Agent)
q How to reduce the use costs of LLM / LLM-based Agent
Ø Tool API: Manual Generation à Automatic API Intergration (e.g., from programs)
Ø …
97
Thanks!
Thanks the help & advice from
Jintao Zhang, Zhaoyan Sun,
Yuyang Gao, Wei Zhou, James Pan
Slides: https://dbgroup.cs.tsinghua.edu.cn/ligl/activities.html
Complete Paper List: https://github.com/code4DB/LLM4DB