0% found this document useful (0 votes)
13 views4 pages

Paper Mid Tern

This paper explores the use of Large Language Models (LLMs) for data preprocessing tasks such as error detection, data imputation, schema matching, and entity matching. It proposes a framework that integrates advanced prompt engineering techniques to enhance the efficiency and performance of LLMs, particularly focusing on models like GPT-4. Experimental results demonstrate that GPT-4 outperforms other models in accuracy for several datasets, highlighting the potential of LLMs in data preprocessing despite their computational challenges.

Uploaded by

an.lennon2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views4 pages

Paper Mid Tern

This paper explores the use of Large Language Models (LLMs) for data preprocessing tasks such as error detection, data imputation, schema matching, and entity matching. It proposes a framework that integrates advanced prompt engineering techniques to enhance the efficiency and performance of LLMs, particularly focusing on models like GPT-4. Experimental results demonstrate that GPT-4 outperforms other models in accuracy for several datasets, highlighting the potential of LLMs in data preprocessing despite their computational challenges.

Uploaded by

an.lennon2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Large Language Models as Data Preprocessors

1 Haochen Zhang, 2 Yuyang Dong, 1,3 Chuan Xiao, 2 Masafumi Oyamada


1 Osaka
University, 2 NEC Corporation, 3 Nagoya University
{chou.koushin,chuanx}@ist.osaka-u.ac.jp,{dongyuyang,oyamada}@nec.com

ABSTRACT for effective application. Thus, as a preliminary study on employing


Large Language Models (LLMs), typified by OpenAI’s GPT, have LLMs for data preprocessing, this paper provides the following
marked a significant advancement in artificial intelligence. Trained contributions.
on vast amounts of text data, LLMs are capable of understanding (1) We examine the inherent knowledge and superior reasoning
and generating human-like text across a diverse range of topics. and learning abilities of LLMs, which can be further enhanced
This study expands on the applications of LLMs, exploring their through zero- and few-shot prompting. These strengths position
arXiv:2308.16361v2 [cs.AI] 27 Oct 2024

potential in data preprocessing, a critical stage in data mining and LLMs as competitive candidates for various data processing tasks.
analytics applications. Aiming at tabular data, we delve into the ap- However, their computational expense and potential inefficiencies
plicability of state-of-the-art LLMs such as GPT-4 and GPT-4o for a present challenges. We provide an analysis of these strengths and
series of preprocessing tasks, including error detection, data imputa- limitations in the context of data preprocessing.
tion, schema matching, and entity matching. Alongside showcasing (2) We propose a framework for LLM-based data preprocessing.
the inherent capabilities of LLMs, we highlight their limitations, This framework integrates a series of SOTA prompt engineering
particularly in terms of computational expense and inefficiency. We techniques, including zero-shot instructions, few-shot examples,
propose an LLM-based framework for data preprocessing, which in- batch prompting, as well as traditional approaches such as contex-
tegrates cutting-edge prompt engineering techniques, coupled with tualization and feature selection. We specifically instruct LLMs to
traditional methods like contextualization and feature selection, follow an answer format and reason before providing an answer
to improve the performance and efficiency of these models. The to enhance performance. Few-shot examples are used to condition
effectiveness of LLMs in data preprocessing is evaluated through LLMs so that they can learn error criteria, means of imputation,
an experimental study spanning a variety of public datasets. GPT-4 matching conditions, etc. Batch prompting amalgamates multiple
emerged as a standout, achieving 100% accuracy or F1 score on 4 of data instances in a prompt to reduce token and time costs.
these datasets, suggesting LLMs’ immense potential in these tasks. (3) We conduct experiments on 12 datasets for four data prepro-
Despite certain limitations, our study underscores the promise of cessing tasks. We evaluate popular LLMs such as GPT-3.5, GPT-4,
LLMs in this domain and anticipates future developments to over- and GPT-4o. The results indicate that GPT-4 generally outperforms
come current hurdles. existing solutions, achieving 100% accuracy or F1 score on 4 out of
12 datasets. GPT-3.5 also delivers competitive performance and is
1 INTRODUCTION recommended for data preprocessing. GPT-4o delivers inconsistent
Large Language Models (LLMs), such as OpenAI’s GPT and Meta’s performance: competitive on data imputation and entity matching
LLaMA, are becoming an increasingly important aspect of the AI but mediocre on error detection and schema matching. The evalua-
landscape. These models, essentially ML systems, are trained on vast tion also sheds light on the effects of the proposed components of
amounts of text data and characterized by an augmented number the solution framework on accuracy and efficiency.
of parameters. They are capable of understanding and generating
text across a diverse range of topics, thereby finding applications 2 PRELIMINARIES
in numerous tasks. Consequently, research involving LLMs has 2.1 Data Preprocessing
garnered significant attention from both academia and industry.
In this initial exploration of large language models (LLMs) for data
Recent endeavors have successfully leveraged LLMs for data man-
preprocessing, we concentrate on tabular data. We target the fol-
agement and mining. For instance, LLMs have been used for SQL
lowing tasks: error detection (ED), data imputation (DI), schema
generation [16], database diagnosis [5], data wrangling [12], and
matching (SM), and entity matching (EM). Other typical data pre-
data analytics [2].
processing tasks, such as data fusion and data wrangling, are re-
This paper investigates the potential of utilizing state-of-the-art
served for future work. Diverging from the traditional definition
(SOTA) LLMs for data preprocessing, a crucial step that refines
that presents the entire dataset and finds or fixes all the errors (or
data before it can be harnessed for downstream data mining and
matches, etc.) within, we define the problem by handling one record
analytics applications. Given their comprehensive understanding
(or a pair) at a time, so the prompt to an LLM can be easily written.
of language semantics and structures, LLMs can identify errors or
We term each input object a data instance, i.e., a tuple for ED and
matches in text data. For example, they are capable of detecting
DI, a pair of attributes for SM and a pair of tuples for EM.
spelling mistakes, grammar issues, contextual discrepancies, and
near-duplicate records. Consequently, the application of LLMs in
data preprocessing can pave the way for tackling tasks such as error 2.2 Large Language Models
detection, data imputation, schema matching, and entity matching. LLMs have become one of the hottest topics in the AI research
While LLMs hold considerable potential for data preprocessing community [20]. We discuss the strengths and limitations of using
tasks, it is critical to comprehend their capabilities and limitations LLMs for data preprocessing.
1
1 Haochen Zhang, 2 Yuyang Dong, 1,3 Chuan Xiao, 2 Masafumi Oyamada

Zero-shot Promp�ng adhere to the chain-of-thought paradigm [18], in which the LLM is
Raw Data Preprocessed
expected to reason before delivering the answer. An example of a
Few-shot Promp�ng
Instances
Large
Data Instances zero-shot prompt for DI is as follows:
Contextualiza�on Language You are requested to infer the value of the "city" attribute based on the
Model values of other attributes.
Feature Selec�on
MUST answer each question in two lines. In the first line, you give the
Batch Promp�ng
reason for the inference. In the second line, you ONLY give the value
of the "city" attribute.

Figure 1: Framework of data preprocessing with an LLM. We design specific zero-shot prompts for ED and DI. For ED,
since we provide the entire record 𝑟 but ask the LLM to detect
an error in one attribute 𝑟 𝑗 at a time, the LLM might erroneously
Strengths. (1) With their comprehensive understanding of language identify an error in attribute 𝑟 𝑗 ′ , where 𝑗 ′ ≠ 𝑗. To avoid this, we
prompt the LLM to confirm the target attribute with: Please confirm
semantics and structures, and the knowledge acquired through
the target attribute in your reason for inference. For DI, we provide a
training on vast amounts of text data, LLMs are general problem
hint about the data type of the attribute to be imputed. For example,
solvers capable of identifying errors, anomalies, and matches in
given the hint The "hoursperweek" attribute can be a range of
textual data, without needing human-engineered rules [13] or fine-
integers, the LLM will respond with a range instead of a single
tuning for specific tasks. (2) Most LLMs provide a prompting in-
number.
terface with which users can interact and assign tasks in natural
language, contrasting with existing data preprocessing solutions
that require computer programming or specific tools (e.g., Holo- 3.2 Few-shot Prompting
Clean [15] and Magellan [8]). (3) LLMs are excellent reasoners [7], Few-shot prompting [1] involves providing a small selection of
enabling them to not only return data preprocessing results but also examples to condition the LLM for tasks that deviate from its pre-
provide the reasons for these results. In this sense, their answers training objectives (e.g., text completion and code generation). We
are more interpretable than those of other DL approaches. (4) LLMs apply few-shot prompting by manually selecting a subset of data
can be conditioned by few-shot prompting [1]. As such, we can instances from the dataset and labeling them. For example, the
tune the criteria for data preprocessing tasks (e.g., the degree of few-shot examples for DI are presented as follows:
matching) using few-shot examples. Users:
Limitations. (1) For data preprocessing, one of the major limitations Question 1: Record is [Data Instance 1]. What is the city?
is the difficulty in domain specification [12]. When dealing with ...
Assistant:
data from highly specialized domains, training LLMs can be costly
Answer 1: [Reason 1]
and sometimes even impossible due to frozen parameters. (2) LLMs
[Answer 1]
sometimes generate text that is plausible-sounding but factually in- ...
correct or nonsensical, as they lack a fundamental understanding of
The data instances here adhere to the contextualization introduced
the world and rely solely on the patterns they learned during train-
in Section 3.3. Users are required to provide plausible reasoning for
ing. (3) LLMs often require substantial computational resources,
few-shot examples. For instance, given [name: "carey’s corner", addr:
thereby increasing the cost of use and compromising the efficiency
"1215 powers ferry rd.", phone: "770-933-0909", type: "hamburgers",
and scalability of data preprocessing on large-scale data.
city: ???] as [Data Instance 1], [Reason 1] would be The phone
number "770" suggests that the city should be either Atlanta or
3 METHOD
Marietta in Georgia. The addr attribute suggests a place in Marietta.,
Figure 1 illustrates our data preprocessing framework, which con- and [Answer 1] would be Marietta.
sists of several modules that construct the prompt serving as the
input to the LLM. We design a prompt template as follows. 3.3 Contextualization
You are a database engineer.
Given that LLMs intake raw text as input, we convert the contents
[Zero-shot prompt]
in each data instance to a text sequence in the following format:
[Few-shot prompt]
[Batch prompt] [𝑥 1 .name: "𝑥 1 .value", . . ., 𝑥𝑛 .name: "𝑥𝑛 .value"]

Initially, we instruct the LLM to impersonate a database engineer. 𝑥𝑖 denotes the 𝑖-th attribute of a data instance, name denotes to the
Other prompt components are marked within [] and will be dis- attribute name, value denotes the cell value, and 𝑛 is the number
cussed throughout this section. of input attributes. Specifically, we use ??? to denote missing values
for DI, and 𝑥 1 .name = name and 𝑥 2 .name = description for SM.
3.1 Zero-shot Prompting
Zero-shot prompting is a technique that guides LLMs to generate 3.4 Feature Selection
the desired output. It has been demonstrated to effectively enhance If metadata is available, users can manually select a subset of fea-
the reasoning abilities of LLMs [7]. We employ zero-shot prompting tures to improve performance. For instance, when imputing a restau-
to specify both the task and the answer format. Specifically, we rant’s location, the phone number and street name are relevant
2
Large Language Models as Data Preprocessors

Table 1: Comparison with baselines, measured in accuracy (%) for data imputation and F1 score (%) for the other tasks. LLMs are
equipped with the best setting. “N/A” denotes not applicable or not reported in their original papers.

Error Detection Data Imputation Schema Entity Matching


Matching
Methods Adult Hospital Buy Restaurant Synthea Amazon- Beer DBLP- DBLP- Fodors- iTunes- Walmart-
Google ACM Google Zagats Amazon Amazon
HoloClean 54.5 51.4 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
HoloDetect 99.1 94.4 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
IPM N/A N/A 96.5 77.2 N/A N/A N/A N/A N/A N/A N/A N/A
SMAT N/A N/A N/A N/A 38.5 N/A N/A N/A N/A N/A N/A N/A
Magellan N/A N/A N/A N/A N/A 49.1 78.8 98.4 92.3 100 91.2 71.9
Ditto N/A N/A N/A N/A N/A 75.6 94.4 99.0 95.6 100 97.1 86.8
Unicorn N/A N/A N/A N/A N/A N/A 90.3 N/A 95.6 100 96.4 86.9
Unicorn ++ N/A N/A N/A N/A N/A N/A 87.5 N/A 96.2 97.7 98.2 86.9
Table-GPT N/A N/A N/A N/A N/A 70.1 96.3 93.8 92.4 97.7 92.9 82.4
GPT-3 99.1 97.8 98.5 88.4 45.2 63.5 100 96.6 83.8 100 98.2 87.0
GPT-3.5 92.0 90.7 98.5 94.2 57.1 66.5 96.3 94.9 76.1 100 96.4 86.2
GPT-4 92.0 90.7 100 97.7 66.7 74.2 100 97.4 91.9 100 100 90.3
GPT-4o 83.6 44.8 100 90.7 6.6 70.9 90.3 95.9 90.4 93.6 98.2 79.2

Table 2: Ablation study, measured in accuracy (%) for data imputation and F1 score (%) for the other tasks, using GPT-3.5. ZS-T
denotes zero-shot task specification. FS denotes few-shots. B denotes batch prompting. ZS-R denotes zero-shot reasoning.

Error Detection Data Imputation Schema Entity Matching


Matching
Components Adult Hospital Buy Restaurant Synthea Amazon- Beer DBLP- DBLP- Fodors- iTunes- Walmart-
Google ACM Google Zagats Amazon Amazon
ZS-T 25.9 18.4 86.2 81.4 18.2 54.7 83.3 94.7 58.5 92.7 80.0 81.5
ZS-T+B 37.8 19.1 83.1 81.4 17.4 60.1 78.3 94.9 59.6 92.7 83.9 81.6
ZS-T+B+ZS-R 46.3 26.2 89.2 65.1 5.9 45.8 50.0 72.6 47.6 92.7 82.0 60.7
ZS-T+FS 59.3 59.4 96.9 90.7 57.1 66.3 96.3 97.0 74.6 100 96.4 85.6
ZS-T+FS+B 58.1 56.1 96.9 86.2 53.3 66.5 96.3 96.2 76.1 97.8 94.7 86.2
ZS-T+FS+B+ZS-R 92.0 90.7 98.5 94.2 61.5 60.1 92.3 95.7 60.0 97.8 96.4 84.0

features, while the restaurant’s name and type (Asian, Italian, etc.) we use 3 few-shot examples, and for other tasks, we use 10. The
are irrelevant. Therefore, users may choose to use only the phone default batch prompting method is random batching. The batch size
number and street name as attributes in the above prompt. ranges for GPT-3.5, GPT-4, and GPT-4o are [10, 20], [10, 15], and [5,
10], respectively. As baselines, we employ GPT-3 (text-davinci-002)
3.5 Batch Prompting with the best settings [12] for all four tasks, and HoloClean [15]
and HoloDetect [6] for ED, IPM [11] for DI, SMAT [19] for SM,
Considering the significant token and time cost of LLMs, batch
and Magellan [8], Ditto [10], Unicorn/Unicorn ++ [17], and Table-
prompting [3] was proposed to enable the LLM to run inference in
GPT [9] for EM. As these baselines have been evaluated in [12], we
batches, rather than processing one sample at a time. To implement
use these results as a reference. Open LLMs like LLaMA are not
this technique, multiple data instances are presented in a single
considered here, as they are generally less competitive than close
prompt, and the LLM is instructed to answer all of them. For ex-
models [4].
ample, for DI, the prompt is the same as the first part of few-shot
prompting (i.e., the part before Assistant:). We propose two modes
for batching: the first is random batching, where data instances 4.2 Experimental Results
are randomly assigned to a batch; and the second is cluster batch- Performance comparisons are presented in Table 4.2. GPT-4 sur-
ing, where we perform clustering on the dataset, and then random passes GPT-3 on three out of the four tasks: DI, SM, and EM. For DI
batching is conducted within each cluster. and SM, and achieves superior performance than previous methods,
particularly for SM. Moreover, GPT-4 emerges as the victor on 4
4 EXPERIMENTS out of 7 datasets for EM. GPT-3.5 also presents strong competition,
outperforming GPT-3 on DI and SM. GPT-4o is generally on a par
4.1 Experimental Setup with GPT-3.5 on DI and EM, but turns out to be mediocre on ED and
We use the datasets evaluated in [12]. We evaluate three LLMs: SM, showcasing inconsistent performance. Table-GPT, as GPT-3.5
GPT-3.5-turbo-0301 (referred to as GPT-3.5), GPT-4-0314 (referred fine-tuned for processing tabular input, roughly exhibits reduced
to as GPT-4), and GPT-4o-2024-05-13 (referred to as GPT-4o). The performance from GPT-3.5 on EM. Consequently, we recommend
temperature parameter for these models is set at 0.35. For SM tasks, users to either employ larger models or fine-tune its parameters for
3
1 Haochen Zhang, 2 Yuyang Dong, 1,3 Chuan Xiao, 2 Masafumi Oyamada

Table 3: Batch size evaluation, measured on the Adult dataset prompting to 1.5M with a batch size of 15. Both the cost and pro-
for ED, using GPT-3.5 without few-shot prompting. cessing time follow similar trends, decreasing from $8.14 to $2.99
and from 4.8 hours to 1.6 hours, respectively. Concurrently, the F1
Batch size F1 score (%) Tokens (M) Cost ($) Time (hrs) score experiences minor fluctuations, even displaying an increase
1 44.0 4.07 8.14 4.76 when the batch size is set to 15. This is because GPT-3.5 can identify
2 45.9 2.38 4.75 2.70 commonalities in questions and generate consistent solutions for all
4 45.1 1.87 3.74 2.06 data instances in the batch, thereby enhancing overall performance.
8 45.0 1.61 3.21 1.82
15 46.3 1.49 2.99 1.60 ACKNOWLEDGMENTS
This work is mainly supported by NEC Corporation and partially
supported by JSPS Kakenhi 23K17456, 23K25157, 23K28096, and JST
these tasks, and avoid the model with more HCI focus (i.e., GPT- CREST JPMJCR22M2.
4o) for the time being. We also observe Ditto, a non-GPT method,
excelling on a few datasets. For ED, our performance is not as com- REFERENCES
petitive as the GPT-3 results reported in [12]. The prompts used [1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakan-
for GPT-3 in [12] are not directly applicable for GPT-3.5 and GPT-4. tan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.
We believe the results of ED warrant further investigation, such as NeurIPS, 33:1877–1901, 2020.
[2] L. Cheng, X. Li, and L. Bing. Is gpt-4 a good data analyst? arXiv preprint
a case-by-case comparison. arXiv:2305.15038, 2023.
To assess the effectiveness of our prompting strategy, we test [3] Z. Cheng, J. Kasai, and T. Yu. Batch prompting: Efficient inference with large
language model APIs. arXiv preprint arXiv:2301.08721, 2023.
GPT-3.5, as it is more cost-effective and faster than GPT-4, while de- [4] W.-L. Chiang, L. Zheng, Y. Sheng, L. Dunlap, A. Angelopoulos, C. Chou, T. Li,
livering notable performance in the above evaluations. This makes it and S. Zhuang. LMSYS chatbot arena leaderboard, 2024.
a more desirable choice for applications dealing with large datasets. [5] csunny. DB-GPT: Revolutionizing database interactions with private LLM tech-
nology, 2023.
The results are reported in Table 2. We start with GPT-3.5 prompted [6] A. Heidari, J. McGrath, I. F. Ilyas, and T. Rekatsinas. HoloDetect: Few-shot
with only task specification (i.e., without reasoning, as shown in the learning for error detection. In SIGMOD, pages 829–846, 2019.
first line of the prompt in Section 3.1) through zero-shot prompting. [7] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models
are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
The result quality for ED and SM is very low, and roughly below [8] P. Konda, S. Das, A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang,
90% for DI and EM. The inclusion of few-shot examples improves J. Naughton, S. Prasad, et al. Magellan: toward building entity matching man-
agement systems over data science stacks. PVLDB, 9(13):1581–1584, 2016.
all performances, exceeding 50% for ED and SM and reaching ap- [9] P. Li, Y. He, D. Yashar, W. Cui, S. Ge, H. Zhang, D. R. Fainman, D. Zhang, and
proximately 90% for the others. Batch prompting generally has a S. Chaudhuri. Table-GPT: Table-tuned GPT for diverse table tasks. arXiv preprint
slight negative effect on result quality. With zero-shot reasoning, arXiv:2310.09263, 2023.
[10] Y. Li, J. Li, Y. Suhara, A. Doan, and W.-C. Tan. Deep entity matching with
the performances of ED, DI, and SM are further improved, with ED pre-trained language models. PVLDB, 14(1):50–60, 2020.
over 90% and SM over 60%. However, there is little improvement [11] Y. Mei, S. Song, C. Fang, H. Yang, J. Fang, and J. Long. Capturing semantics for
observed for EM, potentially due to GPT-3.5’s reasoning limitations imputation with pre-trained language models. In ICDE, pages 61–72, 2021.
[12] A. Narayan, I. Chami, L. Orr, and C. Ré. Can foundation models wrangle your
and the lack of adequate input information for reasoning. data? PVLDB, 16(4):738–746, 2022.
Feature selection proves useful for GPT-4. For instance, for entity [13] S. Razniewski, A. Yates, N. Kassner, and G. Weikum. Language models as or for
knowledge bases. arXiv preprint arXiv:2110.04888, 2021.
matching on the Beer dataset without few-shot prompting, the [14] N. Reimers. Sentence bert. https://www.sbert.net/, 2022.
F1 scores before and after feature selection are 74.1% and 90.3%, [15] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. HoloClean: Holistic data repairs with
respectively. In terms of batch prompting, we compare random probabilistic inference. PVLDB, 10(10):1190–1201, 2017.
[16] I. Trummer. Codexdb: Synthesizing code for query processing from natural
batching with cluster batching, where data instances are clustered language instructions using gpt-3 codex. PVLDB, 15(11):2921–2928, 2022.
using k-means over their Sentence-BERT [14] embeddings. For [17] J. Tu, J. Fan, N. Tang, P. Wang, G. Li, X. Du, X. Jia, and S. Gao. Unicorn: A
entity matching on the Amazon-Google dataset without few-shot unified multi-tasking model for supporting matching tasks in data integration.
PACMMOD, 1(1):1–26, 2023.
prompting, F1 scores increase from 45.8% to 50.6% when switching [18] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of
from random to cluster batching, illustrating the effectiveness of thought prompting elicits reasoning in large language models. arXiv preprint
arXiv:2201.11903, 2022.
cluster batching. [19] J. Zhang, B. Shin, J. D. Choi, and J. C. Ho. Smat: An attention-based deep learning
We explore the impact of batch size and present the results in solution to the automation of schema matching. In ADBIS, pages 260–274, 2021.
Table 3. As batch size augments, there is a significant reduction [20] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang,
Z. Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223,
in the number of tokens, dropping from over 4M without batch 2023.

You might also like