Entity Matching Using Large Language Models
Entity Matching Using Large Language Models
ABSTRACT
Entity matching is the task of deciding whether two entity de-
scriptions refer to the same real-world entity. Entity matching
is a central step in most data integration pipelines. Many state-
of-the-art entity matching methods rely on pre-trained language
models (PLMs) such as BERT or RoBERTa. Two major drawbacks
of these models for entity matching are that (i) the models re-
quire significant amounts of task-specific training data and (ii) the
arXiv:2310.11244v4 [cs.CL] 18 Oct 2024
Table 3: Average F1-scores over all datasets for the zero- Sensitivity: Small variations in prompts can have a large
shot experiments. impact on the overall task performance [25, 30, 51]. We measure
this prompt sensitivity as the standard deviation (SD) of the F1
Prompt All Datasets (Average F1) scores of a model over all 10 prompt designs and list this standard
GPT-mini GPT-4 GPT-4o LLama2 Llama3.1 Mixtral deviation in the lower section of Tables 2 and 3. Comparing
domain-complex-force 85.29 88.91 87.00 66.60 84.87 68.29 the prompt sensitivity of the models, the GPT4 model is most
domain-complex-free 85.40 89.46 80.31 69.69 72.06 62.13
invariant to the wording of the prompt (mean standard deviation
domain-simple-force 50.41 86.10 82.72 57.24 60.52 41.65
domain-simple-free 33.65 87.92 63.53 50.40 36.94 43.20 2.26) while also achieving high results with most of the prompt
general-complex-force 83.50 87.94 85.02 63.89 83.26 59.51 designs. Comparing the sensitivity of GPT4 to all other models
general-complex-free 83.13 87.85 55.81 62.73 80.54 61.50
general-simple-force 52.88 81.12 83.65 62.31 72.65 33.59
shows that they have a significantly higher prompt sensitivity
general-simple-free 45.49 85.07 64.67 52.77 63.38 36.12 (standard deviation 6.18 to 18.54 in Table 3).
Narayan-complex 56.13 86.70 50.65 56.34 36.16 32.04 Prompt to Model Fit: The best result for each model is set
Narayan-simple 75.15 86.92 45.64 67.81 37.32 30.94
bold in Table 2, the second best result is underlined. This high-
Mean 65.10 86.80 69.90 60.98 62.77 46.90
Standard deviation 18.45 2.26 14.86 6.18 18.54 13.68
lighting shows that there is no prompt design that performs best
for most models. As a result, a general statement of how to design
a prompt for the entity matching task cannot be made. While the
presented analysis is not exhaustive regarding all possible prompt
designs, the results indicate that the best prompt depends on the
by at least 1% F1 achieving an absolute performance of 89% or model/dataset combination. While a good performing prompt
higher on 5 of 6 datasets without requiring any task-specific train- can be found by testing a set of pre-defined prompts (as we did),
ing data. On the publication datasets, GPT-4o achieves nearly the automated approaches for prompt tuning and evolution could
same performance (0.1-1% F1). This gap increases on the product still further improve the results [18, 43].
datasets to 1-3% F1 making the more recent model marginally Comparison to PLM Baselines: We compare the zero-shot
worse than GPT4. GPT-mini performs up to 6% F1 worse than performance of the LLMs to the performance of two PLM-based
GPT-4o with only marginal performance difference on 4 of 6 matchers: a fine-tuned RoBERTa model [26] and Ditto [23], an
datasets. Among the open-source LLMs, Llama3.1 consistently entity matching system which also relies on domain-specific
outperforms Llama2 by 1-21% F1. Llama3.1’s performance is com- training data. Table 4 shows the overall best results for each LLM
parable to GPT-mini on all datasets. The Mixtral model performs in comparison to the two PLM-based matchers on all datasets. For
less effectively on this task, lagging behind the other open-source three out of the six datasets, GPT4 achieves higher performance
models by 7-16% on 4 datasets. In summary, the results indicate than the best PLM baseline (2.65-4.71% F1), while the performance
that locally run open-source LLMs can perform similarly to Ope- for the other three datasets is 3.69, 4.49 and 0.73% F1 lower. This
nAI’s GPT-mini model given that the right prompt is selected. shows that GPT4 without using any task-specific training data is
However, if maximum performance is desired, none of the other able to reach comparable results or even outperform PLMs that
LLMs can match GPT-4 in a zero-shot setting. GPT-4o offers a were fine-tuned using thousands of training pairs (see Table 1).
more cost-effective alternative to GPT-4 (see Section 5), though The reliance on large amounts of task-specific training data to
its performance is slightly lower. The GitHub repository pro- achieve good performance is one of the main shortcomings of
vides additional results for the models GPT3.5-turbo, SOLAR, fine-tuned PLMs.
and StableBeluga2.
Table 4: Comparison of F1 scores of the best zero-shot
prompt per model with PLM baselines. The "unseen" rows
correspond to training on the dataset named in the column
and applying the model to the WDC Products test set.
Table 6: Mean results for the in-context learning. related examples to the current matching decision. The hand-
picked demonstrations, while not helpful for the Llama models
Prompt All Datasets (Mean F1) on their source dataset WDC Products, lead to improvements
Shots GPT4-mini GPT4 GPT4o LLama2 Llama3.1 Mixtral on all other product datasets. The same effect is visible for the
6 73.76 90.24 90.41 65.44 82.12 50.51 handpicked demonstrations transferred to DBLP-ACM.
Fewshot-related
10 76.56 90.80 91.21 62.69 85.85 53.25
6 77.86 89.44 89.77 63.99 85.95 57.37
Fewshot-random
10 80.51 89.05 89.85 65.62 88.06 53.94
6 72.81 88.61 89.44 70.52 84.87 57.76
Fewshot-handpicked
10 73.93 88.76 89.52 69.91 87.60 51.03 4.2 Learning Matching Rules
Hand-written rules 0 81.49 87.65 86.36 51.22 85.57 79.03
Learned rules 0 84.14 86.64 84.96 44.23 84.11 74.53 In the next set of experiments, we provide a set of textual match-
Mean - 77.63 88.90 88.94 61.70 85.51 59.68 ing rules in the prompt in order to guide the model to select
Standard deviation - 3.85 1.25 2.00 8.63 1.77 10.23 the correct solution. We differentiate between two kinds of rules
Best zero-shot 0 85.51 89.95 88.10 76.01 86.25 69.18 (i) handwritten and (ii) learned rules. Handwritten rules are a
Δ Few-shot/zero-shot - -5.00 0.85 3.10 -5.49 1.81 -11.42 set of binary rules created by defining which attributes need to
Δ Rules/zero-shot - -1.37 -2.29 -1.74 -24.78 -0.68 9.86
match for the given domain to signify a match. The rules also
inform the model of potential heterogeneity in these attributes,
such as slight differences in surface form or value formats. For
26% performance on most datasets. For all other LLMs providing
the learned rules, we pass the set of handpicked in-context pairs
in-context examples usually leads to performance improvements,
to GPT4 and ask the model to automatically generate matching
while the size of the improvements varies widely.
rules from these examples. Similar to the handwritten rules, they
In summary, in-context learning improves the performance
refer to specific attributes that should be matching and potential
of the LLMs for approximately 61% of the model/dataset com-
sources of heterogeneity that the GPT4 model extracted from the
binations that we tested (see row Δ Few-shot/zero-shot in Table
provided examples. A subset of these handwritten and learned
5). Providing demonstrations was not helpful for GPT4 which
rules for the product domain is depicted in Figure 3. The full list
does not need the additional guidance on two datasets as well
of learned rules is available in the project repository.
as for the smaller models GPT-mini and Mixtral which suffer
Effectiveness: Table 5 shows the results of providing match-
large performance drops on many datasets. As a result, the use-
ing rules in comparison to the best zero-shot prompt and the
fulness of in-context learning cannot be assumed but needs to be
in-context experiments. The results show that GPT4 with match-
determined experimentally for each model/dataset combination.
ing rules does not improve over its best zero-shot performance
Comparison of Selection Methods: The best demonstra-
and instead loses 1% to 3% F1 on all datasets. All other models
tion selection method also varies depending on the dataset. The
see improvements on some datasets of 0.3% to 17% F1 over zero-
open-source LLMs generally reach the best performance when
shot depending on the model/dataset combination. Especially the
random or handpicked demonstrations are provided. In contrast,
Mixtral LLM, which has comparatively low performance com-
GPT-4 and GPT-4o achieve the highest scores on most datasets
pared to all other LLMs in the zero-shot and few-shot settings,
using related demonstrations, suggesting that these models are
significantly improves with the provision of rules on all datasets,
better able to understand and apply specific patterns from closely
Table 7: Results for fine-tuning LLMs and subsequent trans-
fer to all datasets. Left-most column shows the dataset used
for fine-tuning.
matching rules for the product domain. A subset of the Llama2 69.09 82.03 63.91 57.93 85.46 97.62
Best zero-shot Llama3.1 83.67 89.84 84.85 73.99 86.32 98.81
learned rules is depicted below. GPT-mini 81.15 91.93 86.58 72.18 86.11 97.60
Llama2 -2.28 +10.12 +26.66 +18.26 +7.34 +1.58
Δ best zero-shot Llama3.1 -5.80 +3.76 +6.16 +4.68 +6.05 +0.79
GPT-mini +7.74 +3.01 +6.41 +14.93 +7.84 +1.80
gaining from 3 to 17% F1. In summary, the provision of match- Llama2 -22.8 -3.63 +0.90 -0.19 +2.98 +0.79
ing rules can be helpful, especially for the open-source LLMs Δ best GPT4 Llama3.1 -11.74 -2.18 +1.34 +2.29 +2.55 +1.19
with Mixtral achieving its highest scores on all datasets using GPT-mini -0.72 -0.84 +3.32 +10.73 +4.13 +0.99
rules but providing task demonstrations generally leads to higher Best GPT4 - 89.61 95.78 89.67 76.38 89.82 98.41
performance gains than providing matching rules for all other
models.
Sensitivity: We measure the prompt sensitivity of the LLMs
as the standard deviation of the F1-scores across all few-shot and
rule experiments. We list this standard deviation in the lower
part of Tables 5 and 6. Comparing the prompt sensitivity of In summary, fine-tuning the models leads to improved results
the models to the zero-shot deviations across different prompt compared to the zero-shot version of the model rivaling the
formulations, the average deviation from the mean has decreased performance of the best GPT4 prompts with the much cheaper
for all models, suggesting that the additional guidance in the GPT-mini model and consistently improving the performance of
form of demonstrations and rules leads to more robust results. the Llama models by 1-26% F1 on 5 out of 6 datasets leaving Llama
3.1 only slightly behind GPT-mini on 4 datasets. Furthermore,
4.3 Fine-Tuning the experiments show that the fine-tuned Llama models reach a
In the next set of experiments, we fine-tune the GPT-mini model similar performance or outperform GPT4 on 4 out of 6 datasets.
via the OpenAI API as well as the Llama2 and Llama3.1 models Generalization: We observe a generalization effect for the
using local hardware. We use the training and validation sets of GPT-mini model fine-tuned on one dataset to datasets from re-
each dataset to train a fine-tuned model with the domain-simple- lated domains and across domains. Transferring models between
force prompt and subsequently apply the fine-tuned models with related product domains leads to improved performance over the
this prompt to all datasets. We fine-tune GPT-mini for 10 epochs best zero-shot prompts for many combinations of datasets. The
using the default parameters suggested by OpenAI. For the Llama effect is especially visible for the combinations WDC Products,
models, we fine-tune using 4-bit quantization to manage the high Abt-Buy and Walmart-Amazon which contain similar products.
VRAM requirements of the 70B models. We employ Low-Rank The transfer to Amazon-Google results in better performance
Adaptation (LoRA) and also train for 10 epochs. than zero-shot for all of the mentioned product datasets. Con-
Effectiveness: The results of the fine-tuned LLMs are shown versely, the reverse transfer from Amazon-Google does not yield
in Table 7. The lower part of the table restates the best zero-shot improved results. Furthermore, all GPT-mini models fine-tuned
and GPT4 results for comparison. When comparing the fine- on the datasets from the product domain exhibit good general-
tuning results to the best zero-shot performance (Section Δ best ization to the publication domain, resulting in improvements of
zero-shot in Table 7), we observe a substantial improvement of 1-3% F1 over the best zero-shot. Transferring fine-tuned models
1% to 26% F1 depending on the dataset for all models. Only the within the publication domain shows the same effect. The trans-
Llama models on WDC Products do not profit from fine-tuning. fer does not work in the other direction as transferring a model
On four out of six datasets, the best fine-tuned Llama3.1 and fine-tuned for the publication domain leads to lower performance
GPT-mini models exceed the performance of zero-shot GPT4 by on the product datasets. For the Llama models this effect is only
1 to 10% F1 (See Section Δ best GPT4 in Table 7). visible for some inter-product transfers mostly for Llama2.
Table 8: Costs for hosted LLMs on WDC Products. Best performing prompts are selected for the analysis for each scenario.
Table 9: Runtime in seconds per prompt (request) for all significantly higher performance, often approaching or even sur-
LLMs using the best prompts from the previous sections on passing GPT-4, at a fraction of GPT-4’s cost. If many training
the WDC Products dataset. Runtimes marked with * are for examples are available, fine-tuning the GPT-mini model results
the quantized version of the model used for fine-tuning. in comparably high performance for for a fraction of the cost of
even GPT-4o.
Model Zeroshot 6-Shot 10-Shot
Rules Rules Fine-Tune Runtime: Table 9 lists the average runtime per prompt for
(written) (learned) (Inference) all LLMs. The selected prompts and the used numbers of tokens
GPT-mini 1.54 s 0.46 s 0.51 s 0.47 s 0.47 s 0.46 s are the same as in Table 8. If the prompt allowed free form an-
GPT4 2.19 s 0.75 s 0.78 s 0.68 s 0.76 s - swering, this leads to much longer runtimes compared to forcing
GPT4-o 0.51 s 0.48 s 0.53 s 0.48 s 0.49 s -
the model to answer shortly. The large difference in runtimes
Llama2 22.62 s 7.15 s 7.82 s 23.16 s 24.51 s *0.30 s between zero-shot Llama2 and Llama3.1 in Table 9 is an exam-
Llama3.1 0.54 s 1.70 s 2.36 s 0.67 s 1.70 s *0.30 s
ple of this. The runtimes of the hosted models are a snapshot
of the API performance in August 2024 and may change at any
time. Prompting GPT4 generally takes around 50% longer than
5 COST AND RUNTIME ANALYSIS the other two OpenAI models which have comparable runtimes
Apart from pure matching performance there are additional con- if the answering scheme is the same. The locally hosted open-
siderations such as data privacy requirements and the cost of source LLM Llama2 requires the largest amount of time for most
using hosted LLMs which may result in the decision to use a less scenarios on our hardware (see Section 2), particularly when
performant but cheaper hosted LLM or to run an open-source generating freely in zero-shot and rule-based cases, where its
LLM on local hardware. The cost analysis presented in the fol- runtime is 10 to 33 times longer than that of GPT-4. On the other
lowing gives an overview of expected costs for hosted models. hand, the Llama3.1 model achieves a comparable runtime to the
The purpose of the analysis is to give the reader general guidance GPT models in most setups.
of what to expect with regards to the cost dimension. We leave a
more in-depth analysis of costs including acquisition costs for 6 EXPLAINING MATCHING DECISIONS
GPUs and electricity for the open-source models to future work.
Understanding the decisions of a matching model is important for
Costs: Table 8 lists the costs associated with the hosted LLMs
users to build trust towards the systems. Explanations of model
across all experimental scenarios for the WDC Products dataset.
decisions can further be used for debugging matching pipelines.
The cost of using a hosted LLMs is dependent on the length
The size and structure of deep learning models make explaining
of the respective prompts, measured by the amount of tokens,
their decisions a challenging task, which has led to a dedicated
and the current prices of the respective model. Thus, the results
line of research in the field of entity matching [2, 12, 32, 33].
we present here are only a snapshot as of August 2024 as the
Instead of relying on external explainability methods, LLMs can
prices are subject to change. We compare the costs of all OpenAI
directly be queried for explanations of their decisions. In this
models. The prices for using the models were as follows for
Section, we use GPT4 to generate structured explanations for
1 million prompt/completion tokens: $0.15/$0.60 for GPT-mini,
its decisions and show how to aggregate these explanations to
$30.00/$60.00 for GPT-4, and $2.50/$10.00 for GPT-4o.
derive global insights about matching decisions.
Table 8 shows that the in-context learning (6-shot, 10-shot)
and the rule-based approaches (hand-written, learned) from Sec-
tion 4 require between 1.3 and 11 times the amount of tokens per 6.1 Generating Explanations
prompt compared to basic zeroshot prompting (see row Token For the generation of explanations, we first prompt the LLM to
increase to ZS in Table 8). For all of them this is due to longer match a pair of entities and subsequently ask the model for an
prompts, either because of the inclusion of few-shot demon- explanation of its decision using a second prompt. If we do not
strations or rules. The fine-tuning approach on the other hand pose any restrictions on the format of the explanation, the model
requires less tokens than zero-shot as the prompt we chose for would answer with natural language text describing the different
fine-tuning uses the restricted output format force (see Section 3) aspects that influenced its decision [29]. Instead of allowing free-
whereas the best zero-shot prompt for GPT-mini uses the free text explanations, we ask the model to organize its explanations
format which allows the model to answer more verbosely. From into a fixed structure which will later allow us to parse and
a cost perspective, the in-context learning and the rule-based aggregate the explanations. Figure 4 shows examples of complete
approaches increase the costs by 1.5 to 470 times compared to conversations for generating structured explanations of matching
the cost of the zero-shot GPT-mini model. While GPT-mini is decisions for pairs from the Walmart-Amazon and DBLP-Scholar
the cheapest model in this lineup, the GPT-4o model achieves datasets. After prompting for and receiving a decision in the
first exchange with the model, we continue the conversation by
passing a second prompt (the second user prompt in Figure 4).
Specifically, we ask for a structured format of the explanation that
includes all attributes of both product offers that were used for
the matching decision. Each attribute should be accompanied by
an importance value as well as a similarity value for the compared
attributes. The sign of the importance values should be negative
if the attribute comparison contributed to a non-match decision
and vice versa.
The generated structured explanation of the product pair from
Walmart-Amazon is shown in the second blue AI row in Figure 4.
The explanation shows that the model is capable of extracting
various attributes from the serialized strings. The highest posi-
tive importance is assigned to the attribute model followed by
brand and price. Although none of the extracted attribute values
perfectly match, they are very similar and the model correctly
assigns them a high similarity and positive importance value and
considers them indications for matching product offers. Inter-
estingly, the model extracted the hard drive size from the first
offer which is missing in the second offer and assigned due to
this circumstance a low negative importance score. As the size of
the hard drive is an important piece of information for matching,
the model may be accounting for this uncertainty by reducing its
confidence in this specific case. The explanation for the DBLP-
Scholar pair is shown in the 4th blue AI row in Figure 4. The
values of the authors attribute match perfectly, which the model
recognizes as relevant evidence for a match by assigning a pos-
itive importance of 0.3. The model further correctly assigns a
high negative importance to year and conference which are rea-
sonably different to support a non-match decision. Here it is
interesting that while the title overlaps in all but two words, the
model still uses this as the most important evidence for predicting
non-match.
To evaluate the meaningfulness of the similarity values cre-
ated by the model in the structured explanations, we calculate
their Pearson correlation with the well known string similarity
metrics Cosine and Generalized Jaccard. We apply the latter met-
rics to each of the extracted attributes found in the explanations
and calculate the correlation between them and the generated
similarities. We find that the model generated similarities exhibit
a strong positive correlation with Cosine similarity and General-
ized Jaccard similarity, ranging between 0.75–0.85 and 0.73–0.83,
respectively, across all datasets. These results point to the general
meaningfulness of the GPT4 created similarity values.
We subsequently generate structured explanations for all pairs
in the test sets of both datasets using the best-performing zero-
shot prompt. A sample of the generated explanations was manu-
ally verified against the corresponding model decisions, confirm-
ing the connection between the explanations and the model’s
decisions. All explanations are available in the project repository
to enable the further analysis of their quality.
Matches Non-Matches
Mean Mean
Attribute Freq. St.Dev. Freq. St.Dev.
Import. Import.
DBLP-Scholar
title 0.96 0.59 0.40 0.95 -0.40 0.38
authors 0.78 0.65 0.40 0.68 -0.66 0.34
conference 0.50 0.35 0.37 0.29 -0.11 0.29
year 0.46 0.26 0.37 0.43 -0.16 0.25
journal 0.14 0.40 0.43 0.05 -0.15 0.25
Walmart-Amazon
brand 0.98 0.78 0.34 0.99 -0.04 0.34
price 0.92 -0.03 0.27 0.86 -0.16 0.25
model 0.81 0.63 0.51 0.82 -0.77 0.37
color 0.24 0.23 0.31 0.35 -0.06 0.23
product type 0.12 0.64 0.48 0.11 -0.42 0.50
brand and model for the matches while the price was not consid-
ered relevant for these decisions on average. For non-matches,
the model instead focuses on the model attribute and assigns a
nearly neutral average importance to the brand attribute. For
DBLP-Scholar, GPT4 focuses on differences and similarities of the
title and author attributes of the publications for both matches
and non-matches, while the attributes conference and year only
contribute to a lesser extent to the matching decisions.
After the aggregation there are in total 81 attributes for DBLP-
Scholar with seven of them being used in at least 10% of decisions
while the remaining 76 make up the long-tail. 28 of 81 attributes
have a mean importance, positive or negative, of at least 30%. For
Walmart-Amazon there are 181 attributes with seven of them
used in at least 10% of decisions. 64 of 181 have a mean importance
of at least 30% towards the decision. The aggregation of the
structured explanations for the DBLP-Scholar and the Walmart-
Amazon datasets has demonstrated that global insights about a Figure 5: Prompt used for the automatic generation of error
model’s decisions can be derived from the local explanations. classes given false positives and false negatives.
False Negatives (26 overall) # errors False Negatives (15 overall) # errors
1. Year Discrepancy: Differences in publication years lead to false 1. Model Number Mismatch: The system fails when there are slight
8
negatives, even when other attributes match closely. differences in model numbers or product codes, even when other 9
2. Venue Variability: Variations in how the publication venue is attributes match closely.
14 2. Attribute Missing or Incomplete: When one product listing
listed (e.g., abbreviations, full names) cause mismatches.
3. Author Name Variations: Differences in author names, including includes an attribute that the other does not, the system may 9
initials, order of names, or inclusion of middle names, lead to 9 fail to recognize them as a match.
false negatives. 3. Minor Differences in Descriptions: Small differences in product
4. Title Variations: Minor differences in titles, such as missing words descriptions or titles can lead to false negatives, such as slightly 11
11 different wording or the inclusion/exclusion of certain features.
or different word order, can cause false negatives.
5. Author List Incompleteness: Differences in the completeness of 4. Price Differences: Even when products are very similar, significant
the author list, where one entry has more authors listed than 11 price differences can lead to false negatives, as the system might 12
the other. weigh price too heavily.
5. Variant or Accessory Differences: Differences in product variants
False Positives (26 overall) # errors or accessories included can cause false negatives, especially if the 7
1. Overemphasis on Title Similarity: High similarity in titles leading system does not adequately account for these variations being minor.
15
to false positives, despite differences in other critical attributes. False Positives (26 overall) # errors
2. Author Name Similarity Overreach: False positives due to high
16 1. Overemphasis on Matching Attributes: The system might give too
similarity in author names, ignoring discrepancies in other attributes.
3. Year and Venue Ignored: Cases where the year and venue match much weight to matching attributes like brand or model number, 23
5 leading to false positives even when other important attributes differ.
or are close, but other discrepancies are overlooked.
2. Ignoring Minor but Significant Differences: The system fails to
4. Partial Information Match: Matching based on partial information,
19 recognize important differences in product types, models, or 21
such as incomplete author lists or titles, leading to false positives.
features that aresignificant to the product identity.
5. Misinterpretation of Publication Types: Confusing different types of
9 3. Misinterpretation of Accessory or Variant Information: Including or
publications (e.g., conference vs. journal) when other attributes match.
excluding accessories or variants in the product description can lead to 8
false positives if the system does not correctly interpret these differences.
4. Price Discrepancy Overlooked: The system might overlook significant
price differences, assuming products are the same when they are not, 14
correct by a human annotator for 15 of the 26 errors, while the
particularly if other attributes match closely.
third error class is relevant only for 5 of the errors, namely those 5. Condition or Quality Differences: Differences in the condition or
where the model seemed to put too much emphasis on matching quality of products (e.g., original vs. compatible, new vs. refurbished) 2
are not adequately accounted for, leading to false positives.
year and venue information in the pairs while ignoring crucial
difference in the other attributes. After manual inspection, all of
the created error classes are relevant for the errors being made Table 13: Accuracy of GPT4 for classifying errors.
and support a deeper understanding of what causes these errors.
Some of the error classes also point at actions that could be taken Walmart-Amazon DBLP-Scholar
to improve the matching pipeline. For example, the heterogeneity
of how publication venues are listed in the DBLP-Scholar dataset Error class FP FN FP FN
(Table 11, error class 2 for false negatives) could prompt the user 1 34.62 86.67 92.31 96.15
to improve the normalization of these values. 2 84.62 73.33 76.92 92.31
3 84.62 73.33 76.92 73.08
4 76.92 100 100 88.46
7.2 Assignment of Errors to Error Classes
5 84.62 86.67 92.31 88.46
In this final experiment, we investigate whether GPT4-turbo is
Mean 73.08 84.00 87.69 87.69
capable of categorizing errors into the created error classes. Such
a categorization allows data engineers to drill down from the
error classes to concrete example errors which might give them
hints on how to address the problem. For categorizing errors, we The presented methods for the automated creation of error
use the prompt shown in Figure 6. After instructing the model classes and the classification of errors into these classes by an
about the task, the prompt lists all error classes together with LLM can support data engineers in the analysis and debugging
their descriptions. Subsequently, the prompt contains the entity of specific combinations of models, prompts and datasets. The
pair to be categorized together with its correct as well as predicted methods can also be used for the detailed comparison of different
label and the structured explanation of the matching decision. combinations of models, prompts and datasets. For example, the
The model is asked to pick all error classes that apply to the pair errors from all experiments presented in this paper could be
and to provide a confidence value for each of its predictions. classified into the classes presented in Tables 11 and 12 allowing
Table 13 shows the accuracy values the GPT4-turbo model the fine-grained comparison of the strengths and weaknesses of
reaches on this task. From these values we can see that the model each combination. As this analysis goes beyond the scope of this
on average achieves a mean accuracy of over 80% for most error paper, we leave it to future work.
types (see row Mean in Table 13). Only the mean accuracy on
Walmart-Amazons false positives is lower which is caused by the 8 RELATED WORK
low accuracy of the first error class Overemphasis on Matching Entity Matching: Entity matching [3, 8, 15] has been researched
Attributes as the domain experts did not agree with the models for over 50 years [17]. Early approaches involved domain experts
classification in the first error class, more specifically the model hand-crafting matching rules [17]. Over time, advancements were
rarely assigned this class while the domain experts considered it made with unsupervised and supervised machine learning tech-
relevant in 23 out of 26 cases. Apart from this disagreement, the niques resulting in improved matching performance [9]. By the
model is capable of correctly categorizing the errors with a high late 2010s, the success of deep learning in areas such as natu-
accuracy. ral language processing and computer vision paved the way for
into the explainability of these matching systems [2, 12, 32, 33].
Most methods [12, 33] for explaining the matching decisions of
PLMs provide local explanations for single entity pairs, e.g. as
importance score of single tokens. Paganelli et al. [32] present
an approach for explaining matching decisions by analyzing the
attention scores of PLM-based matchers. The WYM [2] system
is an example of an intrinsically interpretable system that was
recently proposed based on the idea of finding important decision
units among entity descriptions for PLM-based matchers. To the
best of our knowledge, none of the existing methods automates
the discovery error classes and generates human-interpretable
descriptions of these error classes like the ones we presented in
Section 7.
9 CONCLUSION
This paper has investigated using LLMs as a more robust and less
task-specific training data dependent alternative to PLM-based
matchers. We can summarize the high-level implications of our
findings concerning the selection of matching techniques in the
following rules of thumb: For use cases that do not involve many
unseen entities and for which a decent amount of training data is
available, PLM-based matchers are a suitable option which does
not require much compute due to the smaller size of the models.
For use cases that involve a relevant amount of unseen entities
and for which it is costly to gather and maintain a decent size
training set, LLM-based matchers should be preferred due to their
high zero-shot performance and ability to generalize to unseen
entities. If using the best performing hosted LLMs is not an option
due to their high usage costs, fine-tuning a cheaper hosted model
is an alternative that can deliver a similar F1 performance. If
using hosted models is no option due to privacy concerns, using
an open-source LLM on local hardware can be an alternative
Figure 6: Prompt used for the classification of errors. given that task-specific training data or domain-specific matching
rules are available. Still, this approach is expected to result in
early applications in entity matching [28, 37]. The Transformer a slightly lower F1 performance. We demonstrated that GPT4
architecture [41] and pre-trained models like BERT [11] and can generate structured explanations of matching decisions and
RoBERTa [26] revolutionized natural language processing, which that we can automatically aggregate these explanations to gain
has led the data integration community to also turn to these global insights into the models decisions. Finally, we have shown
language models for entity matching [5, 23, 33, 42, 47, 48]. More that GPT4-turbo can perform the creative task of automatically
recent work delved into the application of self-supervised and deriving error classes from the explanations. This automation
supervised contrastive losses [7, 19, 21] in combination with PLM of the error analysis can save data engineers time and can point
encoder networks for entity matching [34, 44]. Other studies have them at issues that they might have otherwise overlooked.
explored graph-based methods [20, 47] and the application of
domain adaptation techniques for entity matching [1, 27, 39, 40].
ACKNOWLEDGMENTS
LLM-based Entity Matching: Narayan et al. [30] were the The authors acknowledge support by the state of Baden-Württem-
first to experiment with using an LLM (GPT3) for entity matching berg through bwHPC.
as part of a wider study also covering data engineering tasks such
as schema matching and missing value imputation. In [35], we REFERENCES
employ ChatGPT for entity matching and test different prompt [1] Mehdi Akbarian Rastaghi, Ehsan Kamalloo, and Davood Rafiei. 2022. Probing
the Robustness of Pre-trained Language Models for Entity Matching. In Pro-
designs on a single benchmark dataset. Fan et al. [16] experiment ceedings of the 31st ACM International Conference on Information & Knowledge
with batching multiple entity matching decisions together with Management. 3786–3790.
[2] Andrea Baraldi, Francesco Del Buono, Francesco Guerra, Matteo Paganelli,
in-context demonstrations to reduce the cost of in-context learn- and Maurizio Vincini. 2023. An Intrinsically Interpretable Entity Matching
ing. Wang et al. [45] go beyond binary matching and apply LLMs System. In Proceedings 26th International Conference on Extending Database
to select matching records from a set of candidate matches. Zhang Technology, Ioannina, Greece, March 28-31, 2023. 645–657.
[3] Nils Barlaug and Jon Atle Gulla. 2021. Neural Networks for Entity Matching:
et al. [49] experimented with fine-tuning a Llama2 model for sev- A Survey. ACM Transactions on Knowledge Discovery from Data 15, 3 (2021),
eral data preparation tasks at once and include entity matching 52:1–52:37.
as one of their fine-tuning tasks. In [38], we experiment with [4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
et al. 2020. Language Models are Few-Shot Learners. Advances in Neural
fine-tuning Llama and GPT models for entity matching using dif- Information Processing Systems 33 (2020), 1877–1901.
ferent example representations, including free text and structured [5] Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer
Architectures - a Step Forward in Data Integration. In Proceedings of the
explanations. International Conference on Extending Database Technology. 463–473.
Explaining Entity Matching: The prevalence of PLMs over
recent years in the field of entity matching has led to research
[6] Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. [28] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon
Creating Embeddings of Heterogeneous Relational Datasets for Data Integra- Park, et al. 2018. Deep Learning for Entity Matching: A Design Space Explo-
tion Tasks. In Proceedings of the 2020 ACM SIGMOD International Conference ration. In Proceedings of the 2018 International Conference on Management of
on Management of Data (SIGMOD ’20). 1335–1349. Data. 19–34.
[7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. [29] Navapat Nananukul, Khanin Sisaengsuwanchai, and Mayank Kejriwal. 2024.
A Simple Framework for Contrastive Learning of Visual Representations. In Cost-Efficient Prompt Engineering for Unsupervised Entity Resolution in the
Proceedings of the 37th International Conference on Machine Learning. 1597– Product Matching Domain. Discover Artificial Intelligence 4, 1 (2024), 56.
1607. [30] Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. 2022. Can
[8] Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Foundation Models Wrangle Your Data? Proceedings of the VLDB Endowment
Linkage, Entity Resolution, and Duplicate Detection. Springer-Verlag, Berlin 16, 4 (2022), 738–746.
Heidelberg. [31] Markus Nentwig, Michael Hartung, Axel-Cyrille Ngonga Ngomo, and Erhard
[9] Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, Rahm. 2017. A Survey of Current Link Discovery Frameworks. Semantic Web
and Kostas Stefanidis. 2020. An Overview of End-to-End Entity Resolution 8, 3 (jan 2017), 419–436.
for Big Data. Comput. Surveys 53, 6 (2020), 127:1–127:42. [32] Matteo Paganelli, Francesco Del Buono, Andrea Baraldi, and Francesco Guerra.
[10] Vassilis Christophides, Vasilis Efthymiou, and Kostas Stefanidis. 2015. Entity 2022. Analyzing How BERT Performs Entity Matching. Proceedings of the
Resolution in the Web of Data. Springer International Publishing, Cham. VLDB Endowment 15, 8 (June 2022), 1726–1738.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. [33] Ralph Peeters and Christian Bizer. 2021. Dual-Objective Fine-Tuning of BERT
BERT: Pre-Training of Deep Bidirectional Transformers for Language Under- for Entity Matching. Proceedings of the VLDB Endowment 14, 10 (2021), 1913–
standing. In Proceedings of the 2019 Conference of the North American Chapter 1921.
of the Association for Computational Linguistics: Human Language Technologies, [34] Ralph Peeters and Christian Bizer. 2022. Supervised Contrastive Learning
Volume 1. 4171–4186. for Product Matching. In Companion Proceedings of the Web Conference 2022.
[12] Vincenzo Di Cicco, Donatella Firmani, Nick Koudas, Paolo Merialdo, and 248–251.
Divesh Srivastava. 2019. Interpreting Deep Learning Models for Entity Res- [35] Ralph Peeters and Christian Bizer. 2023. Using ChatGPT for Entity Match-
olution: An Experience Report Using LIME. In Proceedings of the Second In- ing. In New Trends in Database and Information Systems (Communications
ternational Workshop on Exploiting Artificial Intelligence Techniques for Data in Computer and Information Science). Springer Nature Switzerland, Cham,
Management. 8:1–8:4. 221–230.
[13] Huahua Ding, Chaofan Dai, Yahui Wu, Wubin Ma, and Haohao Zhou. 2024. [36] Ralph Peeters, Reng Chiz Der, and Christian Bizer. 2024. WDC Products: A
SETEM: Self-ensemble Training with Pre-trained Language Models for Entity Multi-Dimensional Entity Matching Benchmark. In Proceedings of the 27th
Matching. Knowledge-Based Systems 293 (June 2024), 111708. International Conference on Extending Database Technology, Paestum, Italy,
[14] Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad March 25 - March 28. 22–33.
Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity [37] Kashif Shah, Selcuk Kopru, and Jean David Ruvini. 2018. Neural Network
resolution. Proc. VLDB Endow. 11, 11 (jul 2018), 1454–1467. Based Extreme Classification and Similarity Models for Product Matching.
[15] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. In Proceedings of the 2018 Conference of the Association for Computational
Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Linguistics, Volume 3. 8–15.
Data Engineering 19, 1 (2007), 1–16. [38] Aaron Steiner, Ralph Peeters, and Christian Bizer. 2024. Fine-tuning Large
[16] Meihao Fan, Xiaoyue Han, Ju Fan, Chengliang Chai, Nan Tang, et al. 2024. Cost- Language Models for Entity Matching. arXiv:cs.CL/2409.08185
effective in-context learning for entity resolution: A design space exploration. [39] Mohamed Trabelsi, Jeff Heflin, and Jin Cao. 2022. DAME: Domain Adapta-
In 2024 IEEE 40th International Conference on Data Engineering. IEEE, 3696– tion for Matching Entities. In Proceedings of the Fifteenth ACM International
3709. Conference on Web Search and Data Mining. 1016–1024.
[17] Ivan P. Fellegi and Alan B. Sunter. 1969. A Theory for Record Linkage. J. [40] Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Chengliang Chai, et al. 2022.
Amer. Statist. Assoc. 64, 328 (1969), 1183–1210. Domain Adaptation for Deep Entity Resolution. In Proceedings of the 2022
[18] Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon International Conference on Management of Data. 443–457.
Osindero, and Tim Rocktäschel. 2024. Promptbreeder: Self-Referential Self- [41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Improvement via Prompt Evolution. In Forty-first International Conference on et al. 2017. Attention Is All You Need. In Proceedings of the 31st International
Machine Learning. Conference on Neural Information Processing Systems. 6000–6010.
[19] Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Con- [42] Jin Wang, Yuliang Li, and Wataru Hirota. 2021. Machamp: A Generalized
trastive Learning of Sentence Embeddings. In Proceedings of the 2021 Confer- Entity Matching Benchmark. In Proceedings of the 30th ACM International
ence on Empirical Methods in Natural Language Processing. 6894–6910. Conference on Information & Knowledge Management. 4633–4642.
[20] Congcong Ge, Pengfei Wang, Lu Chen, Xiaoze Liu, Baihua Zheng, et al. 2021. [43] Pengfei Wang, Xiaocan Zeng, Lu Chen, Fan Ye, Yuren Mao, et al. 2022.
CollaborER: A Self-supervised Entity Resolution Framework Using Multi- PromptEM: Prompt-tuning for Low-resource Generalized Entity Matching.
features Collaboration. arXiv:2108.08090 [cs] (Sept. 2021). Proceedings of the VLDB Endowment 16, 2 (2022), 369–378.
[21] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, [44] Runhui Wang, Yuliang Li, and Jin Wang. 2023. Sudowoodo: Contrastive Self-
et al. 2020. Supervised Contrastive Learning. In Advances in Neural Information supervised Learning for Multi-purpose Data Integration and Preparation. In
Processing Systems, Vol. 33. 18661–18673. 2023 IEEE 39th International Conference on Data Engineering. 1502–1515.
[22] Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of Entity [45] Tianshu Wang, Hongyu Lin, Xiaoyang Chen, Xianpei Han, Hao Wang, et al.
Resolution Approaches on Real-World Match Problems. Proceedings of the 2024. Match, Compare, or Select? An Investigation of Large Language Models
VLDB Endowment 3, 1-2 (Sept. 2010), 484–493. for Entity Matching. arXiv preprint arXiv:2405.16884 (2024).
[23] Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. [46] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, et al. 2022.
2020. Deep Entity Matching with Pre-Trained Language Models. Proceedings Emergent Abilities of Large Language Models. Transactions on Machine Learn-
of the VLDB Endowment 14, 1 (2020), 50–60. ing Research (2022).
[24] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, et al. [47] Dezhong Yao, Yuhong Gu, Gao Cong, Hai Jin, and Xinqiao Lv. 2022. Entity
2022. What Makes Good In-Context Examples for GPT-3?. In Proceedings Resolution with Hierarchical Graph Attention Networks. In Proceedings of the
of Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and 2022 International Conference on Management of Data. 429–442.
Integration for Deep Learning Architectures. Association for Computational [48] Alexandros Zeakis, George Papadakis, Dimitrios Skoutas, and Manolis
Linguistics, 100–114. Koubarakis. 2023. Pre-trained embeddings for entity resolution: An experi-
[25] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, et al. mental analysis. Proceedings of the VLDB Endowment 16, 9 (2023), 2225–2238.
2023. Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting [49] Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada. 2024.
Methods in Natural Language Processing. Comput. Surveys 55, 9, Article 195 Jellyfish: Instruction-Tuning Local Large Language Models for Data Prepro-
(2023), 35 pages. cessing. In Proceedings of the Conference on Empirical Methods in Natural
[26] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, et al. Language Processing.
2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. [50] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, et al. 2023.
arXiv:1907.11692 [cs] (2019). A Survey of Large Language Models. arXiv preprint arXiv:2303.18223 (2023).
[27] Michael Loster, Ioannis Koumarelas, and Felix Naumann. 2021. Knowledge [51] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Cali-
Transfer for Entity Resolution with Siamese Neural Networks. Journal of Data brate Before Use: Improving Few-Shot Performance of Language Models. In
and Information Quality 13, 1 (Jan. 2021), 2:1–2:25. Proceedings of the 38th International Conference on Machine Learning. 12697–
12706.