Probabilistic Data Integration
Probabilistic Data Integration
Probabilistic Data Integration, Fig. 1 An uncareful data integration leading to a database with semantic duplicates
Probabilistic Data U P M
Integration, Fig. 4 Grey
area in tuple matching False True
(Taken from Panse et al. Non-match Match
2013)
True False
Non-match Match
0 tl tu 1 sim
set. Especially for categorical attributes, imputing result (Panse et al. 2013). In this way, all signif-
with a wrong value can have grave consequences. icantly likely duplicate mergings find their way
In general, an imputation method is a classifier into the database, and any query answer or other
that predicts a most suitable value based on the derived data will reflect the inherent uncertainty.
other values in the record or data set. A classifier Indeterministic deduplication deviates as fol-
can easily not only predict one value but several lows (Panse et al. 2013). Instead of M and U ,
possible ones each with an associated probability a portion of tuple pairs are now classified into
of suitability. By representing the uncertainty a third set P of possible matches based on two
around the missing value probabilistically, the thresholds (see Fig. 4). For pairs in this gray
result is more informative and is more robust area, both cases are considered: a match or not.
against imperfect imputations. Duplicate clustering now forms clusters for M [
U (in Fig. 1, there are 3 clusters: fd1 ; d2 ; d5 g,
Record Level fd4 g, fd3 ; d6 g). For each cluster, the possible
worlds are determined, e.g., d1 ; d2 ; d5 all differ-
Semantic duplicates, entity resolution A se- ent, d1 ; d5 the same and d2 different, or d1 ; d2 ; d5 P
mantic duplicate is almost never detected with ab- all the same. To represent the probabilistic end
solute certainty unless both records are identical. result, a random variable is introduced for each
Therefore, there is a gray area of record pairs that cluster with as many values as possible worlds for
may or may not be semantic duplicates. Even if that cluster, and merged and unmerged versions
an identifier is present, in practice it may not be of the tuples are added according to the situation
perfectly reliable. For example, it has once been in the world. Figure 3 shows the end result.
reported in the UK that there were 81 million A related problem is that of entity resolution
National Insurance numbers but only 60 million (Naumann and Herschel 2010). The goal of data
eligible citizens. integration is often to bring together data on the
Traditional approaches for deduplication are same real-world entities from different sources.
based on pairwise tuple comparisons. Pairs are In the absence of a usable identifier, this matching
classified into matching (M ) and unmatching (U ) and merging of records from different sources is
based on similarity, then clustered by transitivity, a similar problem.
and, finally, merged by cluster. The latter may
require solving inconsistencies (Naumann and Repairs Another record-level integration prob-
Herschel 2010). lem is when a resulting database state does not
In such approaches with an absolute decision satisfy some constraints. Here the notion of a
for tuples being duplicates or not, many realistic database repair is useful. A repair of an in-
possibilities may be ignored leading to errors in consistent database I is a database J that is
the data. Instead, a probabilistic database can consistent and “as close as possible” to I (Wijsen
directly store an indeterministic deduplication 2005). Closeness is typically measured in terms
6 Probabilistic Data Integration
of the number of “insert,” “delete,” and “update” some countries such as the Netherlands, a PhD
operations needed to change I into J . A re- student is actually an employee of the university.
pair, however, is in general not unique. Typically, Also employees from a company may pursue
one resorts to consistent query answering: the a PhD. In short, not all tuples of table “PhD
intersection of answers to a query posed on all student” should be integrated into “student.” This
possible repairs within a certain closeness bound. also illustrates how this schema-level problem
But, although there is no known work to refer may be transformed into a record-level prob-
to, it is perfectly conceivable that these possible lem: a representation can be constructed where
repairs can be represented with a probabilistic all tuples of a type probabilistically exist in a
database state. corresponding table. The uncertainty about two
attributes being “the same” is an analogous prob-
Grouping data While integrating grouping data lem.
also inconsistencies may occur. A grouping can
be defined as a membership of elements within
groups. When different sources contain a group- Data Cleaning
ing for the same set of elements, two elements
may be in the same group in one source and Probabilistic data allows new kinds of cleaning
in different groups in the other. Wanders et al. approaches. High quality can be defined as a
(2015) describe such a scenario with groups of high probability for correct data and low prob-
orthologous proteins which are expected to have ability for incorrect data. Therefore, cleaning
the same function(s). Biological databases like approaches can be roughly categorized into un-
Homologene, PIRSF, and eggNOG store results certainty reducing and uncertainty increasing.
of determining orthology by means of different
methods. An automatic (probabilistic!) combina- Uncertainty Reduction: Evidence If due
tion of these sources may provide a continuously to some evidence from analysis, reasoning,
evolving unified view of combined scientific in- constraints, or feedback, it becomes apparent
sight of higher quality than any single method that some cases are definitely (not) true, then
could provide. uncertainty may be removed from the database.
For example, if in Fig. 3 feedback is given
Schema Level from which it can be derived that d3 and d6
Probabilistic data integration has been mostly are for certain the same car brand, then in
applied to instance-level data, but it can also be essence P.r2 7! 0/ becomes 0 and P.r2 7! 1/
applied on schema level. For example, if two becomes 1. Consequently, all tuples that need
sources hold data on entity types T and T 0 , and .r2 7! 0/ to be true to exist can be deleted
these seem similar or related, then a number of (d3 and d6 ). Furthermore, random variable r2
hypotheses may be drawn up: can be abolished, and the term .r2 7! 1/ can
be removed from all probabilistic sentences.
• T could have exactly the same meaning as T 0 , It effectively removes all possible worlds that
• T could be a subtype of T 0 or vice versa, or contradict with the evidence. van Keulen and
• T and T 0 partially overlap and have a common de Keijzer (2009) have shown that this form
supertype. of cleaning may quickly and steadily improve
quality of a probabilistic integration result.
But it may be uncertain which one is true. It If such evidence cannot be taken as absolutely
may even be the case that a hypothesis may only reliable, hence cannot justify the cleaning ac-
be partially true, for example, with source tables tions above, the actual cleaning becomes a matter
“student” and “PhD student.” In most cases, a of massaging of the probabilities. For example,
PhD student is a special kind of student, but in P.r2 7! 1/ may be increased only a little bit.
In this approach, a probability threshold may be
Probabilistic Data Integration 7
introduced above which the abovedescribed ran- uncertain information. METIS can be seen as an
dom variable removal is executed. As evidence open-source intelligence (OSINT) application.
accumulates, this approach converges to a certain Another notable and concrete example of an
correct database state as well, so data quality existing system is MCDB-R (Arumugam et al.
improvement is only slowed down provided that 2010). It allows risk assessment queries directly
the evidence is for a large part correct. on the database. Risk assessment typically corre-
sponds to computing interesting properties of the
Uncertainty Increase: Casting Doubt Perhaps upper or lower tails of a query result distribution,
counterintuitive, but increasing uncertainty may for example, computing the probability of a large
improve data quality, hence could be an approach investment loss.
for cleaning. For example, if due to some evi- Probabilistic data integration is in particular
dence it becomes unlikely that a certain tuple is suited for applications where much imperfection
correct, a random variable may be introduced, can be expected but where a quick-and-dirty
and possible repairs for tuples may be inserted. integration and cleaning approach is likely to be
In effect, we are casting doubt on the data and sufficient. It has the potential of drastically low-
insert what seems more likely. Consequently the ering the time and effort needed for integration
uncertainty increases, but the overall quality may and cleaning, which can be considerable since
increase, because the probability mass associated “analysts report spending upwards of 80% of
with incorrect data decreases and the probability their time on problems in data cleaning” (Haas
mass for correct data increases (assuming the et al. 2015).
evidence is largely correct). Other application areas include:
Measuring uncertainty and quality The above • Machine learning and data mining: since prob-
illustrates that uncertainty and quality are orthog- abilistically integrated data has a higher in-
onal notions. Uncertainty is usually measured by formation content than “data with errors,” it
means of entropy. Quality measures for proba- is expected that models of higher quality will P
bilistic data are introduced by (van Keulen and de be produced if probabilistic data is used as
Keijzer 2009): expected precision and expected training data.
recall. These notions are based on the intuition • Information extraction from natural language:
that the quality of a correct query answer is better since natural language is inherently ambigu-
if the system dares to claim that it is correct with ous, it seems quite natural to represent the re-
a higher probability. sult of information extraction as probabilistic
data.
• Web harvesting: websites are designed for
Example Applications use by humans. A probabilistic approach may
lead to more robust navigation. Subtasks like
A notable application of probabilistic data finding search results (Trieschnigg et al. 2012)
integration is the METIS system, “an industrial or finding target fields (Jundt and van Keulen
prototype system for supporting real-time, 2013) are typically based on ranking “possi-
actionable maritime situational awareness” ble actions.” By executing not only one but
(Huijbrechts et al. 2015). It aims to support a top-k of possible actions and representing
operational work in domains characterized by resulting data probabilistically, consequences
constantly evolving situations with a diversity of of imperfect ranking are reduced.
entities, complex interactions, and uncertainty
in the information gathered. It includes
natural language processing of heterogeneous
(un)structured data and probabilistic reasoning of
8 Probabilistic Data Integration
ing data. In: Proceeding of DEXA. LNCS, vol 9261. Widom J (2004) Trio: a system for integrated management
Springer, pp 236–250. [Link] of data, accuracy, and lineage. Technical report 2004-
319-22849-5_17 40, Stanford InfoLab. [Link]
Wanders B, van Keulen M, Flokstra J (2016) Judged: a 658/
probabilistic datalog with dependencies. In: Proceeding Wijsen J (2005) Database repairing using updates.
of DeLBP. AAAI Press ACM TODS 30(3):722–768. [Link]
1093382.1093385