0% found this document useful (0 votes)

61 views9 pages

Probabilistic Data Integration

Van Keulen, Maurice. "Probabilistic Data Integration." (2019): Encyclopedia of Big Data Technologies.

Uploaded by

m.vankeulen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views9 pages

Probabilistic Data Integration

Van Keulen, Maurice. "Probabilistic Data Integration." (2019): Encyclopedia of Big Data Technologies.

Uploaded by

m.vankeulen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

P

A probabilistic database is a specific kind of

Probabilistic Data Integration DBMS that allows storage, querying, and ma-
nipulation of uncertain data. It keeps track of
Maurice Van Keulen
alternatives and the dependencies among them.
Faculty of EEMCS, University of Twente,
Enschede, The Netherlands
Overview
Synonyms This chapter explains a special kind of data inte-
gration approach, called Probabilistic Data Inte-
Uncertain data integration
gration. We first define the concept as well as the
related notion of a Probabilistic Database. After
Definitions presenting an illustrative example that is used
Probabilistic data integration (PDI) is a spe- throughout the chapter, a motivation for the PDI
cific kind of data integration where integration approach is given based on its strengths in dealing
problems such as inconsistency and uncertainty with the main problems in data integration.
are handled by means of a probabilistic data The technical part of this chapter first ad-
representation. The approach is based on the view dresses the notion of a probabilistic database
that data quality problems (as they occur in an capable of storing and querying probabilistic rep-
integration process) can be modeled as uncer- resentations, and then explains which probabilis-
tainty (van Keulen 2012), and this uncertainty is tic representations can be used for various data
considered an important result of the integration integration problems. We conclude with example
process (Magnani and Montesi 2010). applications and future developments.
The PDI process contains two phases (see
Fig. 2): (i) a quick partial integration where cer-
tain data quality problems are not solved imme- Example
diately, but explicitly represented as uncertainty
in the resulting integrated data stored in a proba- As a running example taken from van Keulen
bilistic database; (ii) continuous improvement by (2012), imagine a manufacturer of car parts sup-
using the data – a probabilistic database can be plying major car brands. For a preferred customer
queried directly resulting in possible or approxi- program (preferred customer defined as one with
mate answers (Dalvi et al. 2009) – and gathering sales over 100) meant to avoid loosing important
evidence (e.g., user feedback) for improving the customers to competitors, data of three produc-
data quality. tion sites needs to be integrated.
© Springer International Publishing AG 2018
S. Sakr, A. Zomaya (eds.), Encyclopedia of Big Data Technologies,
[Link]
2 Probabilistic Data Integration

Source data integration

Car brand Sales
B.M.W. 25
Mercedes 32
Renault 10 Real world
Integrated database (of car brands)
Car brand Sales
Car brand Sales o1
BMW 72
d1 B.M.W. 25 o2
Mercedes-Benz 39
Bayerische
Renault 20 d2 Motoren Werke 8
d3 Mercedes 67 o3
Car brand Sales
d4 Renault 45
Bayerische
8 d5 BMW 72 o4
Motoren Werke
Mercedes 35 d6 Mercedes-Benz 39
Renault 15

Probabilistic Data Integration, Fig. 1 An uncareful data integration leading to a database with semantic duplicates

to the same car brand, and their combined sales is

Initial quick-and-dirty integration

Partial data integration

106, so “Mercedes-Benz” should be a preferred
customer.
Typical data-cleaning solutions support dupli-
Enumerate cases for cate removal that merges data items when they
remaining problems likely refer to the same real-world object, such
as d1 and d5 merged into a new data item d15 ;
Store data with d3 ; d6 7! d36 analogously. But, it is quite possi-
uncertainty in ble that an algorithm would not detect that also
probabilistic database d2 refers to “BMW.” Note that this seemingly
small technical glitch has a profound business
Continuous improvement

consequence: it determines whether “BMW” is

Use considered a preferred customer or not, risking
loosing it to a competitor.
What do we as humans do if we suspect
that “BMW” stands for “Bayerische Motoren
Improve Gather
data quality evidence Werke”? We are in doubt. Consequently, humans
simply consider both cases, reason that “BMW”
might be a preferred customer and act on it if we
Probabilistic Data Integration, Fig. 2 Probabilistic decide that it is important and likely enough. It is
data integration process (van Keulen and de Keijzer 2009)
this behavior of “doubting” and “probability and
risk assessment” that probabilistic data integra-
tion is attempting to mimic.
Figure 1 shows an example data integration
result and a part of the real world it is supposed to
represent. Observe that at first glance there is no Motivation
preferred customer due to semantic duplicates:
The “same” car brand occurs more than once “Data integration involves combining data re-
under different names because of different con- siding in different sources and providing users
ventions. Importantly, data items d3 and d6 refer with a unified view of them” (Lenzerini 2002).
Probabilistic Data Integration 3

Applications where uncertainty is unavoidable The formal semantics is based on possible

especially call for a probabilitic approach as the worlds. In its most general form, a probabilistic
highlighted terms in the definition illustrate: database is a probability space over the possi-
ble contents of the database. Assuming a single
• It may be hard to extract information from cer- table, let I be a set of tuples (records) repre-
tain kinds of sources (e.g., natural language, senting that table. A probabilistic database is a
websites). discrete probability space PDB D .W; P/, where
• Information in a source may be missing, of W D fI1 ; I2 ; : : : ; In g is a set of possible instances,
bad quality, or its meaning is unclear. called
P possible worlds, and P W W ! Œ0; 1 is such
• It may be unclear which data items in the that j D1::n P.I j / D 1.
sources should be combined. In practice, one can never enumerate all possi-
• Sources may be inconsistent complicating a ble worlds; instead a more concise representation
unified view. is needed. Many representation formalisms have
been proposed differing a.o. in expressiveness
Typically, data integration is an iterative process (see Panse 2015, Chp.3 for a thorough overview).
where mistakes are discovered and repaired, anal- Figure 3 shows a probabilistic integration re-
yses are repeated, and new mistakes are discov- sult of our running example of Fig. 1 where pos-
ered : : : Still, we demand from data scientists that sible duplicates are probabilistically merged; see
they act responsibly, i.e., they should know and section “Record Level”. The used representation
tell us about the deficiencies in integrated data formalism is based on U-relations (Antova et al.
and analytical results. 2008), which allows for dependencies between
Compared to traditional data integration, prob- tuples, for example, tuples d3 and d6 (top left in
abilistic data integration allows: Fig. 3) both exist together or are both absent.
Relational probabilistic database systems that,
• postponement of solving data integration to a certain degree, have outgrown the laboratory
problems, hence, provides an initial integra- bench include MayBMS (Koch 2009; Antova P
tion result much earlier; et al. 2009), Trio (Widom 2004), and MCDB
• better balancing of trade-off between develop- (Jampani et al. 2008). MayBMS and Trio fo-
ment effort and resulting data quality; cus on tuple-level uncertainty where probabilities
• an iterative integration process with smaller are attached to tuples, while MCDB focuses on
steps (Wanders et al. 2015); attribute-level uncertainty where a probabilistic
• leveraging human attention based on feed- value generator function captures the possible
back; and values for the attribute.
• more robustness being less sensitive to wrong Besides probabilistic relational databases,
settings of thresholds and wrong actions of probabilistic versions of other data models and
rules (van Keulen and de Keijzer 2009). associated query languages can be defined by
attaching a probabilistic “sentence” to data items
and incorporating probabilistic inference in the
Probabilistic Databases semantics of the query language that adheres to
the possible worlds semantics (Wanders and van
Probabilistic data integration hinges on the ca- Keulen 2015). For example, several probabilistic
pability to readily store and query a voluminous XML (Abiteboul et al. 2009; van Keulen and de
probabilistic integration result as provided by a Keijzer 2009) and probabilistic logic formalisms
probabilistic database. The two main challenges have been defined (Fuhr 2000; Wanders et al.
of a probabilistic database is that it needs both 2016; De Raedt and Kimmig 2015).
to scale to large data volumes and also to do
probabilistic inference (Dalvi et al. 2009).
4 Probabilistic Data Integration

Probabilistic Data PDB Worlds

Integration, Fig. 3 car sales rva P
Example of a probabilistic
d1 ... 25 (r1 →0) (r1 →0) 0.1 ‘d1 , d2 , d5 different’
database (resulting from
indeterministic
d2 ... 8 (r1 →0) (r1 →1) 0.6 ‘d1 , d5 same’
deduplication of Fig. 1) d5 ... 72 (r1 →0) (r1 →2) 0.3 ‘d1 , d2 , d5 same’
with a typical query and its d15 ... 97 (r1 →1) (r2 →0) 0.2 ‘d3 , d6 different’
answer (Taken from van d2 ... 8 (r1 →1) (r2 →1) 0.8 ‘d3 , d6 same’
Keulen 2012) d125 . . . 105 (r1 →2)
d4 ... 45 Q = SELECT SUM(sales)
d3 ... 67 (r2 →0) FROM carsales
d6 ... 39 (r2 →0) WHERE sales≥ 100
d36 . . . 106 (r2 →1) ‘sales of preferred customers’

All possible worlds with their answer to Q

World descr. World Probability Q
I1 (r1 →0), (r2 →0) {d1 , d2 , d3 , d4 , d5 , d6 } 0.1 · 0.2 = 0.02 0
I2 (r1 →1), (r2 →0) {d15 , d2 , d3 , d4 , d6 } 0.6 · 0.2 = 0.12 0
I3 (r1 →2), (r2 →0) {d125 , d3 , d4 , d6 } 0.3 · 0.2 = 0.06 105
I4 (r1 →0), (r2 →1) {d1 , d2 , d36 , d4 , d5 } 0.1 · 0.8 = 0.08 106
I5 (r1 →1), (r2 →1) {d15 , d2 , d36 , d4 } 0.6 · 0.8 = 0.48 106
I6 (r1 →2), (r2 →1) {d125 , d36 , d4 } 0.3 · 0.8 = 0.24 211

Possible answers Other derivable figures

sum(sales) P description sum(sales) P
0 0.14 Minimum 0 0.14
105 0.06 Maximum 211 0.24
106 0.56 Answer most likely world 106 0.48
211 0.24 Most likely answer 106 0.56
Sec. most likely answer 211 0.24
Expected value 116.3 N.A.

Probabilistic Data Integration extraction: is a phrase a named entity of a certain

type or not?
In essence probabilistic data integration is about In the formalism of the previous section, this
finding probabilistic representations for data in- is represented as:
tegration problems. These are discussed on three
levels: attribute value, record, and schema level. Firstname Lastname
d1a Maurice Van Keulen .r4 7! 0/
Value Level d1b Maurice Van Keulen .r4 7! 1/
Inconsistency and ambiguity Integrated sources d2a Zhang Li .r5 7! 0/
may not agree on the values of certain attributes, d2b Li Zhang .r5 7! 1/
or it is otherwise unknown which values are d3 Paris Hilton .r6 7! 0/
correct. Some examples: text parsing may be am-
biguous: in splitting my own full name “Maurice where ri .i 2 f4; 5; 6g/ govern the uncertainty
Van Keulen,” is the “Van” part of my first name which names are correct or preferred.
or my last name? Differences in conventions:
one source may use first name-last name (as Data imputation A common approach to deal-
customary in the West) and another last name- ing with missing values is data imputation, i.e.,
first name (as customary in China). Information using a most likely value and/or a value that
retains certain statistical properties of the data
Probabilistic Data Integration 5

Probabilistic Data U P M
Integration, Fig. 4 Grey
area in tuple matching False True
(Taken from Panse et al. Non-match Match
2013)

True False
Non-match Match

0 tl tu 1 sim

set. Especially for categorical attributes, imputing result (Panse et al. 2013). In this way, all signif-
with a wrong value can have grave consequences. icantly likely duplicate mergings find their way
In general, an imputation method is a classifier into the database, and any query answer or other
that predicts a most suitable value based on the derived data will reflect the inherent uncertainty.
other values in the record or data set. A classifier Indeterministic deduplication deviates as fol-
can easily not only predict one value but several lows (Panse et al. 2013). Instead of M and U ,
possible ones each with an associated probability a portion of tuple pairs are now classified into
of suitability. By representing the uncertainty a third set P of possible matches based on two
around the missing value probabilistically, the thresholds (see Fig. 4). For pairs in this gray
result is more informative and is more robust area, both cases are considered: a match or not.
against imperfect imputations. Duplicate clustering now forms clusters for M [
U (in Fig. 1, there are 3 clusters: fd1 ; d2 ; d5 g,
Record Level fd4 g, fd3 ; d6 g). For each cluster, the possible
worlds are determined, e.g., d1 ; d2 ; d5 all differ-
Semantic duplicates, entity resolution A se- ent, d1 ; d5 the same and d2 different, or d1 ; d2 ; d5 P
mantic duplicate is almost never detected with ab- all the same. To represent the probabilistic end
solute certainty unless both records are identical. result, a random variable is introduced for each
Therefore, there is a gray area of record pairs that cluster with as many values as possible worlds for
may or may not be semantic duplicates. Even if that cluster, and merged and unmerged versions
an identifier is present, in practice it may not be of the tuples are added according to the situation
perfectly reliable. For example, it has once been in the world. Figure 3 shows the end result.
reported in the UK that there were 81 million A related problem is that of entity resolution
National Insurance numbers but only 60 million (Naumann and Herschel 2010). The goal of data
eligible citizens. integration is often to bring together data on the
Traditional approaches for deduplication are same real-world entities from different sources.
based on pairwise tuple comparisons. Pairs are In the absence of a usable identifier, this matching
classified into matching (M ) and unmatching (U ) and merging of records from different sources is
based on similarity, then clustered by transitivity, a similar problem.
and, finally, merged by cluster. The latter may
require solving inconsistencies (Naumann and Repairs Another record-level integration prob-
Herschel 2010). lem is when a resulting database state does not
In such approaches with an absolute decision satisfy some constraints. Here the notion of a
for tuples being duplicates or not, many realistic database repair is useful. A repair of an in-
possibilities may be ignored leading to errors in consistent database I is a database J that is
the data. Instead, a probabilistic database can consistent and “as close as possible” to I (Wijsen
directly store an indeterministic deduplication 2005). Closeness is typically measured in terms
6 Probabilistic Data Integration

of the number of “insert,” “delete,” and “update” some countries such as the Netherlands, a PhD
operations needed to change I into J . A re- student is actually an employee of the university.
pair, however, is in general not unique. Typically, Also employees from a company may pursue
one resorts to consistent query answering: the a PhD. In short, not all tuples of table “PhD
intersection of answers to a query posed on all student” should be integrated into “student.” This
possible repairs within a certain closeness bound. also illustrates how this schema-level problem
But, although there is no known work to refer may be transformed into a record-level prob-
to, it is perfectly conceivable that these possible lem: a representation can be constructed where
repairs can be represented with a probabilistic all tuples of a type probabilistically exist in a
database state. corresponding table. The uncertainty about two
attributes being “the same” is an analogous prob-
Grouping data While integrating grouping data lem.
also inconsistencies may occur. A grouping can
be defined as a membership of elements within
groups. When different sources contain a group- Data Cleaning
ing for the same set of elements, two elements
may be in the same group in one source and Probabilistic data allows new kinds of cleaning
in different groups in the other. Wanders et al. approaches. High quality can be defined as a
(2015) describe such a scenario with groups of high probability for correct data and low prob-
orthologous proteins which are expected to have ability for incorrect data. Therefore, cleaning
the same function(s). Biological databases like approaches can be roughly categorized into un-
Homologene, PIRSF, and eggNOG store results certainty reducing and uncertainty increasing.
of determining orthology by means of different
methods. An automatic (probabilistic!) combina- Uncertainty Reduction: Evidence If due
tion of these sources may provide a continuously to some evidence from analysis, reasoning,
evolving unified view of combined scientific in- constraints, or feedback, it becomes apparent
sight of higher quality than any single method that some cases are definitely (not) true, then
could provide. uncertainty may be removed from the database.
For example, if in Fig. 3 feedback is given
Schema Level from which it can be derived that d3 and d6
Probabilistic data integration has been mostly are for certain the same car brand, then in
applied to instance-level data, but it can also be essence P.r2 7! 0/ becomes 0 and P.r2 7! 1/
applied on schema level. For example, if two becomes 1. Consequently, all tuples that need
sources hold data on entity types T and T 0 , and .r2 7! 0/ to be true to exist can be deleted
these seem similar or related, then a number of (d3 and d6 ). Furthermore, random variable r2
hypotheses may be drawn up: can be abolished, and the term .r2 7! 1/ can
be removed from all probabilistic sentences.
• T could have exactly the same meaning as T 0 , It effectively removes all possible worlds that
• T could be a subtype of T 0 or vice versa, or contradict with the evidence. van Keulen and
• T and T 0 partially overlap and have a common de Keijzer (2009) have shown that this form
supertype. of cleaning may quickly and steadily improve
quality of a probabilistic integration result.
But it may be uncertain which one is true. It If such evidence cannot be taken as absolutely
may even be the case that a hypothesis may only reliable, hence cannot justify the cleaning ac-
be partially true, for example, with source tables tions above, the actual cleaning becomes a matter
“student” and “PhD student.” In most cases, a of massaging of the probabilities. For example,
PhD student is a special kind of student, but in P.r2 7! 1/ may be increased only a little bit.
In this approach, a probability threshold may be
Probabilistic Data Integration 7

introduced above which the abovedescribed ran- uncertain information. METIS can be seen as an
dom variable removal is executed. As evidence open-source intelligence (OSINT) application.
accumulates, this approach converges to a certain Another notable and concrete example of an
correct database state as well, so data quality existing system is MCDB-R (Arumugam et al.
improvement is only slowed down provided that 2010). It allows risk assessment queries directly
the evidence is for a large part correct. on the database. Risk assessment typically corre-
sponds to computing interesting properties of the
Uncertainty Increase: Casting Doubt Perhaps upper or lower tails of a query result distribution,
counterintuitive, but increasing uncertainty may for example, computing the probability of a large
improve data quality, hence could be an approach investment loss.
for cleaning. For example, if due to some evi- Probabilistic data integration is in particular
dence it becomes unlikely that a certain tuple is suited for applications where much imperfection
correct, a random variable may be introduced, can be expected but where a quick-and-dirty
and possible repairs for tuples may be inserted. integration and cleaning approach is likely to be
In effect, we are casting doubt on the data and sufficient. It has the potential of drastically low-
insert what seems more likely. Consequently the ering the time and effort needed for integration
uncertainty increases, but the overall quality may and cleaning, which can be considerable since
increase, because the probability mass associated “analysts report spending upwards of 80% of
with incorrect data decreases and the probability their time on problems in data cleaning” (Haas
mass for correct data increases (assuming the et al. 2015).
evidence is largely correct). Other application areas include:

Measuring uncertainty and quality The above • Machine learning and data mining: since prob-
illustrates that uncertainty and quality are orthog- abilistically integrated data has a higher in-
onal notions. Uncertainty is usually measured by formation content than “data with errors,” it
means of entropy. Quality measures for proba- is expected that models of higher quality will P
bilistic data are introduced by (van Keulen and de be produced if probabilistic data is used as
Keijzer 2009): expected precision and expected training data.
recall. These notions are based on the intuition • Information extraction from natural language:
that the quality of a correct query answer is better since natural language is inherently ambigu-
if the system dares to claim that it is correct with ous, it seems quite natural to represent the re-
a higher probability. sult of information extraction as probabilistic
data.
• Web harvesting: websites are designed for
Example Applications use by humans. A probabilistic approach may
lead to more robust navigation. Subtasks like
A notable application of probabilistic data finding search results (Trieschnigg et al. 2012)
integration is the METIS system, “an industrial or finding target fields (Jundt and van Keulen
prototype system for supporting real-time, 2013) are typically based on ranking “possi-
actionable maritime situational awareness” ble actions.” By executing not only one but
(Huijbrechts et al. 2015). It aims to support a top-k of possible actions and representing
operational work in domains characterized by resulting data probabilistically, consequences
constantly evolving situations with a diversity of of imperfect ranking are reduced.
entities, complex interactions, and uncertainty
in the information gathered. It includes
natural language processing of heterogeneous
(un)structured data and probabilistic reasoning of
8 Probabilistic Data Integration

Future Developments Fuhr N (2000) Probabilistic datalog: implementing logical

information retrieval for advanced applications. J Am
Soc Inf Sci 51(2):95–110
Probabilistic data integration depends on scalable Haas D, Krishnan S, Wang J, Franklin M, Wu E (2015)
probabilistic database technology. An important Wisteria: nurturing scalable data cleaning infrastruc-
direction of future research is the development ture. Proc VLDB Endow 8(12):2004–2007. [Link]
of probabilistic database systems and improving org/10.14778/2824032.2824122
Huijbrechts B, Velikova M, Michels S, Scheepens R
their scalability and functionality. Furthermore, (2015) Metis1: an integrated reference architecture
future research is needed that compare the effec- for addressing uncertainty in decision-support systems.
tiveness of probabilistic data integration vs. non- Proc Comput Sci 44(Supplement C):476–485. https://
probabilistic data integration approaches for real- [Link]/10.1016/[Link].2015.03.007
Jampani R, Xu F, Wu M, Perez LL, Jermaine C, Haas PJ
world use cases. (2008) MCDB: a monte carlo approach to managing
uncertain data. In: Proceeding of SIGMOD. ACM,
pp 687–700
Cross-References Jundt O, van Keulen M (2013) Sample-based XPath rank-
ing for web information extraction. In: Proceeding of
EUSFLAT. Advances in intelligent systems research.
Data Cleaning Atlantis Press. [Link]
Data Deduplication Koch C (2009) MayBMS: a system for managing large
Data Integration probabilistic databases. In: Aggarwal CC (ed) Manag-
ing and mining uncertain data. Advances in database
Graph Data Integration and Exchange systems, vol 35. Springer. [Link]
Holistic Schema Matching 0-387-09690-2_6
Record Linkage Lenzerini M (2002) Data integration: a theoretical per-
Schema Mapping spective. In: Proceeding of PODS. ACM, pp 233–246.
[Link]
Semantic Interlinking Magnani M, Montesi D (2010) A survey on uncertainty
Truth Discovery management in data integration. JDIQ 2(1):5:1–5:33.
Uncertain Schema Matching [Link]
Naumann F, Herschel M (2010) An introduction to du-
plicate detection. Synthesis lectures on data manage-
ment. Morgan & Claypool. [Link]
S00262ED1V01Y201003DTM003
References Panse F (2015) Duplicate detection in probabilistic rela-
tional databases. PhD thesis, University of Hamburg
Abiteboul S, Kimelfeld B, Sagiv Y, Senellart P (2009) Panse F, van Keulen M, Ritter N (2013) Indeterminis-
On the expressiveness of probabilistic xml mod- tic handling of uncertain decisions in deduplication.
els. VLDB J 18(5):1041–1064. [Link] JDIQ 4(2):9:1–9:25. [Link]
s00778-009-0146-1 2435225
Antova L, Jansen T, Koch C, Olteanu D (2008) Fast Trieschnigg R, Tjin-Kam-Jet K, Hiemstra D (2012) Rank-
and simple relational processing of uncertain data. In: ing xpaths for extracting search result records. Tech-
Proceedings of ICDE, pp 983–992 nical report TR-CTIT-12-08, Centre for telematics and
6
Antova L, Koch C, Olteanu D (2009) 10.10 / worlds information technology (CTIT)
and beyond: efficient representation and processing van Keulen M (2012) Managing uncertainty: the road
of incomplete information. VLDB J 18(5):1021–1040. towards better data interoperability. IT – Inf Technol
[Link] 54(3):138–146. [Link]
Arumugam S, Xu F, Jampani R, Jermaine C, Perez LL, van Keulen M, de Keijzer A (2009) Qualitative effects of
Haas PJ (2010) MCDB-R: risk analysis in the database. knowledge rules and user feedback in probabilistic data
Proc VLDB Endow 3(1–2):782–793. [Link] integration. VLDB J 18(5):1191–1217
10.14778/1920841.1920941 Wanders B, van Keulen M (2015) Revisiting the formal
Dalvi N, Ré C, Suciu D (2009) Probabilistic databases: foundation of probabilistic databases. In: Proceeding of
diamonds in the dirt. Commun ACM 52(7):86–94. IFSA-EUSFLAT. Atlantis Press, p 47. [Link]
[Link] 10.2991/ifsa-eusflat-15.2015.43
De Raedt L, Kimmig A (2015) Probabilistic (logic) pro- Wanders B, van Keulen M, van der Vet P (2015) Un-
gramming concepts. Mach Learn 100(1):5–47. https:// certain groupings: probabilistic combination of group-
[Link]/10.1007/s10994-015-5494-z
Probabilistic Data Integration 9

ing data. In: Proceeding of DEXA. LNCS, vol 9261. Widom J (2004) Trio: a system for integrated management
Springer, pp 236–250. [Link] of data, accuracy, and lineage. Technical report 2004-
319-22849-5_17 40, Stanford InfoLab. [Link]
Wanders B, van Keulen M, Flokstra J (2016) Judged: a 658/
probabilistic datalog with dependencies. In: Proceeding Wijsen J (2005) Database repairing using updates.
of DeLBP. AAAI Press ACM TODS 30(3):722–768. [Link]
1093382.1093385

Query Processing on Probabilistic Data
No ratings yet
Query Processing on Probabilistic Data
148 pages
User Feedback in Probabilistic Data Integration
No ratings yet
User Feedback in Probabilistic Data Integration
26 pages
Foundations of Probabilistic Databases
No ratings yet
Foundations of Probabilistic Databases
8 pages
PIP: Efficient Probabilistic Database System
No ratings yet
PIP: Efficient Probabilistic Database System
12 pages
Efficient Query Evaluation On Probabilistic Databases: Nilesh Dalvi Dan Suciu
No ratings yet
Efficient Query Evaluation On Probabilistic Databases: Nilesh Dalvi Dan Suciu
12 pages
1 s2.0 S2352340924008175 Main
No ratings yet
1 s2.0 S2352340924008175 Main
23 pages
10 1 1 39 766pob-1-10
No ratings yet
10 1 1 39 766pob-1-10
10 pages
Why Data Preprocessing
No ratings yet
Why Data Preprocessing
7 pages
Data Fusion in Data Integration Survey
No ratings yet
Data Fusion in Data Integration Survey
26 pages
Data Mining Architecture Overview
No ratings yet
Data Mining Architecture Overview
40 pages
IRS Automatic Indexing UNIT-2
75% (4)
IRS Automatic Indexing UNIT-2
18 pages
Unit-2 New
No ratings yet
Unit-2 New
61 pages
Big Data Integration: Xin Luna Dong, Divesh Srivastava
No ratings yet
Big Data Integration: Xin Luna Dong, Divesh Srivastava
4 pages
A Survey On Machine Learning For Data Fusion
No ratings yet
A Survey On Machine Learning For Data Fusion
15 pages
Markov Logic for Entity Resolution
No ratings yet
Markov Logic for Entity Resolution
11 pages
Automatic Indexing Techniques
No ratings yet
Automatic Indexing Techniques
46 pages
Data Science & Aiml (Mile Stone Solution)
No ratings yet
Data Science & Aiml (Mile Stone Solution)
37 pages
A Model-Based Approach For Merging Prioritized Knowledge Bases in Possibilistic Logic (AAAI 2007 - 2)
No ratings yet
A Model-Based Approach For Merging Prioritized Knowledge Bases in Possibilistic Logic (AAAI 2007 - 2)
6 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Session 4
No ratings yet
Session 4
40 pages
Week 2
No ratings yet
Week 2
22 pages
Day 2
No ratings yet
Day 2
20 pages
Module 4
No ratings yet
Module 4
15 pages
Data Integration Using Similarity Joins and A Word-Based Information Representation Language
No ratings yet
Data Integration Using Similarity Joins and A Word-Based Information Representation Language
34 pages
Data Preprocessing in Data Warehousing
100% (1)
Data Preprocessing in Data Warehousing
7 pages
Multimedia Database Retrieval Strategies
No ratings yet
Multimedia Database Retrieval Strategies
7 pages
Bafpred Module 2 Week 5 6
No ratings yet
Bafpred Module 2 Week 5 6
35 pages
Parent 1998 Issues and Approaches of Database Integration
No ratings yet
Parent 1998 Issues and Approaches of Database Integration
12 pages
MILIT PPT Modifies
No ratings yet
MILIT PPT Modifies
43 pages
Data Mining Mid Syllabus
No ratings yet
Data Mining Mid Syllabus
162 pages
AI and ML in Data Integration
No ratings yet
AI and ML in Data Integration
9 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
Data Science Notes
No ratings yet
Data Science Notes
59 pages
Unit Iii
No ratings yet
Unit Iii
100 pages
Challenges in Information Integration
No ratings yet
Challenges in Information Integration
17 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
A Process Model For A Data Fusion Factory: A B C A
No ratings yet
A Process Model For A Data Fusion Factory: A B C A
8 pages
Fourth Parsing Test Overview
No ratings yet
Fourth Parsing Test Overview
20 pages
Entity Analysis Resolution
100% (1)
Entity Analysis Resolution
22 pages
Data Mining Mid Syllabus
No ratings yet
Data Mining Mid Syllabus
185 pages
Machine Learning: A Review of Classification and Combining Techniques
No ratings yet
Machine Learning: A Review of Classification and Combining Techniques
32 pages
A Probabilistic Data Model and Algebra For Location-Based Data Warehouses and Their Implementation
No ratings yet
A Probabilistic Data Model and Algebra For Location-Based Data Warehouses and Their Implementation
48 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Classification 2
No ratings yet
Classification 2
19 pages
BECE352E Module 2
No ratings yet
BECE352E Module 2
58 pages
Survey of Probabilistic XML Models
No ratings yet
Survey of Probabilistic XML Models
53 pages
Data Warehousing Unit 1
No ratings yet
Data Warehousing Unit 1
26 pages
Statistical Perspective
No ratings yet
Statistical Perspective
85 pages
Automatic Indexing Techniques
No ratings yet
Automatic Indexing Techniques
48 pages
Probabilistic User Profile Matching
No ratings yet
Probabilistic User Profile Matching
12 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
IRS III Year UNIT-3 Part 1
50% (2)
IRS III Year UNIT-3 Part 1
18 pages
Bayesian Learning
No ratings yet
Bayesian Learning
42 pages
Introduction To Data Science 1-2-2025
No ratings yet
Introduction To Data Science 1-2-2025
14 pages
Demo Deck May 2022
No ratings yet
Demo Deck May 2022
12 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
40 pages
Module 3
No ratings yet
Module 3
76 pages
PIP-Net: Intuitive Prototypes for Image Classification
No ratings yet
PIP-Net: Intuitive Prototypes for Image Classification
10 pages
From Anecdotal Evidence To Quantitative Evaluation Methods: A Systematic Review On Evaluating Explainable AI
No ratings yet
From Anecdotal Evidence To Quantitative Evaluation Methods: A Systematic Review On Evaluating Explainable AI
42 pages
ROX: Run-Time Optimization of XQueries
No ratings yet
ROX: Run-Time Optimization of XQueries
12 pages
Accelerating XPath Evaluation in Any RDBMS
No ratings yet
Accelerating XPath Evaluation in Any RDBMS
43 pages
SR Data Engineer - Lalitya Resume
No ratings yet
SR Data Engineer - Lalitya Resume
8 pages
Etl Tools Data Integration Survey 2012 PDF
No ratings yet
Etl Tools Data Integration Survey 2012 PDF
2 pages
2023 BDL - Effective Strategies For Data Integration
No ratings yet
2023 BDL - Effective Strategies For Data Integration
21 pages
What Is Oracle Data Integrator (ODI) ?
100% (1)
What Is Oracle Data Integrator (ODI) ?
8 pages
Automated Data Flow to RBI Guide
No ratings yet
Automated Data Flow to RBI Guide
75 pages
Data Mining (Module-1)
No ratings yet
Data Mining (Module-1)
14 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
Data Integration Best Practices - Free Ebook
100% (1)
Data Integration Best Practices - Free Ebook
32 pages
Elucidata: AI for Biomedical Data Harmony
No ratings yet
Elucidata: AI for Biomedical Data Harmony
18 pages
TP 1 - Pentaho Data Integration (PDI) - Installing
No ratings yet
TP 1 - Pentaho Data Integration (PDI) - Installing
2 pages
Understanding Logs in Real-Time Data Systems
No ratings yet
Understanding Logs in Real-Time Data Systems
38 pages
Nikhil.T Sr. Business Data Analyst
No ratings yet
Nikhil.T Sr. Business Data Analyst
7 pages
ETL & Data Warehouse Expert Profile
No ratings yet
ETL & Data Warehouse Expert Profile
12 pages
Pricing - Maltego
No ratings yet
Pricing - Maltego
9 pages
Intelligent Automated Pet Feeder System
No ratings yet
Intelligent Automated Pet Feeder System
29 pages
Predictive Data Mining and Discovering Hidden Values of Data Warehouse
No ratings yet
Predictive Data Mining and Discovering Hidden Values of Data Warehouse
5 pages
Test Exam CDMP - Studyexam 3 Marzo 2022
86% (7)
Test Exam CDMP - Studyexam 3 Marzo 2022
121 pages
Ilhan Khan: Product Manager Profile
No ratings yet
Ilhan Khan: Product Manager Profile
3 pages
Pentaho Data Integration Implementation Hce 5920 Exam
No ratings yet
Pentaho Data Integration Implementation Hce 5920 Exam
2 pages
Cloud Integration Expert Resume
No ratings yet
Cloud Integration Expert Resume
6 pages
Pentaho Data Integration Datasheet
No ratings yet
Pentaho Data Integration Datasheet
2 pages
ETL Guide
No ratings yet
ETL Guide
44 pages
Oracle Data Integrator (ODI) - Frequently Asked Questions (FAQ)
No ratings yet
Oracle Data Integrator (ODI) - Frequently Asked Questions (FAQ)
5 pages
Mit401 Unit 10-Slm
No ratings yet
Mit401 Unit 10-Slm
23 pages
Distributed Data Mining
No ratings yet
Distributed Data Mining
119 pages
Mattec Integration Technical Reference 2021.1
No ratings yet
Mattec Integration Technical Reference 2021.1
283 pages
Certsinside Talend Data Integration Certified Developer Exam Exam Dumps by Whitney 29 01 2024 7qa
No ratings yet
Certsinside Talend Data Integration Certified Developer Exam Exam Dumps by Whitney 29 01 2024 7qa
14 pages
Data Warehouse Overview and Benefits
No ratings yet
Data Warehouse Overview and Benefits
22 pages
Data Mining Tasks & Architecture
No ratings yet
Data Mining Tasks & Architecture
5 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages

Probabilistic Data Integration

Uploaded by

Probabilistic Data Integration

Uploaded by

P

A probabilistic database is a specific kind of

Source data integration

to the same car brand, and their combined sales is

Partial data integration

consequence: it determines whether “BMW” is

Applications where uncertainty is unavoidable The formal semantics is based on possible

Probabilistic Data PDB Worlds

All possible worlds with their answer to Q

Possible answers Other derivable figures

Probabilistic Data Integration extraction: is a phrase a named entity of a certain

Future Developments Fuhr N (2000) Probabilistic datalog: implementing logical

You might also like