Data Integration
Click to edit Master subtitle style
4/9/12
Data Models
Main aim of data models is to support the development of information systems by providing the definition and format of data
Conceptual
4/9/12
Data Modeling Process
4/9/12
4/9/12
Data Set Interaction
4/9/12
Steps in Data Integration
Step
1: Identification of common elements
Identify the common entities Example, Employee dimension can come from Sales system, Payroll System, etc. Products can come from manufacturing, sales, purchase etc. Once the common entity is identified its definition should be standardized. Example, does employee include fulltime employees as well as temporary workers Identify the common attributes. What are the attributes that are common to employee, 1st name, 2nd name, last name, date of joining, etc? Each attribute should be defined. Identify the common values 4/9/12 Same information can be represented in different
Steps in Data Integration
Second
step is identification of Data Steward who will own the responsibility and ownership for particular set of data elements Third step is to design an ETL process to integrate the data into the target. This is the most important area in the implementation of an ETL process for data integration. This topic will be discussed in more detailed under its own heading Fourth step is to establish a process of maintenance, review & reporting of such elements
4/9/12
Historical Perspective
Click to edit Master subtitle style
4/9/12
View Integration (1965-75)
From
application requirements to the database schema
Applicatio n requireme nts2 Applicatio n requireme nts3
Applicatio n requireme nts1
Applicatio n requireme ntsn
Integrated Conceptual Schema
4/9/12 9
Data Interoperability (1975-80)
LOCATION TRANSPARENCY global schema global access only: global + local access: DISTRIBUTED DBS FEDERATED DBS
I N T E G R A T I O N
LOCATION VISIBILITY no global schema multiDB views, multi DB access language: MULTIDATABASE SYSTEMS
unstructured or semi-structured data (files, repositories, knowledge bases, spreadsheets, ) information exchange protocols / languages: INTEROPERABLE SYSTEMS
4/9/12 10
Inter-organizational Data Integration (80-95)
Federated Databases
I.S. DBMS1
1
I.S. netw ork DBMS 3
DB1
FED. SCHEMA A
DB3 I.S. DBMS
2
FED. FED.SCHEM SCHEMA A B B
4/9/12 11
FDB Service Architecture
Negotiator Data Wharehouse
Translation: Wrappers / Mediators
Filtering
Integration
local export schemas
4/9/12
12
Data Warehousing (19952000)
Traditional
DW Architecture
OLAP Server OLAP
Internal Sources Data Integration Component Data Warehouse Query and Analysis Component
Reports
Operational DBs
Data Mining
Meta data External Sources Client Tools
4/9/12
13
The Web Era: Data Alignment
Catalog matching In order for a private company to participate in the marketplace (e.g., eBay), it has to determine correspondences between entries of its catalogs and entries of a common catalog of a marketplace. Once the correspondences between two schemas have been determined, the next step is to generate query expressions that automatically translate data instances of these catalogs under an integrated catalog Having aligned the catalogs, users of a marketplace have a unified access to the products which are on sale
4/9/12
14
Heterogeneity is the Issue
Schemas
Generalization Specialization
Aggregation Structural
Typing
Completenes s Taxonomy
Model Syntactical Language
Data "Conflicts"
Semantic
Values Cognitive
4/9/12
15
Library Example (homogeneous)
Same content, different structures
Schema 1
Book title ISBN authors name birthdate
Schema 2
name
Author birthdate books title ISBN 4/9/12 16
Thesis example (heterogeneous)
Schema S1 (OO)
Person
Pin Name
The integrated schema (OO)
Person Pin Name Faculty Rank Phd-advisor Student GP A PhD Student Thesis Adv . Titl e Student
Faculty
Rank
Student
GP A
Schema S2 (relational)
Thesis (Phd-advisor, Phd-student, title)
Different data models, overlapping content 4/9/12
17
DW: Heterogeneous External Structured Sources
Apply transformations as needed to homogenize data structures (optional) Integrate
transformation integration
source DBs
homogeneized DBs
DW
4/9/12 18
A Generic Framework for Integration
Source schema s schemas transformation correspondence s investigation schemas integration mapping rules
integrat ed schema
transformati on rules
investigatio n rules
integration rules
Also for alignment
4/9/12 19
A Generic Framework for Integration
Source schema s schemas transformation correspondence s investigation schemas integration mapping rules
integrated schema
transformati on rules
investigatio n rules
integration rules
Also for alignment
4/9/12 20
Step 1: Pre-Integration
Given
the heterogeneity of the local sources:
DBMSs , GISs, XML, UML, OWL, RDF, WSDL, Reduce this heterogeneity as much as possible to make the sources more suitable for integration
Often,
the source data sets cannot be modified
Keep current applications programs alive Proprietary sources with independent use 4/9/12 => Modifications through a view mechanism
21
Data Model Homogenization
Goal: use a single, common data model and format
transformation integration
source DBs
homogeneized DBs
DW
4/9/12
22
Design Homogenization
Explicit the semantics of the schema
relational normalization : one fact in one place complex object normalization
Enforce standard design rules to reduce the number of structural conflicts
Objectify multivalued attributes:
Person child Child
Person
Family
4/9/12
23
Filling the semantic gap
COBOL
files Relational databases Object-Oriented databases Spreadsheets
? ?
Problem: what do these legacy data sources contain?
4/9/12
24
Understanding legacy data
Typical data reverse engineering process:
From implementation structures To conceptual design
Documentation is non-existing or obsolete (not updated) Designers are gone Knowledge is missing
4/9/12
25
COBOL files
Data description is within the application programs Data descriptions for the same file vary from one application program to the next Data descriptions may be incomplete (FILLER) 1st problem: to recover the underlying structure for each file
Analysis of DATA DIVISIONs Dataflow analysis within application programs
4/9/12
26
Relational databases
The
existing schema is usually not normalized Information on primary keys, candidate keys, foreign keys,functional dependencies, inclusion dependencies, other constraints, derivation rules, may be missing and is essential for the reverse engineering process
4/9/12
27
Example
Person ( P#, name, address Firstnames ( P#, firstname ) Employee ( E#, position, salary ) EmpDept ( E#, department ) Department ( name, location, boss, ) budget )
P# name address firstname s 0:n position salary
Person
0:n
Employee
0:n
E. D. bo ss
0:n
Department
name location budget
1:1
4/9/12
28
Where to find
Primary keys / Candidate keys
often defined in the schema check Select Distinct in SQL queries or mine the DB
check domain compatibility from the schema check join conditions in SQL queries check values in the DB Mine the DB
Foreign keys
Dependencies (functional, inclusion, existence)
4/9/12
29
Primary Keys?
Person ( P# , name, address ) Firstnames ( P# , firstname ) Employee ( E# , position, salary ) EmpDept ( E# , department ) Department ( name, location, boss, budget ) linguistic heuristics: Person ( P# , name, address ) Firstnames ( P# , firstname ) => Firstnames ( P# , firstname ) Employee ( E# , position, salary ) EmpDept ( E# , department ) => EmpDept ( E# , department ) Department ( name, location, boss, budget )
4/9/12
30
Foreign keys?
Person ( P# , name, address ) Firstnames ( P# , firstname ) Employee ( E# , position, salary ) EmpDept ( E# , department ) Department ( name, location, boss, budget ) Finding foreign keys
Firstnames.P# Person.P# but: Person.P# Firstnames.P# EmpDept.E# Employee.E#
EmpDept.department Department.name
Department.boss Employee.E# Employee.E# Person.P# 4/9/12 31
Objects, relationships, attributes ) Person ( P# , name, address
Firstnames ( P# , firstname ) Employee ( E# , position, salary ) EmpDept ( E# , department ) Department ( name, location, boss, budget ) A relation with one PK only is an object
Person ( P# , name, address ) Employee ( E# , position, salary )
A relation whose PK is composed of FK is a relationship
EmpDept ( E# , department ) Employee, Department are objects
The FK boss in Department defines a relationship with 1:1 role cardinality firstname typically is a multi-valued attribute of Person
4/9/12
32
Is-a links
P# name address firstname s 0:n position salary
Person
0:n
Employee
0:n
E. D. bo ss
0:n
Departme nt
name location budget
1:1
Person ( P# , name, address ) Firstnames ( P# , firstname ) Employee ( E# , position, salary ) PK is FK weak entity
4/9/12 33
Further Semantic Enrichment
Goals: ensure better understanding of the schemas Completing existing descriptions with :
meta-data for each representation external keys thesauri semantic definitions explicit representation rules
Example : if a road crosses a river, a bridge exists
quality indicators
4/9/12 34
Enrichment Examples
Add information on how the schema maps to the real world
add a definition of "Employee" as "those persons who have an employment contract with the enterprise add a definition of "salary" in "Employee" as "the gross salary amount, before any deductions"
Add
information on the description of the real world
"boss" in "Department" is an external key to "Employee "salary" in "Employee" is expressed in Euros
4/9/12 35
Enrichment: Adding Constraints
E.g., add functional dependencies
E m p lo y e e
E#
p ro je c ts
Employee (E#, projectname)
n a m e Project r(name, manager) m anage
E m p lo y e e
E#
p ro je c ts nam e ro le
4/9/12
36
A Generic Framework for Integration
schem as mapping rules correspondences investigation schemas integration integrat ed schema
schemas transformation
transformati on rules
investigatio n rules
integration rules
4/9/12
37
Step 2: Correspondences Investigation
Correspondences relate (schema) elements which describe the same phenomena of the real world
What are the related schema elements ? How their sets of instances are related ? How corresponding instances are identified ? Do their descriptions match ? How ?
Known as the (Schema) Matching problem
4/9/12 38
Semantic Relativism
Information representation depends on perception
cable blue cable color material conflicts red cable copper cable color fiber material
Differences in representation as usually referred to as
4/9/12 39
Semantics of correspondences
Real World
Database System
Schema 1
Schema 2
4/9/12
40
Example of correspondences
Person
Schema S1
Faculty
Rank
Pin Name
Student
GP A
Schema S2
Thesis (Phd-advisor, Phd-student, title)
4/9/12
41
Some Library Correspondences
Book
Schema 1
title ISBN
authors
name birthdate
Schema
2
name
Author
birthdate books title ISBN 4/9/12 42
Asserting Correspondences
S1 thing1 set_relationship S2 thing2
With Corresponding Identifiers: thing1id-predicate = thing2id-predicate [With Corresponding Properties: thing1attribute
set_relationship
thing2attribute , ... ]
set_relationship :
4/9/12
43
Corresponding Schema Elements
one element --- one element (same construct, e.g. object type)
Node CONTAINED-IN Extremity
one element --- one element (different constructs, e.g. objec type and attribute )
Book EQUAL books authors EQUAL Author
4/9/12
44
The Match Operator
It
takes two schemas/ontologies, each consisting of a set of discrete entities (e.g., tables, XML elements, classes, properties) as input and determines as output the relationships (e.g., equivalence, subsumption) holding between these entities.
4/9/12
45
Matching Dimensions
Input
dimensions
Underlying data models (e.g., XML) Schema-level vs. Instance-level
Process
dimensions
Approximate vs. Exact Syntactic vs. semantic vs. external input
Output
dimensions
Cardinality (e.g., 1:1, 1:m) Equivalence vs. more relations Graded vs. absolute confidence
4/9/12 46
e.g. String-based Techniques
Prefix / Suffix
Checks whether the first string starts (ends) with the second one prefix: net = network; but also hot = hotel suffix: phone = telephone; but also word = sword
Edit distance
Calculates the number of insertions / deletions / substitutions of characters required to transform one string into another, ratio max(length(string1); length(string2))
EditDistance(NKN,Nikon) = 2/5 = 0.4
N-Gram
Calculates the number of identical n-grams (i.e., sequences of n characters) between them trigram(3) for the string nikon are nik, iko, kon n-gram(2) (nikon, konnichiwa) = 3 4/9/12 47
e.g. Language-based Techniques
Tokenization
Names are parsed into tokens by recognizing punctuation, cases Hands-Free_Kits => <hands, free, kits>
Lemmatization
Tokens are morphologically analyzed in order to find all their possible basic forms Kits => Kit
Elimination
Tokens that are articles, prepositions, conjunctions, and so on, are marked to be discarded a, the, by, type of
4/9/12
48
A Generic Framework for Integration
schem as mapping rules correspondences investigation schemas integration integrat ed schema
schemas transformation
transformati on rules
investigatio n rules
integration rules
4/9/12
49
Step 3 : Integration (or alignment)
Goal
for Integration: Creating an Integrated Schema ( IS ) and the mappings to the local databases Goal for Alignment: Conform a specification to a given prescriptive specification (PS) Goal for Matching: Find correspondences/complementarity PS IS
S1
S2
S3
S1
S1
4/9/12 50
GAV & LAV Integration
GAV (Global As View): the Integrated Schema provides an integrated description of all the data available in the sources
the IS is used to access data from any sources queries to the IS are mapped into queries to the sources (as in distributed databases) the IS is defined to allow access to all data: in case of conflicting specifications, an all-encompassing specification is elaborated for the IS
LAV (Local As View): the Integrated Schema provides an integrated description of all the data that is desirable and that somehow matches the requirements of users of the IS
the IS may define data that does not exist in any of the sources (missing / incomplete information problem)
Other Sub-Goals for integration:
minimality vs. exhaustiveness of the integrated schema, ... 4/9/12 51
Integration Process
Solving
the conflicts
Classification conflicts Descriptive conflicts Structural conflicts Fragmentation conflicts
Examples
of conflicts among related object
types
different different different different different
classifications (sets of instances) sets of properties structures coding schemes metadata 4/9/12
52
Integration rules
Rules defining the strategy to solve conflicts Example rules:
If an object type corresponds to an attribute, keep the object type If the population of an object type is included in the population of another object type, build an isa hierarchy
Integration rules depend on how you want the integrated schema to look like
4/9/12
53
Structural Conflicts
Different schema element types, e.g.: class, attribute, relationship Library example :
S1 : Book is a class S2 : books is an attribute of Author
Conflict resolution : Choose the less constraining structure
IS: Book are classes
4/9/12
54
Classification Conflicts
Corresponding elements describe different sets of real world objects
S1.Faculty CONTAINS S2.PhD-advisor
Conflict Resolution:
Generalization / Specialization hierarchy
Faculty
Phd-advisor Faculty
Merging
4/9/12
55
Descriptive conflicts
Corresponding types have different properties, or corresponding properties are described in different ways Object / relationship type:
naming conflicts :
synonyms homonyms Node , Extremity Highway (EU) , Highway (USA)
composition conflicts : different attributes and methods
Employee ( E# , name , address , dob )
4/9/12 56
Descriptive conflicts
(2)
Attribute :
naming conflicts : synonyms , homonyms cardinalities
firstname : one , two , N values
domains
salary : $, Euro ... student grade : [ 0 : 20] , [1 : 5 ] , [ A: D ] ... geometry : different scales, different reference systems, ...
integrity constraints
the average area for a flat in Japan, in USA ...
Method
:
4/9/12 57
naming conflicts : synonyms , homonyms
Descriptive conflicts
(3)
Solution depends upon :
the type of the descriptive conflict the type of classification conflict
S1 : Employee ( E# , name , address , dob ) S2 : Employee ( E# , position , salary ) S1.Employee EQUAL S2.Employee => IS. Employee ( E# , name , address , dob , position , salary )
4/9/12 58
Integration Methods: Manual
First method : manual integration do it yourself
a languag e
mapping rules integrate d schema
sche mas
DBA
easy to implement , flexible BUT time consuming for the DBA
4/9/12 59
Integration Methods: Semi-Automatic
Second
method : semi-automatic integration tell me about the problem , I will try to fix it
TOOL mapping rules integrate d schema DBA Opens to visual CASE tools, integration servers BUT knowledge acquisition can be painful
4/9/12 60
correspondence s schem as