Scaling access to heterogeneous data sources with DISCO

A. Tomasic; L. Raschid; P. Valduriez

Scaling Access to Heterogeneous Data Sources With DISCO

1998, IEEE Transactions on knowledge and Data …

Abstract

Accessing many data sources aggravates problems for users of heterogeneous distributed databases. Database administrators must deal with fragile mediators, that is, mediators with schemas and views that must be significantly changed to incorporate a new data source. When implementing translators of queries from mediators to data sources, database implementors must deal with data sources that do not support all the functionality required by mediators. Application programmers must deal with graceless failures for unavailable data sources. Queries simply return failure and no further information when data sources are unavailable for query processing. The Distributed Information Search COmponent (DISCO) addresses these problems. Data modeling techniques manage the connections to data sources, and sources can be added transparently to the users and applications. The interface between mediators and data sources flexibly handles different query languages and different data source functionality. Query rewriting and optimization techniques rewrite queries so they are efficiently evaluated by sources. Query processing and evaluation semantics are developed to process queries over unavailable data sources. In this article, we describe 1) the distributed mediator architecture of DISCO; 2) the data model and its modeling of data source connections; 3) the interface to underlying data sources and the query rewriting process; and 4) query processing semantics. We describe several advantages of our system.

Across a wide variety of fields, huge datasets are being collected and accumulated at a dramatical pace. The datasets addressed by individual applications are very often heterogeneous and geographically distributed. An important factor of the strength of a modern enterprise is its capability to effectively store and process information. As a legacy of the computing trend in recent decades, large enterprises often have many isolated data repositories used only within portions of the organization. From organizational/technical reasons, such isolated systems still emerge within different parts of the enterprises. While these systems support the individual enterprise units, their inability to interoperate is hindering the evolution of a system which provides the user with a unified information model of the data available in the whole organization. This problem gets even more reinforced by applying it onto the emerging Grid concept, in order to provide a uniform information model over the shared resources of the participating enterprises in the virtual organization. Several technical obstacles arise in the design and implementation of a system for integration of such data sources, most notably distribution, autonomy, and data heterogeneity. This technical report presents a data integration system based on the wrapper-mediator approach, namely the Grid Data Mediation Service. The alternative of forcing each grid application to interface directly to a set of databases and resolve federation problems internally would lead to application complexity, and duplication of effort. We describe our extensions and improvements to the reference implementation of the OGSA-DAI Grid Data Service prototype, an infrastructure that allows remote access to Grid databases, in order to provide a Virtual Data Source-a clean abstraction of heterogeneous/distributed data for users and applications. The open architecture (in means of the integratable data sources and operations) is implemented in a first prototype to show the feasibility of the developed concepts as well as their applicability to real needs and problems. We developed a flexible mapping schema to describe the building process of a virtual data source with operators to query data sources and combine the results via operators for join and union. Integrable data sources include relational databases, native XML databases and files in the comma separated value format. Usability and integratability was an big issue during the design phase-so we decided to support a subset of the well known Structured Query Language (SQL) to formulate query against our global schema. By describing general applicable access scenarios we are showing, to our best knowledge, the great need for the first Grid data mediation service as well as the compliance with important requirements of virtual data sources.

Log In

Scaling Access to Heterogeneous Data Sources With DISCO

Sign up for access to the world's latest research

Abstract

Related papers

Related topics