Cosmos: Cluster of Systems of Metadata For Official Statistics
Cosmos: Cluster of Systems of Metadata For Official Statistics
Prepared by Prof H. Papageorgiou and M. Vardaki University of Athens, Department of Mathematics Athens, Greece
Contents
Contents......................................................................................................................2 Preface........................................................................................................................4 Chapter 1 - Projects presentation general information...........................................6 1.1 FASTER...............................................................................................................6 1.1.1 Main Objectives............................................................................................6 1.1.2 Development of a Metadata Specification in XML.....................................7 1.1.3 System Architecture......................................................................................8 1.1.4 The Data Web Technology...........................................................................8 1.1.5 User Environment.........................................................................................9 1.1.6 Access Control...............................................................................................9 1.2 IQML ...................................................................................................................9 1.2.1. Main objectives:.........................................................................................10 1.2.2. Metadata specification...............................................................................10 1.2.3. Architecture................................................................................................10 1.3 IPIS.....................................................................................................................12 1.3.1. Main objectives...........................................................................................12 I.3.2 Architecture:................................................................................................13 I.4 METAWARE.......................................................................................................16 I.4.1 Main objectives............................................................................................16 I.4.2 Architecture.................................................................................................17 I.4.3 Metadata support........................................................................................18 I.5 MISSION.............................................................................................................19 I.5.1 Main objective:............................................................................................19 I.5.2 Architecture:................................................................................................20 Chapter 2 - Projects comparison.............................................................................23 2.1. Objectives ......................................................................................................23 II.2 Comparative analysis of COSMOS cluster projects...................................23 2.2.1 Data Capture..............................................................................................31 2.2.3 Metadata Repository..................................................................................32 2.2.4 Metadata Categories and modelling:........................................................32 II.3 Metadata Model - COMPARISONS............................................................36
Preface
This is the COSMOS deliverable for the topic COSMOS Projects profile prepared by the UoA/Dept of Mathematics. This document attempts to provide a comparative analysis of the five projects that participate in COSMOS (FASTER, IQML, IPIS, METAWARE, MISSION). It intents to illustrate similarities and differences in the following areas: Objectives Areas of Application User requirements User services provided System architecture Other significant issues
To obtain the required information, the following steps have been taken: An initial table including all five projects similarities and differences in certain domains had been prepared, presented and discussed in detail at the COSMOS kick-off meeting in Essex by UoA/Dept of Mathematics. A number of alterations and additional differences and relationships had been suggested by the partners. To achieve the best understanding of each projects specificities, a template was prepared by H.Papageorgiou and M.Vardaki and has been finalized with the help of Hilary Beedham, and then sent to all partners for completion. This template is provided in Annex 1. The responses collected are presented in two relevant tables in Chapter 2. In some cases, from the answers received from the corresponding partners some questions had not been completed and a N.A (nonavailable) is indicated in the corresponding cell of the table; there are also cases where the answers provided were not clear enough, and this is also mentioned. In order to give an overview of each project and to be able to proceed to comparisons, we had to examine a number of documents and websites for each project that are given as references. The first chapter of this deliverable is entirely extracted from the relevant documentation and the comparisons in Chapter 2 concerning the metadata model and the possible relationships among COSMOS cluster of projects, has also been obtained from these documents study.
Therefore, in this deliverable, not only a comparison of projects is attempted, but an effort is also made to present the interrelations and give
some possible areas of projects interaction within the COSMOS cluster framework.
1.1 FASTER1
1.1.1 Main Objectives
FASTER is a dissemination project that aims in developing a flexible and intelligent platform that makes it possible to access various types of statistical data and electronic resources on the Internet. It stresses in developing the technology pioneered in the NESSTAR project by increasing functionality to include access control, statistical disclosure control, management of hierarchical and aggregate data files, to improve usability of the software2. The FASTER project revolves around the metadata repository and will use it to offer a rich, user configurable environment that at the same time will enforce access control rules to the underlying data. This metadata repository will be XML based and standards compliant to facilitate information interchange with other systems. FASTER seems to updating approaches followed in NESSTAR in accordance with technological advances in the area of information interchange. Publicly available results indicate that although the same general architecture will be followed, emphasis will now be given to standards compliance (the metadata repository will evolve using an RDF/XML direction while keeping the base DDI directions), enhancement of the metadata role (metadata will be responsible not only for data conformance details, but will also refer to user requirements as they browse, personalization), and metadata applicability to a wider array of data sources (time-variant, multidimensional, etc). Finally, access control will be metadata supported in all levels of the final system. Architecturally, greater emphasis is placed on XML support and access control issues while some of the NESSTAR
approaches seem to have been depreciated (such as CORBA messaging and Chesire1). Relevant approaches that have been examined in this phase include the Chesire project (a Z39.50 approach), the CBS Cristal model2, XML approaches, RDF and the Data Documentation Initiative (DDI) standard. It has to be noted that NESSTAR is build upon a DDI based metadata schema. It follows that FASTER can draw upon the experience accumulated in NESSTAR3 in using the DDI DTD and its relevant shortcomings4. It should be noted that NESSTAR has produced a working system for both the front-end (a java-based client for the user's browser) and the back-end (a collection of software tools that enable publishers to incorporate their data into a common repository). Some more information about how the project can meet its goals is the following:
The Chesire Project, < http://cheshire.lib.berkeley.edu/> The CBS Cristal Model, < http://neon.vb.cbs.nl/sos_cubes> The NESSTAR project <http://www.nesstar.org/>. DDI DTD beta testers results, < http://www.icpsr.umich.edu/DDI/codebook/testers.html> http://www.icpsr.umich.edu/DDI/codebook/testers.html
will be on the definition of a flexible architecture in which new resources, ranging from electronic journals to multimedia objects can be included.
A Data Browser (Client) that will provide a user-friendly graphical user interface to the system, its role being similar to that played by the Web Browser in the WWW - An abstract protocol for statistical data access that will be mapped to one or more existing or emerging communication standards (e.g. XML over HTTP, CORBA or DCOM). A flexible Server that will be able to host a set of different services and gateways to useful external resources and services.
The Data Web will be seamlessly integrated with the WWW. It will be possible to create links from the WWW to Data Web resources and operations and inversely from statistical metadata to WWW sites. This integration will allow for the creation of a new kind of data-rich, WWW accessible documents that will blend text, images and live data. The interface represents a major step forward in allowing producers to build advanced systems that take data right through from collection to dissemination and users who will have seamless access to a wide variety of data resources. The key element is the interface between these resources. Full advantage will be taken of existing tools and systems in the design of this virtual data environment, and the design will be firmly based in the partners practical and wide-ranging experience in disseminating and publishing data.
1
http://www.venus.cis.rl.ac.uk/limber/
1.2 IQML
The main goal of IQML, according to the answer gathered from the completed template, is to improve the accuracy and timeliness of statistical data collection from enterprises and individuals whilst at the same time reducing the burden of statistical reporting on enterprises.
1.2.1.
Main objectives 1:
To examine the realities of metadata interchange and object standards in order to facilitate an active contribution to the metadata interchange standards by implementing, in software, chosen aspects of the international standards for metadata interchange (eg. CWM from the Object Management Group, and Registry and Repository from ebXML), and carrying out trials in the area of intelligent questionnaires. To exploit the emerging technologies to facilitate the automation, user friendliness, and application integration, of raw data collection demands of collection agencies. To assist raw data collection agencies to build collection instruments in a variety of forms (e.g. CATI, CAPI) using a common metadata model which will facilitate the development of and access to a common metadata repository.
To ensure the metadata interchange and database access standards being elaborated at the international level by software vendors (eg Open Information Model (OIM) from the metadata coalition, Common Warehouse Metadata (CWM) from the Object Management Group) takes into account the needs of the intelligent questionnaire, by participating in the standards process and by developing products, and re-engineering existing products, that use these standards in live data collection scenarios.
1.2.3. Architecture
10
The overall concept of the projects architecture is illustrated in the figure below1:
The system consists of five modules2: The Metadata Maintenance and Repository will support the definition of metadata objects, from fine-grained objects such as codes, to coarser grained objects such as tables, that can be used in a questionnaire. APIs will be developed to store and access these metadata objects. The product will allow questionnaire design systems and other software to access the metadata without the need to know the underlying structure or source of the metadata by implementing object interfaces that follow international standards. The Questionnaire Designer package will enable the user to design and manage questionnaires which can be deployed using the other software modules of the suite. The tool will allow the user to define questionnaires at a number of levels: conceptual, logical, and formal. Attention will be paid to requirements of different types of respondent (business and individual), and to the different types of surveys (eg economic or social) that may be addressed. The questionnaire design tool will capture all relevant metadata and store it in the metadata repository. The Questionnaire Presentation tool will render the questionnaire for use with PCs and in particular with web browsers. XML support in the CWM for the presentation, validation, navigation and calculation will be implemented by the tool. This will allow users to fill in the data and for it to be validated as appropriate. The Database Interrogation tool will support the extraction of data from popular databases and to map these data to the XML. It will also allow data to be extracted from the XML and loaded into a database. Once configured, this will support the automated loading and extraction of data to and from databases and the electronic questionnaire. The Survey Administration package will allow the questionnaires to be integrated with registers and sample frames. It will track the despatch and receipts of questionnaires and software to individuals and organisations.
1 2
As illustrated in the completed template for IQML by Chris Nelson IQML: Registry and Repository Interface Specification, by Chris Nelson and Andy Jenkins, Dimension EDI
11
1.3 IPIS
1.3.1. Main objectives 1
The IPIS project aims to develop and apply advanced methods and technologies required for maximum international compatibility, for efficient use of public information and for increasing the efficiency of administration. In particular, to access, organise and disseminate relevant information is a pre-requisite for any decision-making activity in all stages of the work-flow process. The general objective is to develop new tools and services to enable Public Administrations (P.As.) to design, organise, develop and disseminate Public Information Systems (PIS) in a pre-harmonised and standardised way. The objective concerning software development is the introduction of a new public information system, which would allow for:
Full exploitation of integrated information systems that combine in meaningful ways social and economic data of mutual relevance, for policy analysis and comparative research Resolution of data and metadata storage problems Adaptation to different types of data (labour-market industrial, trading and financial flows, balance sheets, etc.) indicators,
Universal access to a distributed public information systems for analysing a large variety of statistical data and their accompanying metadata. Develop a user-friendly metadata-based software application tool that will assist data providers with on-time availability of low cost, high performance and high reliability integrated information infrastructure, thus increasing productivity and decreasing costs. Furthermore, the simultaneous provision of meta-information will increase the value of the statistical results for the end-users. Demonstrate the usefulness of such information infrastructure on two main specific applications, namely cross-border trading and vocational training.
Areas of application: Labour Market (pilot in Household Budget Surveys), Cross-Border Trading (pilot in External Trade), Vocational Education and Training
12
I.3.2 Architecture1:
I. SYSTEM FUNCTIONAL REQUIREMENTS: i. The use of meta-information, in the form of metadata for the proper processing and presentation of the unprocessed data, is of major importance. The metadata will be used as a means of data identification, knowledge base and as a means for the automatic transformation of these data. The development of a proper metadata model that should be used for the automation of the whole system and as a means to identify and exploit the full potential of all the resources of the system. ii. The system shall be able to manage the effective storage and identification of statistical data and metadata and to provide a mechanism for the retrieval and processing of them. The data that will be stored on the system are expected to be in the form of surveys, tables referring to a level of aggregation, official data in the form of tables, data in registries and probably data from reports produced at special circumstances. iii. A facility to search and retrieve external data described by some generally accepted format (like DDI or RDF) shall be taken into consideration. iv. The most important requirement from the system is to process the statistical data and meta-information. The system needs to have the ability to select and present the required data in the level of analysis asked and with all the required meta-information that needs to be presented with the data in order to understand their meaning and how to handle them. v. The system is required to provide some statistical processing procedures. These procedures may be implemented either as a build-in functionality or as an additional functionality offered to the user as a set of library functions. So the final user will have the ability used them for the processing required. vi. The system shall have facilities to present data for spatial and temporal comparison either within the some country or between different member countries in the EU. The system may also have the ability to present information in different levels of aggregation. vii. A set of standard graphics facilities is required The system must contain a business graphics facility able to present the typical set of graphics (xy-lines, bar graphs-simple or stacked, pies, scatter plots). These are required to be presented in the display, to be printed, or to be included in common packages. II. DATA STORAGE
1
The system will have the ability to store and handle data sets from statistical surveys along with their specification schemes. Also the
13
system must have the ability to store data in the form of aggregates of these variables The system will store or infer to a set of separate data bases containing official data as they are collected by the appropriate Public Administrations. The design must be abstract enough in order to be able to accept data types not known in the time of its development. Data from registries will be stored on the IPIS repository taking into consideration the principle of non-disclosure of personal information in the cases where the principle is applicable.
III. METADATA PRESENTATION AND STORAGE The system is required to have facilities for the storage and presentation of the available metadata information for the selected data sets. In addition, the presentation of the set of the underling classifications used for the data processing is required. The functionality for the automatic transformation of the presented data from a classification to another needs to be taken into consideration. IV. DATA EXPORT The requirements for export procedures include the following: A facility for exporting data and metadata in the most commonly used formats. A facility to export tables and graphs as embeddable object to other software programs. A facility to export database data in the widely used formats. V. DATA PROCESSING Data Selection The system must have the ability to select data across different data sets either from the some country for different time periods or from different countries for the same time period. The selected data must contain the whole set of the appropriate information required for their proper handling as well as the full explanation of them. For that reason a kind of enhanced SQL queries scheme must be supported. The enhancements needed, concern the maintenance of the names of the variables and the explanatory texts associated with them. Tabulations The selected and manipulated data is required to be presented in tables containing the proper labels, as well as the proper names of variables along with the notes required for the appropriate comprehension of tables. For the presentation on the display, some handlers are required in order to give additional information concerning explanatory issues or other statistical metadata issues. The appearance of the proper format of the tables must be preserved. Also the system must provide facilities for printing.
14
The system must have advanced facilities concerning tabulations either for presentation purposes on the display or for printing. The data that are contained in the tabulated tables is needed to be possible to be used for statistical processing and for graphics purposes. The statistical processing required involves facilities for standard statistical measures (descriptive statistics), statistical inference tools, and regression models. Some of them need to be offered in predefined selections and others in a set of library functions available to the user. The tables presented need to have the following capabilities: Presentation of the required data with the labels describing them and the meta-information required for their understanding. The metainformation will be presented as notes, as 'balloons', or with some similar facility. That facility will indicate that there exists metainformation related with some element of data. The variables presented need to be coded with a widely accepted coding scheme in order to prevent misuse of data. A possibility is to hide from the final user the coding scheme providing him with a kind of a front end GUI. The tables may have the ability to provide either numeric values for the variables or percentages. Some kind of a processing functionality is required for the elaboration of indicators defined by the end user. The tables defined may be single rows or columns or cross-tabulated tables. The system needs to have the ability to present data in various levels of aggregation. The system needs to have the ability to present spatial and temporal data in tables. The presented data is needed to be exported with their metadata on a set of formats for use with other software packages (statistical or general) for further processes,. The tabulation facilities of the software need to be advanced enough in order to cover the needs of various categories of users.
Automatic Transformations of Data The system must be able to automatically transform the underling data between classifications, between monetary units or to aggregate the data in higher classification categories. The user needs to have the ability to choose between different methods of transformation that are available at the state of the processing he is working on.
15
Spatial Comparisons The system is required to have the ability to select data and present spatial comparisons of statistical data from regions in a country or across countries either in tabular forms or graphically. These data is required to be used in statistical manipulations. Temporal Comparisons The system is required to have the ability to select data from data set of different years and to create time series, which can be used for comparisons. These data have to be presented in tables or graphs or to be manipulated statistically.
I.4
I.4.1
METAWARE1
Main objectives
The objective of METAWARE is the "Development of a standard metadata repository for data warehouses (DWH) and standard interfaces and functions to exchange metadata between the basic statistical production system and data warehouses2. The aim is to make statistical data warehouse technologies more user-friendly for user access by the public sector. This will support the application of official statistics in the society and broaden the scope of users. The system will operate both in the traditional client/server environment and in the Internet world. The aim is also to support and enhance standardisation both at the national and international level". It should be considered as a kind of continuation of the IMIM project.3 The main idea is the support of data warehouse applications with statistical metadata. The basic structure of the technical project solution, published in the Annex 1, is illustrated as follows:
1 2 3
Most information had been extracted by Deliverables D1 and D2 As indicated in the template completed for METAWARE by L. Planque
http://imim.scb.se
16
External MD target
MD exchange function
Metadata interface
data
The main idea is to support data warehouse tools used in the DWenvironment with metadata and to support the data access with relevant metadata. It has been decided to use an object-oriented approach to define and specify the metadata system necessary for a data warehouse approach. An object-oriented approach means primarily the definition and specification of a number of object types relevant to the DW application. It has also been decided to use metadata specifications that have been done by the Neufchatel Group (Statistical Classifications) and that are available and useful from other sources. A close co-operation with the work done in the Metanet project is envisaged. The Metaware project expects both feedback and input from that project. The metadata specification and the logical system design will be done independently of a special software approach. This means that the system approach will be adaptable to different technical software solutions. Within the Metaware project it is planned to develop a prototype based on the Bridge software, developed during and after the lifetime of the IMIM project.
I.4.2
Architecture
For the prototype development is is planned to use MS Analysis Services and Oracle Express as two existing data warehouse engines. Both systems have OLAP functions that do not sufficient support statistical metadata. The project has to explore how the engines via API can be extended in its metadata functions. A special application layer has to be designed and be implemented as a prototype development. The object types specified by the project have to be implemented in a common metadata interface. The project will use the ComeIn interface tool. The functionality of ComeIn has to be extended by implementing new object types. The project will use object types already defined in ComeIn.
17
Metadata Repsository
DW applikation, Import/Export Metadata Data Metadata Data Warehouse Engine(System) MS Analysis Services Oracle Express
I.4.3
Metadata support
The project will define object types relevant to the data warehouse problems and introduce them also to the Metanet project for adoption. It could be the task of the Metanet project to define the reference object type, while the Metaware project will use the same object type but only with the attributes useful for that task. On the other hand the Metaware project will use object types developed and defined in other projects. The reference model of the project is the general repository for the description of metadata object types. The Metaware project will use a number of these object type with a suitable subset of attributes. The general agreement about such a reference model for metadata object types will permit the development of common interfaces for the exchange of metadata. XML could for instance be one solution. But also the ComeIn interface is based on the same philosophy and will support the exchange of metadata between different software packages or different components in a system. The metadata model does not reflect versioning features and multilingual support, since those are not part of the conceptual model. Multilingual and version support is provided by the appropriate implementation of a ComeIn interface. All textual metadata objects should support any number of languages. Moreover, all metadata objects are assumed to support versions to reflect minor changes in a metadata object. This is not reflected in the data model because this is not of conceptual interest and can be implemented in many different ways. The technical level just defines the way in which input relations (record types) are processed by operation implementations in order to create one or more output relations (record types). The conceptual part is
18
divided into the variable definition part and the process definition part. In order to support retrieval functions keywords (thesaurus) and statistical activities (surveys and products) have been introduced. Conceptual parts are more complex since the concept of statistical variables is rather complex. On the other hand, variable and process definitions provide the meta-information, which is required for retrieval processes and for providing conceptual information about the data. Moreover, conceptual information can be used for generating 50% or more of the technical metadata.
I.5
MISSION
The main goal of this project is to utilise the World Wide Web and emerging agent based technologies to provide a modular system of software which will enable providers of official statistics to publish their data in a unified framework, and to allow consumers of statistics to access these data in an informed manner with minimum effort1.
1 2
As indicated in the completed template for MISSION sent be Yaxin Bi. As indicated in the completed template for MISSION sent be Yaxin Bi.
19
In all it is evident that MISSION is set to be able to incorporate any set of statistical tables and thus includes in its examples of uses output data from Eurostat, Statistics Finland and ONS. It is paying considerable attention to data (and this involves large amounts of microdata and administrative data) from Education and Health. It is not paying special attention to extracting indicators but considers as a requirement the production and maintenance of relevant ones
I.5.2 Architecture:
The MISSION system relies on agents (specific software modules) for both the communication between various modules and the coordination of their actions.
Components of Architecture The architecture comprises five basic logical, or conceptual, units: The Client component is a Web based user interface which connects a user to all sites participating in the architecture. The Client obtains a request from the user, and sends an agent to search for a Library that can satisfy the request. It is speculated that the client will take the form of a Java applet or a true web interface using HTML and Javascript. Of course, other approaches may prove suitable, eg. plugins. The Compute server is a statistical processing engine which stores no information of its own. Based on the query it receives, it obtains the necessary data from various data servers, performs the request, and returns the result to the Library which made the request. It may also make a request to third party statistical packages. A primary objective of the compute server unit is to integrate a distributed declarative querying facility and a distributed statistical aggregation system, using distributed database and web technology. The compute server architecture will be designed to incorporate intelligent agent techniques: query agents will facilitate interaction with library server units concerning locational and other operational metadata; query agents will also facilitate interaction with data server units concerning macrodata; mediation agents will enable the user-specified merger of heterogeneous macrodata and accompanying metadata. The compute server will be specified and designed to efficiently implement the statistical macrodata operators and associated metadata operators necessary to accomplish the required data merger through a Java-embedded query language. The principal task of the computer server unit is therefore to receive
20
and interpret queries from library units and to return macrodata and metadata results along with action plan metadata useful for future query optimisation. The Library is a repository for statistical metadata. It holds the three different kinds of metadata. When a Library receives a request, it decomposes it, and, if necessary, it can send to other Libraries in the system for any metadata it requires. Once it has built up an operation, it submits it to a compute server. On receiving the reply to the request, it returns the answer to the Client. The data server is the unit, which gives access to the data. The data server holds the data itself, management tools for registering and maintaining the system and a gateway module. The gateways hold the minimum amount of metadata necessary for the safe use of the data. This includes registration details to allow the Provider to control access to the data and information about the physical structure of the datastore. Other metadata is made available to be uploaded to Libraries that request it. In brief, the Library is a server in the application layer that:
serves as a statistical metadata repository. Metadata in the context of MISSION project are: access metadata (that are machine readable and contain physical & logical info to access actual metadata), methodological metadata (that are machine readable and are required to process data for statistical analysis) and finally contextual metadata (human readable, provide extra information for the user in the form of notes, footers, survey details etc.) perform a front-end pre-processing of user requests (perhaps performing such tasks as syntax checking, validation, conformance to metadata requirements, etc) before dispatching them to other modules of the system does not hold any actual data, only relevant metadata
Agents form the dynamic part of the system. Agents perform intermediate processing and navigate the Internet to access the appropriate building blocks of the system. Once these are located and accessed, agents are responsible to invoke the appropriate computations on the engines or retrieve the appropriate data and metadata according to the user request/goal. Specifically, it is specified that suitable agents, which shall be developed in the course of project progress, will:
perform intermediate processing and navigate to appropriate part of the system to obtain resources (data, computations, etc.)
21
invoke computational routines (as needed) and provide data to other agents when requested.
22
It seems that all five projects aim directly or indirectly to the following: Development of a standard metadata repository Use of the web for data dissemination Metadata collection and manipulation Use of a metadata model (but with different classes, attributes and structure), although overlapping with some of the COSMOS projects particularly for the description of data storage and processing. The development of a metadata repository Micro and macro-data are (or will be) supported and manipulated Use of current state-of-the art technologies
II.2
Remark: In the following tables, some information has not been specified, because we have received general answers from the corresponding partners; these answers are just added as footnotes. In cases where N.A.
23
(Non Available) in indicated, it means that the responsible partners have not completed this question.
24
COMPARATIVE
FASTER
Areas application of
ANALYSIS OF THE
COSMOS
PROJECTS
USER
TYPES AND
SERVICES
MISSION
Not specified3 (We can assume from the response to the question about classifications, that some of the areas are: Labour Market, Health and Environmental statistics)
IQML
Not specified1
IPIS
Labour Market (mainly the area of Household Budget Survey), Vocational Education and Training and Cross Border Trading (External Trade).
METAWARE
Not specified2
General areas of social science - health, income, voting patterns, labour market, housing tenure, employment etc.
User Types
End users ( academic com-munity, policy makers mass media community commercial community management consultants) Data disseminators (Data archives, Local authorities, Central government depts.) Data producers (government departments, survey organisations, research institutes)
Respondents (either as individuals or companies) Questionnaire designer (i.e. statistician or data collector that designs the questionnaire) Survey administrator (person responsible for administering the survey in terms of population, survey sample, follow up of responses/non responses etc.) Collecting agency (i.e. organization responsible
Policy Makers (International Organisations, International and National Statistical Offices, Public Administrations and Institutions (Ministries, etc), European Commission, etc) End-Users (Universities, Research institutes, Service providers, Enterprises, General Public (citizens, consumers, workers, etc.).
Data producers & publishers (e.g. users of commercial data warehousing packages) for the main objective; Other statistics users for subobjectives
1 2
The answer to the relevant question of the template was raw data collection, questionnaire design, survey administration, registry and repository The answer to the relevant question of the template was The whole metadata domain, but particularly metadata on data warehouses and cubes 3 The answer to the relevant question of the template was Allowing statistical data providers to share expertise and publish data on the Web
25
FASTER
IQML
for collecting the completed questionnaires)
IPIS
METAWARE
MISSION
NO
YES
NO
NO
Data collections offered Data types supported Data Types manipulated Metadata Manipulation Data Dictionaries
YES
NO
Data and metadata from all kind of data providers Micro and Macro Micro and Macro YES YES, Data dictionaries (by Statistical Domain, Statistical products, Data Producer, Year), Time series dictionaries, specialized collections of indicators, userdefined collections of indicators, etc. YES
NO*
NO
under under
Data Import
YES
YES
YES
The Metaware model does not support data in itself but only metadata (including links to datasets, records and physical data elements).
26
YES YES Wizard like facility, Search by variables, var id., metadata, keywords codes, etc.
Searches on the repository to assist the designer or survey administrator to find metadata relating to the questionnaire or survey All European language. Perhaps other languages
Languages supported
Native language of survey for the majority of DDI fields, agreement amongst archives that certain fields of the DDI will be translated into English allows searching across archives in English for these fields. Collaboration with the LIMBER project (which is developing a multilingual thesaurus) means that it a test version demonstrating natural language searches across catalogues in English, German, French and Spanish is expected by the end of the project. Not in the standard interface but this is provided in additional
All
English only
Collection of Indicators
NO
N.A.
27
interfaces Collection of classifications NO. These could be included as part of the standard coding information held in the DDI but they could not be mapped to other classifications. YES, all repository objects can be classified by many different categories ( a category = one classification node from one classification scheme). Allowable responses to questions may be taken from a classification. No mapping between classifications. YES NO Questionnaires can be represented in many media, current external formats are HTML and XML HTML for the respondents, Java based GUI for the survey administrator or questionnaire designer NO YES, Response to questions can be generated from calculations YES, in all projects areas of application YES in all projects areas of application YES (Heterogeneous query is required) in Labour Market, Health and Environmental statistics. Mission may store international ones in the system and allow automatically or manually set up mappings NO YES HTML, sheet XML, Excel
HTML
NO Data
NO
NO No but software includes facility to tabulate, produce descriptive statistics, scatterplot and regression.
YES YES, (Transformation is used for metadata harmonisation and to support heterogeneous queries)
28
algebraic)
GUI specification
Java or HTML
Questionnaire Designer Tool (QDT), Survey Administration Tool (SAT), and repository each have GUI. The Questionnaire Presentation Tool (QPT) GUI is HTML. YES NO
HTML
User-friendly Visual Basic interface to enter & process metainformation on data warehouses and cubes and to pilot commercial data warehousing packages. YES NO
NO NO
YES NO
YES, Limited audit trail on repository items. No history kept of respondent navigation of the questionnaire
NO
NO
YES
29
COMPARATIVE
FASTER
Type of Architecture Data repository 3 tier Distributed
ANALYSIS OF THE
COSMOS
PROJECTS
ARCHITECTURE
METAWARE
3-tier Distributed
IQML
3-tier Centralized: Repository can be located on any HTTP server, but is at present a single and shared resource IQML model (question bank objects, survey admin objects). Repository model is based on ebXML registry/repository model 3-tier
IPIS
Centralized
MISSION
3-tier Distributed
Metadata model
DDI
Uniquely developed for the IPIS project, following the OECD MEI. The classes hold information on: Statistical populations, Survey variables, Indicators, classifica-tions and other standards, Data quality issues, Source agencies and collection info, Logistic metadata, Process metadata XML, DDI NO
Process definitions (Cubes, registers, statistical processes, etc.); Variable definitions (Variables, classifications, measure units, etc.); Technical level (Record types, process implementations, etc.); Thesaurus (Keywords, terms, synonyms, etc.). XML NO
MIMAMED
HTTP, SOAP NO
XML YES
30
The interaction of all five projects have been presented in the following figure (COSMOS, Annex1):
However, the intensive analysis performed proved that there can be found some more concrete converging points that have to be discussed. Therefore, following the above diagram we would like to mention the following: 2.2.1 Data Capture IQML is a data capture system that aims in improving the accuracy and timeliness of statistical data. No specific data providers are considered, meaning that no specified areas of application are considered. We can deduce that the questionnaires designed will be fully supported by various areas of application. IPIS is a dissemination system. However, it incorporates not only the output from statistical agencies but also from administrative sources such as Customs and Vocational Education and Training (VET) institutions, a s well as organisations related to Customs operations. To that extent it can be regarded an input system as well.
2.2.2. Data dissemination Except from IQML that is clearly and input system and METAWARE that is concerned mainly for metadata and standards, the other three projects MISSION, FASTER and IPIS are clearly data dissemination projects. All of them provide a web-based mechanism to allow users to access the existing data and perform certain manipulations from anywhere around the world.
31
2.2.3 Metadata Repository The key link between all COSMOS projects is the use of repositories. All projects have followed closely the standards in this area. Linking of metadata repositories or accessing multiple metadata repositories from one application can be achieved by having a common understanding of the domain model for the chosen business area, and a common way of accessing registries and repositories.
FASTER
Three models seem to have links to FASTERs metadata needs: GESMES (GEneric Statistical MESsage), CWM (Common Warehouse Metamodel1) and Registries and Repositories. The GESMES model is suitable for cubes and time series. In that context, it could play a role in the FASTER environment. The CWM is a system of combining models on several levels of abstractness. Some parts of the model seem to have relations with FASTER: OLAP is a system for describing cubes, very much like what is needed in FASTER; description of the records could also be performed in the CWM; classifications and taxonomies might be described using the Business Nomenclature. Registries and Repositories could form the interface between Web Applications and the Data Archives. OASIS and ebXML provide tools for such registries. If an extension to questionnaires is needed, IQML could be a good candidate. The FASTER model is composed by a set of sub-models: Dataset Model Datatypes Model Cube Model Classification Model Transformation Model Documentation Model
The following have to be mentioned : the Object Model is an extension of the OO model defined by the W3C RDF and RDF Schema standards RPCs (remote procedure calls) are performed as normal HTML FORM calls
http://www.cwmforum.org/
32
The properties are the attributes of an object, they can be either: literals such as a String, an Integer or any other basic type such as the ones defined by XML Schema a reference to another object
Especially for Statistical Disclosure Control, two steps can be identified: The first step is that the data is checked on statistical confidentiality. After that, data that is found to be unsafe, is made safe in a second step. In the case of multidimensional tables, a third step can follow to secure data that, in combination with the secured data, can still reveal individuals. This implies that there is a difference with respect to Statistical Disclosure Control for rectangular record data and multidimensional table data. In analogy to the Argus program, record data will be referred to as micro data, while multidimensional data will be referred to as tabular data. The two different kinds of data also incorporate different approaches in disclosure control. Therefore the two streams will be described separately.
IQML:
This project has developed its own metadata model including information about questions, which represent variables, and about survey administration (e.g. sampling and collection method). Additionally, its repository model is based on ebXML registry/repository model. According to this, a registry is assigned to each repository and contains metadata about the objects in the repository, so that the system can find out whether it contains data relevant with the users question. The metadata model of IQML can be divided in the following parts:
Question Bank, holding information questionnaire design and structure Navigation, Calculation, and Validation describing processes and use of the questionnaire Survey Administration, containing information about the survey under consideration, the sample, etc
Of course, there are interrelations between these parts and we should also stress in the concept of the Node class as any content or expression can be linked to any Node. The major classes from the Question Bank are sub classes of Node. The Content is dependent upon the Context: different Content can be present for the same Node dependent upon Context. Examples of Context are Age Range, Language etc.
IPIS
33
The project has started from scratch, with no predecessor. The metadata types supported are of the following 4 categories: Semantic metadata Documentatio n metadata These are the metadata that give the meaning of the data. Examples of semantic metadata are the sampling population used, the variables measured, the nomenclatures used etc. This is mainly text-based metainformation, like for example labels, that are used in the presentation of the data. Documentation metadata are useful for creating user-friendly interfaces, since semantic metadata are usually too complex to be presented to the user. Usually, an overlap between the semantic and documentation metadata occurs since, many times, we have to store metadata in both structured (i.e. semantic metadata used mainly by machines) and verbal-text (i.e. documentation metadata used by humans) form. These are miscellaneous metadata used for manipulating the data sets. Examples of logistic metadata are the datas URL, the type of RDBMS used, the format and version of the used files etc. Process metadata are the metadata used by Information Systems to support metadata-guided statistical processing. These metadata are transparent to the data consumer and are used in data and metadata transformations. In the rest of the document we will focus our attention on this type of metadata.
Logistic metadata
Process metadata
The following figure describes the overlapping of the above-mentioned metadata categories
In general, the classes of the IPIS statistical metadata model hold information on the following:
34
Indicators, classifications and other standards Data quality issues. Source agencies and collection info. Logistic metadata Process metadata
METAWARE:
The project follows the recommendations of IMIM project and it is planned to develop a prototype based on the Bridge software. The technical basic structure of the metadata support for a data warehouse approach has been demonstrated in the initial description of the METAWARE project, Chapter 1, relevant part. The Projects holds three types of metadata classified according to purpose: Physical metadata; Operational metadata; Conceptual metadata. Conceptual metadata Operational metadata Physical metadata Systems and Applications External Users and Statisticians
Communication between users or systems and users is, however, frequently based on conceptual metadata. Systems are referring to operational and physical metadata to provide the required information to the user. These three different aspects of metadata are not clearly distinguished from each other. Operational and physical metadata can be derived from conceptual metadata in many cases." The ComeIn metadata model will be used. ComeIn 3.0, which will be released in January 2002, will fully support the DDI Codebook, the Dublin Core and the ISO 11179 standard. Thus, it will be possible to generate from a ComeIn compliant metadata system (e.g. Bridge NA) DDI Codebooks as well as ISO 11179 compliant registry entries. Moreover, a ComeIn SOAP server (Simple Object Access Protocol W3C standard) will be provided with
35
ComeIn 3.0, which allows direct exchange of metadata in the internet via SOAP, which is an XML based object access protocol.
MISSION
This project is the successor of the ADDSIA project, which used the MAMED model supporting macro and metadata. The MISSION project now has extended the existing metadata model into the MIMAMED model, which supports microdata, apart from macro- and metadata. Three kinds of metadata are distinguished: a. technical metadata, which refer to the physical storage means and location of data, b. active metadata, which actually define the manipulations that a user can perform, c. passive metadata, which describe certain features of data in freetext, e.g. quality. Furthermore, the MISSION project supports metadata capturing using additional methods and standards, such as the PC-Axis standard. Finally, it must be noted that there is a special package called Standard Metadata Package which contains the items of MISSION that can be mapped onto the standards that are being developed for metadata, for example the CWM model by OMG, the Guidelines for statistical metadata by United Nations, the Corporate Metadata Repository (CMR) Model, etc.
II.3
the Survey Administrator of the IQML model and the part of information on Statistical Populations of the IPIS one. In addition, in METAWARE, there are parts of the model on Process Implementation, Process definition (for tables manipulation) and Variables definition (treating measure units, statistical objects, value sets, classification items, etc) that can be linked with almost all COSMOS projects metadata model. Especially for IQML and IPIS, they are essential. Besides, the Thesaurus part of METAWARE model can be of value for the MISSION project, and the Library part of MISSION where the main function is to hold the metadata that allows the user to search for and query data, can serve the purposes of METAWARE. Furthermore, the MISSIONs Standard Metadata package that holds the items of MISSION that can be mapped onto the standards that are being developed for metadata, can be perfectly be connected with the other projects parts on standards, classifications and indicators (according to the project, with a preservation whether it can serve the purpose of the FASTER project, where some information is provided by additional interfaces, not the standard one). In addition, since all projects will provide data dictionaries, the part of MISSIONs project for this purpose is deemed necessary to be included in the COSMOS metadata model. Last but not least, the FASTER part of model on space, variables and populations can be considered one of the most analytical ones compared with the other projects of COSMOS cluster in this specific part, and can serve as a guideline in the for the related procedures.
II.4
Architecture comparisons
Basically all projects employ an n-tier architecture (3-tier as illustrated in the templates completed) to use either in a distributed system (MISSION, METAWARE, FASTER) or in a centralized one (IPIS, IQML) together with some kind of communication glue to tie various modules (XML, Z39.50, CORBA) and use mostly standards-based technologies. XML is used as interchange protocol by four COSMOS projects (except from IQML that uses HTTP, SOAP) and the DDI is used in parallel by IPIS and FASTER. Only MISSION uses Agents at the moment and FASTER also plans to develop them. Main technologies: XML, HTML, (DDI: FASTER and IPIS, SOAP:IQML) UML (most of them use Rational Rose) Java, C++ Generic registry and repository MS Analysis Services and Oracle Express
37
Oracle 8i
38
References
Papers and presentations De Vaney Chris (1997), Common Application Platform Architecture of the Distributed Application Server, Working Paper, WSEL/WP003/Rev.001, 27/12/1996. Karge Reinhard, (1997), BRIDGE, Workshop on Output Databases Stockholm. IMIM, WSIG/WP/017/Rev.000,
Karge Reinhard, Bridge Functionality, http://imim.scb.se. Musgrave S. and Ryssevik J., (2000), Beyond NESSTAR: FASTER access to data, IASSIST.
Project Deliverables and documents IPIS: Deliverables D5, by UoA/Dept of Mathematics Team, Deliverable D6 and D7.1 IQML Registry and Repository Interface Specification, by Chris Nelson and Andy Jenkins, Dimension EDI MISSION, Deliverables D4, D6 METAWARE, Deliverables D1, D2 COSMOS, Annex 1 projects interrelations EPROS publication documents for all 5 projects
Web Pages: The NESSTAR project: http://www.nesstar.org/ The FASTER project: http://www.faster-data.org/ The IQML project: http://www.epros.ed.ac.uk/iqml/ The MISSION project: http://www.epros.ed.ac.uk/mission/ The IPIS project: http://www.instore.gr/ipis/ The METANET project: http://www.epros.ed.ac.uk/metanet/ The CWM metamodel: http://www.cwmforum.org/ The IMIM project: http://imim.scb.se The LIMBER project: http://www.venus.cis.rl.ac.uk/limber/ The Chesire Project: http://cheshire.lib.berkeley.edu/ Bridge software: http://imim.scb.se/software/bridge.htm The CBS Cristal Model: http://neon.vb.cbs.nl/sos_cubes DDI DTD beta testers results: http://www.icpsr.umich.edu/DDI/codebook/testers.html The CORBA: http://www.corba.org/ The OMG organization: http://www.omg.org/
39
40
Annex 1 Template
(TO BE COMPLETED BY EACH COSMOS
PROJECT)
General Information:
Projects objectives Main objective: Sub-objectives: Areas of Application: User types: (please provide an example for each user category in order to avoid errors due to differences in terminology)
User Services provided Data Collection (does the projects software facilitates data collection?) Data Collection (does the project offers collections of data, i.e
collections of survey data, indicators, etc?)
Macro Macro
Metadata Manipulation Yes No (i.e manipulated in the metadata model, harmonisation of metadata, transformations of metadata with pre- and post-conditions, etc?) Data Dictionaries Data import Data Export Yes Yes Yes No No No No If YES, what
If yes, in
(i.e store and use by the system groups of pre-selected indicators i.e in the area of Labour Market, Household Budget Survey, etc?)
41
If yes, in
(i.e do you store some pivot classifications (i.e international ones) to allow for mapping of other classifications into them?) e-mail facilities Data publishing Data presentation in what format? (i.e HTML tables) Data visualization-in what format? Data Browsing Harmonisation of results Yes No No If yes Yes No
Transformations supported Yes which ones are supported? GUI specification Access control/security functions: Yes Statistical disclosure control: User action histories: Yes Yes No
No No
Metadata model used: (please provide the main metadata categories supported) Interchange protocol: Use of Agents? Yes No
Other Other characteristics that may have been omitted and are essential for the better understanding of the projects framework.
42
Please provide any common features between the project you are involved and any other project of COSMOS that cannot result from the previous questions
43
The FASTER project employs a three-tier distributed architecture that builds upon the developments of the NESSTAR and LIMBER projects . It combines a distributed search facility using XML syntax for seamless remote database searching and a consistent XML-based interface for accessing multiple data repositories . This architecture improves access to statistical data by enabling it to be more accessible via a Data Web technology that offers universal information access similar to the WWW, integrating statistical data with text and live data .
FASTER addresses metadata management by developing a metadata repository that is XML-based and standards compliant to facilitate data interchange and usability . It enhances metadata's role by making it responsible not only for data conformance but also for personalization and access control . The project leverages the Data Documentation Initiative (DDI) standard and other XML approaches for metadata specification, ensuring a flexible and extensible structure for metadata and interfaces .
FASTER addresses privacy and security concerns by implementing access control mechanisms that are supported by metadata at all levels of the system . This inclusion ensures that sensitive statistical data is accessed only by authorized users, maintaining confidentiality and compliance with data protection standards . The emphasis on XML-based interfaces also contributes to secure data handling across different server environments .
The interrelation between FASTER and other COSMOS cluster projects is significant as it facilitates collaborative approaches to address shared challenges in data dissemination and usage . These projects interact by sharing metadata repositories and processes, showcasing converging points in their architecture and goals, such as data capture and dissemination . Such interactions enhance resource sharing, technology transfer, and provide a united framework for advancing statistical data management .
FASTER introduces several methodological advancements over NESSTAR, including improved functionality with standards compliance, such as XML and RDF, for broader metadata applicability . It abandons some NESSTAR tactics like CORBA messaging in favor of more robust XML approaches, enhancing metadata’s role in usability and user personalization . By incorporating access control within metadata and expanding to multi-dimensional data sources, FASTER improves on NESSTAR's groundwork to offer a more flexible and comprehensive data management solution .
FASTER's reliance on XML and RDF standards implies a commitment to interoperability, flexibility, and semantic precision in metadata management . Using these standards facilitates information interchange with other systems and enhances the applicability of metadata to diverse data sources, including time-variant and multi-dimensional ones . This reliance ensures that the metadata repository remains standards compliant, supporting seamless integration and future scalability .
The Data Web technology in FASTER aims to revolutionize statistical data handling by providing universal information access akin to the WWW's impact on text publishing . It involves creating a Data Browser (Client) that offers a user-friendly interface to standardized services for data access and processing . This allows seamless integration with the WWW, enabling the creation of data-rich documents that blend text, images, and live data, thereby simplifying data identification and dissemination .
The user-friendly graphical user interface in FASTER enhances data dissemination by offering seamless navigation and interaction with statistical data . The Data Browser component simplifies data access and processing with a Web browser-like experience, easing user engagement with complex datasets . This GUI design allows users to intuitively manage data, automate routine data tasks, and deliver real-time data views, thereby making statistical data more accessible and actionable .
Workshops and multidisciplinary teams are crucial for FASTER in developing metadata specifications and flexible user environments, as they allow for collaboration and consensus building on metadata semantics and structures . These expert workshops facilitate discussions with interested parties, leveraging their expertise to refine architecture and data accessibility . Multidisciplinary teams bring together leaders in metadata management and statistical control to address FASTER’s goals on a European level .
FASTER plans to enhance user interaction with statistical data by developing a configurable Web-based client application that allows users to personalize their environment for immediate statistical data interaction . Users will be able to visualize and interact with data using tools like visualizations and bookmarks . This setup is expected to facilitate easier data access and management, aid users in combining various data sources, and improve user engagement with a wide range of statistical data .