Sponsoring Editor Diane Cerra Director of Production and Manufacturing Yonie Overton Production Editor Heather Collins Editorial Coordinator Belinda Breyer Cover Design Martin Heirakuji Text Design Mark Ong Composition and Illustration... more
Queries navigate semistructured data via path expressions, and can be accelerated using an index. Our solution encodes paths as strings, and inserts those strings into a special index that is highly optimized for long and complex keys. We... more
We describe a configurable tool for extracting semistructured data from a set of HTML pages and for converting the extracted information into database objects. The input to the extractor is a declarative specification that states where... more
Providing an integrated access to multiple heterogeneous sources is a challenging issue in global information systems for cooperation and interoperability. In this context, two fundamental problems arise. First, how to determine if the... more
Abstract. This paper presents structural recursion as the basis of the syntax and semantics of query languages for semistructured data and XML. We describe a simple and powerful query language based on pattern matching and show that it... more
Semistructured data is characterized by the lack of any fixed and rigid schema, although typically the data has some implicit structure.
Recent work on semi-structured data has revitalized the interest in path queries, i.e., queries that ask for all pairs of objects in the database that are connected by a path conforming to a certain specification, in particular to a... more
We posit that a semistructured data model offers the right balance of rich structure and flexible (or lack of) schema allowing naive end users to record information in whatever form makes it easy for them to manage. We describe our... more
All query languages proposed for semistructured data share as common characteristic the ability to traverse arbitrary long path in the data in the form of regular path expressions. The expressive power of these languages lies in between... more
In this paper we discuss the management of semi-structured data, ie, data that has irregular or dynamically changing structure. We describe components of the Stanford TSIMMIS Project that help extract semi-structured data from Web pages,... more
Because of the widespread diffusion of semistructured data in XML format, much research effort is currently devoted to support the storage and retrieval of large collections of such documents. XML documents can be compared as to their... more
Recent IR extensions to XML query languages such as Xpath 1.0 Full-Text or the NEXI query language of the INEX benchmark series reflect the emerging interest in IR-style ranked retrieval over semistructured data. TopX is a top-k retrieval... more
As web applications mature and evolve, the nature of the semistructured data that drives these applications also changes. An important trend is the need for increased flexibility in the structure of web documents. Hence, applications... more
Relational or semi-structured data is naturally represented by a graph, where nodes denote entities and directed typed edges represent the relations between them. Such graphs are heterogeneous in the sense that they describe different... more
Community Web Portals (e.g., digital libraries, vertical aggregators, infomediaries) have become quite popular nowadays in supporting specific communities of interest on corporate intranets or the Web. Portal Catalogs, organize and... more
Motivated to a large extent by the substantial and growing prominence of the World-Wide Web and the potential benefits that may be obtained by applying database concepts and techniques to web data management, new data models and query... more
This paper describes the theoretical framework and implementation of a database management system for storing and manipulating diverse probability distributions of discrete random variables with finite domains, and associated information.... more
The increasing availability of heterogeneous XML informative sources has raised a number of issues concerning how to represent and manage semistructured data. Although XML sources can exhibit proper structures and contents, differently... more
Semistructured data occur in situations where information lacks a homogeneous structure and is incomplete. Yet, up to now the incompleteness of information has not been re ected by special features of query languages for semistructured... more
We present a Heterogenous Data Quality Methodology (HDQM) for Data Quality (DQ) assessment and improvement that considers all types of data managed in an organization, namely structured data represented in databases, semistructured data... more
Tree-walking automata (TWAs) recently received new attention in the fields of formal languages and databases. To achieve a better understanding of their expressiveness, we characterize them in terms of transitive closure logic formulas in... more
Many Web data sources and APIs make their data available in XML, JSON, or a domain-specific semi-structured format, with the goal of making the data easily accessible and usable by Web application developers. Although such data formats... more
Existing systems for managing and querying semistructured-data sources store the schema with the data. Lorel QRS + 95] and Tsimmis PGMW95] store their data as graphs. The schema is stored as attributes labeling the graph's edges. Strudel... more
We consider semistructured data as multi rooted edge-labeled directed graphs, and path inclusion constraints on these graphs. A path inclusion constraint p q is satisfied by a semistructured data if any node reached by the regular query p... more
This work describes a new theoretical framework for uniform storage and management of diverse probabilistic information.
Semi-structured data has become prevalent with the growth of the Internet. The data is usually stored in a traditional database system or in a specialized repository. While many information providers have presented their databases on the... more
Path queries have been extensively used to query semistructured data, such as the Web and XML documents. In this paper we introduce weighted path queries, an extension of path queries enabling several classes of optimization problems... more
Due to its flexibility, XML is becoming the de facto standard for exchanging and querying documents over the Web. Many XML query languages such as XQuery and XPath use label paths to traverse the irregularly structured XML data. Without a... more
The diversity and availability of information sources on the World Wide Web has set the stage for integration and reuse at an unparalleled scale. There remain signi cant hurdles to exploiting the extent of the Web's resources in a... more
The nature of semistructured data in web collections is evolving. Increasingly, XML web documents (or documents exchanged via web services) are valid with regard to a schema, yet the actual structure of such documents exhibits significant... more
We propose combining query approximation and query relaxation techniques in order to support flexible querying of heterogeneous data arising from lifelong learners' educational and work experiences. A key aim of such querying facilities... more
Database management systems are becoming available for semistructured data, however, these tools cannot be used on many real-world data sources (e.g., most web sites) in their native form. Often, wrappers are needed to extract information... more
In this paper we investigate the quantifier-free fragment of the TQL logic proposed by Cardelli and Ghelli. The TQL logic, inspired from the ambient logic, is the core of a query language for semistructured data represented as unranked... more
Query processing in global information systems integrating multiple heterogeneous sources is a challenging issue in relation to the effective extraction of information available on-line. In this paper we propose intelligent,... more
While most business applications typically operate on structured data that can be effectively managed using relational databases, some applications use more complex semistructured data that lacks a stable schema. XML techniques are... more
Multidimensional XML (MXML) is an extension of XML that incorporates dimensions in order to represent in an elegant and concise way context-dependent data, that is, data which can exhibit di erent variations in value or structure (e.g.... more