Papers by Michael E Flaster
Hidden text detection for search result scoring
Methods and apparatus for contextual schema mapping of source documents to target documents
Method for maintaining consistency and performing recovery in a replicated data storage system
Query generation using structural similarity between documents
Methods and apparatus for mapping source schemas to a target schema using schema embedding
Methods and Apparatus for User-Guided Inference of Regular Expressions for Information Extraction
Patch panel cover mounted antenna grid for use in the automatic determination of network cable connections using RFID tags
Evaluating website properties by partitioning user feedback
Method and apparatus for enabling authorized and billable message transmission between multiple communications environments
Efficient document clustering
Identifying transient portions of web pages
Identifying transient paths within websites
Proceedings of the …, 2003
In this paper we present StarFish, a highly-available geographically-dispersed block storage syst... more In this paper we present StarFish, a highly-available geographically-dispersed block storage system built from commodity servers running FreeBSD, which are connected by standard high-speed IP networking gear. StarFish achieves high availability by transparently replicating data over multiple storage sites. StarFish is accessed via a host-site appliance that masquerades as a host-attached storage device, hence it requires no spe- cial hardware
Equivalence class-based method and apparatus for cost-based repair of database constraint violations

In this paper we present StarFish, a highly-available geographically-dispersed block storage syst... more In this paper we present StarFish, a highly-available geographically-dispersed block storage system built from commodity servers running FreeBSD, which are connected by standard high-speed IP networking gear. StarFish achieves high availability by transparently replicating data over multiple storage sites. StarFish is accessed via a host-site appliance that masquerades as a host-attached storage device, hence it requires no special hardware or software in the host computer. We show that a StarFish system with 3 replicas and a write quorum size of 2 is a good choice, based on a formal analysis of data availability and reliability: 3 replicas with individual availability of 99%, a write quorum of 2, and read-only consistency gives better than 99.9999% data availability. Although StarFish increases the per-request latency relative to a direct-attached RAID, we show how to design a highly-available StarFish configuration that provides most of the performance of a direct-attached RAID on...

A fundamental concern of information integration in an XML context is the ability to embed one or... more A fundamental concern of information integration in an XML context is the ability to embed one or more source documents in a target document so that (a) the target document conforms to a target schema and (b) the information in the source document(s) is preserved. In this paper, information preservation for XML is formally studied, and the results of this study guide the definition of a novel notion of schema embedding between two XML DTD schemas represented as graphs. Schema embedding generalizes the conventional notion of graph similarity by allowing an edge in a source DTD schema to be mapped to a path in the target DTD. Instance-level embeddings can be defined from the schema embedding in a straightforward manner, such that conformance to a target schema and information preservation are guaranteed. We show that it is NP-complete to find an embedding between two DTD schemas. We also provide efficient heuristic algorithms to find candidate embeddings, along with experimental results to evaluate and compare the algorithms. These yield the first systematic and effective approach to finding information preserving XML mappings.

Coded Replication: A Space-Efficient Technique for Increasing File Availability
Distributed file systems offer a potential increase in file availability through replication of d... more Distributed file systems offer a potential increase in file availability through replication of data. Many previous solutions have had large space requirements to achieve high availability. In this paper, we propose a method of replication that is extremely space efficient and yet provides significantly better availability than Dynamic Voting (the best of the previous methods) for reasonably reliable systems. The method employs Reed-Solomon encoding techniques, permitting each node to hold a small amount of the file, and yet allow reconstruction of the entire file given only a subset of the nodes. This increases availability at the cost of increased processing time, instead of increased disk space. The technique is shown to be flexible both in system resource demands and in the availability provided. 1 Introduction Distributing file storage over several machines can increase file availability. The most common mechanism for increasing availability is replication. Replication introduc...

Proceedings of the 2005 ACM SIGMOD international conference on Management of data - SIGMOD '05, 2005
Data integrated from multiple sources may contain inconsistencies that violate integrity constrai... more Data integrated from multiple sources may contain inconsistencies that violate integrity constraints. The constraint repair problem attempts to find "low cost" changes that, when applied, will cause the constraints to be satisfied. While in most previous work repair cost is stated in terms of tuple insertions and deletions, we follow recent work to define a database repair as a set of value modifications. In this context, we introduce a novel cost framework that allows for the application of techniques from record-linkage to the search for good repairs. We prove that finding minimal-cost repairs in this model is NP-complete in the size of the database, and introduce an approach to heuristic repair-construction based on equivalence classes of attribute values. Following this approach, we define two greedy algorithms. While these simple algorithms take time cubic in the size of the database, we develop optimizations inspired by algorithms for duplicate-record detection that greatly improve scalability. We evaluate our framework and algorithms on synthetic and real data, and show that our proposed optimizations greatly improve performance at little or no cost in repair quality.
Attribute-level schema matching has proven to be an important rst step in developing mappings for... more Attribute-level schema matching has proven to be an important rst step in developing mappings for data exchange, integration, restructuring and schema evolution. In this paper we investigate contextual schema matching, in which selection conditions are as- sociated with matches by the schema matching process in order to improve overall match quality. We dene a general space of matching techniques, and
Exploratory Analysis System for Semi-structured Engineering Logs
Lecture Notes in Computer Science, 2006
Engineering diagnosis often involves analyzing complex records of system states printed to large,... more Engineering diagnosis often involves analyzing complex records of system states printed to large, textual log les. Typically the logs are designed to accommodate the widest debugging needs without rigorous plans on formatting. As a result, critical quantities and ags are mixed with less important messages in a loose structure. Once the system is sealed, the log format is not changeable,
Uploads
Papers by Michael E Flaster