Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2020, Proceedings of the VLDB Endowment
Why and why-not provenance have been studied extensively in recent years. However, why-not provenance and --- to a lesser degree --- why provenance can be very large, resulting in severe scalability and usability challenges. We introduce a novel approximate summarization technique for provenance to address these challenges. Our approach uses patterns to encode why and why-not provenance concisely. We develop techniques for efficiently computing provenance summaries that balance informativeness, conciseness, and completeness. To achieve scalability, we integrate sampling techniques into provenance capture and summarization. Our approach is the first to both scale to large datasets and generate comprehensive and meaningful summaries.
2015
As users become confronted with a deluge of provenance data, dedicated techniques are required to make sense of this kind of information. We present Aggregation by Provenance Types, a provenance graph analysis that is capable of generating provenance graph summaries. It proceeds by converting provenance paths up to some length k to attributes, referred to as provenance types, and by grouping nodes that have the same provenance types. The summary also includes numeric values representing the frequency of nodes and edges in the original graph. A quantitative evaluation and a complexity analysis show that this technique is tractable; with small values of k, it can produce useful summaries and can help detect outliers. We illustrate how the generated summaries can further be used for conformance checking and visualization.
Lecture Notes in Computer Science, 2012
As Open Data becomes commonplace, methods are needed to integrate disparate data from a variety of sources. Although Linked Data design has promise for integrating world wide data, integrators often struggle to provide appropriate transparency for their sources and transformations. Without this transparency, cautious consumers are unlikely to find enough information to allow them to trust third party content. While capturing provenance in RPI's Linking Open Government Data project, we were faced with the common problem that only a portion of provenance that is captured is effectively used. Using our water quality portal's use case as an example, we argue that one key to enabling provenance use is a better treatment of provenance granularity. To address this challenge, we have designed an approach that supports deriving abstracted provenance from granular provenance in an open environment. We describe the approach, show how it addresses the naturally occurring unmet provenance needs in a family of applications, and describe how the approach addresses similar problems in open provenance and open data environments.
Proceedings of the 2019 International Conference on Management of Data, 2019
Provenance and intervention-based techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instance, an unusually low number of publications of a prolific researcher in a certain venue and year can be explained by an increased number of publications in another venue in the same year. We present a novel approach for explaining outliers in aggregation queries through counterbalancing. That is, explanations are outliers in the opposite direction of the outlier of interest. Outliers are defined w.r.t. patterns that hold over the data in aggregate. We present efficient methods for mining such aggregate regression patterns (ARPs), discuss how to use ARPs to generate and rank explanations, and experimentally demonstrate the efficiency and effectiveness of our approach.
2012
Abstract As Open Data becomes commonplace, methods are needed to integrate disparate data from a variety of sources. Although Linked Data design has promise for integrating world wide data, integrators often struggle to provide appropriate transparency for their sources and transformations. Without this transparency, cautious consumers are unlikely to find enough information to allow them to trust third party results.
2008
ZOOM*UserViews presents a model of provenance for scientific workflows that is simple, generic, and yet sufficiently expressive to answer questions of data and step provenance that have been encountered in a large variety of scientific case studies. In addition, ZOOM builds on the concept of composite step-classes – or sub-workflows – which is present in many scientific workflow systems to develop a notion of user views. This paper discusses the design and implementation of ZOOM in the context of the queries posed by the provenance challenge, and shows how user views affect the level of granularity at which provenance information can be seen and reasoned about. key words: User views; Multi-level of granularity for provenance; Scientific workflows 1.
Concurrency and Computation: Practice and Experience, 2008
Workflows and data pipelines are becoming increasingly valuable to computational and experimental sciences. These automated systems are capable of generating significantly more data within the same amount of time compared to their manual counterparts. Automatically capturing and recording data provenance and annotation as part of these workflows are critical for data management, verification, and dissemination. We have been prototyping a workflow provenance system, targeted at biological workflows, that extends our content management technologies and other open source tools. We applied this prototype to the provenance challenge to demonstrate an end-to-end system that supports dynamic provenance capture, persistent content management, and dynamic searches of both provenance and metadata. We describe our prototype, which extends the Kepler system for the execution environment, the Scientific Annotation Middleware (SAM) content management software for data services, and an existing HTTPbased query protocol. Our implementation offers several unique capabilities, and through the use of standards, is able to provide access to the provenance record with a variety of commonly available client tools.
Concurrency and Computation: Practice and Experience, 2008
VisTrails is a new workflow and provenance management system that provides support for scientific data exploration and visualization. Whereas workflows have been traditionally used to automate repetitive tasks, for applications that are exploratory in nature, change is the norm. VisTrails uses a new change-based provenance mechanism which was designed to handle rapidly-evolving workflows. It uniformly and automatically captures provenance information for data products and for the evolution of the workflows used to generate these products. In this paper, we describe how the VisTrails provenance data is organized in layers and present a first approach for querying this data that we developed to tackle the Provenance Challenge queries.
Lecture Notes in Computer Science, 2006
This paper presents Provenance Explorer, a secure provenance visualization tool, designed to dynamically generate customized views of scientific data provenance that depend on the viewer's requirements and/or access privileges. Using RDF and graph visualizations, it enables scientists to view the data, states and events associated with a scientific workflow in order to understand the scientific methodology and validate the results. Initially the Provenance Explorer presents a simple, coarse-grained view of the scientific process or experiment. However the GUI allows permitted users to expand links between nodes (input states, events and output states) to reveal more finegrained information about particular sub-events and their inputs and outputs. Access control is implemented using Shibboleth to identify and authenticate users and XACML to define access control policies. The system also provides a platform for publishing scientific results. It enables users to select particular nodes within the visualized workflow and drag-and-drop them into an RDF package for publication or e-learning. The direct relationships between the individual components selected for such packages are inferred by the ruleinference engine.
The first Provenance Challenge was set up in order to provide a forum for the community to understand the capabilities of different provenance systems and the expressiveness of their provenance representations. To this end, a Functional Magnetic Resonance Imaging workflow was defined, which participants had to either simulate or run in order to produce some provenance representation, from which a set of identified queries had to be implemented and executed. Sixteen teams responded to the challenge, and submitted their inputs. In this paper, we present the challenge workflow and queries, and summarise the participants contributions.
Concurrency and Computation: Practice and Experience, 2008
Provenance-aware storage systems (PASS) are a new class of storage system treating provenance as a first-class object, providing automatic collection, storage, and management of provenance as well as query capabilities. We developed the first PASS prototype between 2005 and 2006, targeting scientific end users. Prior to undertaking the provenance challenge, we had focused on provenance collection and storage, without much emphasis on a query model or language. The challenge forced us to (quickly) develop a query model and infrastructure implementing this model. We present a brief overview of the PASS prototype and a discussion of the evolution of the query model that we developed for the challenge.
Lecture Notes in Computer Science, 2012
As interest in provenance grows among the Semantic Web community, it is recognized as a useful tool across many domains. However, existing automatic provenance collection techniques are not universally applicable. Most existing methods either rely on (low-level) observed provenance, or require that the user discloses formal workflows. In this paper, we propose a new approach for automatic discovery of provenance, at multiple levels of granularity. To accomplish this, we detect entity derivations, relying on clustering algorithms, linked data and semantic similarity. The resulting derivations are structured in compliance with the Provenance Data Model (PROV-DM). While the proposed approach is purposely kept general, allowing adaptation in many use cases, we provide an implementation for one of these use cases, namely discovering the sources of news articles. With this implementation, we were able to detect 73% of the original sources of 410 news stories, at 68% precision. Lastly, we discuss possible improvements and future work.
2013
Systems that gather fine-grained provenance metadata must process and store large amounts of information. Filtering this metadata as it is collected has a number of benefits, including reducing the amount of persistent storage required and simplifying subsequent provenance queries. However, writing these filters in a procedural language is verbose and error prone. We propose a simple declarative language for processing provenance metadata and evaluate it by translating filters implemented in SPADE [9], an open-source provenance collection platform.
2009
Most application provenance systems are hard coded for a particular type of system or data, while current provenance file systems maintain in-memory provenance graphs and reside in kernel space, leading to complex and constrained implementations. Story Book resides in user space, and treats provenance events as a generic event log, leading to a simple, flexible and easily optimized system.
Lecture Notes in Computer Science, 2016
Provenance generated by different workflow systems is generally expressed using different formats. This is not an issue when scientists analyze provenance graphs in isolation, or when they use the same workflow system. However, when analyzing heterogeneous provenance graphs from multiple systems poses a challenge. To address this problem we adopt ProvONE as an integration model, and show how different provenance databases can be converted to a global ProvONE schema. Scientists can then query this integrated database, exploring and linking provenance across several different workflows that may represent different implementations of the same experiment. To illustrate the feasibility of our approach, we developed conceptual mappings between the provenance databases of two workflow systems (e-Science Central and SciCumulus). We provide cartridges that implement these mappings and generate an integrated provenance database expressed as Prolog facts. To demonstrate its usage, we have developed Prolog rules that enable scientists to query the integrated database.
2013
Data provenance is a form of metadata that records the activities involved in data production. It can be used to help data consumers to form judgments regarding data reliability. The PROV data model, released by the W3C in 2013, defines a relational model and constraints which provides a structural and semantic foundation for provenance. This enables the exchange of provenance between data producers and consumers. When the provenance content is sensitive and subject to disclosure restrictions, however, a complementary model is needed to enable producers to partially obfuscate provenance in a principled way. In this paper we propose such a formal model. It is embodied by a grouping operator, whereby a set of nodes in a PROV-compliant provenance graph is replaced by a new abstract node, leading to a new valid PROV graph. We define graph editing rules which allow existing dependencies to be removed, but guarantee that no spurious dependencies are introduced in the abstracted graph. As ...
2009
Provenance capture as applied to execution oriented and interactive workflows is designed to record minute detail needed to support a "modify and restart" paradigm as well as re-execution of past workflows. In our experience, provenance also plays an important role in human-centered verification, results tracking, and knowledge sharing. However, the amount of information recorded by provenance capture mechanisms generally obfuscates the conceptual view of events. There is a need for a flexible means to create and dynamically control user oriented views over the detailed provenance record. In this paper, we present a design which leverages named graphs and extensions to the SPARQL query language to create and manage views as a server-side function, simplifying user presentation of provenance data.
2015
The problem of answering Why-Not questions consists in explaining why the result of a query does not contain some expected data, i.e., missing answers. To solve this problem, we resort to identifying where in the query, data relevant to the missing answer were lost. Existing algorithms producing such query-based explanations rely on a query tree representation, potentially leading to different or partial explanations. This significantly impairs on the effectiveness of computed explanations. Here we present an effective, query-tree independent representation of query-based explanations, for a wide class of Why-Not questions, based on provenance polynomials. We further describe an algorithm that efficiently computes the complete set of these explanations. An experimental evaluation validates our statements
International Journal of Digital Curation, 2012
Experimental science can be thought of as the exploration of a large research space, in search of a few valuable results. While it is this "Golden Data" that gets published, the history of the exploration is often as valuable to the scientists as some of its outcomes. We envision an e-research infrastructure that is capable of systematically and automatically recording such history -an assumption that holds today for a number of workflow management systems routinely used in escience. In keeping with our gold rush metaphor, the provenance of a valuable result is a "Golden Trail". Logically, this represents a detailed account of how the Golden Data was arrived at, and technically it is a sub-graph in the much larger graph of provenance traces that collectively tell the story of the entire research (or of some of it).
ArXiv, 2017
Explaining why an answer is in the result of a query or why it is missing from the result is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. Both types of questions, i.e., why and why-not provenance, have been studied extensively. In this work, we present the first practical approach for answering such questions for queries with negation (first-order queries). Our approach is based on a rewriting of Datalog rules (called firing rules) that captures successful rule derivations within the context of a Datalog query. We extend this rewriting to support negation and to capture failed derivations that explain missing answers. Given a (why or why-not) provenance question, we compute an explanation, i.e., the part of the provenance that is relevant to answer the question. We introduce optimizations that prune parts of a provenance graph early on if we can determine that they will not be part of the explanation...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.