Papers by Bertram Ludaescher

arXiv (Cornell University), Dec 15, 2021
OpenRefine is a popular open-source data cleaning tool. It allows users to export a previously ex... more OpenRefine is a popular open-source data cleaning tool. It allows users to export a previously executed data cleaning workflow in a JSON format for possible reuse on other datasets. We have developed or2yw, a novel tool that maps a JSON-formatted OpenRefine operation history to a YesWorkflow (YW) model, which then can be visualized and queried using the YW tool. The latter was originally developed to allow researchers a simple way to annotate their program scripts in order to reveal the workflow steps and dataflow dependencies implicit in those scripts. With or2yw the user can automatically generate YW models from OpenRefine operation histories, thus providing a "workflow view" on a previously executed sequence of data cleaning operations. The or2yw tool can generate different types of YesWorkflow models, e.g., a linear model which mirrors the sequential execution order of operations in OpenRefine, and a parallel model which reveals independent workflow branches, based on a simple analysis of dependencies between steps: if two operations are independent of each other (e.g., when the columns they read and write do not overlap) then these can be viewed as parallel steps in the data cleaning workflow. The resulting YW models can be understood as a form of prospective provenance, i.e., knowledge artifacts that can be queried and visualized (i) to help authors document their own data cleaning workflows, thereby increasing transparency, and (ii) to help other users, who might want to reuse such workflows, to understand them better.
figureS3-7_mir
SUPPLEMENTARY MATERIALS S2. – Set of Euler/X toolkit output Maximally Informative Relations (MIR)... more SUPPLEMENTARY MATERIALS S2. – Set of Euler/X toolkit output Maximally Informative Relations (MIR) for the input data files provided in the Supplementary Materials S1 and for the entire Prim-UC ("primates-all.csv"). Each output file is saved in .csv format. The MIR files form the basis for analyses of name:meaning relations (Tables 3–5)
FIGURE4
SUPPLEMENTARY MATERIALS S1. – Set of Euler/X toolkit input data files for all alignments produced... more SUPPLEMENTARY MATERIALS S1. – Set of Euler/X toolkit input data files for all alignments produced in the Prim-UC (Figs. 1, 3–7, S3 1–13). Each file is saved is saved in .txt format and contains annotations and instructions for run commands to yield the alignments and visualizations shown in the 19 corresponding figures
figureS3-3_mir
SUPPLEMENTARY MATERIALS S2. – Set of Euler/X toolkit output Maximally Informative Relations (MIR)... more SUPPLEMENTARY MATERIALS S2. – Set of Euler/X toolkit output Maximally Informative Relations (MIR) for the input data files provided in the Supplementary Materials S1 and for the entire Prim-UC ("primates-all.csv"). Each output file is saved in .csv format. The MIR files form the basis for analyses of name:meaning relations (Tables 3–5)
FIGURE6_MIR
SUPPLEMENTARY MATERIALS S2. – Set of Euler/X toolkit output Maximally Informative Relations (MIR)... more SUPPLEMENTARY MATERIALS S2. – Set of Euler/X toolkit output Maximally Informative Relations (MIR) for the input data files provided in the Supplementary Materials S1 and for the entire Prim-UC ("primates-all.csv"). Each output file is saved in .csv format. The MIR files form the basis for analyses of name:meaning relations (Tables 3–5)
figureS3-7
SUPPLEMENTARY MATERIALS S1. – Set of Euler/X toolkit input data files for all alignments produced... more SUPPLEMENTARY MATERIALS S1. – Set of Euler/X toolkit input data files for all alignments produced in the Prim-UC (Figs. 1, 3–7, S3 1–13). Each file is saved is saved in .txt format and contains annotations and instructions for run commands to yield the alignments and visualizations shown in the 19 corresponding figures
figureS3-4
SUPPLEMENTARY MATERIALS S1. – Set of Euler/X toolkit input data files for all alignments produced... more SUPPLEMENTARY MATERIALS S1. – Set of Euler/X toolkit input data files for all alignments produced in the Prim-UC (Figs. 1, 3–7, S3 1–13). Each file is saved is saved in .txt format and contains annotations and instructions for run commands to yield the alignments and visualizations shown in the 19 corresponding figures
FIGURE7
SUPPLEMENTARY MATERIALS S1. – Set of Euler/X toolkit input data files for all alignments produced... more SUPPLEMENTARY MATERIALS S1. – Set of Euler/X toolkit input data files for all alignments produced in the Prim-UC (Figs. 1, 3–7, S3 1–13). Each file is saved is saved in .txt format and contains annotations and instructions for run commands to yield the alignments and visualizations shown in the 19 corresponding figures

SupplementaryMaterials-S3
SUPPLEMENTARY MATERIALS S3. – Visualizations for the primary set (3 partitions; Figs. S3 1–3) and... more SUPPLEMENTARY MATERIALS S3. – Visualizations for the primary set (3 partitions; Figs. S3 1–3) and secondary set (10 partitions; Figs. S3 4–13) of alignments for the Prim-UC. Each caption specifies the most inclusive concept sec. Groves (2005), aligned with the respective congruent and/or entailed concepts sec. Groves (1993). Fig. S3–1: 2005.Strepsirrhini (partition 3; see Table 2). Fig. S3–2: 2005.Haplorrhini, excluding 2005.Catarrhini (partition 4; see Table 2). Fig. S3–3: 2005.Catarrhini (partition 5; see Table 2). Fig. S3–4: 2005.Cheirogaleoidea. Fig. S3–5: 2005.Lemuroidea. Fig. S3–6: 2005.Lorisiformes. Fig. S3–7: 2005.Chiromyiformes. Fig. S3–8: 2005.Tarsiiformes. Fig. S3–9: 2005.Platyrrhini, excluding 2005.Callitrichinae. Fig. S3–10: 2005.Callitrichinae. Fig. S3–11: 2005.Cercopithecinae. Fig. S3–12: 2005.Colobinae. Fig. S3–13: 2005.Hominoidea (= Fig. 5). See also Supplementary Materials S1 and S2
FIGURE5
SUPPLEMENTARY MATERIALS S1. – Set of Euler/X toolkit input data files for all alignments produced... more SUPPLEMENTARY MATERIALS S1. – Set of Euler/X toolkit input data files for all alignments produced in the Prim-UC (Figs. 1, 3–7, S3 1–13). Each file is saved is saved in .txt format and contains annotations and instructions for run commands to yield the alignments and visualizations shown in the 19 corresponding figures
2019 15th International Conference on eScience (eScience), 2019
In this paper we describe our experience adopting the Research Object Bundle (RO-Bundle) format w... more In this paper we describe our experience adopting the Research Object Bundle (RO-Bundle) format with BagIt serialization (BagIt-RO) for the design and implementation of "tales" in the Whole Tale platform. A tale is an executable research object intended for the dissemination of computational scientific findings that captures information needed to facilitate understanding, transparency, and re-execution for review and computational reproducibility at the time of publication. We describe the Whole Tale platform and requirements that led to our adoption of BagIt-RO, specifics of our implementation, and discuss migrating to the emerging Research Object Crate (RO-Crate) standard.
Beyond Bibliographic Citation: Provenance and Dependency Metadata for Complex Research Objects

ArXiv, 2021
Scientific workflows are a cornerstone of modern scientific computing, and they have underpinned ... more Scientific workflows are a cornerstone of modern scientific computing, and they have underpinned some of the most significant discoveries of the last decade. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale HPC platforms. Workflows will play a crucial role in the data-oriented and post-Moore's computing landscape as they democratize the application of cutting-edge research techniques, computationally intensive methods, and use of new computing platforms. As workflows continue to be adopted by scientific projects and user communities, they are becoming more complex. Workflows are increasingly composed of tasks that perform computations such as short machine learning inference, multi-node simulations, long-running machine learning model training, amongst others, and thus increasingly rely on heterogeneous architectures that include CPUs but ...

ArXiv, 2017
Explaining why an answer is in the result of a query or why it is missing from the result is impo... more Explaining why an answer is in the result of a query or why it is missing from the result is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. Both types of questions, i.e., why and why-not provenance, have been studied extensively. In this work, we present the first practical approach for answering such questions for queries with negation (first-order queries). Our approach is based on a rewriting of Datalog rules (called firing rules) that captures successful rule derivations within the context of a Datalog query. We extend this rewriting to support negation and to capture failed derivations that explain missing answers. Given a (why or why-not) provenance question, we compute an explanation, i.e., the part of the provenance that is relevant to answer the question. We introduce optimizations that prune parts of a provenance graph early on if we can determine that they will not be part of the explanation...
Data re-use: Tools for producing and displaying data provenance across DataONE repositories
Data from: Two influential primate classifications logically aligned
ArXiv, 2016
We present an overview of the recently funded "Merging Science and Cyberinfrastructure Pathw... more We present an overview of the recently funded "Merging Science and Cyberinfrastructure Pathways: The Whole Tale" project (NSF award #1541450). Our approach has two nested goals: 1) deliver an environment that enables researchers to create a complete narrative of the research process including exposure of the data-to-publication lifecycle, and 2) systematically and persistently link research publications to their associated digital scholarly objects such as the data, code, and workflows. To enable this, Whole Tale will create an environment where researchers can collaborate on data, workspaces, and workflows and then publish them for future adoption or modification. Published data and applications will be consumed either directly by users using the Whole Tale environment or can be integrated into existing or future domain Science Gateways.
Proceedings of the VLDB Endowment, 2020
Why and why-not provenance have been studied extensively in recent years. However, why-not proven... more Why and why-not provenance have been studied extensively in recent years. However, why-not provenance and --- to a lesser degree --- why provenance can be very large, resulting in severe scalability and usability challenges. We introduce a novel approximate summarization technique for provenance to address these challenges. Our approach uses patterns to encode why and why-not provenance concisely. We develop techniques for efficiently computing provenance summaries that balance informativeness, conciseness, and completeness. To achieve scalability, we integrate sampling techniques into provenance capture and summarization. Our approach is the first to both scale to large datasets and generate comprehensive and meaningful summaries.
The VLDB Journal, 2018
It has been issued as a Technical Report for early dissemination of its contents. In view of the ... more It has been issued as a Technical Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IIT-DB prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. payment of royalties).
Past Global Change Magazine, 2018
Uploads
Papers by Bertram Ludaescher