Papers by Alison Callahan

Journal of biomedical semantics, 2011
Key to the success of e-Science is the ability to computationally evaluate expert-composed hypoth... more Key to the success of e-Science is the ability to computationally evaluate expert-composed hypotheses for validity against experimental data. Researchers face the challenge of collecting, evaluating and integrating large amounts of diverse information to compose and evaluate a hypothesis. Confronted with rapidly accumulating data, researchers currently do not have the software tools to undertake the required information integration tasks. We present HyQue, a Semantic Web tool for querying scientific knowledge bases with the purpose of evaluating user submitted hypotheses. HyQue features a knowledge model to accommodate diverse hypotheses structured as events and represented using Semantic Web languages (RDF/OWL). Hypothesis validity is evaluated against experimental and literature-sourced evidence through a combination of SPARQL queries and evaluation rules. Inference over OWL ontologies (for type specifications, subclass assertions and parthood relations) and retrieval of facts sto...
With the growth of the Semantic Web as a medium for creating, consuming, mashing up and republish... more With the growth of the Semantic Web as a medium for creating, consuming, mashing up and republishing data, our ability to trace any statement(s) back to their origin is becoming ever more important. Several approaches have now been proposed to associate statements with provenance, with multiple applications in data publication, attribution and argumentation. Here, we describe the ovopub, a modular model for data publication that enables encapsulation, aggregation, integrity checking, and selective-source query answering. We describe the ovopub RDF specification, key design patterns and their application in the publication and referral to data in the life sciences.

Journal of Biomedical Semantics, 2013
Background: A key activity for life scientists in this post "-omics" age involves searching for a... more Background: A key activity for life scientists in this post "-omics" age involves searching for and integrating biological data from a multitude of independent databases. However, our ability to find relevant data is hampered by non-standard web and database interfaces backed by an enormous variety of data formats. This heterogeneity presents an overwhelming barrier to the discovery and reuse of resources which have been developed at great public expense.To address this issue, the open-source Bio2RDF project promotes a simple convention to integrate diverse biological data using Semantic Web technologies. However, querying Bio2RDF remains difficult due to the lack of uniformity in the representation of Bio2RDF datasets. Results: We describe an update to Bio2RDF that includes tighter integration across 19 new and updated RDF datasets. All available open-source scripts were first consolidated to a single GitHub repository and then redeveloped using a common API that generates normalized IRIs using a centralized dataset registry. We then mapped dataset specific types and relations to the Semanticscience Integrated Ontology (SIO) and demonstrate simplified federated queries across multiple Bio2RDF endpoints. Conclusions: This coordinated release marks an important milestone for the Bio2RDF open source linked data framework. Principally, it improves the quality of linked data in the Bio2RDF network and makes it easier to access or recreate the linked data locally. We hope to continue improving the Bio2RDF network of linked data by identifying priority databases and increasing the vocabulary coverage to additional dataset vocabularies beyond SIO.

Biocomputing 2015, 2014
Post-market drug safety surveillance is hugely important and is a significant challenge despite t... more Post-market drug safety surveillance is hugely important and is a significant challenge despite the existence of adverse event (AE) reporting systems. Here we describe a preliminary analysis of search logs from healthcare professionals as a source for detecting adverse drug events. We annotate search log query terms with biomedical terminologies for drugs and events, and then perform a statistical analysis to identify associations among drugs and events within search sessions. We evaluate our approach using two different types of reference standards consisting of known adverse drug events (ADEs) and negative controls. Our approach achieves a discrimination accuracy of 0.85 in terms of the area under the receiver operator curve (AUC) for the reference set of well-established ADEs and an AUC of 0.68 for the reference set of recently labeled ADEs. We also find that the majority of associations in the reference sets have support in the search log data. Despite these promising results additional research is required to better understand users' search behavior, biasing factors, and the overall utility of analyzing healthcare professional search logs for drug safety surveillance.

Journal of Biomedical Semantics, 2014
Background: Two distinct trends are emerging with respect to how data is shared, collected, and a... more Background: Two distinct trends are emerging with respect to how data is shared, collected, and analyzed within the bioinformatics community. First, Linked Data, exposed as SPARQL endpoints, promises to make data easier to collect and integrate by moving towards the harmonization of data syntax, descriptive vocabularies, and identifiers, as well as providing a standardized mechanism for data access. Second, Web Services, often linked together into workflows, normalize data access and create transparent, reproducible scientific methodologies that can, in principle, be re-used and customized to suit new scientific questions. Constructing queries that traverse semantically-rich Linked Data requires substantial expertise, yet traditional RESTful or SOAP Web Services cannot adequately describe the content of a SPARQL endpoint. We propose that content-driven Semantic Web Services can enable facile discovery of Linked Data, independent of their location. Results: We use a well-curated Linked Dataset -OpenLifeData -and utilize its descriptive metadata to automatically configure a series of more than 22,000 Semantic Web Services that expose all of its content via the SADI set of design principles. The OpenLifeData SADI services are discoverable via queries to the SHARE registry and easy to integrate into new or existing bioinformatics workflows and analytical pipelines. We demonstrate the utility of this system through comparison of Web Service-mediated data access with traditional SPARQL, and note that this approach not only simplifies data retrieval, but simultaneously provides protection against resource-intensive queries. Conclusions: We show, through a variety of different clients and examples of varying complexity, that data from the myriad OpenLifeData can be recovered without any need for prior-knowledge of the content or structure of the SPARQL endpoints. We also demonstrate that, via clients such as SHARE, the complexity of federated SPARQL queries is dramatically reduced.

BMC Bioinformatics, 2015
Extensive studies have been carried out on Caenorhabditis elegans as a model organism to elucidat... more Extensive studies have been carried out on Caenorhabditis elegans as a model organism to elucidate mechanisms of aging and the effects of perturbing known aging-related genes on lifespan and behavior. This research has generated large amounts of experimental data that is increasingly difficult to integrate and analyze with existing databases and domain knowledge. To address this challenge, we demonstrate a scalable and effective approach for automatic evidence gathering and evaluation that leverages existing experimental data and literature-curated facts to identify genes involved in aging and lifespan regulation in C. elegans. We developed a semantic knowledge base for aging by integrating data about C. elegans genes from WormBase with data about 2005 human and model organism genes from GenAge and 149 genes from GenDR, and with the Bio2RDF network of linked data for the life sciences. Using HyQue (a Semantic Web tool for hypothesis-based querying and evaluation) to interrogate this knowledge base, we examined 48,231 C. elegans genes for their role in modulating lifespan and aging. HyQue identified 24 novel but well-supported candidate aging-related genes for further experimental validation. We use semantic technologies to discover candidate aging genes whose effects on lifespan are not yet well understood. Our customized HyQue system, the aging research knowledge base it operates over, and HyQue evaluations of all C. elegans genes are freely available at http://hyque.semanticscience.org .

The lack of reproducibility in many areas of experimental science has a number of causes, includi... more The lack of reproducibility in many areas of experimental science has a number of causes, including a lack of transparency and precision in the description of experimental approaches. This has far-reaching consequences, including wasted resources and slowing of progress. Additionally, the large number of laboratories around the world publishing articles on a given topic make it difficult, if not impossible, for individual researchers to read all of the relevant literature. Consequently, centralized databases are needed to facilitate the generation of new hypotheses for testing. One strategy to improve transparency in experimental description, and to allow the development of frameworks for computer-readable knowledge repositories, is the adoption of uniform reporting standards, such as common data elements (data elements used in multiple clinical studies) and minimum information standards. This article describes a minimum information standard for spinal cord injury (SCI) experiments, its major elements, and the approaches used to develop it. Transparent reporting standards for experiments using animal models of human SCI aim to reduce inherent bias and increase experimental value.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2012
ABSTRACT

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2013
ABSTRACT Background / Purpose: The world wide web acts as a major publication platform for scient... more ABSTRACT Background / Purpose: The world wide web acts as a major publication platform for scientific publications, but more and more life sciences data are being published in a variety of incompatible formats that preclude easy integration, information retrieval and query answering. Although the semantic web effort offers machine understandable languages such as RDF and OWL to publish semantically annotated data on an emerging web of linked data, life sciences data providers don’t always use the same identifier to refer to the same entities and data must be processed in order to ensure link accuracy. Main conclusion: Bio2RDF is an open source project that uses semantic web technologies to build and provide the largest network of linked data for the life sciences. Here, we present the second release of the Bio2RDF project which features up-to-date, open-source scripts, IRI normalization through a common dataset registry, dataset provenance, data metrics, public SPARQL endpoints, compressed RDF files and full text-indexed virtuoso triple stores for download.

Journal of Biomedical Semantics, 2014
The Semanticscience Integrated Ontology (SIO) is an ontology to facilitate biomedical knowledge d... more The Semanticscience Integrated Ontology (SIO) is an ontology to facilitate biomedical knowledge discovery. SIO features a simple upper level comprised of essential types and relations for the rich description of arbitrary (real, hypothesized, virtual, fictional) objects, processes and their attributes. SIO specifies simple design patterns to describe and associate qualities, capabilities, functions, quantities, and informational entities including textual, geometrical, and mathematical entities, and provides specific extensions in the domains of chemistry, biology, biochemistry, and bioinformatics. SIO provides an ontological foundation for the Bio2RDF linked data for the life sciences project and is used for semantic integration and discovery for SADI-based semantic web services. SIO is freely available to all users under a creative commons by attribution license. See website for further information: http://sio.semanticscience.org.
Journal of the American Society for Information Science and Technology, 2010
... of the citing paper with DOI 10.1186/1471-2180-7-53. This DOI corresponds to the following BM... more ... of the citing paper with DOI 10.1186/1471-2180-7-53. This DOI corresponds to the following BMC Microbiology article: Gutierrez-Rios, RM, Freyre-Gonzalez, JA, Resendi, O., Collado-Vides, J., Saier, M., & Gosset, G. (2007). ...

Drug Safety, 2014
Text mining is the computational process of extracting meaningful information from large amounts ... more Text mining is the computational process of extracting meaningful information from large amounts of unstructured text. Text mining is emerging as a tool to leverage underutilized data sources that can improve pharmacovigilance, including the objective of adverse drug event detection and assessment. This article provides an overview of recent advances in pharmacovigilance driven by the application of text mining, and discusses several data sources-such as biomedical literature, clinical narratives, product labeling, social media, and Web search logs-that are amenable to text-mining for pharmacovigilance. Given the state of the art, it appears text mining can be applied to extract useful ADE-related information from multiple textual sources. Nonetheless, further research is required to address remaining technical challenges associated with the text mining methodologies, and to conclusively determine the relative contribution of each textual source to improving pharmacovigilance.
Uploads
Papers by Alison Callahan