Papers by Sarah Cohen-Boulakia

In the era of high-throughput visual plant phenotyping, it is crucial to design fully automated a... more In the era of high-throughput visual plant phenotyping, it is crucial to design fully automated and flexible workflows able to derive quantitative traits from plant images. Over the last years, several software supports the extraction of architectural features of shoot systems. Yet currently no end-to-end systems are able to extract both 3D shoot topology and geometry of plants automatically from images on large datasets and a large range of species. In particular, these software essentially deal with dicotyledons, whose architecture is comparatively easier to analyze than monocotyledons. To tackle these challenges, we designed the Phenomenal software featured with: (i) a completely automatic workflow system including data import, reconstruction of 3D plant architecture for a range of species and quantitative measurements on the reconstructed plants; (ii) an open source library for the development and comparison of new algorithms to perform 3D shoot reconstruction and (iii) an integ...
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific ... more HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Proceedings of the VLDB Endowment, 2015
The problem of aggregating multiple rankings into one consensus ranking is an active research top... more The problem of aggregating multiple rankings into one consensus ranking is an active research topic especially in the database community. Various studies have implemented methods for rank aggregation and may have come up with contradicting conclusions upon which algorithms work best. Comparing such results is cumbersome, as the original studies mixed different approaches and used very different evaluation datasets and metrics. Additionally, in real applications, the rankings to be aggregated may not be permutations where elements are strictly ordered, but they may have ties where some elements are placed at the same position. However, most of the studies have not considered ties. This paper introduces the first large scale study of algorithms for rank aggregation with ties. More precisely, (i) we review rank aggregation algorithms and determine whether or not they can handle ties; (ii) we propose the first implementation to compute the exact solution of the Rank Aggregation with tie...
Lecture Notes in Computer Science, 2006
Future Generation Computer Systems, 2017
With the development of new experimental technologies, biologists are faced with an avalanche of ... more With the development of new experimental technologies, biologists are faced with an avalanche of data to be computationally analyzed for scientific advancements and discoveries to emerge. Faced with the complexity of analysis pipelines, the large number of computational tools, and the enormous amount of data to manage, there is compelling evidence that many if not most scientific discoveries will not stand the test of time: increasing the reproducibility

2018 IEEE 14th International Conference on e-Science (e-Science)
SPARQL is the standard query language used to access RDF linked data sets available on the Web. H... more SPARQL is the standard query language used to access RDF linked data sets available on the Web. However, designing a SPARQL query can be a tedious task, even for experienced users. This is often due to imperfect knowledge by the user of the ontologies involved in the query. To overcome this problem, a growing number of query editors offer autocompletion features. Such features are nevertheless limited and mostly focused on typo checking. In this context, our contribution is four-fold. First, we analyze several autocompletion features proposed by the main editors, highlighting the needs currently not taken into account while met by a user community we work with, scientists. Second, we introduce the first (to our knowledge) autocompletion approach able to consider snippets (fragments of SPARQL query) based on queries expressed by previous users, enriching the user experience. Third, we introduce a usable, open and concrete solution able to consider a large panel of SPARQL autocompletion features that we have implemented in an editor. Last but not least, we demonstrate the interest of our approach on real biomedical queries involving services offered by the Wikidata collaborative knowledge base.

SPARQL s'est impose comme le langage de requetes le plus utilise pour acceder aux masses de d... more SPARQL s'est impose comme le langage de requetes le plus utilise pour acceder aux masses de donnees RDF disponibles sur le Web. Neanmoins, rediger une requete en SPARQL peut se reveler fastidieux, y compris pour des utilisateurs experimentes. Cela tient souvent d'une maitrise imparfaite par l'utilisateur des ontologies impliquees pour decrire les connaissances. Pour pallier ce probleme, un nombre croissant d'editeurs de requetes SPARQL proposent des fonctionnalites d'autocompletion qui restent limitees car souvent associees a un unique champ et toujours associees a un service SPARQL fixe. Dans cet article, nous demontrons, au travers d'une experimentation, une approche permettant de proposer des completions d'une requete en cours de redaction en exploitant de nombreux types d'autocompletion et ce dans un contexte multi-services. Cette experimentation s'appuie sur un editeur SPARQL auquel nous avons ajoute des mecanismes d'autocompletion qui su...

Lecture Notes in Computer Science, 2005
Biologists face two problems in interpreting their experiments: the integration of their data wit... more Biologists face two problems in interpreting their experiments: the integration of their data with information from multiple heterogeneous sources and data analysis with bioinformatics tools. It is difficult for scientists to choose between the numerous sources and tools without assistance. Following a thorough analysis of scientists' needs during the querying process, we found that biologists express preferences concerning the sources to be queried and the tools to be used. Interviews also showed that the querying process itself-the strategy followed-differs between scientists. In response to these findings, we have introduced a user-centric framework allowing to specify various querying processes. Then we have developed the BioGuide system which helps the scientists to choose suitable sources and tools, find complementary information in sources, and deal with divergent data. It is generic in that it can be adapted by each user to provide answers respecting his/her preferences, and obtained following his/her strategies.

Journal of Bioinformatics and Computational Biology, 2006
Fueled by novel technologies capable of producing massive amounts of data for a single experiment... more Fueled by novel technologies capable of producing massive amounts of data for a single experiment, scientists are faced with an explosion of information which must be rapidly analyzed and combined with other data to form hypotheses and create knowledge. Today, numerous biological questions can be answered without entering a wet lab. Scientific protocols designed to answer these questions can be run entirely on a computer. Biological resources are often complementary, focused on different objects and reflecting various experts' points of view. Exploiting the richness and diversity of these resources is crucial for scientists. However, with the increase of resources, scientists have to face the problem of selecting sources and tools when interpreting their data. In this paper, we analyze the way in which biologists express and implement scientific protocols, and we identify the requirements for a system which can guide scientists in constructing protocols to answer new biological ...
Scientific workflow rewriting while preserving provenance
2012 IEEE 8th International Conference on E-Science, 2012
ABSTRACT Scientific workflow systems are numerous and equipped of provenance modules able to coll... more ABSTRACT Scientific workflow systems are numerous and equipped of provenance modules able to collect data produced and consumed during workflow runs to enhance reproducibility. An increasing number of approaches have been developed to help managing provenance information. Some of them are able to process data in a polynomial time but they require workflows to have series-parallel (SP) structures. Rewriting any workflow into an SP workflow is thus particularly important. In this paper, (i) we introduce the concept of provenance-equivalent rewriting process, (ii) we review existing graph transformations, (iii) we design the provenance-equivalent SPFlow algorithm, (iv) we evaluate our approach over a thousand of real workflows.
Bioinformatics, 2004
Motivation: Biologists are now faced with the problem of integrating information from multiple he... more Motivation: Biologists are now faced with the problem of integrating information from multiple heterogeneous public sources with their own experimental data contained in individual sources. The selection of the sources to be considered is thus critically important. Results: Our aim is to support biologists by developing a module based on an algorithm that presents a selection of sources relevant to their query and matched to their own preferences. We approached this task by investigating the characteristics of biomedical data and introducing several preference criteria useful for bioinformaticians. This work was carried out in the framework of a project which aims to develop an integrative platform for the multiple parametric analysis of cancer. We illustrate our study through an elementary biomedical query occurring in a CGH analysis scenario.
Providing techniques to automatically infer molecular networks is particularly important to under... more Providing techniques to automatically infer molecular networks is particularly important to understand complex relationships between biological objects. We present a logic-based method to infer such networks and show how it allows inferring signalling networks from the design of a knowledge base. Provenance of inferred data has been carefully collected, allowing quality evaluation. More precisely, our method (i) takes into account various kinds of biological experiments and their origin; (ii) mimics the scientist's reasoning within a first-order logic setting; (iii) specifies precisely the kind of interaction between the molecules; (iv) provides the user with the provenance of each interaction; (v) automatically builds and draws the inferred network.

Dans de multiples domaines scientifiques, de nombreuses donnees sont generees quotidiennement et ... more Dans de multiples domaines scientifiques, de nombreuses donnees sont generees quotidiennement et doivent etre analysees. Dans ces processus d’analyse, les donnees initiales sont combinees a d’autres jeux de donnees massifs. Pour garantir une interpretation correcte des resultats de ces analyses de donnees, il est crucial de pouvoir retracer la provenance des donnees produites a partir des donnees initiales. La communaute des bases de donnees a propose un cadre formel unifiant de «semi-anneaux de provenance». L'objectif de cet article est de certifier a posteriori la correction d’une provenance. Pour ce faire, nous proposons une formalisation en Coq fondee sur le modele de semi-anneaux de provenance pour des analyses de donnees exprimees en algebre relationnelle. Nous introduisons ici notamment une preuve d'adequation de cette provenance avec l'interpretation usuelle de l'algebre relationnelle. Il s'agit d'une premiere etape vers la formalisation de langages c...
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific ... more HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Future generations computer systems, 2016
Scientific workflows have become a valuable tool for large-scale data processing and analysis. Th... more Scientific workflows have become a valuable tool for large-scale data processing and analysis. This has led to the creation of specialized online repositories to facilitate workflow sharing and reuse. Over time, these repositories have grown to sizes that call for advanced methods to support workflow discovery, in particular for similarity search. Effective similarity search requires both high quality algorithms for the comparison of scientific workflows and efficient strategies for indexing, searching, and ranking of search results. Yet, the graph structure of scientific workflows poses severe challenges to each of these steps. Here, we present a complete system for effective and efficient similarity search in scientific workflow repositories, based on the Layer Decompositon approach to scientific workflow comparison. Layer Decompositon specifically accounts for the directed dataflow underlying scientific workflows and, compared to other state-of-the-art methods, delivers best results for similarity search at comparably low runtimes. Stacking Layer Decomposition with even faster, structure-agnostic approaches allows us to use proven, off-the-shelf tools for workflow indexing to further reduce runtimes and scale similarity search to sizes of current repositories.
Abstract—An increasing number of scientific workflow systems are providing support for the automa... more Abstract—An increasing number of scientific workflow systems are providing support for the automated tracking and storage of provenance information. However, the amount of provenance information recorded can become very large, even for a single execution of a workflow – [6] estimates a ten-fold blowup of the size of the original input data. There is therefore a need to provide ways of allowing users to focus their attention on meaningful provenance information in provenance queries. We highlight recent work in this area on user views, showing how they can be efficiently computed given user input on relevance, or and how pre-existing views can be corrected to provide accurate provenance information. We also discuss how to search a repository of workflow specifications and their views, returning workflows at an appropriate level of complexity with respect to a hierarchy of views. I.

— Workflow systems have become increasingly popu-lar for managing experiments where many bioinfor... more — Workflow systems have become increasingly popu-lar for managing experiments where many bioinformatics tasks are chained together. Due to the large amount of data generated by these experiments and the need for reproducible results, provenance has become of paramount importance. Workflow systems are therefore starting to provide support for querying provenance. However, the amount of provenance information may be overwhelming, so there is a need for abstraction mechanisms to help users focus on the most relevant information. The technique we pursue is that of “user views. ” Since bioinformatics tasks may themselves be complex sub-workflows, a user view determines what level of sub-workflow the user can see, and thus what data and tasks are visible in provenance queries. In this paper, we formalize the notion of user views, demon-strate how they can be used in provenance queries, and give an algorithm for generating a user view based on which tasks are relevant for the user. We then...

Search, Adapt, and Reuse: The Future of Scientific Workflows
Over the last years, a number of scientific workflow management systems (SciWFM) have been brough... more Over the last years, a number of scientific workflow management systems (SciWFM) have been brought to a state of maturity that should permit their usage in a production-style environment. This is especially true for the Life Sciences, but SciWFM also attract considerable attention in fields like geophysics or climate research. These developments, accompanied by the growing availability of analytical tools wrapped as (web) services, were driven by a series of very interesting promises: End users will be empowered to develop their own pipelines; reuse of services will be enhanced by easier integration into custom workflows; time necessary for developing analysis pipelines will decrease; etc. But despite all efforts, SciWFM have not yet found widespread acceptance in their intended audience. In this paper, we argue that a wider adoption of SciWFM will only be achieved if the focus of research and development is shifted from methods for developing and running workflows to searching, ada...
As the number, richness and diversity of biological sources grow, scientists are increasingly con... more As the number, richness and diversity of biological sources grow, scientists are increasingly confronted with the problem of selecting appropriate sources and tools. To address this problem, we have designed BioGuide 1, a user-centric framework that helps scientists choose sources and tools according to their preferences and strategy, by specifying queries through a user-friendly visual interface. In this paper, we provide a complete RDF representation of BioGuide and introduce XPR (eXtensible Path language for RDF), an extension of FSL 2 that is expressive enough to model all BioGuide queries. BioGuide queries modeled as XPR expressions can then be saved, compared, evaluated and exchanged through the Web between users and applications. 1.

Motivation: High-throughput technologies provide fundamental infor-mations concerning thousands o... more Motivation: High-throughput technologies provide fundamental infor-mations concerning thousands of genes. Many of the current rese-arch laboratories daily use one or more of these technologies and end-up with lists of genes. Assessing the originality of the results obtained includes being aware of the number of publications availa-ble concerning individual or multiple genes and accessing information about these publications. Faced with the exponential growth of publi-cations avaliable and number of genes involved in a study, this task is becoming particularly difcult to achieve. Results: We introduce GeneValorization, a web-based tool which gives a clear and handful overview of the bibliography available cor-responding to the user input formed by (i) a gene list (expressed by gene names or ids from EntrezGene) and (ii) a context of study (expressed by keywords). From this input, GeneValorization provides a matrix containing the number of publications with co-occurrences of gene name...
Uploads
Papers by Sarah Cohen-Boulakia